Ticket 21826

Summary:	Error getting jobs with sacct from dbd - DBD_GET_JOBS_COND
Product:	Slurm	Reporter:	hpc-ops
Component:	Accounting	Assignee:	Patrick Wigger <patrick>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	24.05.5
Hardware:	Linux
OS:	Linux
Site:	Ghent	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description hpc-ops 2025-01-16 02:35:26 MST

Hi,

Trying to get all jobs from the dbd for injection into XDMoD.

Ran into the following error:


sacct --clusters doduo --allusers --parsable2 --noheader --allocations --duplicates --format jobid,jobidraw,cluster,partition,qos,account,group,gid,user,uid,submit,
eligible,start,end,elapsed,exitcode,state,nnodes,ncpus,reqcpus,reqmem,reqtres,alloctres,timelimit,nodelist,jobname --state CANCELLED,COMPLETED,FAILED,NODE_FAIL,PREEMPTED,TIMEOUT,OUT_OF_MEMORY,REQUEUED --starttime 2024-01-25T00:00:00 --endtime 2024-09-28T23:59:59

sacct: error: Getting response to message type: DBD_GET_JOBS_COND
sacct: error: DBD_GET_JOBS_COND failure: Unspecified error


Any suggestions on how to proceed? DBD was updated to 24.05.x last November. 

Another ticket stated to reduce log level, but we're at:

[root@masterdb01 ~]# cat /etc/slurm/slurmdbd.conf

ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=no
ArchiveSuspend=no
ArchiveTXN=no
ArchiveUsage=no
AuthInfo=socket=/run/munge/munge.socket.2
AuthType=auth/munge
DbdHost=masterdb01.gastly.os
DebugLevel=info
LogFile=/var/log/slurmdbd.log
PidFile=/var/run/slurmdbd/slurmdbd.pid
PrivateData=users,accounts,jobs,usage,events
PurgeEventAfter=720hours
PurgeJobAfter=25920hours
PurgeResvAfter=25920hours
PurgeStepAfter=720hours
PurgeSuspendAfter=720hours
PurgeTXNAfter=25920hours
PurgeUsageAfter=25920hours
SlurmUser=slurm
StoragePass=<snip>
StorageType=accounting_storage/mysql
StorageUser=slurm


Thanks,
-- Andy

Comment 1 Patrick Wigger 2025-01-21 08:55:39 MST

Hi Andy,

To debug this further, could you please:

1. Enable detailed logging using DebugFlags=DB_QUERY,DB_JOB in slurmdbd.conf
2. Run the problematic sacct command once again
3. While it is hanging, run mysql> SHOW processlist; to inspect database activity.
4. Submit the slurmdbd log that captures the sacct duration and reset DebugFlags to prevent extra log collection.

Does this behavior occur when running on shorter time intervals specified by starttime and endtime? Additionally, could you check the output of "sacctmgr show runawayjobs". This will list any completed jobs that are missing an end time in the database.

Best,
Patrick

Comment 2 Patrick Wigger 2025-05-28 10:48:57 MDT

Hi Andy,

Is this still an issue that you are facing or should I close this ticket?

Thanks,
Patrick

Comment 3 Patrick Wigger 2025-06-30 13:17:50 MDT

Hi Andy,

I am going to go ahead and close this out for now. Please feel free to reopen this ticket if you still require assistance with this issue.

Best,
Patrick