Ticket 21826 - Error getting jobs with sacct from dbd - DBD_GET_JOBS_COND
Summary: Error getting jobs with sacct from dbd - DBD_GET_JOBS_COND
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 24.05.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Patrick Wigger
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-01-16 02:35 MST by hpc-ops
Modified: 2025-01-21 08:55 MST (History)
0 users

See Also:
Site: Ghent
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description hpc-ops 2025-01-16 02:35:26 MST
Hi,

Trying to get all jobs from the dbd for injection into XDMoD.

Ran into the following error:


sacct --clusters doduo --allusers --parsable2 --noheader --allocations --duplicates --format jobid,jobidraw,cluster,partition,qos,account,group,gid,user,uid,submit,
eligible,start,end,elapsed,exitcode,state,nnodes,ncpus,reqcpus,reqmem,reqtres,alloctres,timelimit,nodelist,jobname --state CANCELLED,COMPLETED,FAILED,NODE_FAIL,PREEMPTED,TIMEOUT,OUT_OF_MEMORY,REQUEUED --starttime 2024-01-25T00:00:00 --endtime 2024-09-28T23:59:59

sacct: error: Getting response to message type: DBD_GET_JOBS_COND
sacct: error: DBD_GET_JOBS_COND failure: Unspecified error


Any suggestions on how to proceed? DBD was updated to 24.05.x last November. 

Another ticket stated to reduce log level, but we're at:

[root@masterdb01 ~]# cat /etc/slurm/slurmdbd.conf

ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=no
ArchiveSuspend=no
ArchiveTXN=no
ArchiveUsage=no
AuthInfo=socket=/run/munge/munge.socket.2
AuthType=auth/munge
DbdHost=masterdb01.gastly.os
DebugLevel=info
LogFile=/var/log/slurmdbd.log
PidFile=/var/run/slurmdbd/slurmdbd.pid
PrivateData=users,accounts,jobs,usage,events
PurgeEventAfter=720hours
PurgeJobAfter=25920hours
PurgeResvAfter=25920hours
PurgeStepAfter=720hours
PurgeSuspendAfter=720hours
PurgeTXNAfter=25920hours
PurgeUsageAfter=25920hours
SlurmUser=slurm
StoragePass=<snip>
StorageType=accounting_storage/mysql
StorageUser=slurm


Thanks,
-- Andy
Comment 1 Patrick Wigger 2025-01-21 08:55:39 MST
Hi Andy,

To debug this further, could you please:

1. Enable detailed logging using DebugFlags=DB_QUERY,DB_JOB in slurmdbd.conf
2. Run the problematic sacct command once again
3. While it is hanging, run mysql> SHOW processlist; to inspect database activity.
4. Submit the slurmdbd log that captures the sacct duration and reset DebugFlags to prevent extra log collection.

Does this behavior occur when running on shorter time intervals specified by starttime and endtime? Additionally, could you check the output of "sacctmgr show runawayjobs". This will list any completed jobs that are missing an end time in the database.

Best,
Patrick