Ticket 21340

Summary: sacct v squeue mismatch
Product: Slurm Reporter: jay.kubeck
Component: AccountingAssignee: Ben Roberts <ben>
Status: OPEN --- QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: eric.peskin
Version: 23.11.7   
Hardware: Linux   
OS: Linux   
Site: Yale Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: sdiag.txt
squeue.state.R.txt
sacct.state.R.txt
slurmctld p1
slurmctld p2
slurmctld p3
slurmctld p4
slurmdbd logs
slurm.conf
slurmdbd.conf no pwd
slurmdbd logs after network debugging turned on 24 hours
slurmdbd logs after network debugging turned on

Description jay.kubeck 2024-11-05 07:46:46 MST
Hello,

I'm trying to understand the discrepancy between what squeue and sacct report for running jobs - 

[root@mgt1.grace ~]# squeue -o "%.18i %.9T" --state R | wc -l
3181
[root@mgt1.grace ~]# sacct --format=JobID,State --state R -a -X | wc -l
662

I checked a similar ticket, and it recommended looking for runaway jobs, which i did, and cleared the one reported. 

Please let me know what else I can do get the DB to catch up to what squeue reports.

Best,

Jay
Comment 1 Ben Roberts 2024-11-05 10:27:40 MST
Hi Jay,

Can I have you send the output of the following commands to try to track down where the disconnect is between these two lists.
squeue --state R
sacct --state R -a -X --format=jobid,partition,user,nodelist,state,submit,start

I would also like to see if there is a delay between what is happening on the scheduler side and what is being reported to slurmdbd.  To see that I would like to have you run 'sdiag'.

Thanks,
Ben
Comment 2 jay.kubeck 2024-11-05 10:48:10 MST
Created attachment 39564 [details]
sdiag.txt

Good afternoon,

Thank you for the quick reply. I've attached txt files of the requested output.

Best,

Jay
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, November 5, 2024 12:27 PM
To: Kubeck, Jay <jay.kubeck@yale.edu>
Subject: [Ticket 21340] sacct v squeue mismatch

Ben Roberts<mailto:ben@schedmd.com> changed ticket 21340<https://support.schedmd.com/show_bug.cgi?id=21340>
What    Removed Added
Assignee        support@schedmd.com     ben@schedmd.com

Comment # 1<https://support.schedmd.com/show_bug.cgi?id=21340#c1> on ticket 21340<https://support.schedmd.com/show_bug.cgi?id=21340> from Ben Roberts<mailto:ben@schedmd.com>

Hi Jay,

Can I have you send the output of the following commands to try to track down
where the disconnect is between these two lists.
squeue --state R
sacct --state R -a -X --format=jobid,partition,user,nodelist,state,submit,start

I would also like to see if there is a delay between what is happening on the
scheduler side and what is being reported to slurmdbd.  To see that I would
like to have you run 'sdiag'.

Thanks,
Ben

________________________________
You are receiving this mail because:

  *   You reported the ticket.
Comment 3 jay.kubeck 2024-11-05 10:48:10 MST
Created attachment 39565 [details]
squeue.state.R.txt
Comment 4 jay.kubeck 2024-11-05 10:48:10 MST
Created attachment 39566 [details]
sacct.state.R.txt
Comment 5 Ben Roberts 2024-11-05 12:57:44 MST
Thanks for sending that.  It looks like there may have been a few time periods where job records weren't being recorded properly in the database.  I can see that the sacct output only shows a handful of jobs that were submitted/started on the 4th, but sacct has a large number of jobs that started then.  Looking at 8112304 (a job array), there are 11 records for that array in sacct, but squeue shows over 900 currently running.  

This could happen if there are records that the controller tries to send to slurmdbd, but it can't make that connection for some reason.  There is a mechanism that holds on to messages if slurmdbd can't be reached for some reason, but there is a limit to how many of these messages the controller can keep track of before it starts dropping them.  In the sdiag output you sent I can see that there is a queue of messages that aren't going through:
  DBD Agent queue size: 1047

But there must be some messages going through because some of the most recent jobs do show up in the sacct output.  

It seems to me that there must be some sort of intermittent communication issue that is causing communication problems, resulting in some of these messages being missed.  I would like to see your slurmctld and slurmdbd logs for the last couple of days to see if there are any clues there about what's happening.  I would also like to see how many messages your cluster is configured to track before it starts dropping them.  You can see that with the following command:
  scontrol show config | grep MaxDBDMsgs

Thanks,
Ben
Comment 6 jay.kubeck 2024-11-05 13:17:23 MST
Created attachment 39571 [details]
slurmctld p1
Comment 7 jay.kubeck 2024-11-05 13:18:05 MST
Created attachment 39572 [details]
slurmctld p2
Comment 8 jay.kubeck 2024-11-05 13:18:31 MST
Created attachment 39573 [details]
slurmctld p3
Comment 9 jay.kubeck 2024-11-05 13:18:50 MST
Created attachment 39574 [details]
slurmctld p4
Comment 10 jay.kubeck 2024-11-05 13:19:48 MST
Created attachment 39575 [details]
slurmdbd logs
Comment 11 jay.kubeck 2024-11-05 13:20:06 MST
Thank for the quick reply again!

Here is the information you requested -

[root@mgt1.grace ~]# scontrol show config | grep -i MaxDBDMsgs
MaxDBDMsgs              = 1003128

The logs are attached to the ticket, (the slurmctld logs were pretty large at 2 days, so  I split them).

Best,

Jay
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, November 5, 2024 2:57 PM
To: Kubeck, Jay <jay.kubeck@yale.edu>
Subject: [Ticket 21340] sacct v squeue mismatch


Comment # 5<https://support.schedmd.com/show_bug.cgi?id=21340#c5> on ticket 21340<https://support.schedmd.com/show_bug.cgi?id=21340> from Ben Roberts<mailto:ben@schedmd.com>

Thanks for sending that.  It looks like there may have been a few time periods
where job records weren't being recorded properly in the database.  I can see
that the sacct output only shows a handful of jobs that were submitted/started
on the 4th, but sacct has a large number of jobs that started then.  Looking at
8112304 (a job array), there are 11 records for that array in sacct, but squeue
shows over 900 currently running.

This could happen if there are records that the controller tries to send to
slurmdbd, but it can't make that connection for some reason.  There is a
mechanism that holds on to messages if slurmdbd can't be reached for some
reason, but there is a limit to how many of these messages the controller can
keep track of before it starts dropping them.  In the sdiag output you sent I
can see that there is a queue of messages that aren't going through:
  DBD Agent queue size: 1047

But there must be some messages going through because some of the most recent
jobs do show up in the sacct output.

It seems to me that there must be some sort of intermittent communication issue
that is causing communication problems, resulting in some of these messages
being missed.  I would like to see your slurmctld and slurmdbd logs for the
last couple of days to see if there are any clues there about what's happening.
 I would also like to see how many messages your cluster is configured to track
before it starts dropping them.  You can see that with the following command:
  scontrol show config | grep MaxDBDMsgs

Thanks,
Ben

________________________________
You are receiving this mail because:

  *   You reported the ticket.
Comment 12 Ben Roberts 2024-11-06 09:31:12 MST
I went through the logs you sent and I do see some things that indicate that there could be a problem, but not necessarily a smoking gun of what caused the discrepancy of running jobs.  In the slurmctld logs I can see that there are frequently messages like this:
Nov 03 15:11:24 r912u23n01 slurmctld[163760]: slurmctld: Resending TERMINATE_JOB request JobId=7903368 Nodelist=r915u32n03

The fact that it is trying to resend events like these means that it failed to register the first time.  

The slurmdbd logs don't have much in them, as I'm sure you saw.  They do periodically show that there is an error processing a message from an IP that I assume belongs to the controller.  

I would like to see some additional detail about what the slurmdbd logs show for network connections.  Can I have you update your slurmdbd.conf file to have the following:
  DebugLevel=verbose
  DebugFlags=network

You can have slurmdbd pick up that change without stopping slurdmbd by running 'sacctmgr reconfigure'.  Then let things run for a few minutes before reverting the settings to what they were previously and doing another reconfigure.  I would expect to see that there are connections opened and closed from the controller regularly.  If we don't see these then that would be a good clue that something is wrong.  

I would also like to see your slurm.conf and slurmdbd.conf (with username and password information for the database removed).

Thanks,
Ben
Comment 13 Ben Roberts 2024-12-06 13:27:30 MST
Hi Jay,

I wanted to follow up and see if you've had a chance to gather the information I was looking for in comment 12.  Let me know if you still need help with this ticket or if it's ok to close.

Thanks,
Ben
Comment 14 jay.kubeck 2024-12-19 07:18:45 MST
Created attachment 40206 [details]
slurm.conf
Comment 15 jay.kubeck 2024-12-19 07:19:06 MST
Created attachment 40207 [details]
slurmdbd.conf no pwd
Comment 16 jay.kubeck 2024-12-19 07:21:39 MST
Good morning, I apologize for the delay in getting back to you, I'm still interested in pursing this issue. 

I have attached the slurm.conf and slurmdbd.conf (password redacted) files to the ticket.

You'll see in the slurmdbd.conf the debug options are - 

DebugLevelSyslog=error

Should that line be changed to  DebugLevel=verbose or should I add 

  DebugLevel=verbose
  DebugFlags=network

to slurmdbd.conf.

Please let me know and I'll set that up. 

Best,

Jay
Comment 17 Ben Roberts 2024-12-19 11:31:07 MST
I looked over your configuration files and there were a couple things that stood out.  One is that you have the SlurmdTimeout set to 2 hours:
SlurmdTimeout=7200

I'm curious if there was an issue that you were dealing with that prompted you to set it that high.  If there are long delays where slurmd can't communicate with the controller then that could account for why there are differences in how many running jobs are shown by squeue vs sacct.

Another thing I noticed is that you have UnkillableStepTimeout set to 600 seconds, but don't have an UnkillableStepProgram defined.  This allows you to configure a script to try to clean up any stray processes that might be left that are preventing the job step from terminating correctly.  Without a defined UnkillableStepProgram, the timeout won't cause the controller to take any additional action on the node.  You can read more about this here:
https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram

I do still want to look at some additional logs.  The DebugLevelSyslog is fine to use most of the time.  For trying to capture debug logs I would rather have you write the information to a separate log file.  There are frequently log entries that don't have a direct reference to a particular job, so trying to pull them out of a bigger log file can be tricky.  If you can define a LogFile for long enough to capture some logs that would be great.
https://slurm.schedmd.com/slurmdbd.conf.html#OPT_LogFile

With that, you would want to define the parameters mentioned earlier:
  DebugLevel=verbose
  DebugFlags=network

Thanks,
Ben
Comment 18 jay.kubeck 2025-01-03 11:57:39 MST
Created attachment 40274 [details]
slurmdbd logs after network debugging turned on 24 hours
Comment 19 jay.kubeck 2025-01-03 11:59:37 MST
Good afternoon,

I've uploaded slurmdbd logs from the last 24 hours with network debugging and verbose turned on. 

Here is some info on the two parameters you had questions about - 

For "SlurmdTimeout=7200" we set it that high to provide buffer time and help avoid the situation of having slurmd terminate all running jobs on a node if that node had been marked as down and later reconnected to the scheduler. This setting does not impact how frequently slurmctld attempts to talk to slurmd.
1:27
We don't have UnkillableStepProgram defined because zombies are hard to kill: if slurmd isn't able to end the program on it's own it is likely because the pid is locked in an IO wait state or is/was running on an unresponsive GPU. 99% of these types of cases require a reboot to actually clear them.

Please let me know if that's helpful or if I can provide any further information.

Best,

Jay
Comment 20 Ben Roberts 2025-01-03 14:05:07 MST
Hi Jay,

Thanks for the additional information and the logs.  Unfortunately the logs don't show any real issue.  The majority of the log entries are about connections being opened and closed (normally).  Then there are a handful of other entries about the hourly rollups taking a while to process, from 7 to 20 seconds.  There aren't any entries that indicate there was a connection that was dropped in the middle of a transaction or anything like that.  Did you have any instances of jobs that weren't recorded in slurmdbd during the time covered by these logs?  

Thanks,
Ben
Comment 21 Ben Roberts 2025-01-31 14:41:30 MST
Hi Jay,

I wanted to follow up and see if you were able to track down any other instances of this happening with some logs to go with that occurrence.  Let me know if you still need help with this ticket.

Thanks,
Ben
Comment 22 jay.kubeck 2025-02-12 14:08:47 MST
Created attachment 40773 [details]
slurmdbd logs after network debugging turned on
Comment 23 jay.kubeck 2025-02-12 14:11:39 MST
Good afternoon,

The discrepancy between sdiag and squeue job counts returned this week and network and verbose debugging were turned on soon after it was discovered, hopefully that was picked up in the attached logs.

When we are seeing the running job discrepancy, mariadb is completely taking a core (it is single threaded I guess) and while watching seemed to be just busy adding, updating, checking job scripts and environments

Does make sense that better DB performance can be had by using it for fewer things?  Would scaling the DB back to 1 month of active data would help? Other tweaks... Right now the "keep" is specified like "3Months" which apparently means purge once a month and keep the most recent 3 Months.  Specifying it like "90Days" would mean purge daily and keep the most recent 90 days....

Please let me know what you find and what you think of making some config changes.

Best,

Jay
Comment 24 Ben Roberts 2025-02-17 11:12:25 MST
Hi Jay,

I looked at the logs you sent and I see that there are some errors that show up that look like this:
Feb 10 11:43:01 r912u23n01 slurmdbd[157721]: slurmdbd: error: Processing last message from connection 13(10.181.0.8) uid(0)
Feb 10 11:43:01 r912u23n01 slurmdbd[157721]: error: Processing last message from connection 13(10.181.0.8) uid(0)

I assume that 10.181.0.8 is your scheduler.  Is that right?  It doesn't have details of what was in the messages that generated these errors, but they could definitely be related to a mismatch.

It is possible that the size of your database could be a related factor if you have a high-throughput type of workflow.  How many jobs do you typically have run in a day?  

You're right that specifying 90 days, to have it do a nightly purge, would probably be a better way of cleaning up old information than having it do the purge monthly.  If your job and step tables are very large (in the millions of records) then it would be good to reduce the amount of time you keep those records.  You should reduce that amount of time gradually rather than trying to purge 2 months worth of records in one night.  I would recommend starting out with a change from 3 months to 97 days first (to only purge one weeks worth of data since we're roughly halfway through the month), then the next night you could reduce it to 90, and so forth until you're at the point you would like to stay at.

I also notice that you are configured to store job script and environment information in the database:
AccountingStoreFlags=job_script,job_env

If you are having database performance related issues, this could be a factor.  This would especially be the case if some of your users have relatively large job scripts.  Do you have a hard requirement to store job scripts in the database?

Thanks,
Ben
Comment 25 jay.kubeck 2025-03-17 12:31:44 MDT
Good afternoon,

Thank you for the continued efforts to get this issue resolved. 

Concerning storing the job scripts - 

"I also notice that you are configured to store job script and environment information in the database:
AccountingStoreFlags=job_script,job_env

If you are having database performance related issues, this could be a factor.  This would especially be the case if some of your users have relatively large job scripts.  Do you have a hard requirement to store job scripts in the database?"

It was our understanding that was a supported function, and one that is useful for our implementation. Do you think that is causing this issue to pop up regularly, and if so, is there something we can do to tweak the config to continue saving the job scripts without the database lagging? 

Also, if you recommend this change I'll implement it, as it's been approved by our research staff-

before -
# Purge database - keep 3 months max
PurgeEventAfter=3months
PurgeJobAfter=3months
PurgeResvAfter=2months
PurgeStepAfter=2months
PurgeSuspendAfter=1month
PurgeTXNAfter=3months
PurgeUsageAfter=3months


after- 
# Purge database - keep 3 months max
PurgeEventAfter=97days
PurgeJobAfter=97days
PurgeResvAfter=60days
PurgeStepAfter=60days
PurgeSuspendAfter=30days
PurgeTXNAfter=97days
PurgeUsageAfter=97days


And then we'll change that to 90days after a night. 

Please let me know if that looks correct.

Best,

Jay
Comment 26 Ben Roberts 2025-03-18 14:54:21 MDT
Hi Jay,

I think that starting out by changing the purge times to use days rather than months is a good place to start.  If you make that change and then continue to see performance issues then it would be good to look at the job scripts that are being saved.  Let me know how things look after making the change to days.

Thanks,
Ben
Comment 27 jay.kubeck 2025-03-20 09:15:01 MDT
Good morning,

I've implemented the purge change described above in hopes that solves the issue. If it returns I will let you know. 

Thank you very much for your continued help!

Best,

Jay
Comment 28 Ben Roberts 2025-03-20 10:09:32 MDT
Sounds good, I'll wait to hear how things look after making this change.

Thanks,
Ben