21698 – Jobs not writing to the database

Ticket 21698 - Jobs not writing to the database

Summary: Jobs not writing to the database

Status:	OPEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	23.11.10
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Chad Vizino
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2024-12-19 10:27 MST by Nicole
Modified:	2025-04-03 08:31 MDT (History)
CC List:	8 users (show)

See Also:	16954
Site:	Sandia National Laboratories
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
image001.png (4.36 KB, image/png) 2025-03-27 11:42 MDT, Nicole	Details
image002.png (1.61 KB, image/png) 2025-03-27 11:42 MDT, Nicole	Details
image003.png (1.83 KB, image/png) 2025-03-27 11:42 MDT, Nicole	Details
image004.png (1.73 KB, image/png) 2025-03-27 11:42 MDT, Nicole	Details
image005.png (2.30 KB, image/png) 2025-03-27 11:42 MDT, Nicole	Details
image006.png (1.54 KB, image/png) 2025-03-27 11:42 MDT, Nicole	Details
image007.png (1.63 KB, image/png) 2025-03-27 11:42 MDT, Nicole	Details
Logs after Job submitted (18.14 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document) 2025-03-27 16:43 MDT, Nicole	Details
job log - No Grep'ing (52.97 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document) 2025-04-03 08:31 MDT, Nicole	Details
Show Obsolete (7) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Nicole 2024-12-19 10:27:13 MST

Good Morning,
I'm looking for any help regarding an issue we have seen three times in the last 2 weeks. This is the first time this is happening. We have a common database for multiple clusters. In the last two weeks 2 of our 6 clusters randomly stopped writing jobs to the common database.  Sometimes days have went by before we realize this. SLURMCTLD is running just fine and everything is logging fine on the clusters. The only reason we know there is an issue is that squeue and sacct are not in sync. Once we restart slurmctld on the cluster I am able to see runaway jobs and fix them - but prior to restarting slurmctld I show 'no runaway jobs'.  Once i restart everything seems to work fine again.  This is a major issue as I  need detailed job information for reporting to our management.  When it died over the weekend and I lost 3 days of job information. Please let me know what I can look for to see when this happens so I can immediately restart slurmctld and please help me try to figure out why this is happening. 

Also let me note we are at Sandia National Labs and our person to clear logs going out is on christmas vacation. I would prefer to be told what to look for vs having to send over logs.  Also if there anything you can think of that I can do to query when this happens so I can immediately restart slurmctld.  We already have alerts setup for when slurmctld stops running and in this case it doesn't stop, it basically just breaks.

 Thank You! Nicole

Comment 1 Stephen Kendall 2024-12-19 16:33:57 MST

There may be log entries in the 'slurmctld', 'slurmdbd' and/or MySQL/MariaDB logs indicating some communication issues. Common causes of such issues are networking, daemon failure, or an overwhelmed DB server. Obviously without visibility into the logs our ability to diagnose it is limited but hopefully this helps as you look for signs in the logs.

Let me also explain the intended behavior in these cases. When the 'slurmctld' is unable to send messages to the 'slurmdbd', those messages are held in the "DBD Agent Queue", which can be displayed with the 'sdiag' command. This queue is limited by the value of 'MaxDBDMsgs', which is at least 10,000. This value can be increased at some cost to memory utilization.
https://slurm.schedmd.com/slurm.conf.html#OPT_MaxDBDMsgs

When the agent queue reaches a large enough size, additional error messages will be printed in the 'slurmctld' log, and once it fills up completely the configured 'SlurmctldParameters=max_dbd_msg_action' will be taken. By default, pending messages will be discarded starting with the least important types as explained in the docs. (This may be the reason for the runaway jobs.) In your case perhaps the accounting information is important enough that it is preferable to stop the 'slurmctld' when filled so that it can hold onto all the currently pending messages until communication is restored.
https://slurm.schedmd.com/slurm.conf.html#OPT_max_dbd_msg_action

In any case, it is good to have some monitoring of errors in the log files and of the DBD Agent Queue size to help identify these issues early on.

Best regards,
Stephen

Comment 2 Nicole 2024-12-24 12:05:58 MST

Thank you Stephen for all of this , it is very helpful.


I did look through the logs and saw zero communication error whether it be from the database, a network or a daemon.
I will send logs next time it happens and hopefully you can help me get to the bottom of it.  I will keep an eye on sdiag when it happens.

I verified MaxDBDMsgs is set to the calculation schedmd recommends.  There are no messages in the slurmctld log or anywhere else discussing it was filled up.
We actually use splunk to look for that exact error and have tested it in the past to make sure we are notified when we see maxdbdmsgs.

We continue to see a mismatch is what the cluster and the database (squeue vs sacct) see.  Here is an example. I didn’t want to restart or fix anything until you maybe could give me direction on what would cause the below OR how to prevent it from happening:
(Squeue vs sacct failed/cancelled job that squeue thinks is still running and holding those nodes)
[root@clusteradmin1 ~]# squeue -j 1234
     JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   1234     batch interact   xyz  R 1-20:10:55     96 cl[222-230,245-255,505-516,931-942,947-959,1108-1121,1302-1314,1336-1347]
[root@clusteradmin1 ~]# sacct -j 1234
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1234      interacti+      batch   account       4608     FAILED      1:0
1234.bat+      batch              account         48     FAILED      1:0
1234.0    name.202+              account        4608 CANCELLED+      0:9
1234.1    name.202+              account       4608 CANCELLED+      0:9
1234.2     name.202+              account       4608     FAILED      0:0

-Nicole



From: bugs@schedmd.com <bugs@schedmd.com>
Date: Thursday, December 19, 2024 at 4:34 PM
To: Martin, Nicole Bianca <nblucer@sandia.gov>
Subject: [EXTERNAL] [Ticket 21698] Jobs not writing to the database
Stephen Kendall<mailto:stephen@schedmd.com> changed ticket 21698<https://support.schedmd.com/show_bug.cgi?id=21698>
What
Removed
Added
CC

stephen@schedmd.com
Comment # 1<https://support.schedmd.com/show_bug.cgi?id=21698#c1> on ticket 21698<https://support.schedmd.com/show_bug.cgi?id=21698> from Stephen Kendall<mailto:stephen@schedmd.com>

There may be log entries in the 'slurmctld', 'slurmdbd' and/or MySQL/MariaDB

logs indicating some communication issues. Common causes of such issues are

networking, daemon failure, or an overwhelmed DB server. Obviously without

visibility into the logs our ability to diagnose it is limited but hopefully

this helps as you look for signs in the logs.



Let me also explain the intended behavior in these cases. When the 'slurmctld'

is unable to send messages to the 'slurmdbd', those messages are held in the

"DBD Agent Queue", which can be displayed with the 'sdiag' command. This queue

is limited by the value of 'MaxDBDMsgs', which is at least 10,000. This value

can be increased at some cost to memory utilization.

https://slurm.schedmd.com/slurm.conf.html#OPT_MaxDBDMsgs



When the agent queue reaches a large enough size, additional error messages

will be printed in the 'slurmctld' log, and once it fills up completely the

configured 'SlurmctldParameters=max_dbd_msg_action' will be taken. By default,

pending messages will be discarded starting with the least important types as

explained in the docs. (This may be the reason for the runaway jobs.) In your

case perhaps the accounting information is important enough that it is

preferable to stop the 'slurmctld' when filled so that it can hold onto all the

currently pending messages until communication is restored.

https://slurm.schedmd.com/slurm.conf.html#OPT_max_dbd_msg_action



In any case, it is good to have some monitoring of errors in the log files and

of the DBD Agent Queue size to help identify these issues early on.



Best regards,

Stephen

________________________________
You are receiving this mail because:

  *   You reported the ticket.

Comment 3 Ben Roberts 2024-12-31 09:11:04 MST

Hi Nicole,

I was looking at this ticket and had a possibility cross my mind.  It's possible that the job being displayed by sacct is a previous instance of the job.  Slurm will reuse job ids when it gets to the maximum allowable value, or if something happens to reset the job ids.  You might want to have sacct show some additional details to verify that the job information you're looking at is from the job you expect.  I would recommend running something like this:

sacct -D -j 1234 --format=jobid,jobname,partition,state,start,end

Can you send the output of that command?

Thanks,
Ben

Comment 4 Nicole 2025-01-02 10:03:05 MST

Good Morning,
Unfortunately this is no longer showing in squeue (I’m not sure if one of my team member manually scancelled it or it clearned up on its own, but I’m still interested in chasing the mismatch and figuring out why this might has happened to prevent it from happening again.
We do set jobstart numbers to make sure job numbers don’t step on eachother.

We are still having some mis-matches and I’ll reply after I get more information.
Here is the current information you requested:


Here is the result as requested. It is a current job

[root@mzadmin1 ~]# sacct -D -j 1234 --format=jobid,jobname,partition,state,start,end

JobID           JobName  Partition      State               Start                 End

------------ ---------- ---------- ---------- ------------------- -------------------

1234      interacti+      batch    TIMEOUT 2024-12-22T15:45:52 2024-12-24T15:46:56

1234.bat+      batch                FAILED 2024-12-19T08:55:55 2024-12-19T13:14:20

1234.0    abc.202+            CANCELLED+ 2024-12-23T11:14:17 2024-12-23T16:44:23

1234.1    abc.202+            CANCELLED+ 2024-12-23T20:19:27 2024-12-23T20:20:52

1234.2    abc.202+            CANCELLED+ 2024-12-23T20:21:05 2024-12-24T13:54:46



From: bugs@schedmd.com <bugs@schedmd.com>
Date: Tuesday, December 31, 2024 at 9:11 AM
To: Martin, Nicole Bianca <nblucer@sandia.gov>
Subject: [EXTERNAL] [Ticket 21698] Jobs not writing to the database
Ben Roberts<mailto:ben@schedmd.com> changed ticket 21698<https://support.schedmd.com/show_bug.cgi?id=21698>
What
Removed
Added
Assignee
support@schedmd.com
ben@schedmd.com
Comment # 3<https://support.schedmd.com/show_bug.cgi?id=21698#c3> on ticket 21698<https://support.schedmd.com/show_bug.cgi?id=21698> from Ben Roberts<mailto:ben@schedmd.com>

Hi Nicole,



I was looking at this ticket and had a possibility cross my mind.  It's

possible that the job being displayed by sacct is a previous instance of the

job.  Slurm will reuse job ids when it gets to the maximum allowable value, or

if something happens to reset the job ids.  You might want to have sacct show

some additional details to verify that the job information you're looking at is

from the job you expect.  I would recommend running something like this:



sacct -D -j 1234 --format=jobid,jobname,partition,state,start,end



Can you send the output of that command?



Thanks,

Ben

________________________________
You are receiving this mail because:

  *   You reported the ticket.

Comment 5 Ben Roberts 2025-01-02 13:23:06 MST

Thanks for sending that sacct output.  It does confirm that there isn't a second record of this job id in the database from the job being requeued that might have a different status.  

If you are seeing new cases of the same behavior I'd like to have you send some logs.  If you can identify a newer job that has a status mismatch, then collect the 'scontrol show job' and 'sacct' output for that job to show the discrepancy.  The scontrol output will show the 'SubmitTime' for the job.  I would like to have you collect the slurmctld logs that cover just a little before the submit time of that job to the present.  Please send all those logs rather than just grepping for the job id.  There are frequently relevant log entries that don't include the job id that I would need to see.  If you could do the same for the slurmdbd logs that would be great.

Hopefully the logs will give some indication of what is happening.  It's possible that these logs will provide a clue, but it may require higher level logging to find the exact cause.  We can look at increasing the log level later, if necessary.

Thanks,
Ben

Comment 6 Nicole 2025-01-02 13:58:21 MST

Well this is different but along the same lines.. squeue and sacct mismatch. This morning we saw this job running : (another cluster same database)
 
The job log showed it completed :
[root@admin1 slurm]# grep -i 128..... slurm-joblog.txt

JobId=128.....
JobState=COMPLETED Partition=batch TimeLimit=901 StartTime=2025-01-01T20:49:03 EndTime=2025-01-02T08:56:55 


[root@admin1 slurm]#  sacct -D -j 128..... --format=jobid,jobname,partition,state,start,end

JobID           JobName  Partition      State               Start                 End 

------------ ---------- ---------- ---------- ------------------- ------------------- 

[root@swaadmin1 slurm]# sacctmgr show runaway

Runaway Jobs: No runaway jobs found on cluster attaway

Why wouldn't a job that was completed show up with the sacct command?

Thanks,
Nicole

Comment 7 Ben Roberts 2025-01-03 08:36:38 MST

Hi Nicole,

You're right that you should see a record for this job in the database if it has completed.  I know Stephen mentioned looking at the 'DBD Agent queue size' in comment 1.  Does that show that there are messages destined for slurmdbd that aren't able to make it?  

Are you able to send logs for us to review?  

Thanks,
Ben

Comment 8 Nicole 2025-01-03 08:59:16 MST

HI Ben,
Yes I recalled what Stephen had mentioned about DB Agent queue. I immediately did a sdiag on our admin node to see the following: 
Server thread count:  2
Agent queue size:     0
Agent count:          0
Agent thread count:   0
DBD Agent queue size: 0

Our site is very sensitive about log data regarding job information. I'll have to sanitize it. Can you be specific about a timeframe you would like to see based on the information below and which logs you would like to see?
Thank You,
Nicole

Comment 9 Ben Roberts 2025-01-03 10:03:27 MST

That's good that sdiag doesn't show messages queued.  I understand that sharing logs can be challenging for privacy reasons.  We can start by just looking for errors that might show up in the logs around the time this job was started and when it ended.  Those time stamps show up in the output you sent in comment 6.

If there were errors sending information to slurmdbd from slurcmtld then I would expect to see errors like this in the slurmctld logs:

[2025-01-03T10:34:37.896] error: _open_persist_conn: failed to open persistent connection to host:localhost:6819: Connection refused
[2025-01-03T10:34:37.896] error: Failure sending message: 11: Connection refused


But if those were the errors you were seeing I would also expect to see messages queued in the sdiag output.

It would also be good to look for any errors in the slurmdbd log file.  If you see any "error" messages around these times in either log file then we can start there and see if we need to dig deeper into what's happening.

Thanks,
Ben

Comment 10 Nicole 2025-01-03 10:12:14 MST

Wonderful, I'll start working on that right now.
-Nicole

Comment 11 Nicole 2025-01-08 12:55:39 MST

Ben,
I'm still working on those logs, but during our monintoring of squeue and sacct mismatch I would like to know if what I am seeing is what should be expected or if I should see something different.

Job xyz123 was submitted and cancelled last month. Sacct show's cancelled.
Job xyz123 was resubmitted this week and shows running by squeue, but still cancelled in sacct. I would think the status would update to running in sacct. Please confirm. Here is the printout:

root@clusteradmin1 scripts]# sacct -D -j abc123 --format=jobid,jobname,partition,state,start,end,user
JobID           JobName  Partition      State               Start                 End      User 
------------ ---------- ---------- ---------- ------------------- ------------------- --------- 
Abc123     /jobname/+      batch CANCELLED+ 2024-12-28T07:21:33 2024-12-28T07:21:33   
Abc123.ba+      batch             CANCELLED 2025-01-08T00:52:23 2024-12-28T07:21:33           
Abc123.0        orted             CANCELLED 2025-01-08T00:52:23 2024-12-28T07:21:33           

[root@clusteradmin1 scripts]# squeue -j abc123 
     JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  Abc123     batch /jobname  user123  R   12:01:18     16 nodes[1-16]

-Nicole

Comment 12 Ben Roberts 2025-01-08 14:12:57 MST

If you have a job that was cancelled and requeued, then I would expect to see multiple entries for that job from sacct.  The '-D' flag tells sacct to show duplicate entries for the job you specify.  When a job is requeued there will be multiple instances of that job in sacct, one for each time it was requeued.  

The sacct output you show for this job is a little confusing though.  It looks like the start time for the steps was today, but the end time was Dec 28.  Do you have first hand knowledge about the life of this job?  Do these dates align with the original submit time of the job or when it was able to start again after being requeued?  

Thanks,
Ben

Comment 13 Nicole 2025-01-08 14:20:00 MST

yes that's correct it was restarted today...
sacct shows 2025-01-08T00:52:23 2024-12-28T07:21:33  as start/end

scontrol show job Abc123
SubmitTime=2024-12-17T08:13:49 EligibleTime=2024-12-17T08:13:49
   AccrueTime=2024-12-17T08:13:49
   StartTime=2025-01-08T00:52:23 EndTime=2025-01-09T06:52:2


-Nicole

Comment 14 Nicole 2025-01-08 14:25:09 MST

Ben,
The job completed and sacct now reflects just the new start and stop time with no duplicates. In other words I cannot see the last instance of this job - but from your statement it seems like I should see both instances: the cancelled and now the updated. 

please see new entry:

sacct -D -j abc123 --format=jobid,jobname,partition,state,start,end,user

JobID           JobName  Partition      State               Start                 End      User 
------------ ---------- ---------- ---------- ------------------- ------------------- --------- 
Abc123     /jobname/+      batch  COMPLETED 2025-01-08T00:52:23 2025-01-08T13:06:12   user1 
Abc123.ba+      batch             COMPLETED 2025-01-08T00:52:23 2025-01-08T13:06:12           
Abc123.0        orted             COMPLETED 2025-01-08T00:52:23 2025-01-08T13:06:12           

Nicole

Comment 15 Ben Roberts 2025-01-09 10:47:08 MST

That's interesting that you're not seeing a duplicate entry for this requeued job.  Can I have you go into the database and query it directly?  I would like to see the following things (although you can still edit the jobid) for the example job you're looking at.  Here's an example job that I requeued.  The second instance doesn't have an end time because I didn't wait for it to complete.  The important things are that each instance of the job should have a unique job_db_inx and a count of the number of times it restarted.

select job_db_inx,id_job,restart_cnt,time_start,time_end from knight_job_table where id_job='8555';
+---------------------+--------+-------------+------------+------------+
| job_db_inx          | id_job | restart_cnt | time_start | time_end   |
+---------------------+--------+-------------+------------+------------+
| 9797107308264082432 |   8555 |           0 | 1736444076 | 1736444087 |
| 9797107308280664064 |   8555 |           1 | 1736444288 |          0 |
+---------------------+--------+-------------+------------+------------+
2 rows in set (0.000 sec)


Bear in mind that you will need to substitute the name of your cluster where I have 'knight' for the name of the job table and the appropriate job id.

Thanks,
Ben

Comment 16 Nicole 2025-01-14 13:32:38 MST

HI Ben
See Below.
Any idea why our jobs table wouldn't have the restart_cnt field?

Not sure how all that works..I'm fairly new to exploring the slurm database outside of sacct commands.

Unknown column 'restart_cnt' in 'SELECT'

MariaDB [slurm_acct_db]> select job_db_inx,id_job,time_start,time_end from attaway_job_table where id_job='abc123';
+------------+----------+------------+------------+
| job_db_inx | id_job   | time_start | time_end   |
+------------+----------+------------+------------+
|    4xxxxxx | abc123 | 1736322743 | 1736366772 |
+------------+----------+------------+------------+


-Nicole

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, January 9, 2025 10:47 AM
To: Martin, Nicole Bianca <nblucer@sandia.gov>
Subject: [EXTERNAL] [Ticket 21698] Jobs not writing to the database

Comment # 15<https://support.schedmd.com/show_bug.cgi?id=21698#c15> on ticket 21698<https://support.schedmd.com/show_bug.cgi?id=21698> from Ben Roberts<mailto:ben@schedmd.com>

That's interesting that you're not seeing a duplicate entry for this requeued

job.  Can I have you go into the database and query it directly?  I would like

to see the following things (although you can still edit the jobid) for the

example job you're looking at.  Here's an example job that I requeued.  The

second instance doesn't have an end time because I didn't wait for it to

complete.  The important things are that each instance of the job should have a

unique job_db_inx and a count of the number of times it restarted.



select job_db_inx,id_job,restart_cnt,time_start,time_end from knight_job_table

where id_job='8555';

+---------------------+--------+-------------+------------+------------+

| job_db_inx          | id_job | restart_cnt | time_start | time_end   |

+---------------------+--------+-------------+------------+------------+

| 9797107308264082432 |   8555 |           0 | 1736444076 | 1736444087 |

| 9797107308280664064 |   8555 |           1 | 1736444288 |          0 |

+---------------------+--------+-------------+------------+------------+

2 rows in set (0.000 sec)





Bear in mind that you will need to substitute the name of your cluster where I

have 'knight' for the name of the job table and the appropriate job id.



Thanks,

Ben

________________________________
You are receiving this mail because:

  *   You reported the ticket.

Comment 17 Ben Roberts 2025-01-15 10:04:07 MST

My apologies about the restart_cnt field.  That was added in 24.11.  

I think it would be worthwhile to double check a couple basic things.  It's possible that job records are being purged soon enough that it's causing confusion, or it could be that the tables are getting large enough that it causes other problems.  Can I have you send the output of the following commands (for the database query you would need to replace 'slurm_2411' with the name of your database)

sacctmgr show config | grep Purge

SELECT table_name 'Table Name', ROUND(((data_length + index_length) / 1024 / 1024), 2) 'Size in MiB' FROM information_schema.tables WHERE table_schema = 'slurm_2411' ORDER BY (data_length + index_length) DESC;


Thanks,
Ben

Comment 18 Nicole 2025-01-15 11:48:12 MST

Ben,
Can’t remember if I mentioned :
We have multiple large clusters on the same database (I pulled out just this 1 particular cluster we have had the most issues with and shared tables names)


| Table Name                      | Size in MiB |
+---------------------------------+-------------+
| clustername_step_table              |    23776.00 |
| clustername_job_table               |     2848.83 |
| clustername_assoc_usage_hour_table  |      256.63 |
| clustername_assoc_usage_day_table   |       32.02 |
| clustername_usage_hour_table        |       28.06 |
| clustername_event_table             |        8.55 |
| clustername_assoc_usage_month_table |        5.02 |
| clustername_assoc_table             |        3.13 |
| clustername_usage_day_table         |        1.73 |
| clustername_resv_table              |        0.08 |
| clustername_usage_month_table       |        0.05 |
| clustername_suspend_table           |        0.05 |
| clustername_job_env_table           |        0.03 |
| clustername_job_script_table        |        0.03 |
| clustername_wckey_usage_day_table   |        0.03 |
| clustername_wckey_table             |        0.03 |
| clustername_wckey_usage_month_table |        0.03 |
| clustername_wckey_usage_hour_table  |        0.03 |
| clustername_last_ran_table          |        0.02 |
| slurmdb_step_table                |        0.03 |
| tres_table                        |        0.03 |
| slurmdb_job_script_table          |        0.03 |
| res_table                         |        0.03 |
| cluster_table                     |        0.02 |
| convert_version_table             |        0.02 |
| federation_table                  |        0.02 |
| slurmdb_last_ran_table            |        0.02 |


sacctmgr show config | grep Purge

PurgeEventAfter        = NONE

PurgeJobAfter          = NONE

PurgeResvAfter         = NONE

PurgeStepAfter         = NONE

PurgeSuspendAfter      = NONE

PurgeTXNAfter          = NONE

PurgeUsageAfter        = NONE

Comment 19 Ben Roberts 2025-01-17 10:52:19 MST

Thanks for that output. It doesn't look like a case of the records being purged too quickly, but the opposite problem could be coming into play. The step table for this cluster is 23G. If the other cluster(s) have tables that are a similar size then that could cause some slowness of the database, which might account for strange behavior with some of these job records. Without some detailed logs from slurmdbd I can't say for sure that this is what's happening, but it's a good possibility.

Are you keeping old job records around because you run reports on the older data? If so you may want to consider setting up an archive server to house older records. We have some information in our documentation about setting up an archive server:
https://slurm.schedmd.com/accounting.html#archive

Once you have configured an archive server you will want to gradually purge the old records from your current database to move them to the archive server. There are options in the slurmdbd.conf to purge different types of records, as you can see in the output you sent. The types of records you would want to focus on in your case are steps and jobs. You can find information on these parameters starting here:
https://slurm.schedmd.com/slurmdbd.conf.html#OPT_PurgeEventAfter

Along with the purge options, there are options to enable the archiving of the records that are purged. When you enable the archiving of these records then slurmdbd will write the records that are removed to a file that you can then transfer to the archive server and load there. The documentation for the Archive options start here:
https://slurm.schedmd.com/slurmdbd.conf.html#OPT_ArchiveEvents

When you first start purging records it's not a good idea to try to purge a huge block of time at once. This can take a long time to complete and will prevent other updates from happening while it works on the purge. We generally recommend looking at your oldest record and calculating the number of days that have elapsed since that time. Then you can set the purge time to 7 days more recent than that record so that the nightly purge only has to do 7 days worth of records at once. You can look at the logs to see how long that purge took and decide if it makes sense to decrease the next night's purge by 7 days again, or increase the window of time further.

It can take some time to get the window of time down to where you want to keep it long term, but once it's there it should be pretty easy to maintain.

Do you have a machine you can use as an archive server? Does this sound like something you can move towards?

Thanks,
Ben

Comment 20 Nicole 2025-01-27 08:14:20 MST

Ben,
Thank you, We will discuss and archive server as a team and let you know if that seems to be an option.
I’m going to work on getting you some more detailed logs from slurmdbd as this continues to occur.
We continue to see oddities with the difference of squeue and sacct.
Below is an example I’m hoping you can help me understand. This job is no longer running and is no longer seen by squeue , however it sacct thinks it will be resizing.  Can you please explain this?  I read slurm documentation on resizing and it didn’t seem to match what could be happening with this job.  Is there something I can do to get it out of this state? It has been 3 days since it ended.

JobID           JobName  Partition      State               Start                 End      User
------------ ---------- ---------- ---------- ------------------- ------------------- ---------
1111          jobname      batch   RESIZING 2025-01-24T09:09:55 2025-01-24T09:16:52   user
11111.bat+      batch             CANCELLED 2025-01-24T09:09:55 2025-01-24T09:16:52
1111.0         orted              RESIZING 2025-01-24T09:09:55 2025-01-24T09:16:52
1111          jobname      batch    TIMEOUT 2025-01-24T09:09:55 2025-01-26T09:11:24   user

-Nicole

Comment 21 Ben Roberts 2025-01-27 09:38:28 MST

Hi Nicole,

You should be able to fix jobs that are in an unexpected state by running "sacctmgr show runawayjobs". This will look for jobs that are not in a completed state in the database, but have completed on the controller side, and mark them as completed.

I was talking with a colleague about the behavior you're seeing and he suspects that there might be an issue with connections timing out. It could be the connection between slurmctld and slurmdbd or the connection between slurmdbd and mysql. We don't have evidence still that this is what's happening without the logs, but it could explain why jobs aren't being updated. Can you look in the logs and see if there are messages about the connection being dropped or it trying to use a connection and it not being there?

If you do see logs like this, there are KeepAlive* parameters that you can set in the slurm.conf and slurmdbd.conf files. These let you control how long/often to send keepalive messages. You can read more about these here:
https://slurm.schedmd.com/slurm.conf.html#OPT_CommunicationParameters
https://slurm.schedmd.com/slurmdbd.conf.html#OPT_CommunicationParameters

Also, the ticket shows that you are on 23.11.0. Is that right? There were improvements made to these parameters, starting in 23.11.8. If you are going to use them I would recommend you upgrade to at least the latest version in that branch (23.11.10), or preferably upgrade to 24.05 or 24.11. Is that possible for you in the near future?

Thanks,
Ben

Comment 22 Nicole 2025-01-27 10:45:23 MST

Sorry, correction. We are on 23.11.10.
-I will check those logs for communication failures.
-Nicole

From: bugs@schedmd.com <bugs@schedmd.com>
Date: Monday, January 27, 2025 at 9:38 AM
To: Martin, Nicole Bianca <nblucer@sandia.gov>
Subject: [EXTERNAL] [Ticket 21698] Jobs not writing to the database
Comment # 21<https://support.schedmd.com/show_bug.cgi?id=21698#c21> on ticket 21698<https://support.schedmd.com/show_bug.cgi?id=21698> from Ben Roberts<mailto:ben@schedmd.com>

Hi Nicole,

You should be able to fix jobs that are in an unexpected state by running

"sacctmgr show runawayjobs".  This will look for jobs that are not in a

completed state in the database, but have completed on the controller side, and

mark them as completed.

I was talking with a colleague about the behavior you're seeing and he suspects

that there might be an issue with connections timing out.  It could be the

connection between slurmctld and slurmdbd or the connection between slurmdbd

and mysql.  We don't have evidence still that this is what's happening without

the logs, but it could explain why jobs aren't being updated.  Can you look in

the logs and see if there are messages about the connection being dropped or it

trying to use a connection and it not being there?

If you do see logs like this, there are KeepAlive* parameters that you can set

in the slurm.conf and slurmdbd.conf files.  These let you control how

long/often to send keepalive messages.  You can read more about these here:

https://slurm.schedmd.com/slurm.conf.html#OPT_CommunicationParameters

https://slurm.schedmd.com/slurmdbd.conf.html#OPT_CommunicationParameters

Also, the ticket shows that you are on 23.11.0.  Is that right?  There were

improvements made to these parameters, starting in 23.11.8.  If you are going

to use them I would recommend you upgrade to at least the latest version in

that branch (23.11.10), or preferably upgrade to 24.05 or 24.11.  Is that

possible for you in the near future?

Thanks,

Ben

________________________________
You are receiving this mail because:

  *   You reported the ticket.

Comment 23 Nicole 2025-01-27 11:06:51 MST

My apologies, we are on version: 23.11.10
And we will be doing a major update soon.


Also, there are no runaway jobs. Please let me know if you have another suggestion it is still showing resizing:

sacctmgr show runaway

Runaway Jobs: No runaway jobs found on cluster clustername:

-Nicole

Comment 24 Ben Roberts 2025-01-27 14:24:16 MST

Ok, thanks for confirming the version.  I'm updating the ticket to reflect that.

That's interesting that it isn't showing up as a runaway job.  I assume that it doesn't show up if you do 'scontrol show job <jobid>'.  Can you confirm that's true?  

I'm looking at the sacct output you sent in comment 20.  It looks like there are two entries for the job allocation (not counting job steps).  Did you gather this output by including the '-D' flag for sacct?  If you run the same command without the '-D' what does it show?

Thanks,
Ben

Comment 25 Nicole 2025-01-29 10:19:23 MST

Ben,

Correct no result from scontrol show job (because its not in the queue)

I ran the command with and without the -D and got the same results.  
The same job Started AND Ended on the 24th and is in state 'resizing'.
Looks like it was re-ran on the 26th and is state 'Timeout'.

Is state resizing something that should remain in the database? I guess that's my bottom line question. Should it eventually no longer say resizing?

I searched back through 2024 doing the following command:

sacct -s resizing --starttime=2024-01-01 --endtime=2025-01-01 --cluster=xyz --format=jobid,jobname,partition,state,start,end 

and found plenty of resizing jobs all the way back to January 2024 (I didn't search further)



Thanks,
Nicole

Comment 26 Ben Roberts 2025-01-30 12:22:11 MST

Hi Nicole,

I experimented with a few different scenarios and I am able to reproduce the behavior you're seeing with job records showing up as resizing.

I submitted a job that requested 3 nodes.

$ sbatch -N3 --exclusive --wrap='srun sleep 300'
Submitted batch job 8777

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8777     debug     wrap      ben  R       0:02      3 node[25-27]




While it was running I updated the job, removing one of the nodes from its node list.  This left the job record in a state similar to what you show in comment 20 where there is a job record with a 'RESIZING' state.

$ scontrol update jobid=8777 nodelist=node25,node26

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)


$ sacct -j 8777
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
8777               wrap      debug       sub1         72   RESIZING      0:0 
8777.batch        batch                  sub1         24   RESIZING      0:0 
8777.extern      extern                  sub1         72   RESIZING      0:0 
8777.0            sleep                  sub1         72  CANCELLED     0:15 
8777               wrap      debug       sub1         48     FAILED     15:0 



This doesn't look like an unexpected state for job records, which is why it wasn't identified as a runaway job.  So I believe the latest for the issue of some records not being written to the database is that you were trying to sanitize some logs for me to review.  Does it look like you will be able to send some sanitized logs?

Thanks,
Ben

Comment 27 Nicole 2025-01-30 12:34:40 MST

Thank you for confirming that about resizing.
Actually our database has behaved for the last month. I plan to wait for it to happen again and dig deep into the logs vs trying to go back. Would you like to close the ticket and me reopen when it happens again?
I’m not confident we are getting evrything in the database but every morning we compare squeue and sacct to make sure they are in sync.
Thank You,
Nicole

Comment 28 Ben Roberts 2025-01-31 09:09:35 MST

I'm glad to hear that you have a plan to monitor for inconsistencies.  I will point out that it is possible for there to be a delay between what squeue shows and what's in the database.  For example, if a job is started then squeue should show that pretty immediately, but it could take a little time for that update to be sent to the database, which could look like inconsistencies if these commands are both run at exactly the same time.  You may have already seen this and taken it into account, but I wanted to make sure you're aware of it.

I do think that closing the ticket for now and reopening it if this comes up again is a good idea.  Any logging you're able to provide if it does happen again would be helpful.

Thanks,
Ben

Comment 29 Nicole 2025-03-26 07:29:48 MDT

HI Ben,
Can we reopen this ticket?
We are seeing this state again.
I can restart slurm on the database server and on the cluster and probably get working again but we would lose these jobs.  I'd like to get any information and logs you'd be curious about right away.  However, we can't keep this going for very long or it effects our metrics , so if you could please get back to be soon that would be great.

Right now we have 54 jobs that are in a state different in squeue than from sacct.  All our other clusters that use the same database are running fine and have a 0 mismatch. We have no runaway jobs at this time. (we did have a few runaway jobs but those have all been fixed)

Here is one of many examples: (You can see it has been running for 2 hours so sacct should be caught up)

[root@cluster slurm]# sacct -D -j X516968 --format=jobid,jobname,partition,state,start,end
JobID           JobName  Partition      State               Start                 End 
------------ ---------- ---------- ---------- ------------------- ------------------- 
X516968         jobname      batch    PENDING             Unknown             Unknown 



[root@cluster slurm]# squeue -j X516968
     JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   X516968     batch  jobname  Username  R    2:00:13     13 cluster[96,133,162,232,355,656,830,967,994-995,1000,1027,1085]

-Nicole

Comment 30 Nicole 2025-03-26 07:45:59 MDT

I should also add that the sdiag does not realize it isn't in sync with sacct.
THE dbd agent queue is empty:
Server thread count:  2
Agent queue size:     0
Agent count:          1
Agent thread count:   12
DBD Agent queue size: 0

Comment 31 Nicole 2025-03-26 07:53:34 MDT

You asked for an scontrol show job as well:
here you go:

[root@cluster slurm]# scontrol show job xx16968
JobId=xx16968 JobName=jobname
   UserId=user(1234) GroupId=username(1234) MCS_label=N/A
   Priority=1629640 Nice=0 Account=fy1234 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=02:23:17 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2025-03-17T12:55:42 EligibleTime=2025-03-26T01:12:00
   AccrueTime=2025-03-26T01:12:00
   StartTime=2025-03-26T05:25:13 EndTime=2025-03-28T05:25:13 Deadline=N/A
   PreemptEligibleTime=2025-03-26T05:25:13 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-03-26T05:25:13 Scheduler=Backfill
   Partition=batch AllocNode:Sid=cluster-login3:153351
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cluster[96,133,162,232,355,656,830,967,994-995,1000,1027,1085]
   BatchHost=cluster96
   NumNodes=13 NumCPUs=624 NumTasks=13 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=13,mem=2437.50G,node=13,billing=13
   AllocTRES=cpu=624,mem=2437.50G,node=13,billing=624
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=187.50G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=mylicense Network=(null)

Comment 32 Ben Roberts 2025-03-26 08:40:14 MDT

Hi Nicole,

I'm sorry to hear this has popped up again.  If you can capture some logs for one of these jobs that would be great.  Can you run these commands to enable some logging that should show what is happening to the request that gets sent to slurmdbd.
scontrol setdebug debug2
scontrol setdebugflags +dbd_agent


It would be nice to have this for when a job starts running and that message doesn't get relayed to slurmdbd, but if I remember right, the completion doesn't get picked up either, at least for some jobs.  Hopefully you can capture a failure when a job completes and then turn the logging back down, like this:
scontrol setdebug info
scontrol setdebugflags -dbd_agent


Do you have a rough idea of how long it will be before one of these jobs completes?  Can you leave this logging on that long?

Thanks,
Ben

Comment 33 Nicole 2025-03-26 08:46:56 MDT

Ben,
We ran those commands.
I did a for each loop for the jobs to see when they start and end.
these two will end soon:
   JobID   Start                End
  x81737   2025-03-26T06:31:33  2025-03-26T10:31:33
   x81850   2025-03-26T07:21:26 2025-03-26T11:21:26

-Nicole

Comment 34 Nicole 2025-03-26 08:48:19 MDT

Ben,
I should add at this time, sacct does not know that either of these jobs exist.
I will keep an eye on them.
-Nicole

Comment 35 Nicole 2025-03-26 08:59:02 MDT

Ben,
Can you please see if there is a typo in the second command? It didn't work on our slurm controller node:

[root@clusteradmin1 mismatch]# scontrol setdebug debug2
[root@clusteradmin1 mismatch]# scontrol setdebugflags +dbd_agent
scontrol: error: Invalid DebugFlag: dbd_agent
invalid debug flag: +dbd_agent

-Nicole

Comment 36 Ben Roberts 2025-03-26 09:07:37 MDT

Ok, I'm glad that you've identified some jobs that shouldn't require you to leave the logging at debug levels for too long.  I assume that there are other jobs that are starting and stopping since you started seeing this, and those jobs are being recorded without a problem.  Is that right?  

I'll also note that I would like to see the slurmdbd logs that cover the time period when these jobs are ending.

I'm sorry, it looks like that debugflag wasn't added until 24.05.  There was a similar looking one named "agent" that I hope will get the same information.  

Thanks,
Ben

Comment 37 Nicole 2025-03-26 09:24:59 MDT

Ben,
Ok, i got debug 'agent' set.

you asked 'I assume that there are other jobs that are starting and stopping since you started seeing this, and those jobs are being recorded without a problem.  Is that right?  '

No this is not right. Any new jobs are not known by the database.
Here is an example:

JobId=x82467 JobName=job2
   UserId=a(xxx) GroupId=a(xxx0) MCS_label=N/A
   Priority=1696556 Nice=0 Account=fyxxx QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:03:16 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2025-03-26T09:17:31

sacct -D -j x82467 --format=jobid,jobname,partition,state,start,end
JobID           JobName  Partition      State               Start                 End 
------------ ---------- ---------- ---------- ------------------- ------------------- 

ANY JOBS STARTING , ENDING OR SUBMITTED are not being updated to sacct.
I can restart the services on the database server and the controller and this will fix it , but I don't want to do that yet.

-Nicole

Comment 38 Ben Roberts 2025-03-26 11:34:17 MDT

Hi Nicole,

I wanted to send a quick note to see how this is going.  Let me know if you are having problems collecting those logs.

Thanks,
Ben

Comment 39 Nicole 2025-03-26 11:47:37 MDT

Ben,
Here is the output from the slurm controller - sacct still doesn't know about it-

[root@clustseradmin1 slurm]# sacct -D -j xxx1737 --format=jobid,jobname,partition,state,start,end
JobID           JobName  Partition      State               Start                 End 
------------ ---------- ---------- ---------- ------------------- ------------------- 


[root@clusteradmin1 slurm]# grep -i xxx37 slurmctld.log
[2025-03-26T06:31:33.768] sched: _slurm_rpc_allocate_resources JobId=xxxxx37 NodeList=cluster[1-3,25-40] usec=21114
[2025-03-26T10:31:38.000] Time limit exhausted for JobId=xxxxx37
[2025-03-26T10:31:41.262] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster33 
[2025-03-26T10:31:41.289] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster36 
[2025-03-26T10:31:41.293] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster32 
[2025-03-26T10:31:41.330] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster35 
[2025-03-26T10:31:41.342] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster40 
[2025-03-26T10:31:41.343] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster37 
[2025-03-26T10:31:41.346] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster34 
[2025-03-26T10:31:41.355] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster29 
[2025-03-26T10:31:41.367] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster30 
[2025-03-26T10:31:41.389] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster28 
[2025-03-26T10:31:41.391] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster1 
[2025-03-26T10:31:41.400] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster38 
[2025-03-26T10:31:41.415] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster31 
[2025-03-26T10:31:41.420] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster26 
[2025-03-26T10:31:41.423] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster27 
[2025-03-26T10:31:41.431] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster25 
[2025-03-26T10:31:41.434] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster2 
[2025-03-26T10:31:41.440] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster3 
[2025-03-26T10:31:41.493] debug2: _slurm_rpc_epilog_complete: JobId=xxxxx37 Node=cluster39 
[2025-03-26T10:31:49.000] debug2: _purge_files_thread: purging files from JobId=xxxxx37
[2025-03-26T11:32:05.025] debug2: acct_policy_job_begin: after adding JobId=xx83683, assoc 3(nd/(null)/(null)) grp_used_tres_run_secs(cpu) is xxx
[2025-03-26T11:32:05.025] debug2: acct_policy_job_begin: after adding JobId=xx83683, assoc 3(nd/(null)/(null)) grp_used_tres_run_secs(billing) is xxx3
[2025-03-26T11:32:05.025] debug2: acct_policy_job_begin: after adding JobId=xx83683, assoc 1(root/(null)/(null)) grp_used_tres_run_secs(cpu) is xxx
[2025-03-26T11:32:05.025] debug2: acct_policy_job_begin: after adding JobId=xx83683, assoc 1(root/(null)/(null)) grp_used_tres_run_secs(billing) is xxx

-Nicole

Comment 40 Nicole 2025-03-26 11:49:49 MDT

Here is the joblog.txt

slurm-joblog.txt:JobId=xx37 UserId=user(xx1) GroupId=user (xxx71) Name=interactive JobState=TIMEOUT Partition=short,batch TimeLimit=240 StartTime=2025-03-26T06:31:33 EndTime=2025-03-26T10:31:37 NodeList=cluster[1-3,25-40] NodeCnt=19 ProcCnt=912 WorkDir=/home/ ReservationName= Tres=cpu=19,mem=3562.50G,node=19,billing=19 Account=fyxxx QOS=normal WcKey= Cluster=clustername SubmitTime=2025-03-26T06:31:33 EligibleTime=2025-03-26T06:31:33 DerivedExitCode=1:0 ExitCode=0:0

Comment 41 Nicole 2025-03-26 11:54:17 MDT

Here is the other job that completed:

slurmctld.log:[2025-03-26T07:21:26.547] sched: Allocate JobId=x1850 NodeList=cluster[5-8] #CPUs=192 Partition=short
slurmctld.log:[2025-03-26T11:21:27.000] Time limit exhausted for JobId=x1850
slurmctld.log:[2025-03-26T11:21:30.299] debug2: _slurm_rpc_epilog_complete: JobId=x1850 Node=cluster6 
slurmctld.log:[2025-03-26T11:21:30.340] debug2: _slurm_rpc_epilog_complete: JobId=x1850 Node=cluster8 
slurmctld.log:[2025-03-26T11:21:30.376] debug2: _slurm_rpc_epilog_complete: JobId=x1850 Node=cluster5 
slurmctld.log:[2025-03-26T11:21:30.381] debug2: _slurm_rpc_epilog_complete: JobId=x1850 Node=cluster7 
slurmctld.log:[2025-03-26T11:21:46.000] debug2: _purge_files_thread: purging files from JobId=x1850
-Nicole

Comment 42 Nicole 2025-03-27 07:06:59 MDT

Hi Ben,
I wanted to see if you got anything valuable from the log data I sent?
I didn’t think so and want to see if there’s anything else I can look at, We are at over 24 hour of missing job data which will have a negative impact on our metrics, not only do we have to report utilization hours (which I can estimate) but we categorize the type of work being done on the cluster for funding. I’d like to get it up and running as soon as possible with service restarts, but at the same time I don’t want to miss any findings on why this is happening.
Thank you,
Nicole

Comment 43 Nicole 2025-03-27 07:08:37 MDT

...As of now we have 517 jobs in the queue that sacct is not aware of.
-Nicole

Comment 44 Nicole 2025-03-27 11:17:18 MDT

Hi Ben,
I changed this to priority 2, High impact because we can't afford to lose any more jobs in the database.
Is there anything you can do to escalate this? Its been 24 hours since the last communication.
Thank you so much,
Nicole

Comment 45 Ben Roberts 2025-03-27 11:27:58 MDT

Hi Nicole,

My apologies for the delay.  I forgot that you're at a secure site and probably can't share the logs you gathered with me.  Frequently with debug logging, there are entries that don't include the job id.  You have to find the relevant section in the logs for the job you're interested in and look at the surrounding entries.  

I'm not sure how failures might look since I didn't have the problem on my side, but some of the log entries I see when slurmctld sends an update to slurmdbd look like this:
[2025-03-26T09:34:08.002] debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
[2025-03-26T09:34:08.002] debug2: _slurm_rpc_node_registration complete for node32 usec=79
[2025-03-26T09:34:08.019] debug2: Spawning RPC agent for msg_type REQUEST_LAUNCH_PROLOG
[2025-03-26T09:34:08.019] debug2: Spawning RPC agent for msg_type REQUEST_BATCH_JOB_LAUNCH
[2025-03-26T09:34:08.065] debug2: Processing RPC: REQUEST_COMPLETE_PROLOG from UID=0
[2025-03-26T09:34:08.100] debug2: Processing RPC: REQUEST_HET_JOB_ALLOC_INFO from UID=1000
[2025-03-26T09:34:08.101] debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from UID=1000


There are similar entries when the job and steps complete.  Do you see logs like these when you had the debugflag enabled?  If you did, I was curious to see if there are any indications on the slurmdbd side of things that these requests are coming through, or if there are errors there.

Thanks,
Ben

Comment 46 Nicole 2025-03-27 11:42:01 MDT

Created attachment 41289 [details]
image001.png

I did a grep for those:

The first entry node registration is going good.:

2025-03-27T11:36:02.060] debug2: _slurm_rpc_node_registration complete for cl534 usec=30


Spawning is going good:

2025-03-27T11:37:00.001] debug2: Spawning RPC agent for msg_type REQUEST_PING

[2025-03-27T11:37:00.001] debug2: Spawning RPC agent for msg_type REQUEST_NODE_REGISTRATION_STATUS

Prolog good:

[2025-03-27T11:30:18.140] debug2: _slurm_rpc_complete_prolog JobId=xx usec=9



Step creates good:

[2025-03-27T11:38:48.165] debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from UID=zzz

[2025-03-27T11:38:48.165] debug:  step_create: external launcher step requested




Here are the last two minutes of grepping on RPC:



2025-03-27T11:32:29.313] debug2: Processing RPC: REQUEST_FED_INFO from UID=x

[2025-03-27T11:32:29.313] debug2: _slurm_rpc_get_fed usec=25

[2025-03-27T11:32:29.315] debug2: Processing RPC: REQUEST_JOB_USER_INFO from UID=xx

[2025-03-27T11:32:29.317] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=xx

[2025-03-27T11:32:29.317] debug2: _slurm_rpc_dump_partitions, size=1802 usec=15

[2025-03-27T11:32:29.801] debug2: Processing RPC: REQUEST_FED_INFO from UID=xx

[2025-03-27T11:32:29.801] debug2: _slurm_rpc_get_fed usec=36

[2025-03-27T11:32:29.803] debug2: Processing RPC: REQUEST_JOB_USER_INFO from UID=xx

[2025-03-27T11:32:29.804] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=xx

[2025-03-27T11:32:29.804] debug2: _slurm_rpc_dump_partitions, size=1802 usec=61

[2025-03-27T11:32:31.225] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=xx

[2025-03-27T11:32:31.225] debug2: _slurm_rpc_dump_partitions, size=1210 usec=13

[2025-03-27T11:32:31.228] debug2: Processing RPC: REQUEST_NODE_INFO from UID=xx

[2025-03-27T11:32:31.366] debug2: Processing RPC: REQUEST_HET_JOB_ALLOC_INFO from UID=xx

[2025-03-27T11:32:31.366] debug2: _slurm_rpc_het_job_alloc_info: JobId=ssss NodeList=cluster[17-20] usec=1

[2025-03-27T11:32:31.369] debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from UID=xx

[2025-03-27T11:32:32.363] debug2: Processing RPC: REQUEST_FED_INFO from UID=xx

[2025-03-27T11:32:32.363] debug2: _slurm_rpc_get_fed usec=52

[2025-03-27T11:32:32.364] debug2: Processing RPC: REQUEST_JOB_USER_INFO from UID=xx

[2025-03-27T11:32:32.366] debug2: Processing RPC: REQUEST_PARTITION_INFO from UID=xx

[2025-03-27T11:32:32.366] debug2: _slurm_rpc_dump_partitions, size=1802 usec=18



NO errors that I’m seeeing on the SLurm DB side.  Not timeouts from the cluster.
There are only two entries in slurmdbd.log for today and they look like this:

[signature_2222685036]
Nicole Martin
Information Systems Architect
Dept. 9327<https://directory.web.sandia.gov/SAPLE/search.jsp?query=9327&followUp=true&showAll=false&source=global_header>  |  HPC Systems<https://directory.web.sandia.gov/SAPLE/search.jsp?query=9327&followUp=true&showAll=false&source=global_header>

Office 505.845.1069
Mobile 505.400.3839
Web computing.sandia.gov<http://computing.sandia.gov/>

Friday is my normal 4/10 day off

Sandia National Laboratories

[signature_1603233209]<https://www.facebook.com/SandiaLabs>[signature_7922779]<https://twitter.com/SandiaLabs>[signature_951752118]<https://www.linkedin.com/company/sandia-national-laboratories>[signature_280729882]<https://www.instagram.com/SandiaLabs/>[signature_1491634624]<https://www.youtube.com/channel/UCNY3eAhDPQ_MPTVruP4Ji8g>[signature_631889743]<https://www.flickr.com/photos/sandialabs>
sandia.gov<http://www.sandia.gov/>



From: bugs@schedmd.com <bugs@schedmd.com>
Date: Thursday, March 27, 2025 at 11:28 AM
To: Martin, Nicole Bianca <nblucer@sandia.gov>
Subject: [EXTERNAL] [Ticket 21698] Jobs not writing to the database
Comment # 45<https://support.schedmd.com/show_bug.cgi?id=21698#c45> on ticket 21698<https://support.schedmd.com/show_bug.cgi?id=21698> from Ben Roberts<mailto:ben@schedmd.com>

Hi Nicole,



My apologies for the delay.  I forgot that you're at a secure site and probably

can't share the logs you gathered with me.  Frequently with debug logging,

there are entries that don't include the job id.  You have to find the relevant

section in the logs for the job you're interested in and look at the

surrounding entries.



I'm not sure how failures might look since I didn't have the problem on my

side, but some of the log entries I see when slurmctld sends an update to

slurmdbd look like this:

[2025-03-26T09:34:08.002] debug2: Processing RPC:

MESSAGE_NODE_REGISTRATION_STATUS from UID=0

[2025-03-26T09:34:08.002] debug2: _slurm_rpc_node_registration complete for

node32 usec=79

[2025-03-26T09:34:08.019] debug2: Spawning RPC agent for msg_type

REQUEST_LAUNCH_PROLOG

[2025-03-26T09:34:08.019] debug2: Spawning RPC agent for msg_type

REQUEST_BATCH_JOB_LAUNCH

[2025-03-26T09:34:08.065] debug2: Processing RPC: REQUEST_COMPLETE_PROLOG from

UID=0

[2025-03-26T09:34:08.100] debug2: Processing RPC: REQUEST_HET_JOB_ALLOC_INFO

from UID=1000

[2025-03-26T09:34:08.101] debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from

UID=1000





There are similar entries when the job and steps complete.  Do you see logs

like these when you had the debugflag enabled?  If you did, I was curious to

see if there are any indications on the slurmdbd side of things that these

requests are coming through, or if there are errors there.



Thanks,

Ben

________________________________
You are receiving this mail because:

  *   You reported the ticket.

Comment 47 Nicole 2025-03-27 11:42:02 MDT

Created attachment 41290 [details]
image002.png

Comment 48 Nicole 2025-03-27 11:42:02 MDT

Created attachment 41291 [details]
image003.png

Comment 49 Nicole 2025-03-27 11:42:02 MDT

Created attachment 41292 [details]
image004.png

Comment 50 Nicole 2025-03-27 11:42:02 MDT

Created attachment 41293 [details]
image005.png

Comment 51 Nicole 2025-03-27 11:42:02 MDT

Created attachment 41294 [details]
image006.png

Comment 52 Nicole 2025-03-27 11:42:02 MDT

Created attachment 41295 [details]
image007.png

Comment 53 Nicole 2025-03-27 11:43:56 MDT

sorry I hit send too Soon and meant to finish with these are the only two log entries from slurmdbd.log today:

[2025-03-27T09:49:47.089] error: _conn_readable: persistent connection for fd 32 experienced error[104]: Connection reset by peer
[2025-03-27T09:49:54.280] error: _conn_readable: persistent connection for fd 35 experienced error[104]: Connection reset by peer

Comment 54 Nicole 2025-03-27 13:45:26 MDT

Ben.
You mentioned looking at log for the job in question . I just want to reiterate that this is ALL jobs at this point. Over 600! There has to be something I can run from the controller or from the database server to show the two aren’t communicating. Do you have any ideas? I really need to restart services and get this back up asap.  I can get you logs, I can clean them up, but I’m not seeing anything that points to any service communication failure. This is the 3rd or 4th time this has happened on this cluster.
Thanks,
Nicole

________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, March 27, 2025 11:27:58 AM
To: Martin, Nicole Bianca <nblucer@sandia.gov>
Subject: [EXTERNAL] [Ticket 21698] Jobs not writing to the database

Comment # 45<https://support.schedmd.com/show_bug.cgi?id=21698#c45> on ticket 21698<https://support.schedmd.com/show_bug.cgi?id=21698> from Ben Roberts<mailto:ben@schedmd.com>

Hi Nicole,

My apologies for the delay.  I forgot that you're at a secure site and probably
can't share the logs you gathered with me.  Frequently with debug logging,
there are entries that don't include the job id.  You have to find the relevant
section in the logs for the job you're interested in and look at the
surrounding entries.

I'm not sure how failures might look since I didn't have the problem on my
side, but some of the log entries I see when slurmctld sends an update to
slurmdbd look like this:
[2025-03-26T09:34:08.002] debug2: Processing RPC:
MESSAGE_NODE_REGISTRATION_STATUS from UID=0
[2025-03-26T09:34:08.002] debug2: _slurm_rpc_node_registration complete for
node32 usec=79
[2025-03-26T09:34:08.019] debug2: Spawning RPC agent for msg_type
REQUEST_LAUNCH_PROLOG
[2025-03-26T09:34:08.019] debug2: Spawning RPC agent for msg_type
REQUEST_BATCH_JOB_LAUNCH
[2025-03-26T09:34:08.065] debug2: Processing RPC: REQUEST_COMPLETE_PROLOG from
UID=0
[2025-03-26T09:34:08.100] debug2: Processing RPC: REQUEST_HET_JOB_ALLOC_INFO
from UID=1000
[2025-03-26T09:34:08.101] debug2: Processing RPC: REQUEST_JOB_STEP_CREATE from
UID=1000

There are similar entries when the job and steps complete.  Do you see logs
like these when you had the debugflag enabled?  If you did, I was curious to
see if there are any indications on the slurmdbd side of things that these
requests are coming through, or if there are errors there.

Thanks,
Ben

________________________________
You are receiving this mail because:

  *   You reported the ticket.

Comment 55 Ben Roberts 2025-03-27 14:03:03 MDT

Thanks for sending what you could.  If those are he only two log entries that you have for slurmdbd then it sounds like you might be running at a low log level, like 'error'.  
https://slurm.schedmd.com/slurmdbd.conf.html#OPT_DebugLevel

It would be nice to have some more information from slurmdbd in the future if something like this comes up again.  Unfortunately there isn't a way to increase the log level for slurmdbd without updating your config and restarting or reconfiguring slurmdbd.  It sounds like this would fix the problem though, so we're probably not going to be able to get more information for this time it's happening.  

Can we have you look for a couple more things in the slurmctld logs that you're getting right now?  We would like to have you look for log entries that include "agent queue".  These would indicate that messages destined for slurmdbd are queuing up.  Can we also have you double check that the DBD Agent queue is still at 0 in the sdiag output?

Can you also send the dbd related settings in your configuration:
scontrol show config | grep -i dbd

Thanks,
Ben

Comment 56 Nicole 2025-03-27 14:36:35 MDT

Ben,
I submitted this comment but got 'mid air collision' I think it didn't take:sorry if its a duplicate:

No agent queue messages
[root@cladmin1 slurm]# grep -i agent slurmctld.log | grep -i queue^C
[root@cladmin1 slurm]# grep -i 'agent queue' slurmctld.log
[root@cladmin1 slurm]# grep -i agent slurmctld.log | grep -i queue


Agent queue still 0
[root@cladmin1 slurm]# sdiag
*******************************************************
sdiag output at Thu Mar 27 14:22:34 2025 (1743106954)
Data since      Wed Mar 26 18:00:01 2025 (1743033601)
*******************************************************
Server thread count:  2
Agent queue size:     0
Agent count:          0
Agent thread count:   0
DBD Agent queue size: 0

Config:

[root@cladmin1 slurm]# scontrol show config | grep -i dbd
AccountingStorageType   = accounting_storage/slurmdbd
MaxDBDMsgs              = 405952

What else? Happy to send anything at this point.


-Nicole

Comment 57 Ben Roberts 2025-03-27 14:43:37 MDT

You mention that you can send some additional logs that you clean up. That would be useful. If we are going to have you collect and clean up some logs like that, I would like to get as much information as we can from it. Can you turn up the logging a little higher long enough to get these logs:
scontrol setdebug debug3

Then if you would submit a test job and collect the logs for the full second after you see an entry that looks like this for the job:
sched: Allocate JobId=10649 NodeList=node25 #CPUs=2 Partition=debug

Then you can turn the logs back down with this:
scontrol setdebug info

You're asking about a way to confirm that the controller and slurmdbd are communicating. We can run a quick command to confirm that slurmdbd still recognizes that the controller is configured to talk to it.
sacctmgr show cluster format=cluster,controlhost,controlport,rpc

If this doesn't have information about the controller it would confirm there is a problem, but to understand why we would need some details in the logs, which is why I've been focusing there.

You're mentioning that sacct doesn't see these jobs. Can we also confirm that there isn't an entry in the database for them. It's possible that sacct isn't returning them for some reason and the record is still fine. Can you pick an example job and run the following:
mysql -D <StorageLoc> -e "select * from <Cluster>_job_table where id_job=X\G"

You would need to substitute appropriate values for:
<StorageLoc> from your slurmdbd.conf
<Cluster> your cluster name from slurm.conf
X the appropriate job id

Thanks,
Ben

Comment 58 Nicole 2025-03-27 16:35:11 MDT

Hi Ben,
I've submitted a job and attaching all log information for it.

[me@dbserver ~]$ sacctmgr show cluster format=cluster,controlhost,controlport,rpc
   Cluster     ControlHost  ControlPort   RPC 
---------- --------------- ------------ ----- 
   xx           xx.17         6817 10240 
   yyy.        yy.147         6817 10240
   zz          zz.201         6817 10240 
   aa         aaa.147         6817 10240 
   bb          bbb.17         6817 10240 
  cc          ccc.151         6817 10240 
  MyproblemCluster  ddd.14        6817 10240 




I've confirmed I could see multiple jobs in the job table from 2 days ago.
I cannot see the one I just submitted and ended,
mysql -D slurm_acct_db -e "select * from myproblemcluster_job_table where id_job=12345"

Nicole

Comment 59 Nicole 2025-03-27 16:43:40 MDT

Created attachment 41298 [details]
Logs after Job submitted

these are all the logs for the short time period after the job was submitted.

Comment 60 Nicole 2025-03-27 17:35:48 MDT

Ben,
I've ran the following command and see that the last job to complete was recorded in the database was the morning of the 25th (0922) I've looked at all logs before and after that time and there is nothing unusual.

mysql -D slurm_acct_db -e "select id_job,time_end from clustername_job_table where time_end>1742910910 order by time_end asc"

last one reported;
| 222222222 | 1742916144 |
-Nicole

Comment 61 Nicole 2025-03-27 18:12:15 MDT

Ben,
I couldn't have any more job data loss.
I had to restart the slurmtld.
Unable to restart slurmctld on the controller.

systemctl restart slurmctld.service
Job for slurmctld.service failed because a fatal signal was delivered to the control process.
See "systemctl status slurmctld.service" and "journalctl -xe" for details.

Going to get with another admin and try to get slurm up.
Any insight you might have will be great.

-Nicole

Comment 62 Ben Roberts 2025-03-27 21:30:09 MDT

Hi Nicole,

Thanks for gathering that information again.  It looks like you still grepped for the jobid in the logs you sent.  Some of the relevant log entries that will show what's going on are not going to have the job id included in the line.

I understand not being able to leave things in this state anymore.  Were you able to restart the slurmdbd service successfully?  Does restarting slurmdbd not resolve the problem, or is it only when slurmctld is restarted?  Are you able to send the journalctl output for us to see what is keeping it from restarting?  If you can't send that, are you able to try starting slurmctld by calling the binary directly with flags to show what's happening?  If you start slurmctld like this (slurmctld -Dvvv) it should tell us what's wrong, and then you can start it normally with systemd once the problem has been resolved.

Thanks,
Ben

Comment 64 Nicole 2025-03-28 07:15:55 MDT

Ben,
I will send you the log entries without grepping, SORRY about that.
Yes SLURM is restarted , we had to reboot the server.   Jobs are once again writing to the database.
-Nicole

Comment 65 Chad Vizino 2025-03-28 15:39:11 MDT

(In reply to Nicole from comment #64)
> I will send you the log entries without grepping, SORRY about that.
> Yes SLURM is restarted , we had to reboot the server.   Jobs are once again
> writing to the database.
Hi. Ben is out today so am covering for him on this one. I can look at the logs if you have them. Just curious, when you restarted the slurm daemons did you enable more logging?

Comment 66 Nicole 2025-03-31 07:59:16 MDT

Chad,
The last thing Ben had me do with debug level is to set it back to admin.
At one point we had it set to 3, just to submit a job.
Right now it is back at admin.
This problem is happening about once a quarter. What would would be realistic debugging level which is sustainable for that amount of time?

Also sorry for not responding Friday, I had logged in on my day off just to respond and verify things were working.

Thank you,
Nicole

Comment 67 Chad Vizino 2025-03-31 10:13:27 MDT

(In reply to Nicole from comment #66)
> The last thing Ben had me do with debug level is to set it back to admin.
> At one point we had it set to 3, just to submit a job.
> Right now it is back at admin.
> This problem is happening about once a quarter. What would would be
> realistic debugging level which is sustainable for that amount of time?
Level 3 is a bit much to keep on for that long until the next quarter. Even just "debug" can be a lot depending on the volume of jobs you have. Do you have a feel for how many jobs per day (starting,completing) you have? Once I know this, I can better advise you on the level.

With it happening at that frequency I'm wondering if it's correlated to another activity on your system. Is there a specific month/day period this is happening each quarter?

> Also sorry for not responding Friday, I had logged in on my day off just to
> respond and verify things were working.
No problem.

Comment 68 Nicole 2025-03-31 10:38:05 MDT

Chad,
No frequency. It just started happening a few cycles back. When we restarted SLURMCTLD this time we did see some SiGbus errors which I'm asking our hardware team to look into to see if it can be related.

BUT EVEN IF IT IS RELATED:

The most odd thing to me is SLURMctld doesn't realize it isn't communicating with the database and nothing is being queued in the DB Agent queue. Have you ever seen this before? Like a check/balance is needed for ctld to realize it isn't writing to the database like what it thinks it is.

This is a fairly large cluster. We currently have ~290  jobs actively running for up to 4 days.  We have anywhere from ~20-60 jobs complete per day. Which I would guess is the same estimate for submission. At the end of everyday we do a count of how many jobs completed, when we get a 0 from sacct this is the clue that something is going wrong.

Unfortunately, when it goes down mid day, we get a count ending for that day (in this example Wednesday ) so we don't catch it for an additional 24 hours when we get a count of 0.

Thanks,
Nicole

Comment 69 Nicole 2025-04-03 08:30:51 MDT

Chad/Ben,
Please see attached log from when you had me submit a job and it started running using the debug level recommended.
-Nicole

Comment 70 Nicole 2025-04-03 08:31:25 MDT

Created attachment 41356 [details]
job log - No Grep'ing