Ticket 1836

Summary: "Runaway" Jobs in SLURM DB
Product: Slurm Reporter: Will French <will>
Component: OtherAssignee: Brian Christiansen <brian>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: brian, charles.johnson, da, sean
Version: 14.11.8   
Hardware: Linux   
OS: Linux   
Site: Vanderbilt Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmdbd.log
logs from slurmctld starting shortly after upgrade

Description Will French 2015-07-30 10:58:15 MDT
Hi again,

We updated to 14.11.8 yesterday and we are now noticing that there are a lot of "runaway" jobs in the database since the upgrade. By runaway I mean jobs that have finished but are still accumulating time in the SLURM database. For example:

[frenchwr@vmps14 python-job]$ date && sacct --format=User,JobID,Timelimit,elapsed,ExitCode,state
Thu Jul 30 17:50:50 CDT 2015
     User        JobID  Timelimit    Elapsed ExitCode      State 
--------- ------------ ---------- ---------- -------- ---------- 
 frenchwr 4100939_2      02:00:00   00:21:09      0:0    RUNNING 
 frenchwr 4100939_0      02:00:00   00:21:09      0:0    RUNNING 
 frenchwr 4100939_1      02:00:00   00:21:09      0:0    RUNNING 
 frenchwr 4100989_2      02:00:00   00:18:40      0:0  CANCELLED 
 frenchwr 4100989_0      02:00:00   00:18:40      0:0  CANCELLED 
 frenchwr 4100989_1      02:00:00   00:18:40      0:0  CANCELLED 
 frenchwr 4101140        00:10:00   00:12:12      0:0  COMPLETED 
 frenchwr 4101313        00:10:00   00:06:46      0:0    RUNNING 

[frenchwr@vmps14 python-job]$ date && sacct --format=User,JobID,Timelimit,elapsed,ExitCode,state
Thu Jul 30 17:51:04 CDT 2015
     User        JobID  Timelimit    Elapsed ExitCode      State 
--------- ------------ ---------- ---------- -------- ---------- 
 frenchwr 4100939_2      02:00:00   00:21:23      0:0    RUNNING 
 frenchwr 4100939_0      02:00:00   00:21:23      0:0    RUNNING 
 frenchwr 4100939_1      02:00:00   00:21:23      0:0    RUNNING 
 frenchwr 4100989_2      02:00:00   00:18:54      0:0  CANCELLED 
 frenchwr 4100989_0      02:00:00   00:18:54      0:0  CANCELLED 
 frenchwr 4100989_1      02:00:00   00:18:54      0:0  CANCELLED 
 frenchwr 4101140        00:10:00   00:12:26      0:0  COMPLETED 
 frenchwr 4101313        00:10:00   00:07:00      0:0    RUNNING

All of these jobs have completed (some I cancelled with scancel, others finished normally), but they are still accumulating time in the SLURM db. This is happening to a large number of jobs (possibly all?) on our cluster since the upgrade yesterday. 

We are seeing lots of these messages in the slurmctld logs:

root@slurmsched1:~# tail -n 50000 /var/log/messages | grep slurmdbd
Jul 30 15:58:25 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:00:26 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:02:27 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:04:28 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:05:06 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 135850
Jul 30 16:06:29 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:08:09 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 132100
Jul 30 16:08:30 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:10:44 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:12:45 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:14:46 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:16:47 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:18:48 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:20:54 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:22:55 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:24:56 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:26:57 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:28:58 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:30:59 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:33:00 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:33:13 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 104900
Jul 30 16:35:01 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:37:02 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:39:03 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:41:05 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:43:06 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:45:13 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:47:34 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:47:57 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 87600
Jul 30 16:49:37 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:59:50 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 72100
Jul 30 17:11:55 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 56350
Jul 30 17:41:26 slurmsched1 slurmctld[20588]: slurmdbd: reopening connection
Jul 30 17:41:31 slurmsched1 slurmctld[20588]: error: slurmdbd: Sending message type 1472: 11: Connection refused
Jul 30 17:41:31 slurmsched1 slurmctld[20588]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Jul 30 17:41:36 slurmsched1 slurmctld[20588]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Jul 30 17:41:39 slurmsched1 slurmctld[20588]: Registering slurmctld at port 6817 with slurmdbd.

Note that the last four lines occurred as I was restarting slurmdbd.

The problem is still persisting after restarting slurmdbd. I have run a couple of extremely short (5 second) test jobs and the db thinks they are still running. I waited about 15 minutes after restarting the slurmdbd and tried again. Same result.

The "agent queue size" and "error: slurmdbd: agent queue filling" messages have not reappeared since restarting slurmdbd.

Will
Comment 1 Will French 2015-07-30 11:07:34 MDT
FWIW, the first "agent queue filling, RESTART SLURMDBD NOW" message occurred at around 4PM CST today (about two hours ago). We performed the upgrade yesterday at around 1PM CST.
Comment 2 Danny Auble 2015-07-30 11:12:13 MDT
Will, are you still getting the RESTART SLURMDBD NOW messages?

Do you see anything in your slurmdbd log that is interesting?

It doesn't appear anything has been lost yet, was the DBD down for a long time?
Comment 3 Will French 2015-07-30 11:17:10 MDT
And now the problem seems to have resolved itself. I suppose it just took 20-25 minutes for all the job information to make it to slurmdbd after it was restarted. 

slurmdbd was not down for a long period of time so long as we know (other than the 10 seconds or so yesterday it took to restart the daemon during the upgrade). Could anything else that you're aware of cause these "agent queue filling" messages?
Comment 4 Danny Auble 2015-07-30 11:30:22 MDT
This is what I would expect (it to resolve itself eventually).  All the RPCs sent to the DBD are in time order so you would have to wait for all those thousands of RPCs to make it over before you would see the latest jobs finishing.  You most likely saw messages in the dbd log like "Need to reroll usage from $TIME Job $JOBID from $CLUSTER started then and we are just now hearing about it"

Was there something else going on on the DBD node?  Looks like something was crazy busy there for a bit.

Looks like things cleared up some time between

Jul 30 17:11:55 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 56350
Jul 30 17:41:26 slurmsched1 slurmctld[20588]: slurmdbd: reopening connection

Based on

Jul 30 16:08:09 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 132100
Jul 30 16:33:13 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 104900

1500 seconds handling 27200 messages it appears your database was able to handle around 18 RPCs a second.  That seems a bit slow, still, do you know if something happened around 3-4 today that would of caused a load on the node?

Perhaps someone submitted a ridiculous amount of jobs at that moment without making them part of an array?  Or perhaps something on the node outside of Slurm was doing something?  Perhaps it is still happening?

what does

sacct -X | wc
then
sacct | wc

say?  I am looking for the count of lines from each to see how many jobs ran today first and then how many steps as well.

At least it appears things are back to normal, from your ctld log you sent it doesn't appear anything was lost.
Comment 5 Will French 2015-07-31 07:04:20 MDT
Hey Danny,
 
> Looks like things cleared up some time between 
> Jul 30 17:11:55 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size
> 56350
> Jul 30 17:41:26 slurmsched1 slurmctld[20588]: slurmdbd: reopening connection

That second messages was right around the time I restarted slurmdbd.


> Based on
> Jul 30 16:08:09 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size
> 132100
> Jul 30 16:33:13 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size
> 104900
> 1500 seconds handling 27200 messages it appears your database was able to
> handle around 18 RPCs a second.  That seems a bit slow, still, do you know
> if something happened around 3-4 today that would of caused a load on the
> node?
> Perhaps someone submitted a ridiculous amount of jobs at that moment without
> making them part of an array?  Or perhaps something on the node outside of
> Slurm was doing something?  Perhaps it is still happening?

Well, it does look like there were a large number of jobs submitted between 2:30-3:00pm yesterday:

root@slurmsched1:~# sacct -X --starttime=07/30/15-13:00:00 --endtime=07/30/15-13:30:00 --state=PD --format=User,JobID,account,Submit,Start,End | wc -l
4175
root@slurmsched1:~# sacct -X --starttime=07/30/15-13:30:00 --endtime=07/30/15-14:00:00 --state=PD --format=User,JobID,account,Submit,Start,End | wc -l
3915
root@slurmsched1:~# sacct -X --starttime=07/30/15-14:00:00 --endtime=07/30/15-14:30:00 --state=PD --format=User,JobID,account,Submit,Start,End | wc -l
5113
root@slurmsched1:~# sacct -X --starttime=07/30/15-14:30:00 --endtime=07/30/15-15:00:00 --state=PD --format=User,JobID,account,Submit,Start,End | wc -l
6569
root@slurmsched1:~# sacct -X --starttime=07/30/15-15:00:00 --endtime=07/30/15-15:30:00 --state=PD --format=User,JobID,account,Submit,Start,End | wc -l
4365
root@slurmsched1:~# sacct -X --starttime=07/30/15-15:30:00 --endtime=07/30/15-16:00:00 --state=PD --format=User,JobID,account,Submit,Start,End | wc -l
4590

Might this have triggered it? slurmdbd runs on its own dedicated server so it's unlikely that another process/application impacted it.


> what does 
> sacct -X | wc
> then
> sacct | wc
> say?  I am looking for the count of lines from each to see how many jobs ran
> today first and then how many steps as well.

root@slurmsched1:~# sacct --starttime=07.30.15 --endtime=07.30.15 --format=User,JobID,account,Submit,Start,End | wc
  18987  105712 1803765
root@slurmsched1:~# sacct -X --starttime=07.30.15 --endtime=07.30.15 --format=User,JobID,account,Submit,Start,End | wc
  10777   64662 1023815


> At least it appears things are back to normal, from your ctld log you sent
> it doesn't appear anything was lost.

So far as we can tell nothing was lost. Phew!
Comment 6 Brian Christiansen 2015-07-31 10:08:42 MDT
A couple more questions:

What type of hardware is the dbd on? Is it a VM? SSD?

Will you grep the logs for "agent queue" from the time the dbd was upgraded? We want to see when the queue size started growing.

Did you have a network problem by chance?

Can you send your dbd logs from the time of the upgrade?

Thanks,
Brian
Comment 7 Will French 2015-07-31 10:29:30 MDT
Created attachment 2089 [details]
slurmdbd.log
Comment 8 Will French 2015-07-31 10:35:41 MDT
Hi Brian,

> What type of hardware is the dbd on? Is it a VM? SSD?

The database is running on a 6-month old Dell server. 2 x Intel Xeon hex-core processor. 128 GB RAM. Pretty standard spinning disk. No VM. No SSD.



> Will you grep the logs for "agent queue" from the time the dbd was upgraded?
> We want to see when the queue size started growing.

The upgrade took place around 1:15 PM on July 29:

Jul 28 22:08:17 slurmsched1 slurmctld[48105]: slurmdbd: agent queue size 50
Jul 29 16:15:10 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 450
Jul 29 21:57:12 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 29 22:07:21 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 29 22:29:38 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 100
Jul 29 23:22:14 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 30 15:58:25 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:00:26 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:02:27 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:04:28 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:05:06 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 135850
Jul 30 16:06:29 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:08:09 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 132100
Jul 30 16:08:30 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:10:44 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:12:45 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:14:46 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:16:47 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:18:48 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:20:54 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:22:55 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:24:56 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:26:57 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:28:58 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:30:59 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:33:00 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:33:13 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 104900
Jul 30 16:35:01 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:37:02 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:39:03 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:41:05 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:43:06 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:45:13 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:47:34 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:47:57 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 87600
Jul 30 16:49:37 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW
Jul 30 16:59:50 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 72100
Jul 30 17:11:55 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 56350
Jul 30 18:42:25 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 30 23:27:53 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 31 11:00:25 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 31 12:29:29 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 31 14:33:07 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50



> Did you have a network problem by chance?

Not that we are aware but it's possible.

> Can you send your dbd logs from the time of the upgrade?

Attached. I do see some interesting messages around the time of the upgrade. This popped up about fifteen minutes prior to the upgrade:

[2015-07-29T13:00:04.738] error: We have more time than is possible (22809907+1296000+0)(24105907) > 23198400 for cluster accre(6444) from 2015-07-29T12:00:00 - 2015-07-29T13:00:00

You'll also see a lot of missing UID messages. These are users we have deactivated on our cluster but we have not yet removed their association(s) from the SLURM db. Is this a problem?
Comment 9 Brian Christiansen 2015-07-31 11:09:48 MDT
I don't think the missing UID's will cause the situation. Will you also attach all of the slurmctld logs since the upgrade -- so I see if there is anything that correlates?

I'll continue looking at these next week.
Comment 10 Will French 2015-08-04 06:08:24 MDT
(In reply to Brian Christiansen from comment #9)
> I don't think the missing UID's will cause the situation. Will you also
> attach all of the slurmctld logs since the upgrade -- so I see if there is
> anything that correlates?
> 
> I'll continue looking at these next week.

I'm attaching the logs starting slightly before the upgrade, which occurred 1:15 PM on July 29. Note that this is our system logs so you'll see messages from other services scattered here and there.

Sorry for the delay.

Will
Comment 11 Will French 2015-08-04 06:10:11 MDT
Created attachment 2095 [details]
logs from slurmctld starting shortly after upgrade

Note we are using 

SlurmctldDebug = info
Comment 12 Brian Christiansen 2015-08-07 09:13:16 MDT
Have you seen any more "RESTART SLURMDBD NOW" since it happened last? Do you see the "agent queue size" getting higher than 100? In your slurmctld logs, I see about 5-10 jobs completing per second but that shouldn't be a problem. I don't see a reason for it starting to backup.
Comment 13 Brian Christiansen 2015-08-21 03:14:51 MDT
Checking in. Do you see any of the information in the last comment?
Comment 14 Will French 2015-08-21 03:38:54 MDT
(In reply to Brian Christiansen from comment #13)
> Checking in. Do you see any of the information in the last comment?

Hey Brian. Sorry I missed your replies. Things look better. We have not seen the "RESTART SLURMDBD NOW" messages since July 30. The agent queue size has also been much smaller:

root@slurmsched1:~# cat /var/log/messages-* | grep " agent queue size "
Jul 19 04:39:54 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 19 04:50:00 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 19 05:22:19 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 19 06:33:00 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 19 07:05:19 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 19 08:09:58 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 19 09:18:40 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 19 14:39:08 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 19 14:42:57 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 19 19:50:59 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 19 20:22:46 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 19 23:50:12 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 20 02:11:42 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 20 02:27:12 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 20 03:50:51 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 20 04:54:34 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 20 07:11:02 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 20 08:55:01 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 20 13:39:34 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 20 13:41:45 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 20 13:43:56 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100
Jul 20 14:22:56 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 20 15:02:07 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 20 15:43:54 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 20 15:48:47 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 20 16:32:06 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 21 04:05:51 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 21 18:51:28 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 21 23:03:43 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 21 23:12:19 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 00:39:01 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 03:00:08 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 09:25:02 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 10:05:20 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 10:11:21 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 10:37:34 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 11:31:50 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 11:38:10 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100
Jul 22 11:58:41 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 13:39:57 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 15:13:45 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 15:47:38 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 15:50:31 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 15:58:35 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 16:42:41 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 17:14:40 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 19:35:36 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 20:28:43 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 22 22:59:49 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 23 11:02:03 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 23 11:15:00 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 23 16:40:49 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 23 16:41:32 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 23 18:07:20 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 23 18:55:33 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 24 00:20:53 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 24 01:01:51 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 24 04:17:31 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 24 05:42:30 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 24 16:26:30 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 24 19:34:07 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 24 19:54:41 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 24 22:14:05 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100
Jul 25 11:26:44 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 300
Jul 26 13:31:26 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100
Jul 26 15:09:02 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 26 15:17:14 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 26 15:18:59 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100
Jul 26 16:37:26 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 26 16:37:47 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 26 17:08:18 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 26 18:54:23 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 26 19:00:19 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 26 19:22:21 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 26 19:34:37 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100
Jul 26 19:41:03 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 26 19:58:48 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 26 20:41:25 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100
Jul 26 20:59:48 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 26 21:01:16 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 26 21:07:40 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 26 21:09:42 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 26 23:39:37 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 27 00:43:58 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100
Jul 27 01:28:11 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 27 03:44:39 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 27 05:55:20 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100
Jul 27 06:19:28 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 27 06:58:16 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 27 16:37:55 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 150
Jul 27 16:39:59 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 27 16:44:07 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 27 19:25:13 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 27 20:09:45 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 28 02:18:55 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 28 04:01:27 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 28 12:14:40 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100
Jul 28 12:16:35 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 150
Jul 28 12:34:39 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100
Jul 28 12:35:27 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50
Jul 28 16:49:15 slurmsched1 slurmctld[48105]: slurmdbd: agent queue size 50
Jul 28 22:08:17 slurmsched1 slurmctld[48105]: slurmdbd: agent queue size 50
Jul 29 16:15:10 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 450
Jul 29 21:57:12 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 29 22:07:21 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 29 22:29:38 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 100
Jul 29 23:22:14 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 30 16:05:06 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 135850
Jul 30 16:08:09 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 132100
Jul 30 16:33:13 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 104900
Jul 30 16:47:57 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 87600
Jul 30 16:59:50 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 72100
Jul 30 17:11:55 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 56350
Jul 30 18:42:25 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 30 23:27:53 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 31 11:00:25 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 31 12:29:29 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 31 14:33:07 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 31 19:36:08 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 31 21:09:04 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Jul 31 22:25:57 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Aug  1 03:37:26 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Aug  1 03:45:35 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Aug  1 05:55:12 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Aug  1 17:23:01 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Aug  3 07:20:51 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Aug  3 07:23:22 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50
Aug  3 07:26:05 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 100
Aug  4 14:29:50 slurmsched1 slurmctld[38758]: slurmdbd: agent queue size 50
Aug  5 13:34:31 slurmsched1 slurmctld[38758]: slurmdbd: agent queue size 50
Aug  6 18:26:15 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug  6 21:08:44 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 200
Aug  6 21:10:47 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 250
Aug  6 21:11:34 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 150
Aug  6 21:21:53 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100
Aug  6 22:16:17 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 200
Aug  6 22:18:14 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100
Aug  6 22:18:21 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 200
Aug  6 22:58:07 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug  6 23:24:44 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug  7 00:36:18 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug  7 00:44:20 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug  7 01:58:33 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug  7 13:11:50 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 550
Aug  8 09:25:53 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100
Aug  9 15:46:03 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 200
Aug  9 15:52:23 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 200
Aug 10 01:16:00 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 10 04:30:43 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 10 06:00:48 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 10 07:17:57 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 10 07:18:47 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100
Aug 10 08:34:18 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100
Aug 10 11:37:21 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100
Aug 10 12:15:00 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 10 14:59:46 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100
Aug 10 15:00:33 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 10 16:57:03 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 10 18:36:56 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100
Aug 10 19:15:45 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100
Aug 10 23:05:13 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 11 04:17:53 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 150
Aug 11 04:54:04 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 11 04:55:10 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100
Aug 11 05:30:40 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 11 07:05:18 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 11 08:02:41 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 11 20:35:39 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 11 20:42:41 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 200
Aug 12 12:01:50 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 12 16:46:01 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 250
Aug 13 07:49:15 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 13 16:41:13 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 13 18:12:34 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 14 03:39:11 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50
Aug 15 08:28:29 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 500


So it was out of control on July 30 but has been much better since then. The queue size has definitely been larger than 100 in some cases (as high as 500 last week). Not sure how problematic that might be...

Will
Comment 15 Brian Christiansen 2015-08-21 04:14:36 MDT
The 50's to 500's are fine. It's not increasing so that's good. I would be concerned if the size was consistently in the 1000's and would look into some tuning. Just watch for the "RESTART SLURMDBD NOW"; that will tell you if there is an issue or not. It sounds like the systems were just busy on July 30 for some reason.

I'll close this for now. Let us know if you have problems with this again.

Thanks,
Brian