| Summary: | "Runaway" Jobs in SLURM DB | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Will French <will> |
| Component: | Other | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | brian, charles.johnson, da, sean |
| Version: | 14.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Vanderbilt | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurmdbd.log
logs from slurmctld starting shortly after upgrade |
||
FWIW, the first "agent queue filling, RESTART SLURMDBD NOW" message occurred at around 4PM CST today (about two hours ago). We performed the upgrade yesterday at around 1PM CST. Will, are you still getting the RESTART SLURMDBD NOW messages? Do you see anything in your slurmdbd log that is interesting? It doesn't appear anything has been lost yet, was the DBD down for a long time? And now the problem seems to have resolved itself. I suppose it just took 20-25 minutes for all the job information to make it to slurmdbd after it was restarted. slurmdbd was not down for a long period of time so long as we know (other than the 10 seconds or so yesterday it took to restart the daemon during the upgrade). Could anything else that you're aware of cause these "agent queue filling" messages? This is what I would expect (it to resolve itself eventually). All the RPCs sent to the DBD are in time order so you would have to wait for all those thousands of RPCs to make it over before you would see the latest jobs finishing. You most likely saw messages in the dbd log like "Need to reroll usage from $TIME Job $JOBID from $CLUSTER started then and we are just now hearing about it" Was there something else going on on the DBD node? Looks like something was crazy busy there for a bit. Looks like things cleared up some time between Jul 30 17:11:55 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 56350 Jul 30 17:41:26 slurmsched1 slurmctld[20588]: slurmdbd: reopening connection Based on Jul 30 16:08:09 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 132100 Jul 30 16:33:13 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 104900 1500 seconds handling 27200 messages it appears your database was able to handle around 18 RPCs a second. That seems a bit slow, still, do you know if something happened around 3-4 today that would of caused a load on the node? Perhaps someone submitted a ridiculous amount of jobs at that moment without making them part of an array? Or perhaps something on the node outside of Slurm was doing something? Perhaps it is still happening? what does sacct -X | wc then sacct | wc say? I am looking for the count of lines from each to see how many jobs ran today first and then how many steps as well. At least it appears things are back to normal, from your ctld log you sent it doesn't appear anything was lost. Hey Danny, > Looks like things cleared up some time between > Jul 30 17:11:55 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size > 56350 > Jul 30 17:41:26 slurmsched1 slurmctld[20588]: slurmdbd: reopening connection That second messages was right around the time I restarted slurmdbd. > Based on > Jul 30 16:08:09 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size > 132100 > Jul 30 16:33:13 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size > 104900 > 1500 seconds handling 27200 messages it appears your database was able to > handle around 18 RPCs a second. That seems a bit slow, still, do you know > if something happened around 3-4 today that would of caused a load on the > node? > Perhaps someone submitted a ridiculous amount of jobs at that moment without > making them part of an array? Or perhaps something on the node outside of > Slurm was doing something? Perhaps it is still happening? Well, it does look like there were a large number of jobs submitted between 2:30-3:00pm yesterday: root@slurmsched1:~# sacct -X --starttime=07/30/15-13:00:00 --endtime=07/30/15-13:30:00 --state=PD --format=User,JobID,account,Submit,Start,End | wc -l 4175 root@slurmsched1:~# sacct -X --starttime=07/30/15-13:30:00 --endtime=07/30/15-14:00:00 --state=PD --format=User,JobID,account,Submit,Start,End | wc -l 3915 root@slurmsched1:~# sacct -X --starttime=07/30/15-14:00:00 --endtime=07/30/15-14:30:00 --state=PD --format=User,JobID,account,Submit,Start,End | wc -l 5113 root@slurmsched1:~# sacct -X --starttime=07/30/15-14:30:00 --endtime=07/30/15-15:00:00 --state=PD --format=User,JobID,account,Submit,Start,End | wc -l 6569 root@slurmsched1:~# sacct -X --starttime=07/30/15-15:00:00 --endtime=07/30/15-15:30:00 --state=PD --format=User,JobID,account,Submit,Start,End | wc -l 4365 root@slurmsched1:~# sacct -X --starttime=07/30/15-15:30:00 --endtime=07/30/15-16:00:00 --state=PD --format=User,JobID,account,Submit,Start,End | wc -l 4590 Might this have triggered it? slurmdbd runs on its own dedicated server so it's unlikely that another process/application impacted it. > what does > sacct -X | wc > then > sacct | wc > say? I am looking for the count of lines from each to see how many jobs ran > today first and then how many steps as well. root@slurmsched1:~# sacct --starttime=07.30.15 --endtime=07.30.15 --format=User,JobID,account,Submit,Start,End | wc 18987 105712 1803765 root@slurmsched1:~# sacct -X --starttime=07.30.15 --endtime=07.30.15 --format=User,JobID,account,Submit,Start,End | wc 10777 64662 1023815 > At least it appears things are back to normal, from your ctld log you sent > it doesn't appear anything was lost. So far as we can tell nothing was lost. Phew! A couple more questions: What type of hardware is the dbd on? Is it a VM? SSD? Will you grep the logs for "agent queue" from the time the dbd was upgraded? We want to see when the queue size started growing. Did you have a network problem by chance? Can you send your dbd logs from the time of the upgrade? Thanks, Brian Created attachment 2089 [details]
slurmdbd.log
Hi Brian, > What type of hardware is the dbd on? Is it a VM? SSD? The database is running on a 6-month old Dell server. 2 x Intel Xeon hex-core processor. 128 GB RAM. Pretty standard spinning disk. No VM. No SSD. > Will you grep the logs for "agent queue" from the time the dbd was upgraded? > We want to see when the queue size started growing. The upgrade took place around 1:15 PM on July 29: Jul 28 22:08:17 slurmsched1 slurmctld[48105]: slurmdbd: agent queue size 50 Jul 29 16:15:10 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 450 Jul 29 21:57:12 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 29 22:07:21 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 29 22:29:38 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 100 Jul 29 23:22:14 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 30 15:58:25 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:00:26 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:02:27 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:04:28 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:05:06 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 135850 Jul 30 16:06:29 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:08:09 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 132100 Jul 30 16:08:30 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:10:44 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:12:45 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:14:46 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:16:47 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:18:48 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:20:54 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:22:55 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:24:56 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:26:57 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:28:58 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:30:59 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:33:00 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:33:13 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 104900 Jul 30 16:35:01 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:37:02 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:39:03 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:41:05 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:43:06 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:45:13 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:47:34 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:47:57 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 87600 Jul 30 16:49:37 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:59:50 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 72100 Jul 30 17:11:55 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 56350 Jul 30 18:42:25 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 30 23:27:53 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 31 11:00:25 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 31 12:29:29 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 31 14:33:07 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 > Did you have a network problem by chance? Not that we are aware but it's possible. > Can you send your dbd logs from the time of the upgrade? Attached. I do see some interesting messages around the time of the upgrade. This popped up about fifteen minutes prior to the upgrade: [2015-07-29T13:00:04.738] error: We have more time than is possible (22809907+1296000+0)(24105907) > 23198400 for cluster accre(6444) from 2015-07-29T12:00:00 - 2015-07-29T13:00:00 You'll also see a lot of missing UID messages. These are users we have deactivated on our cluster but we have not yet removed their association(s) from the SLURM db. Is this a problem? I don't think the missing UID's will cause the situation. Will you also attach all of the slurmctld logs since the upgrade -- so I see if there is anything that correlates? I'll continue looking at these next week. (In reply to Brian Christiansen from comment #9) > I don't think the missing UID's will cause the situation. Will you also > attach all of the slurmctld logs since the upgrade -- so I see if there is > anything that correlates? > > I'll continue looking at these next week. I'm attaching the logs starting slightly before the upgrade, which occurred 1:15 PM on July 29. Note that this is our system logs so you'll see messages from other services scattered here and there. Sorry for the delay. Will Created attachment 2095 [details]
logs from slurmctld starting shortly after upgrade
Note we are using
SlurmctldDebug = info
Have you seen any more "RESTART SLURMDBD NOW" since it happened last? Do you see the "agent queue size" getting higher than 100? In your slurmctld logs, I see about 5-10 jobs completing per second but that shouldn't be a problem. I don't see a reason for it starting to backup. Checking in. Do you see any of the information in the last comment? (In reply to Brian Christiansen from comment #13) > Checking in. Do you see any of the information in the last comment? Hey Brian. Sorry I missed your replies. Things look better. We have not seen the "RESTART SLURMDBD NOW" messages since July 30. The agent queue size has also been much smaller: root@slurmsched1:~# cat /var/log/messages-* | grep " agent queue size " Jul 19 04:39:54 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 19 04:50:00 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 19 05:22:19 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 19 06:33:00 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 19 07:05:19 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 19 08:09:58 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 19 09:18:40 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 19 14:39:08 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 19 14:42:57 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 19 19:50:59 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 19 20:22:46 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 19 23:50:12 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 20 02:11:42 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 20 02:27:12 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 20 03:50:51 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 20 04:54:34 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 20 07:11:02 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 20 08:55:01 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 20 13:39:34 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 20 13:41:45 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 20 13:43:56 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100 Jul 20 14:22:56 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 20 15:02:07 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 20 15:43:54 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 20 15:48:47 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 20 16:32:06 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 21 04:05:51 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 21 18:51:28 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 21 23:03:43 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 21 23:12:19 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 00:39:01 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 03:00:08 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 09:25:02 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 10:05:20 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 10:11:21 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 10:37:34 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 11:31:50 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 11:38:10 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100 Jul 22 11:58:41 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 13:39:57 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 15:13:45 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 15:47:38 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 15:50:31 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 15:58:35 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 16:42:41 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 17:14:40 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 19:35:36 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 20:28:43 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 22 22:59:49 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 23 11:02:03 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 23 11:15:00 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 23 16:40:49 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 23 16:41:32 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 23 18:07:20 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 23 18:55:33 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 24 00:20:53 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 24 01:01:51 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 24 04:17:31 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 24 05:42:30 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 24 16:26:30 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 24 19:34:07 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 24 19:54:41 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 24 22:14:05 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100 Jul 25 11:26:44 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 300 Jul 26 13:31:26 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100 Jul 26 15:09:02 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 26 15:17:14 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 26 15:18:59 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100 Jul 26 16:37:26 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 26 16:37:47 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 26 17:08:18 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 26 18:54:23 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 26 19:00:19 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 26 19:22:21 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 26 19:34:37 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100 Jul 26 19:41:03 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 26 19:58:48 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 26 20:41:25 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100 Jul 26 20:59:48 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 26 21:01:16 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 26 21:07:40 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 26 21:09:42 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 26 23:39:37 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 27 00:43:58 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100 Jul 27 01:28:11 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 27 03:44:39 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 27 05:55:20 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100 Jul 27 06:19:28 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 27 06:58:16 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 27 16:37:55 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 150 Jul 27 16:39:59 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 27 16:44:07 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 27 19:25:13 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 27 20:09:45 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 28 02:18:55 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 28 04:01:27 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 28 12:14:40 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100 Jul 28 12:16:35 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 150 Jul 28 12:34:39 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 100 Jul 28 12:35:27 slurmsched1 slurmctld[7866]: slurmdbd: agent queue size 50 Jul 28 16:49:15 slurmsched1 slurmctld[48105]: slurmdbd: agent queue size 50 Jul 28 22:08:17 slurmsched1 slurmctld[48105]: slurmdbd: agent queue size 50 Jul 29 16:15:10 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 450 Jul 29 21:57:12 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 29 22:07:21 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 29 22:29:38 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 100 Jul 29 23:22:14 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 30 16:05:06 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 135850 Jul 30 16:08:09 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 132100 Jul 30 16:33:13 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 104900 Jul 30 16:47:57 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 87600 Jul 30 16:59:50 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 72100 Jul 30 17:11:55 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 56350 Jul 30 18:42:25 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 30 23:27:53 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 31 11:00:25 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 31 12:29:29 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 31 14:33:07 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 31 19:36:08 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 31 21:09:04 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Jul 31 22:25:57 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Aug 1 03:37:26 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Aug 1 03:45:35 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Aug 1 05:55:12 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Aug 1 17:23:01 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Aug 3 07:20:51 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Aug 3 07:23:22 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 50 Aug 3 07:26:05 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 100 Aug 4 14:29:50 slurmsched1 slurmctld[38758]: slurmdbd: agent queue size 50 Aug 5 13:34:31 slurmsched1 slurmctld[38758]: slurmdbd: agent queue size 50 Aug 6 18:26:15 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 6 21:08:44 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 200 Aug 6 21:10:47 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 250 Aug 6 21:11:34 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 150 Aug 6 21:21:53 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100 Aug 6 22:16:17 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 200 Aug 6 22:18:14 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100 Aug 6 22:18:21 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 200 Aug 6 22:58:07 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 6 23:24:44 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 7 00:36:18 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 7 00:44:20 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 7 01:58:33 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 7 13:11:50 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 550 Aug 8 09:25:53 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100 Aug 9 15:46:03 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 200 Aug 9 15:52:23 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 200 Aug 10 01:16:00 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 10 04:30:43 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 10 06:00:48 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 10 07:17:57 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 10 07:18:47 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100 Aug 10 08:34:18 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100 Aug 10 11:37:21 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100 Aug 10 12:15:00 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 10 14:59:46 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100 Aug 10 15:00:33 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 10 16:57:03 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 10 18:36:56 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100 Aug 10 19:15:45 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100 Aug 10 23:05:13 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 11 04:17:53 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 150 Aug 11 04:54:04 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 11 04:55:10 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 100 Aug 11 05:30:40 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 11 07:05:18 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 11 08:02:41 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 11 20:35:39 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 11 20:42:41 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 200 Aug 12 12:01:50 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 12 16:46:01 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 250 Aug 13 07:49:15 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 13 16:41:13 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 13 18:12:34 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 14 03:39:11 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 50 Aug 15 08:28:29 slurmsched1 slurmctld[5671]: slurmdbd: agent queue size 500 So it was out of control on July 30 but has been much better since then. The queue size has definitely been larger than 100 in some cases (as high as 500 last week). Not sure how problematic that might be... Will The 50's to 500's are fine. It's not increasing so that's good. I would be concerned if the size was consistently in the 1000's and would look into some tuning. Just watch for the "RESTART SLURMDBD NOW"; that will tell you if there is an issue or not. It sounds like the systems were just busy on July 30 for some reason. I'll close this for now. Let us know if you have problems with this again. Thanks, Brian |
Hi again, We updated to 14.11.8 yesterday and we are now noticing that there are a lot of "runaway" jobs in the database since the upgrade. By runaway I mean jobs that have finished but are still accumulating time in the SLURM database. For example: [frenchwr@vmps14 python-job]$ date && sacct --format=User,JobID,Timelimit,elapsed,ExitCode,state Thu Jul 30 17:50:50 CDT 2015 User JobID Timelimit Elapsed ExitCode State --------- ------------ ---------- ---------- -------- ---------- frenchwr 4100939_2 02:00:00 00:21:09 0:0 RUNNING frenchwr 4100939_0 02:00:00 00:21:09 0:0 RUNNING frenchwr 4100939_1 02:00:00 00:21:09 0:0 RUNNING frenchwr 4100989_2 02:00:00 00:18:40 0:0 CANCELLED frenchwr 4100989_0 02:00:00 00:18:40 0:0 CANCELLED frenchwr 4100989_1 02:00:00 00:18:40 0:0 CANCELLED frenchwr 4101140 00:10:00 00:12:12 0:0 COMPLETED frenchwr 4101313 00:10:00 00:06:46 0:0 RUNNING [frenchwr@vmps14 python-job]$ date && sacct --format=User,JobID,Timelimit,elapsed,ExitCode,state Thu Jul 30 17:51:04 CDT 2015 User JobID Timelimit Elapsed ExitCode State --------- ------------ ---------- ---------- -------- ---------- frenchwr 4100939_2 02:00:00 00:21:23 0:0 RUNNING frenchwr 4100939_0 02:00:00 00:21:23 0:0 RUNNING frenchwr 4100939_1 02:00:00 00:21:23 0:0 RUNNING frenchwr 4100989_2 02:00:00 00:18:54 0:0 CANCELLED frenchwr 4100989_0 02:00:00 00:18:54 0:0 CANCELLED frenchwr 4100989_1 02:00:00 00:18:54 0:0 CANCELLED frenchwr 4101140 00:10:00 00:12:26 0:0 COMPLETED frenchwr 4101313 00:10:00 00:07:00 0:0 RUNNING All of these jobs have completed (some I cancelled with scancel, others finished normally), but they are still accumulating time in the SLURM db. This is happening to a large number of jobs (possibly all?) on our cluster since the upgrade yesterday. We are seeing lots of these messages in the slurmctld logs: root@slurmsched1:~# tail -n 50000 /var/log/messages | grep slurmdbd Jul 30 15:58:25 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:00:26 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:02:27 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:04:28 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:05:06 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 135850 Jul 30 16:06:29 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:08:09 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 132100 Jul 30 16:08:30 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:10:44 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:12:45 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:14:46 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:16:47 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:18:48 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:20:54 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:22:55 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:24:56 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:26:57 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:28:58 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:30:59 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:33:00 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:33:13 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 104900 Jul 30 16:35:01 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:37:02 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:39:03 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:41:05 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:43:06 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:45:13 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:47:34 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:47:57 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 87600 Jul 30 16:49:37 slurmsched1 slurmctld[20588]: error: slurmdbd: agent queue filling, RESTART SLURMDBD NOW Jul 30 16:59:50 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 72100 Jul 30 17:11:55 slurmsched1 slurmctld[20588]: slurmdbd: agent queue size 56350 Jul 30 17:41:26 slurmsched1 slurmctld[20588]: slurmdbd: reopening connection Jul 30 17:41:31 slurmsched1 slurmctld[20588]: error: slurmdbd: Sending message type 1472: 11: Connection refused Jul 30 17:41:31 slurmsched1 slurmctld[20588]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Jul 30 17:41:36 slurmsched1 slurmctld[20588]: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Jul 30 17:41:39 slurmsched1 slurmctld[20588]: Registering slurmctld at port 6817 with slurmdbd. Note that the last four lines occurred as I was restarting slurmdbd. The problem is still persisting after restarting slurmdbd. I have run a couple of extremely short (5 second) test jobs and the db thinks they are still running. I waited about 15 minutes after restarting the slurmdbd and tried again. Same result. The "agent queue size" and "error: slurmdbd: agent queue filling" messages have not reappeared since restarting slurmdbd. Will