| Summary: | runaway reservations | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Ryan Day <day36> |
| Component: | Database | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | sts |
| Version: | 18.08.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | LLNL | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | patch for 18.08.6 | ||
I'm marking this as a sev-2, since this can cause lost accounting data. You'll have to add in the missing reservation. I'm looking up the correct sql right now. We've added a fix for this in 19.05 (commit 89910fd50a3, bug 7155), so we also recommend upgrading as a permanent fix. Created attachment 11875 [details]
patch for 18.08.6
This is actually commit 89910fd50a3 but applied to 18.08.6. Can you apply this commit, rebuild and restart slurmdbd, and let us know if the problem goes away? This is easier and safer than manually modifying the database, since this will fix all "runaway" reservations.
I forgot to say - use the command sdiag to monitor the DBD Agent Queue Size. If that number keeps increasing, then there is a problem. If that number decreases down towards zero, then things are looking good. Were you able to apply this patch? Did it resolve the problem (see the slurmdbd log file and "DBD Agent Queue Size" in sdiag)? I'd really like to make sure that you don't lose accounting data - if that happens, then the problems become much worse and harder to solve. I have it built, and I'm about to push it out. I put the patched slurmdbd on our database server and it seems to have cleared up the problem. Thank you for the very fast response on this. What would data loss look like? I did have some runaway jobs on a couple of clusters, although I'm not sure if those were from the same power outage that led to the problem reservations or from the agent queue filling. The slurmctld logs have a lot of messages about [2019-10-08T15:39:07.862] error: slurmdbd: agent queue filling (32371), RESTART SLURMDBD NOW which is what prompted me to look at the slurmdbd, but they never got any messages that tell me that the agent queue was completely full. (In reply to Ryan Day from comment #6) > I put the patched slurmdbd on our database server and it seems to have > cleared up the problem. Thank you for the very fast response on this. What > would data loss look like? When the DBD agent queue is full, every time slurmctld sends something to slurmdbd to write in the database, it is lost. Examples include jobs and steps starting or ending, a node going down or coming back up, a reservation starting or ending, and other things. Without the patch I provided you, losing reservation start times can unfortunately be self-perpetuating and cause more data loss. You would see slurmdbd log messages indicated that the agent queue is full: "slurmdbd: agent queue is full" > I did have some runaway jobs on a couple of > clusters, although I'm not sure if those were from the same power outage > that led to the problem reservations or from the agent queue filling. The > slurmctld logs have a lot of messages about > > [2019-10-08T15:39:07.862] error: slurmdbd: agent queue filling (32371), > RESTART SLURMDBD NOW > > which is what prompted me to look at the slurmdbd, but they never got any > messages that tell me that the agent queue was completely full. Great! That means the angry log messages worked. I'm glad we got this taken care of before any accounting were lost - that probably would have been a massive headache for all involved. I recommend keeping this patch in place until you upgrade to 19.05. Is there anything else we can do for this ticket? Sounds good then. I think you've fixed us up. Thanks again for the fast response. You're welcome. Closing as infogiven. |
I have a bunch of errors in my slurmdbd.log along the lines of: [2019-10-08T13:41:09.137] error: There is no reservation by id 66, time_start 1570270371, and cluster 'quartz It seems that the slurmdbd still thinks that there are a set of reservations out there that the various slurmctld's know have been removed: [day36@quartz1148:~]$ scontrol show reservation No reservations in the system [day36@quartz1148:~]$ sacctmgr show reservation Cluster Name TRES TimeStart TimeEnd UnusedWall ---------- --------------- ------------------------------ ------------------- ------------------- ---------- borax b451_downtime cpu=1548 2019-10-04T12:55:43 2020-10-03T12:55:43 3.4586e+05 flash b451_downtime cpu=680 2019-10-04T12:55:43 2019-10-08T09:43:05 3.3404e+05 flash b451_downtime cpu=680 2019-10-08T09:43:05 2019-10-08T13:29:30 1.1815e+04 oslic b451_downtime cpu=432 2019-10-04T12:55:43 2019-10-07T09:19:00 2.4620e+05 oslic b451_downtime cpu=432 2019-10-07T09:19:00 2019-10-07T11:43:37 8.6770e+03 pascal b451_downtime cpu=5868 2019-10-04T12:55:43 2020-10-03T12:55:43 3.4586e+05 quartz b451_downtime cpu=93744 2019-10-04T12:55:43 2020-10-03T12:55:43 3.4946e+05 syrah b451_downtime cpu=5024 2019-10-06T16:59:41 2019-10-07T17:32:58 8.4778e+04 [day36@quartz1148:~]$ How should I get rid of these? Is it sufficient to just "update quartz_resv_table set deleted='1' where id_resv='66'" (and repeated for other clusters)?