Ticket 7896

Summary:	runaway reservations
Product:	Slurm	Reporter:	Ryan Day <day36>
Component:	Database	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	sts
Version:	18.08.6
Hardware:	Linux
OS:	Linux
Site:	LLNL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	patch for 18.08.6

Description Ryan Day 2019-10-08 14:45:44 MDT

I have a bunch of errors in my slurmdbd.log along the lines of:

[2019-10-08T13:41:09.137] error: There is no reservation by id 66, time_start 1570270371, and cluster 'quartz

It seems that the slurmdbd still thinks that there are a set of reservations out there that the various slurmctld's know have been removed:

[day36@quartz1148:~]$ scontrol show reservation
No reservations in the system
[day36@quartz1148:~]$ sacctmgr show reservation
   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall 
---------- --------------- ------------------------------ ------------------- ------------------- ---------- 
     borax   b451_downtime                       cpu=1548 2019-10-04T12:55:43 2020-10-03T12:55:43 3.4586e+05 
     flash   b451_downtime                        cpu=680 2019-10-04T12:55:43 2019-10-08T09:43:05 3.3404e+05 
     flash   b451_downtime                        cpu=680 2019-10-08T09:43:05 2019-10-08T13:29:30 1.1815e+04 
     oslic   b451_downtime                        cpu=432 2019-10-04T12:55:43 2019-10-07T09:19:00 2.4620e+05 
     oslic   b451_downtime                        cpu=432 2019-10-07T09:19:00 2019-10-07T11:43:37 8.6770e+03 
    pascal   b451_downtime                       cpu=5868 2019-10-04T12:55:43 2020-10-03T12:55:43 3.4586e+05 
    quartz   b451_downtime                      cpu=93744 2019-10-04T12:55:43 2020-10-03T12:55:43 3.4946e+05 
     syrah   b451_downtime                       cpu=5024 2019-10-06T16:59:41 2019-10-07T17:32:58 8.4778e+04 
[day36@quartz1148:~]$

How should I get rid of these? Is it sufficient to just "update quartz_resv_table set deleted='1' where id_resv='66'" (and repeated for other clusters)?

Comment 1 Marshall Garey 2019-10-08 14:51:34 MDT

I'm marking this as a sev-2, since this can cause lost accounting data. You'll have to add in the missing reservation. I'm looking up the correct sql right now. We've added a fix for this in 19.05 (commit 89910fd50a3, bug 7155), so we also recommend upgrading as a permanent fix.

Comment 2 Marshall Garey 2019-10-08 15:01:56 MDT

Created attachment 11875 [details]
patch for 18.08.6

This is actually commit 89910fd50a3 but applied to 18.08.6. Can you apply this commit, rebuild and restart slurmdbd, and let us know if the problem goes away? This is easier and safer than manually modifying the database, since this will fix all "runaway" reservations.

Comment 3 Marshall Garey 2019-10-08 15:03:32 MDT

I forgot to say - use the command sdiag to monitor the DBD Agent Queue Size. If that number keeps increasing, then there is a problem. If that number decreases down towards zero, then things are looking good.

Comment 4 Marshall Garey 2019-10-08 16:23:58 MDT

Were you able to apply this patch? Did it resolve the problem (see the slurmdbd log file and "DBD Agent Queue Size" in sdiag)? I'd really like to make sure that you don't lose accounting data - if that happens, then the problems become much worse and harder to solve.

Comment 5 Ryan Day 2019-10-08 16:29:41 MDT

I have it built, and I'm about to push it out.

Comment 6 Ryan Day 2019-10-08 17:07:51 MDT

I put the patched slurmdbd on our database server and it seems to have cleared up the problem. Thank you for the very fast response on this. What would data loss look like? I did have some runaway jobs on a couple of clusters, although I'm not sure if those were from the same power outage that led to the problem reservations or from the agent queue filling. The slurmctld logs have a lot of messages about 

[2019-10-08T15:39:07.862] error: slurmdbd: agent queue filling (32371), RESTART SLURMDBD NOW

which is what prompted me to look at the slurmdbd, but they never got any messages that tell me that the agent queue was completely full.

Comment 7 Marshall Garey 2019-10-08 17:36:26 MDT

(In reply to Ryan Day from comment #6)
> I put the patched slurmdbd on our database server and it seems to have
> cleared up the problem. Thank you for the very fast response on this. What
> would data loss look like?

When the DBD agent queue is full, every time slurmctld sends something to slurmdbd to write in the database, it is lost. Examples include jobs and steps starting or ending, a node going down or coming back up, a reservation starting or ending, and other things. Without the patch I provided you, losing reservation start times can unfortunately be self-perpetuating and cause more data loss.

You would see slurmdbd log messages indicated that the agent queue is full: "slurmdbd: agent queue is full"


> I did have some runaway jobs on a couple of
> clusters, although I'm not sure if those were from the same power outage
> that led to the problem reservations or from the agent queue filling. The
> slurmctld logs have a lot of messages about 
> 
> [2019-10-08T15:39:07.862] error: slurmdbd: agent queue filling (32371),
> RESTART SLURMDBD NOW
> 
> which is what prompted me to look at the slurmdbd, but they never got any
> messages that tell me that the agent queue was completely full.

Great! That means the angry log messages worked. I'm glad we got this taken care of before any accounting were lost - that probably would have been a massive headache for all involved.

I recommend keeping this patch in place until you upgrade to 19.05.

Is there anything else we can do for this ticket?

Comment 8 Ryan Day 2019-10-08 17:41:09 MDT

Sounds good then. I think you've fixed us up. Thanks again for the fast response.

Comment 9 Marshall Garey 2019-10-08 18:02:01 MDT

You're welcome. Closing as infogiven.