Ticket 8241

Summary:	Hundreds of "runaway" jobs
Product:	Slurm	Reporter:	Mark Bartelt <mark>
Component:	slurmdbd	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED CANNOTREPRODUCE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	naveed
Version:	19.05.2
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=7383 https://bugs.schedmd.com/show_bug.cgi?id=8131
Site:	Caltech	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Output of "sdiag" command

Description Mark Bartelt 2019-12-17 11:37:23 MST

One of the users of our central HPC cluster reported that some
of his jobs don't show up in the output of "squeue", but they
do show up in the output of "sacct -j [JobID]".

A bit more poking around revealed that although he has most of
these "runaway" jobs, there are nearly twenty other users who
have one or more.  Most of those have only one such job, but
several have between three and eight, one user has fourteen,
one has 23, and one has 66.  The user who reported the issue
has 97 of them!

Also interesting is the fact that, although nearly 70% of the
"runaway" jobs are reported as "RUNNING", some are reported as
being in other states; here's a summary:

	  5 CANCELLED
	 57 COMPLETED
	  5 FAILED
	  5 PENDING
	163 RUNNING

Presumably we can use "sacctmgr show RunawayJobs" to get rid of
all these jobs.  But what causes Slurm to get into such a weird
state in the first place, and what can we do to avoid having the
problem recur?

PS:  I noticed that there have been a couple other bug reports
about this issue (7738 and 8131), though those were with older
Slurm versions (18.08 family); we're using 19.05.02, so I would
have hoped that the problem would have been fixed by now.  Or in
other words, I really don't know whether this is the same problem
as was reported in those other two, or whether it's some totally
different problem.

Comment 3 Marshall Garey 2019-12-17 14:59:29 MST

There are multiple possibilities where database updates might be dropped. A few possibilities:

- slurmdbd losing connection to slurmctld
- mysql timeouts
- slurmdbd crashes
- slurmctld crashes
- Slurm bugs
  * One bug that caused runaways was fixed in 19.05.1, and I'm not currently aware of any other bug that is causing runaways (though of course that doesn't mean there are none).

Could you upload the slurmctld and slurmdbd logs for a period of time where these runaways occurred and let me know some of the runaway job id's from that period of time? Can you also upload the output of sdiag?

In particular, if you ever see this message in the slurmctld log file:

"slurmdbd: agent queue filling (%d), RESTART SLURMDBD NOW"

things are in trouble, and if you ever see this message in the slurmctld log file:

"slurmdbd: agent queue is full (%u), discarding %s:%u request"

you've definitely lost database updates.

Comment 4 Mark Bartelt 2019-12-18 11:31:52 MST

Re your request to "upload the slurmctld and slurmdbd logs
for a period of time where these runaways occurred", that's
not possible, since we have logs going back only as far as
early morning of December 9th, but the most recent runaway
jobs date from the 3rd.

Well, unless older logs are getting copied somewhere, then
deleted from /var/log; I don't think that's happening, but
I'll ask our Slurm guru to confirm.

None of the logs that we do have contain any messages about
"agent queue filling" or "agent queue full".

I'll attach the output of "sdiag".

A question or two:  When I do "sacctmgr show RunawayJobs",
it asks whether I'd like to fix those runaway jobs, and it
goes on to say ...

"This will set the end time for each job to the latest out
of the start, eligible, or submit times, and set the state
to completed."

Of the three, wouldn't the start time always be the latest?
Anyway, the main thing that I wanted to confirm with you is:

If, when I do "sacctmgr show RunawayJobs", I answer "y" to
the question it asks, can we be guaranteed that the runaway
jobs are the only ones which will have their accounting info
mucked with?  I.e. all other jobs' information will remain as
it is?

Comment 5 Mark Bartelt 2019-12-18 11:36:02 MST

Created attachment 12597 [details]
Output of "sdiag" command

Comment 6 Marshall Garey 2019-12-18 11:52:29 MST

(In reply to Mark Bartelt from comment #4)
> Re your request to "upload the slurmctld and slurmdbd logs
> for a period of time where these runaways occurred", that's
> not possible, since we have logs going back only as far as
> early morning of December 9th, but the most recent runaway
> jobs date from the 3rd.
> 
> Well, unless older logs are getting copied somewhere, then
> deleted from /var/log; I don't think that's happening, but
> I'll ask our Slurm guru to confirm.

That's too bad. Next time you see some new runaways occur, save the slurmctld and slurmdbd logs from that time period (if they still exist). That will help us find out what's causing them to occur.


> None of the logs that we do have contain any messages about
> "agent queue filling" or "agent queue full".

Good!


> I'll attach the output of "sdiag".

Thanks.


> A question or two:  When I do "sacctmgr show RunawayJobs",
> it asks whether I'd like to fix those runaway jobs, and it
> goes on to say ...
> 
> "This will set the end time for each job to the latest out
> of the start, eligible, or submit times, and set the state
> to completed."
> 
> Of the three, wouldn't the start time always be the latest?

This actually used to be the behavior. However, this was a bug:

Start time is set to 0 until the job starts. If the job is ineligible (for example, held or dependent), then eligible time is set to 0. If an ineligible pending job gets becomes runaway - for example, it gets cancelled but the job cancel update to the database gets dropped - then both the job's start and eligible times are 0. If you fix runaways in that instance, then the database <clustername>_last_ran_table is set to 0 and the database re-rolls from time 0 (the UNIX epoch). There actually was one site where this happened, and it took a couple months to re-roll their database. Job submit time is never zero, however, so we include the job submit time in that list.


> Anyway, the main thing that I wanted to confirm with you is:
> 
> If, when I do "sacctmgr show RunawayJobs", I answer "y" to
> the question it asks, can we be guaranteed that the runaway
> jobs are the only ones which will have their accounting info
> mucked with?  I.e. all other jobs' information will remain as
> it is?

Yes, when the next rollup happens. When fixing runaways, we figure out when we need to re-roll usage from. Then, we delete usage from the various assoc and wckey usage tables in the database. When the database re-rolls (which happens every hour), that usage is recalculated and filled in the proper database tables. So just wait until the next hour for rollup and you should see all the usage.

If the usage isn't properly re-rolled for some reason, then that's a bug. If that happens (I don't expect it to), I'll give you a few mysql commands to run to get the usage re-rolled.

Comment 7 Marshall Garey 2020-01-02 09:08:03 MST

Did the usage re-roll correctly?
* The fixed runaway jobs will have 0 as their usage; the rest of the jobs should contribute to usage normally.

sdiag didn't show anything strange. I'm afraid that without logs I can't find out what might have caused the runaways. Did more runaways happen? If so, can you upload slurmctld and slurmdbd logs? If not, then can we close this bug as timedout and then you can re-open it if runaways happen again?

Comment 8 Mark Bartelt 2020-01-02 09:32:16 MST

Things seem OK after using "sacctmgr show RunawayJobs" to get
rid of all those runaway jobs.  We haven't had any recurrence
of this problem.

I also have a short script that runs as a hourly cron job, to
check whether any new runaway jobs have appeared.  So if this
problem occurs again, we'll know shortly afterward, and we'll
be back in touch.

But for now, I think it's OK to close this one.

Comment 9 Marshall Garey 2020-01-02 09:39:12 MST

Sounds good. I'll close it as cannotreproduce for now