1985 – Job disappearing from the queue due to failed node

Ticket 1985 - Job disappearing from the queue due to failed node

Summary: Job disappearing from the queue due to failed node

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	14.11.9
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2015-09-28 22:04 MDT by Akmal Madzlan
Modified:	2015-10-01 02:54 MDT (History)
CC List:	3 users (show)

See Also:
Site:	DownUnder GeoSolutions
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.11.10
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
clus152 log (275.94 KB, text/x-log) 2015-09-28 22:04 MDT, Akmal Madzlan	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Akmal Madzlan 2015-09-28 22:04:14 MDT

Created attachment 2262 [details]
clus152 log

job 15647267_247(15647864) submitted and ran on clus275 for a while.
Then something bad happen to the node and the job is requeued to another node, clus152, which is also in a horrible state.

And the job disappear from the queue...

------------------------------------------

- output of zgrep 15647864 slurmctld_log.gz

[2015-09-27T15:09:25.906] sched: Allocate JobId=15647267_247 (15647864) NodeList=clus275 #CPUs=32
[2015-09-27T18:18:45.583] requeue job 15647864 due to failure of node clus275
[2015-09-27T18:18:45.583] Requeuing JobID=15647267_247(15647864) State=0x0 NodeCnt=0
[2015-09-27T20:25:47.650] sched: Allocate JobId=15647267_247 (15647864) NodeList=clus152 #CPUs=40
[2015-09-27T20:53:16.047] requeue job 15647864 due to failure of node clus152
[2015-09-27T20:53:16.048] Requeuing JobID=15647267_247(15647864) State=0x0 NodeCnt=0
[2015-09-27T20:53:36.321] sched: Allocate JobId=15647267_247 (15647864) NodeList=clus152 #CPUs=40
[2015-09-27T21:51:37.731] job_complete: JobID=15647267_247(15647864) State=0x1 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 0
[2015-09-27T21:51:37.731] job_complete: JobID=15647267_247(15647864) State=0x8003 NodeCnt=1 done
[2015-09-27T21:54:09.809] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T21:56:49.429] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T21:59:20.031] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T21:59:56.985] job_complete: JobID=15647267_247(15647864) State=0x8003 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 0
[2015-09-27T22:01:59.635] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:04:39.379] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:07:19.186] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:09:59.772] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:12:39.544] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:15:19.977] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:17:59.808] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:20:39.248] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:23:19.432] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:25:59.735] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:28:39.647] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:31:19.783] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:33:59.859] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:36:39.297] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:39:19.182] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:41:59.060] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:44:39.124] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:47:19.221] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:49:59.600] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:52:39.180] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:55:19.069] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T22:57:59.483] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T23:00:39.743] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T23:03:19.215] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T23:05:50.433] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T23:08:21.647] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T23:11:01.906] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T23:13:32.027] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T23:16:11.459] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T23:18:51.801] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T23:21:31.586] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T23:24:02.594] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T23:26:42.330] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T23:29:22.739] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T23:32:02.497] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T23:34:33.561] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152
[2015-09-27T23:36:09.110] cleanup_completing: job 15647864 completion process took 6272 seconds

- A portion of clus152 log attached

- sacct output

[root@pud22 ~]# sacct -j 15647864
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
15647267_247 rt_m141_g+ teamrotto+                    40  COMPLETED      0:9
15647267_24+      batch                               40  CANCELLED      0:9

Long story short: once again, with a node that's seriously ill, slurm has failed to recover the task back into the queue, instead dropping it on the floor. This is why I (Phil) continue to insist that the default policy should be to put tasks back into the queue unless it's explicitly canceled by the user, or it runs absolutely to completion and exits with a 0 exit code.

Comment 1 Moe Jette 2015-09-29 06:43:00 MDT

Here are some relevant bits of the slurmctld log file with my comments preceded by "***"

[2015-09-27T15:09:25.906] sched: Allocate JobId=15647267_247 (15647864) NodeList=clus275 #CPUs=32
[2015-09-27T18:18:45.583] requeue job 15647864 due to failure of node clus275
[2015-09-27T18:18:45.583] Requeuing JobID=15647267_247(15647864) State=0x0 NodeCnt=0
*** JOB ALLOCATED RESOURCES, STARTS, NODE FAILS, JOB REQUEUED

[2015-09-27T20:25:47.650] sched: Allocate JobId=15647267_247 (15647864) NodeList=clus152 #CPUs=40
[2015-09-27T20:53:16.047] requeue job 15647864 due to failure of node clus152
[2015-09-27T20:53:16.048] Requeuing JobID=15647267_247(15647864) State=0x0 NodeCnt=0
*** JOB ALLOCATED RESOURCES, STARTS, NODE FAILS, JOB REQUEUED

[2015-09-27T20:53:36.321] sched: Allocate JobId=15647267_247 (15647864) NodeList=clus152 #CPUs=40
*** JOB ALLOCATED BACK TO SAME NODE clus152
*** AT THIS POINT WE HAVE TWO COPIES OF THE SAME JOB RUNNING ON THE SAME BAD NODE

[2015-09-27T21:51:37.731] job_complete: JobID=15647267_247(15647864) State=0x1 NodeCnt=1 (RUNNING)
*** FIRST COPY OF THE JOB COMPLETES AND SLURM CONSIDERS THE JOB DONE BASED UPON MATCHING HOSTNAME AND JOB ID

=========================

Do you have something that configures down nodes back online immediately?
If so, you could have all sorts of bad vestigial state, including jobs.
Slurm uses the job ID and host names to uniquely identify jobs, so it isn't currently possible to distinguish between a vestigial job and new one started on the same node.

Are you familiar with available node health check tools?
It may be valuable to run something like this either at the end of each job or on a periodic basis.
http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check

Comment 2 Akmal Madzlan 2015-09-30 16:31:53 MDT

> Do you have something that configures down nodes back online immediately?

We dont
This happen while/after recovering from power outage so it could be a flickering network.

> Are you familiar with available node health check tools?
> It may be valuable to run something like this either at the end of each job or > on a periodic basis.

We dont use that tool, we run our monitoring (nagios/opsview) check periodically that monitor and sometimes fix issues on node

So the job vanished because the job get on the same node twice and one of them finished and Slurm removed the job from queue?

Comment 3 Moe Jette 2015-09-30 16:39:23 MDT

Correct 

On September 30, 2015 9:31:53 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1985
>
>--- Comment #2 from Akmal Madzlan <akmalm@dugeo.com> ---
>> Do you have something that configures down nodes back online
>immediately?
>
>We dont
>This happen while/after recovering from power outage so it could be a
>flickering network.
>
>> Are you familiar with available node health check tools?
>> It may be valuable to run something like this either at the end of
>each job or > on a periodic basis.
>
>We dont use that tool, we run our monitoring (nagios/opsview) check
>periodically that monitor and sometimes fix issues on node
>
>So the job vanished because the job get on the same node twice and one
>of them
>finished and Slurm removed the job from queue?
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.

Comment 4 Moe Jette 2015-10-01 02:29:40 MDT

I've got a pretty good idea of what happened then. It took some bad luck and bad timing, but I've implemented changes to prevent it.

1. Job launched
2. Node in some bad state, set DOWN and cancel job request sent
3. Node starts responding, set back to IDLE state
4. Job launched for second time on the same node, at this point 2 copies are running. The problem only happens when that same node is used.
5. First copy of job ends due to cancel request in step #2, job marked as completed
6. Second copy of job discovered and cancelled as it was already considered complete

What this patch does differently starts at step 4
4. Job launch request sent to compute node. Slurmd discovers this is a duplicate job request and issues requeue/hold request on the job.

Here is the change:
https://github.com/SchedMD/slurm/commit/c15139562ff19948c076d2076b05b3fb3f68287e

Comment 5 Phil Schwan 2015-10-01 02:34:50 MDT

Thanks, Moe!

So just to be clear, if we omit the second job launching from the sequence:

> 1. Job launched
> 2. Node in some bad state, set DOWN and cancel job request sent
> 3. Node starts responding, set back to IDLE state
> 4. First copy of job ends due to cancel request in step #2

...then the job would definitely not be marked as completed, right?  Because the fact that the job was marked completed was the only truly troubling part of the story, for me.

Cheers,

-Phil

Comment 6 Moe Jette 2015-10-01 02:54:45 MDT

(In reply to Phil Schwan from comment #5)
> Thanks, Moe!
> 
> So just to be clear, if we omit the second job launching from the sequence:
> 
> > 1. Job launched
> > 2. Node in some bad state, set DOWN and cancel job request sent
> > 3. Node starts responding, set back to IDLE state
> > 4. First copy of job ends due to cancel request in step #2
> 
> ...then the job would definitely not be marked as completed, right?  Because
> the fact that the job was marked completed was the only truly troubling part
> of the story, for me.

Correct. The job only gets marked as completed when a message saying the job ended arrives for a job that was in RUNNING state on the same node that it was dispatched to. It needs to match:
1. Job ID
2. Job state of running
3. Same node as dispatched to