833 – job was killed on a dead node, but epilog was told that the job was successful

Ticket 833 - job was killed on a dead node, but epilog was told that the job was successful

Summary: job was killed on a dead node, but epilog was told that the job was successful

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	14.11.x
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-05-23 13:32 MDT by Phil Schwan
Modified:	2014-05-27 21:37 MDT (History)
CC List:	2 users (show)

See Also:
Site:	DownUnder GeoSolutions
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.03.4
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
attachment-21289-0.html (2.56 KB, text/html) 2014-05-26 11:47 MDT, Moe Jette	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Phil Schwan 2014-05-23 13:32:07 MDT

This was running your version b5ace9a8083f337 plus our standard modifications.

We had a node that failed in a sort of incomplete way (its swap disk died), which I think led to the slurm daemon on that node being half alive, but not really operating properly.

Anyway, slurmctld did the right thing, recognised the node as down, and killed its job.  From the slurm log:

> [2014-05-21T03:48:47.558] sched: Allocate JobId=2168558 NodeList=clus497 #CPUs=24
> [2014-05-21T05:43:35.740] Batch JobId=2168558 missing from node 0
> [2014-05-21T05:43:35.745] completing job 2168558 status -2

However, it ran the epilog with an exit code of 0:0, so our script thought it ran successfully to completion, and removed it from the system.  From our epilog log:

> Wed May 21 05:43:35 WST 2014 SLURM_ARRAY_JOB_ID=2168073 SLURM_ARRAY_TASK_ID=486 SLURM_CLUSTER_NAME=perth
> SLURM_JOBID=2168558 SLURM_JOB_ACCOUNT=(null) SLURM_JOB_CONSTRAINTS=localdisk SLURM_JOB_DERIVED_EC=0
> SLURM_JOB_EXIT_CODE=0 SLURM_JOB_EXIT_CODE2=0:0 SLURM_JOB_GID=2102 SLURM_JOB_GROUP=bm
> SLURM_JOB_ID=2168558 SLURM_JOB_NAME=dp_P_Broad SLURM_JOB_NODELIST=clus497
> SLURM_JOB_PARTITION=teambm SLURM_JOB_UID=1233 SLURM_JOB_USER=kd JOB_EXIT=0

We probably agree this is not what should happen?  Or have I misinterpreted the tea leaves?

Comment 1 Phil Schwan 2014-05-24 14:32:32 MDT

During some cluster maintenance this weekend -- in which nodes would be expected to be unresponsive for a time -- hundreds of jobs disappeared in what appears to be similar fashion.

It's hard to separate the signal from the noise with so many thousands of jobs, but I do see lots of instances where there's a non-zero status in the slurmctld.log, but a zero status passed to the epilog:

> [2014-05-24T20:35:30.694] completing job 2264172 status 15
> [2014-05-24T20:35:30.694] _slurm_rpc_complete_batch_script JobId=2264172: Job/step already completing or completed

And in the epilog:

> Sat May 24 20:35:27 WST 2014    SLURM_ARRAY_JOB_ID=2263492 SLURM_ARRAY_TASK_ID=681 SLURM_CLUSTER_NAME=perth SLURM_JOBID=2264172
> SLURM_JOB_ACCOUNT=(null) SLURM_JOB_CONSTRAINTS=localdisk SLURM_JOB_DERIVED_EC=0 SLURM_JOB_EXIT_CODE=0 SLURM_JOB_EXIT_CODE2=0:0
> SLURM_JOB_GID=2102 SLURM_JOB_GROUP=teambm SLURM_JOB_ID=2264172 SLURM_JOB_NAME=dp_PBroad SLURM_JOB_NODELIST=clus230
> SLURM_JOB_PARTITION=teambm SLURM_JOB_UID=1233 SLURM_JOB_USER=kd JOB_EXIT=0

Comment 2 Moe Jette 2014-05-26 11:47:03 MDT

Created attachment 869 [details]
attachment-21289-0.html

Would you happen to be changing the order in which your nodes are defined in Slurm.conf and running "scontrol reconfigure"? Something is clearly messing up the bitmap to node ordering.

On May 24, 2014 7:32:32 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=833
>
>--- Comment #1 from Phil Schwan <phils@dugeo.com> ---
>During some cluster maintenance this weekend -- in which nodes would be
>expected to be unresponsive for a time -- hundreds of jobs disappeared
>in what
>appears to be similar fashion.
>
>It's hard to separate the signal from the noise with so many thousands
>of jobs,
>but I do see lots of instances where there's a non-zero status in the
>slurmctld.log, but a zero status passed to the epilog:
>
>> [2014-05-24T20:35:30.694] completing job 2264172 status 15
>> [2014-05-24T20:35:30.694] _slurm_rpc_complete_batch_script
>JobId=2264172: Job/step already completing or completed
>
>And in the epilog:
>
>> Sat May 24 20:35:27 WST 2014    SLURM_ARRAY_JOB_ID=2263492
>SLURM_ARRAY_TASK_ID=681 SLURM_CLUSTER_NAME=perth SLURM_JOBID=2264172
>> SLURM_JOB_ACCOUNT=(null) SLURM_JOB_CONSTRAINTS=localdisk
>SLURM_JOB_DERIVED_EC=0 SLURM_JOB_EXIT_CODE=0 SLURM_JOB_EXIT_CODE2=0:0
>> SLURM_JOB_GID=2102 SLURM_JOB_GROUP=teambm SLURM_JOB_ID=2264172
>SLURM_JOB_NAME=dp_PBroad SLURM_JOB_NODELIST=clus230
>> SLURM_JOB_PARTITION=teambm SLURM_JOB_UID=1233 SLURM_JOB_USER=kd
>JOB_EXIT=0
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.

Comment 3 Phil Schwan 2014-05-27 00:29:50 MDT

(In reply to Moe Jette from comment #2)
> 
> Would you happen to be changing the order in which your nodes are defined in
> Slurm.conf and running "scontrol reconfigure"? Something is clearly messing
> up the bitmap to node ordering.

I don't believe that the slurm.conf was changed at all, before this Saturday maintenance event (and any slurm reconfig that might've happened -- if one happened at all, due to logrotate -- would've been at least 12 hours prior to this instance)

Regardless, this bug is really about the fact that slurm _knew_ it was killing the job before it finished, but (as far as I can tell) it still told the epilog that it was successful, so the job was removed rather than being requeued.

Comment 4 Moe Jette 2014-05-27 08:35:32 MDT

I'm not sure what would be the best exit code value to set for missing batch job, but will agree that zero definitely isn't good. Here is a patch to set the exit code to 1, but I am open to suggestions for other values.

https://github.com/SchedMD/slurm/commit/38a78b3fcd890f274ad5481c375863ad60e251d8

Comment 5 Phil Schwan 2014-05-27 21:37:41 MDT

Thanks.  I agree; no idea what the code should be, but I'm happy as long as it's non-zero.

I think I see what might be the root cause, which I'll file separately.