Ticket 866

Summary:	Batch JobId= missing from node 0 / job credential revoked
Product:	Slurm	Reporter:	Phil Schwan <phils>
Component:	slurmctld	Assignee:	David Bigagli <david>
Status:	RESOLVED FIXED	QA Contact:
Severity:	1 - System not usable
Priority:	---	CC:	da, stuartm
Version:	14.11.x
Hardware:	Linux
OS:	Linux
Site:	DownUnder GeoSolutions	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	14.03.4
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmctld epilog in C

Description Phil Schwan 2014-06-06 13:21:26 MDT

running your 02e6033e9e26 + our standard patches

...

Unlike bug 840, this message is occurring when the job *isn't* missing:

From slurmctld.log:

> [2014-06-07T08:46:03.746] _slurm_rpc_submit_batch_job JobId=2736098 usec=1892
> [2014-06-07T08:47:48.706] update_job: setting partition to teamswanTest,teamswan,teambmIdle for job_id 2736098
> [2014-06-07T08:47:48.708] _slurm_rpc_update_job complete JobId=2736098 uid=0 usec=1747
> [2014-06-07T08:48:07.858] backfill: Started JobId=2736098 on clus080
> [2014-06-07T08:57:28.029] Batch JobId=2736098 missing from node 0
> [2014-06-07T08:57:28.029] completing job 2736098 status -2
> [2014-06-07T08:57:28.029] Job 2736098 cancelled from interactive user
> [2014-06-07T08:57:28.034] Requeue JobId=2736098 due to node failure
> [2014-06-07T08:57:28.035] sched: job_complete for JobId=2736098 successful, exit code=4294967294

(I can assure you that "Job 2736098 cancelled from interactive user" is false.  It's my job; I canceled nothing.)

From the node's syslog:

> 2014-06-07T08:57:38+08:00 clus048 slurmstepd[64148]: task_p_pre_launch_priv: 2736098.4294967294
> 2014-06-07T08:57:38+08:00 clus048 slurmstepd[64144]: Job 2736098 memory used:1560 limit:7520256 KB
> 2014-06-07T08:57:38+08:00 clus048 slurmstepd[64144]: job_container none plugin loaded
> 2014-06-07T08:57:38+08:00 clus048 slurmstepd[64148]: task_p_pre_launch: 2736098.4294967294, task 0
> 2014-06-07T08:57:38+08:00 clus048 slurmstepd[64148]: [job 2736098] attempting to run slurm task_prolog [/d/sw/slurm/etc/slurm_task_prolog.sh]
> 2014-06-07T08:57:38+08:00 clus048 slurmstepd[64148]: export name:DUG_LICENSE:val:PER2-PUG1-L38W-Y8PG-WNXT-TL:
> 2014-06-07T08:57:38+08:00 clus048 slurmstepd[64148]: export name:DUG_LICENSE_SERVER:val:dugy:8080:
> 2014-06-07T08:57:38+08:00 clus048 slurmstepd[64148]: export name:COI_SEP_DISABLE:val:TRUE:
> 2014-06-07T08:57:38+08:00 clus048 slurmstepd[64148]: export name:USP_PRINT_PREFIX:val:DUG_2736098_:
> 2014-06-07T08:57:38+08:00 clus048 slurmstepd[64148]: export name:TMPDIR:val:/localData/queue-teamswan/job2736098.mjlM:
> 2014-06-07T08:57:38+08:00 clus048 slurmstepd[64148]: export name:CWP_TMPDIR:val:/localData/queue-teamswan/job2736098.mjlM:
> 2014-06-07T08:57:39+08:00 clus048 slurmstepd[64144]: Job 2736098 memory used:17616 limit:7520256 KB
> 2014-06-07T08:57:40+08:00 clus048 slurmd[21337]: debug:  _rpc_terminate_job, uid = 601
> 2014-06-07T08:57:40+08:00 clus048 slurmd[21337]: debug:  task_p_slurmd_release_resources: 2736098
> 2014-06-07T08:57:40+08:00 clus048 slurmd[21337]: debug:  credential for job 2736098 revoked
> 2014-06-07T08:57:40+08:00 clus048 slurmstepd[64144]: Null authentication plugin loaded
> 2014-06-07T08:57:40+08:00 clus048 slurmstepd[64144]: Handling REQUEST_SIGNAL_CONTAINER
> 2014-06-07T08:57:40+08:00 clus048 slurmstepd[64144]: _handle_signal_container for step=2736098.4294967294 uid=0 signal=18
> 2014-06-07T08:57:40+08:00 clus048 slurmstepd[64144]: Sent signal 18 to 2736098.4294967294
> 2014-06-07T08:57:40+08:00 clus048 slurmstepd[64144]: Handling REQUEST_SIGNAL_CONTAINER
> 2014-06-07T08:57:40+08:00 clus048 slurmstepd[64144]: _handle_signal_container for step=2736098.4294967294 uid=0 signal=15
> 2014-06-07T08:57:40+08:00 clus048 slurmstepd[64144]: error: *** JOB 2736098 CANCELLED AT 2014-06-07T08:57:40 ***
> 2014-06-07T08:57:40+08:00 clus048 slurmstepd[64144]: Sent signal 15 to 2736098.4294967294
> 2014-06-07T08:57:40+08:00 clus048 slurmstepd[64144]: Handling REQUEST_STATE
> 2014-06-07T08:57:40+08:00 clus048 slurmstepd[64144]: Job 2736098 memory used:0 limit:7520256 KB
> 2014-06-07T08:57:40+08:00 clus048 slurmstepd[64144]: task 0 (64148) exited. Killed by signal 15.
> 2014-06-07T08:57:40+08:00 clus048 slurmstepd[64144]: task_p_post_term: 2736098.4294967294, task 0
> 2014-06-07T08:57:41+08:00 clus048 slurmstepd[64144]: Handling REQUEST_STATE
> 2014-06-07T08:57:41+08:00 clus048 slurmstepd[64144]: cpu_freq_reset: #cpus reset = 0
> 2014-06-07T08:57:41+08:00 clus048 slurmstepd[64144]: Message thread exited
> 2014-06-07T08:57:41+08:00 clus048 slurmstepd[64144]: get_exit_code task 0 killed by cmd
> 2014-06-07T08:57:41+08:00 clus048 slurmstepd[64144]: job 2736098 completed with slurm_rc = 0, job_rc = -2
> 2014-06-07T08:57:41+08:00 clus048 slurmstepd[64144]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
> 2014-06-07T08:57:41+08:00 clus048 slurmstepd[64144]: done with job
> 2014-06-07T08:57:42+08:00 clus048 slurmd[21337]: debug:  Waiting for job 2736098's prolog to complete
> 2014-06-07T08:57:42+08:00 clus048 slurmd[21337]: debug:  Finished wait for job 2736098's prolog to complete
> 2014-06-07T08:57:42+08:00 clus048 slurmd[21337]: debug:  Calling /d/sw/slurm/20140606/sbin/slurmstepd spank epilog
> 2014-06-07T08:57:42+08:00 clus048 spank-epilog[64966]: Reading slurm.conf file: /d/sw/slurm/20140606/etc/slurm.conf
> 2014-06-07T08:57:42+08:00 clus048 spank-epilog[64966]: Running spank/epilog for jobid [2736098] uid [1047]
> 2014-06-07T08:57:42+08:00 clus048 spank-epilog[64966]: spank: opening plugin stack /d/sw/slurm/20140606/etc/plugstack.conf
> 2014-06-07T08:57:42+08:00 clus048 slurmd[21337]: debug:  [job 2736098] attempting to run epilog [/d/sw/slurm/etc/slurm_epilog.sh]
> 2014-06-07T08:57:43+08:00 clus048 slurmd[21337]: debug:  completed epilog for jobid 2736098
> 2014-06-07T08:57:43+08:00 clus048 slurmd[21337]: debug:  Job 2736098: sent epilog complete msg: rc = 0


What causes "credential for job 2736098 revoked"?

Because that happened *again* after I released the job.  Only this time, the "Job credential revoked" message appeared in the slurmctld.log, instead of "missing from node 0":

> [2014-06-07T08:57:33.309] sched: Allocate JobId=2736098 NodeList=clus048 #CPUs=64
> [2014-06-07T08:57:40.177] _slurm_rpc_requeue: 2736098: usec=18650
> [2014-06-07T08:57:41.264] completing job 2736098 status 15
> [2014-06-07T08:57:43.781] requeue batch job 2736098
> [2014-06-07T09:03:37.568] sched: update_job: releasing hold for job_id 2736098, new priority is 100
> [2014-06-07T09:03:37.568] _slurm_rpc_update_job complete JobId=2736098 uid=1047 usec=144
> [2014-06-07T09:03:43.748] sched: Allocate JobId=2736098 NodeList=clus044 #CPUs=64
> [2014-06-07T09:03:44.826] _slurm_rpc_requeue: 2736098: usec=21613
> [2014-06-07T09:03:46.173] requeue batch job 2736098
> [2014-06-07T09:03:47.198] slurmd error running JobId=2736098 from node(s)=clus044: Job credential revoked
> [2014-06-07T09:03:47.198] completing job 2736098 status 1
> [2014-06-07T09:03:47.198] Job 2736098 cancelled from interactive user


Third time's a charm?  Nope.  This time, slurmctld.log doesn't say anything about why it's being killed (neither "missing from node" nor "job credential revoked"):

> [2014-06-07T09:15:21.594] sched: update_job: releasing hold for job_id 2736098, new priority is 100
> [2014-06-07T09:15:21.594] _slurm_rpc_update_job complete JobId=2736098 uid=1047 usec=217
> [2014-06-07T09:15:24.887] sched: Allocate JobId=2736098 NodeList=clus050 #CPUs=64
> [2014-06-07T09:15:33.947] _slurm_rpc_requeue: 2736098: usec=19561
> [2014-06-07T09:15:34.998] completing job 2736098 status 15

but the syslog on clus050 reveals:

> 2014-06-07T09:15:33+08:00 clus050 slurmd[29549]: debug:  credential for job 2736098 revoked

None of the other tasks in this job seem to have this problem.  These logs are from three different cluster nodes, so it doesn't seem machine-specific.  I give up!

Comment 1 Phil Schwan 2014-06-07 10:40:11 MDT

The root cause of the problem seems to be that the epilog thinks it's failing (but appears to actually be succeeding).  From our epilog log:

> Sat Jun  7 01:44:50 WST 2014    SLURM_ARRAY_JOB_ID=2735047 SLURM_ARRAY_TASK_ID=1996 SLURM_CLUSTER_NAME=perth SLURM_JOBID=2735054
> SLURM_JOB_ACCOUNT=(null) SLURM_JOB_CONSTRAINTS=(null) SLURM_JOB_DERIVED_EC=0 SLURM_JOB_EXIT_CODE=25600 SLURM_JOB_EXIT_CODE2=100:0
> SLURM_JOB_GID=2114 SLURM_JOB_GROUP=teamswan SLURM_JOB_ID=2735054 SLURM_JOB_NAME=dp_NMO SLURM_JOB_NODELIST=clus509
> SLURM_JOB_PARTITION=teamswan,teamswanTest SLURM_JOB_UID=1288 JOB_EXIT=100
> Job can not be altered now, try again later for job 2735047_1996 (2735054)
> slurm_requeue error: Job can not be altered now, try again later

To ensure that jobs don't vanish (see bug 717) our epilog attempts the requeue until it succeeds, so this runs in a loop.

Unfortunately, it appears that the requeue is "failing" no matter how many times it attempts it:

> Job can not be altered now, try again later for job 2735047_1996 (2735054)
> slurm_requeue error: Job can not be altered now, try again later

...which is why we have this on our slurmctld node:

> # ps auxw | grep epilog.sh | wc -l
> 1154

But of course it's not failing.  It's requeueing it just fine, which is why the job keeps getting killed and put into SE.

It's also managing to bring jobs back from the dead after an explicit scancel (!)

I'm probably going to try to roll back to a version from a few days ago, to see if I can find some stability.

Comment 2 Phil Schwan 2014-06-07 10:49:24 MDT

I wonder if perhaps bab22e4f8ad335 is to blame for the epilog thinking it's failing over and over, because it will be attempting to requeue jobs that are in CG state?

Comment 3 David Bigagli 2014-06-09 04:54:31 MDT

Yes, the commit bab22e4f8ad335 is not right, we got carried away....
We should return success and requeue the job, basically allow requeue
of jobs in completing state. I realize this during the weekend but you
were faster... sorry.

In any case I would change your requeue script to not try to scontrol forever,
but just few times. Another option would be to use the job_requeue() API,
it will be faster and give you better control over the returning codes.

David

Comment 4 David Bigagli 2014-06-09 04:59:34 MDT

Created attachment 914 [details]
slurmctld epilog in C

Comment 5 David Bigagli 2014-06-09 06:11:21 MDT

Commit: a9c1c8e5f50f98f0f6 reverts bab22e4f8ad335.

David

Comment 6 Phil Schwan 2014-06-10 06:32:35 MDT

Well, I think the intention is to do away with the epilog requeue entirely (per bug 717) -- right?

Comment 7 David Bigagli 2014-06-10 06:37:26 MDT

Absolutely yes. Actually I am going to retest the development in the light of
the new fixes especially the dependencies.

David

Comment 8 David Bigagli 2014-06-10 06:38:40 MDT

However that will take some time for you to implement I guess, so the option
of using the API is probably better, faster and better control over the return
code than a script. Just an idea...

David

Comment 9 David Bigagli 2014-06-10 06:52:46 MDT

Revert bad commit.

David