855 – job losing its dependencies, possibly after a slurmctld restart

Ticket 855 - job losing its dependencies, possibly after a slurmctld restart

Summary: job losing its dependencies, possibly after a slurmctld restart

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	14.03.2
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	David Bigagli
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-06-03 23:07 MDT by Phil Schwan
Modified:	2014-06-11 09:09 MDT (History)
CC List:	2 users (show)

See Also:
Site:	DownUnder GeoSolutions
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.03.4
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
attachment-12028-0.html (1.67 KB, text/html) 2014-06-04 12:19 MDT, David Bigagli	Details
attachment-23591-0.html (2.34 KB, text/html) 2014-06-04 13:48 MDT, David Bigagli	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Phil Schwan 2014-06-03 23:07:38 MDT

Created attachment 899 [details]
attachment-12028-0.html

running your 666bbe05fe2bf + our standard patches

The setup:

- we have a single-task array job 2534575 which depends on 10-task array job 2534565 (tasks 1916-1925)

- below we examine the log entries for the dependent job (2534575) and one of the jobs on which it depends (2534566, aka 2534565_1917)

- at 08:52, there was a slurm reconfig for unknown reason (still tracking that down)

- at 09:58, we recovered from a very localised power outage that affected the slurmctld nodes, but not the cluster nodes

- after that second recovery, job 2534566 "completes" (suspiciously, but we'll get to that)

- because it completed unsuccessfully, 2534566 gets requeued immediately

- even though 2534566 is still running, 2534575 gets allocated

> [2014-06-04T00:59:40.615] _slurm_rpc_submit_batch_job JobId=2534565 usec=516135
> [2014-06-04T00:59:42.643] _slurm_rpc_submit_batch_job JobId=2534575 usec=1260057
> ...
> [2014-06-04T03:28:46.169] backfill: Started JobId=2534566 on clus551
> ...
> [2014-06-04T08:52:05.557] Recovered job 2534566 0
> [2014-06-04T08:52:05.557] Recovered job 2534575 0
> ...
> [2014-06-04T09:58:06.201] Recovered job 2534566 0
> [2014-06-04T09:58:06.201] Recovered job 2534575 0
> [2014-06-04T09:58:15.042] completing job 2534566 status 25600
> [2014-06-04T09:58:15.049] sched: job_complete for JobId=2534566 successful, exit code=25600
> [2014-06-04T09:58:20.482] _slurm_rpc_requeue: 2534566: usec=50723
> [2014-06-04T09:58:30.249] sched: Allocate JobId=2534575 NodeList=clus202 #CPUs=24

Which leads me to believe that 2534575's dependencies have been lost?  Possibly during the 09:58 restart.

(That said, I don't think it's the only situation in which dependencies are getting lost.  I've had complaints from for the past 3 or 4 days, but not enough information in those complaints to put together an actionable bug report.)




Question: is status 25600 significant?  I'm skeptical that this job actually completed.

What are the odds that a job that had been running for 6.5 hours just happened to finish 1 second after slurmctld was restored from a power outage?  I don't believe in coincidences like that.

From the log on clus551, where it was running:

> 2014-06-04T09:58:16+08:00 clus551 slurmstepd[35230]: done with job
> 2014-06-04T09:58:16+08:00 clus551 slurmd[55834]: debug:  _rpc_terminate_job, uid = 601
> 2014-06-04T09:58:16+08:00 clus551 slurmd[55834]: debug:  task_p_slurmd_release_resources: 2534566

I'm happy to file this separately if you think there's something to it.  It smells a little bit like bug 805, but without 805's tell-tale "Killing job" messages.

Comment 1 Moe Jette 2014-06-04 02:38:44 MDT

This may be relevant...
The job record contains three dependency fields:

	List depend_list;		/* list of job_ptr:state pairs */
	char *dependency;		/* wait for other jobs */
	char *orig_dependency;		/* original value (for archiving) */

One a dependency is satisfied, it gets removed from the depend_list and dependency fields. If a job gets requeued, its dependency should probably be restored from orig_dependency, but that needs to be checked. I'm not sure what should happen if a job gets requeued that some other job is dependent upon (i.e. the dependency was once satisfied, but no longer is when considering the requeued job's new state).

Comment 2 Moe Jette 2014-06-04 05:56:23 MDT

Can you tell me what the original dependency looked like?

If it was "afterany:2534565" then the dependency on job 2534566, aka 2534565_1917, would be satisfied when that job completed unsuccessfully, even though it is requeued. If you want the dependency to be satisfied only when the job completes successfully, the dependency should be "afterok:2534565"

Comment 3 Stuart Midgley 2014-06-04 12:10:24 MDT

We use afterany .  It was my understanding that this triggered when jobs left the queue, which doesn't happen in our environment (due to use of SE) when jobs error.

Comment 4 David Bigagli 2014-06-04 12:19:19 MDT

It means after any exit code, before beign requed the job exits.

On June 4, 2014 5:10:24 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=855
>
>--- Comment #3 from Stuart Midgley <stuartm@dugeo.com> ---
>We use afterany .  It was my understanding that this triggered when
>jobs left
>the queue, which doesn't happen in our environment (due to use of SE)
>when jobs
>error.
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.
>You are watching someone on the CC list of the bug.
>You are watching the assignee of the bug.

Comment 5 Stuart Midgley 2014-06-04 13:15:29 MDT

According to the sbatch documenation

             afterany:job_id[:jobid...]
                     This job can begin execution after the specified jobs have terminated.

My testing when we started indicated that afterany only occured when the job left the queue (ie. squeue no longer showed it).

If the job went from CG to SE then it worked as we expected (as the job hasn't terminated).

The afterok did not work for us.  If the job exited non-zero (went into SE) and we then scancel'ed it, then the dependant jobs did not run... which sort of makes sense given its documentation (ie. requiring an exit code of 0).

Comment 6 David Bigagli 2014-06-04 13:47:49 MDT

The job doesnt go from cg to se, it goes from cg to completed with exit code either 0 or != 0 then it gets requeued by your script. Indeed there is a very small window in which squeue wont show it unless -t all is specified.
This works as designed.  What is exactly the behaviour you want? 

On June 4, 2014 6:15:29 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=855
>
>--- Comment #5 from Stuart Midgley <stuartm@dugeo.com> ---
>According to the sbatch documenation
>
>             afterany:job_id[:jobid...]
>             This job can begin execution after the specified jobs have
>terminated.
>
>My testing when we started indicated that afterany only occured when
>the job
>left the queue (ie. squeue no longer showed it).
>
>If the job went from CG to SE then it worked as we expected (as the job
>hasn't
>terminated).
>
>The afterok did not work for us.  If the job exited non-zero (went into
>SE) and
>we then scancel'ed it, then the dependant jobs did not run... which
>sort of
>makes sense given its documentation (ie. requiring an exit code of 0).
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.
>You are watching someone on the CC list of the bug.
>You are watching the assignee of the bug.

Comment 7 David Bigagli 2014-06-04 13:48:01 MDT

Created attachment 900 [details]
attachment-23591-0.html

Comment 8 Stuart Midgley 2014-06-04 14:34:13 MDT

What we want is for jobs that have dependencies which end up in SE state to still depend on them... however they get to SE state.

We also need an scancel of a job to satisfy the dependency.

I guess we want an afterleftqueue

Comment 9 Phil Schwan 2014-06-04 14:46:55 MDT

> The job doesnt go from cg to se, it goes from cg to completed with exit code
> either 0 or != 0 then it gets requeued by your script.

Really?  Really really?

I thought the epilog script ran while the job was still in CG, and thus was moved directly from CG to SE?

If there's an intermediate "completed" state, then it sounds like the race condition is worse than we feared.

Comment 10 David Bigagli 2014-06-05 11:29:59 MDT

Really! Really! Really! :-)

Indeed there is a race between the slurmctld epilog and the slurmd epilog.
The job completing flag is set when the controller receives the job exit
status from the slurmstepd at the same time it sets the job state to be
JOB_FAILED, I am considering exit code !=0.

The completing flag is cleared and the job is requeue *only* after the slurmd epilog completes  

Consider 2 cases when job has the completing flag and it is JOB_FAILED
state since exited with code != 0.

1) slurmctld epilog finishes first. (slurmd epilog still running)
scontrol came through but since the job is in completing transition state
we don't requeue it we just remember the the requeue request. 
However since the job state is JOB_FAILED when the scheduler checks the 
dependency afterany evaluates true, regardless of the completing flag, and the dependent starts.

We have fixed this today in commit: 992229d1a57d

2) The slurmd epilog finishes first. (slurmctld epilog still running)
The job is considered completed and the completing flag is cleared.
The scheduler starts the dependent job of course.
The argument is that it should not since the slurmctld epilog is still
running and we are even waiting for it.
We think we will have a fix by tomorrow. 

These two fixes should gave us the desired behavior.

I think that afterok will help you for now. After you scancel it than you have to modify the dependency to become afterany or afternotok.

David

Comment 11 David Bigagli 2014-06-09 05:50:25 MDT

The commit to clean the completing flag when the last epilog runs is:351900d44c597
however there are some other fixes that followed that you should have, especially
a9c1c8e5f50f as mentioned in bug 866.

David

Comment 12 David Bigagli 2014-06-11 09:09:16 MDT

Fixed.

David