| Summary: | job losing its dependencies, possibly after a slurmctld restart | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Phil Schwan <phils> |
| Component: | slurmctld | Assignee: | David Bigagli <david> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | da, stuartm |
| Version: | 14.03.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DownUnder GeoSolutions | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 14.03.4 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
attachment-12028-0.html
attachment-23591-0.html |
||
|
Description
Phil Schwan
2014-06-03 23:07:38 MDT
This may be relevant... The job record contains three dependency fields: List depend_list; /* list of job_ptr:state pairs */ char *dependency; /* wait for other jobs */ char *orig_dependency; /* original value (for archiving) */ One a dependency is satisfied, it gets removed from the depend_list and dependency fields. If a job gets requeued, its dependency should probably be restored from orig_dependency, but that needs to be checked. I'm not sure what should happen if a job gets requeued that some other job is dependent upon (i.e. the dependency was once satisfied, but no longer is when considering the requeued job's new state). Can you tell me what the original dependency looked like? If it was "afterany:2534565" then the dependency on job 2534566, aka 2534565_1917, would be satisfied when that job completed unsuccessfully, even though it is requeued. If you want the dependency to be satisfied only when the job completes successfully, the dependency should be "afterok:2534565" We use afterany . It was my understanding that this triggered when jobs left the queue, which doesn't happen in our environment (due to use of SE) when jobs error. It means after any exit code, before beign requed the job exits. On June 4, 2014 5:10:24 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=855 > >--- Comment #3 from Stuart Midgley <stuartm@dugeo.com> --- >We use afterany . It was my understanding that this triggered when >jobs left >the queue, which doesn't happen in our environment (due to use of SE) >when jobs >error. > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug. >You are watching someone on the CC list of the bug. >You are watching the assignee of the bug. According to the sbatch documenation
afterany:job_id[:jobid...]
This job can begin execution after the specified jobs have terminated.
My testing when we started indicated that afterany only occured when the job left the queue (ie. squeue no longer showed it).
If the job went from CG to SE then it worked as we expected (as the job hasn't terminated).
The afterok did not work for us. If the job exited non-zero (went into SE) and we then scancel'ed it, then the dependant jobs did not run... which sort of makes sense given its documentation (ie. requiring an exit code of 0).
The job doesnt go from cg to se, it goes from cg to completed with exit code either 0 or != 0 then it gets requeued by your script. Indeed there is a very small window in which squeue wont show it unless -t all is specified. This works as designed. What is exactly the behaviour you want? On June 4, 2014 6:15:29 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=855 > >--- Comment #5 from Stuart Midgley <stuartm@dugeo.com> --- >According to the sbatch documenation > > afterany:job_id[:jobid...] > This job can begin execution after the specified jobs have >terminated. > >My testing when we started indicated that afterany only occured when >the job >left the queue (ie. squeue no longer showed it). > >If the job went from CG to SE then it worked as we expected (as the job >hasn't >terminated). > >The afterok did not work for us. If the job exited non-zero (went into >SE) and >we then scancel'ed it, then the dependant jobs did not run... which >sort of >makes sense given its documentation (ie. requiring an exit code of 0). > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug. >You are watching someone on the CC list of the bug. >You are watching the assignee of the bug. Created attachment 900 [details]
attachment-23591-0.html
What we want is for jobs that have dependencies which end up in SE state to still depend on them... however they get to SE state. We also need an scancel of a job to satisfy the dependency. I guess we want an afterleftqueue > The job doesnt go from cg to se, it goes from cg to completed with exit code
> either 0 or != 0 then it gets requeued by your script.
Really? Really really?
I thought the epilog script ran while the job was still in CG, and thus was moved directly from CG to SE?
If there's an intermediate "completed" state, then it sounds like the race condition is worse than we feared.
Really! Really! Really! :-) Indeed there is a race between the slurmctld epilog and the slurmd epilog. The job completing flag is set when the controller receives the job exit status from the slurmstepd at the same time it sets the job state to be JOB_FAILED, I am considering exit code !=0. The completing flag is cleared and the job is requeue *only* after the slurmd epilog completes Consider 2 cases when job has the completing flag and it is JOB_FAILED state since exited with code != 0. 1) slurmctld epilog finishes first. (slurmd epilog still running) scontrol came through but since the job is in completing transition state we don't requeue it we just remember the the requeue request. However since the job state is JOB_FAILED when the scheduler checks the dependency afterany evaluates true, regardless of the completing flag, and the dependent starts. We have fixed this today in commit: 992229d1a57d 2) The slurmd epilog finishes first. (slurmctld epilog still running) The job is considered completed and the completing flag is cleared. The scheduler starts the dependent job of course. The argument is that it should not since the slurmctld epilog is still running and we are even waiting for it. We think we will have a fix by tomorrow. These two fixes should gave us the desired behavior. I think that afterok will help you for now. After you scancel it than you have to modify the dependency to become afterany or afternotok. David The commit to clean the completing flag when the last epilog runs is:351900d44c597 however there are some other fixes that followed that you should have, especially a9c1c8e5f50f as mentioned in bug 866. David Fixed. David |