Summary: | Broken afterok dependency when state is OUT_OF_MEMORY | ||
---|---|---|---|
Product: | Slurm | Reporter: | Stephane Thiell <sthiell> |
Component: | Other | Assignee: | Alejandro Sanchez <alex> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | alex, kaizaad, kilian, sthiell |
Version: | 17.11.0 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=4590 | ||
Site: | Stanford | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | Google sites: | --- |
HPCnow Sites: | --- | HPE Sites: | --- |
IBM Sites: | --- | NOAA SIte: | --- |
NoveTech Sites: | --- | Nvidia HWinf-CS Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Tzag Elita Sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Ticket Depends on: | 3820 | ||
Ticket Blocks: |
Description
Stephane Thiell
2018-01-12 15:37:34 MST
Hi Stephane. We are internally reviewing a patch for bug 3820 so that job state will change to OUT_OF_MEMORY if oom-killer actually killed. That would avoid situations where pages were reclaimed by the kernel and process managed to succeed, but job state got changed marked OOM. I am going to mark this bug as dependant on bug 3820 for now, then we will address the issue here. But taking a quick look at test_job_array_completed(), it doesn't consider state OUT_OF_MEMORY (which has ExitCode 0:125) and can see derived problems in src/slurmctld/job_scheduler.c's test_job_dependency() logic. We will study further the situation and come back to you, but most probably we will need to solve bug 3820 beforehand. Thanks for your understanding. Hi Alejandro, Thanks much! This is impacting several jobs and multiple users have reported the issue. I'll closely follow bug 3820 too. I do hope you'll find a solution and provide a patch soon. Best regards, Stephane Stephane, after doing some more tests today with this, and discussing this internally, we think Slurm is behaving as expected with regards of 'afterok' dependency type. Let me elaborate why with a few examples and some notes. The following example satisfies the 'afterok' dependency: $ sbatch --wrap "sleep 20" Submitted batch job 20012 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20012 p1 wrap alex R 0:01 1 compute1 $ sbatch -d afterok:20012 --wrap "sleep 88888" Submitted batch job 20013 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20013 p1 wrap alex PD 0:00 1 (Dependency) 20012 p1 wrap alex R 0:10 1 compute1 $ squeue # (eventually, after 20012 finishes) JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20013 p1 wrap alex R 0:04 1 compute1 $ sacct -j 20012 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 20012 wrap p1 acct1 2 COMPLETED 0:0 20012.batch batch acct1 2 COMPLETED 0:0 As you can see, 20012 finished with ExitCode 0:0 and state COMPLETED. All of these states (from slurm.h) should be considered as failed with respect to dependencies: JOB_CANCELLED, /* cancelled by user */ JOB_FAILED, /* completed execution unsuccessfully */ JOB_TIMEOUT, /* terminated on reaching time limit */ JOB_NODE_FAIL, /* terminated on node failure */ JOB_PREEMPTED, /* terminated due to preemption */ JOB_BOOT_FAIL, /* terminated due to node boot failure */ JOB_DEADLINE, /* terminated on deadline */ JOB_OOM, /* experienced out of memory error */ All of them indicate the job did not run to completion with exit code 0. So for instance a CANCELLED job looks like this in sacct: 20010 wrap p1 acct1 2 CANCELLED+ 0:0 20010.batch batch acct1 2 CANCELLED 0:15 and a OOM one like this: 20014 mem_eater p1 acct1 2 OUT_OF_ME+ 0:125 and since they don't have the ExitCode 0:0 and state COMPLETED, they will never satisfy an 'afterok' dependency. You can view the logic in src/slurmctld/job_scheduler.c test_job_dependency() function, around this code spot: } else if (dep_ptr->depend_type == SLURM_DEPEND_AFTER_OK) { if (!IS_JOB_COMPLETED(djob_ptr)) depends = true; else if (IS_JOB_COMPLETE(djob_ptr)) clear_dep = true; else { failure = true; break; } The IS_JOB_COMPLETE macro is defined like this in src/common/slurm_protocol_defs.h : #define IS_JOB_COMPLETE(_X) \ ((_X->job_state & JOB_STATE_BASE) == JOB_COMPLETE) Thus in terms of dependency a job will only satisfy it it is finished with state JOB_COMPLETE, other finished states won't. Now, regarding those jobs whose spawned step tasks memory usage hit the limit but aren't oom-killed and instead manage to finish successfully, they won't be marked as JOB_OOM anymore after the patch prepared for bug 3820. Just FYI it was a quiet involved patch, but things to be working as expected now and just we are waiting for another team member to decide in which version(s) we check this in. I think after this explanation we can proceed and mark this one as resolved/infogiven, unless you have any more questions. Hi Alejandro, Thank you for the thorough explanation! Indeed, this behavior makes sense to me if bug 3820 is finally fixed. Thanks! Stephane (In reply to Stephane Thiell from comment #6) > Hi Alejandro, > > Thank you for the thorough explanation! Indeed, this behavior makes sense to > me if bug 3820 is finally fixed. > > Thanks! > Stephane All right, closing this bug then since I finally fixed bug 3820 too. |