| Summary: | Partition based preemption works incorrectly | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Sergey Meirovich <sergey_meirovich> |
| Component: | slurmctld | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | bart |
| Version: | 15.08.12 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | AMAT | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 17.02.8 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf
patch 15.08 Updated patch with memory leak fixed Diagnostic info for comment #20 All the diagnostic data for comment #25 All the diagnostic data for comment #33 patch 15.08 |
||
> Why Slurm preempted low partition job?
I can confirm and reproduce this behavior, although I can't exactly explain it at the moment.
This does appear to be a side effect of the current implementation - multiple levels of preemption weren't originally anticipated.
Oddly enough, if you flip the order of the second two jobs around they will both run.
My test system is setup with:
PartitionName=p1 Nodes=node001 Priority=1 PreemptMode=SUSPEND
PartitionName=p2 Nodes=node001 Priority=2 PreemptMode=SUSPEND
PartitionName=p3 Nodes=node001 Priority=3
node001 has 8 cpus available. Reproducing what you're seeing:
tim@zoidberg:~$ sbatch -p p1 --wrap "sleep 100" -n 8
Submitted batch job 38922
tim@zoidberg:~$ sbatch -p p2 --wrap "sleep 100" -n 4
Submitted batch job 38923
tim@zoidberg:~$ sbatch -p p3 --wrap "sleep 100" -n 4
Submitted batch job 38924
tim@zoidberg:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
38924 p3 wrap tim R 0:00 1 node001
38922 p1 wrap tim S 0:06 1 node001
38923 p2 wrap tim S 0:03 1 node001
But if I flip the order of jobs submitted to p2 and p3:
tim@zoidberg:~$ sbatch -p p1 --wrap "sleep 100" -n 8
Submitted batch job 38925
tim@zoidberg:~$ sbatch -p p3 --wrap "sleep 100" -n 4
Submitted batch job 38926
tim@zoidberg:~$ sbatch -p p2 --wrap "sleep 100" -n 4
Submitted batch job 38927
tim@zoidberg:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
38926 p3 wrap tim R 0:01 1 node001
38927 p2 wrap tim R 0:01 1 node001
38925 p1 wrap tim S 0:04 1 node001
then all the slots on the node are filled up.
I'm going to see if Dominik may be able to find this, although this may take some time to address.
Hi I have found the problem in the source code, now I am working on it. Dominik Hi, this is more like side effect of our preemption model/algorithm than the bug. We are working on tuning this but it is not so simple so we need more time. Could you drop severity to 3? Dominik Lowered to Sev-3 as requested. Even if that is side effect of preemption model/algorithm it still leads to inefficient utilization of the cluster resources and we see clear benifits from fixing this. Is there any aproximate ETA on this? Hi I think that in next few days I will provide some solution. Dominik Hi We have improved job selection to preempt when there are multiple partition. This commit is only in 17.02: https://github.com/SchedMD/slurm/commit/47b5fe608b7a8ab58b416a3218c8644b7e67da09 I will provide you patch to 15.08 but I recommend to update to current version. Dominik Created attachment 4735 [details]
patch 15.08
Hi In patch from comment 15 we found some not critical memory leak. New version hasn't got this issue. Dominik Thanks Dominik! Appreciate this. Hi I am marking this as resolved. But feel free to reopen if any problem occur. Dominik Hi, I have sad news. I am reproducing behavior similiar to what we had. Sending 72 cores (with -N2) job #308921 into license partition (Priority=5) preempts normal partition (Priority=3) job while there were enough low partition (Priority=2) jobs for that: Situation after #30892 submission ================================== bash-4.1$ squeue -w 'dcalph[001-002]' JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS NODES NODELIST(REASON) 308921 e154466 R license fluent /tmp/tmp.FAug 2 16:08 72 2 dcalph[001-002] 308499 e153547 S normal VASP 5.4.1/tmp/tmp.qAug 2 8:34 72 2 dcalph[001-002] 307973 e157618 S low VASP 5.4.1/tmp/tmp.gAug 1 10:25 36 1 dcalph001 307640 e157618 S low VASP 5.4.1/tmp/tmp.fJul 31 18:59 36 1 dcalph002 -bash-4.1$ Example of the low partition jobs which were better candidates for preemption ============================================================================= 308529 e157618 R low VASP 5.4.1/tmp/tmp.wAug 2 9:11 36 1 dcalph008 308015 e157618 R low VASP 5.4.1/tmp/tmp.zAug 1 11:11 36 1 dcalph009 Submission itself and diagnostic info collection around that ================================================================= -bash-4.1$ fluentslurm -x"-C E5-2699v3 -N2" -n 72 -j Jun27C.jou -v 3ddp -f 171 -p license Submitted batch job 308921 -bash-4.1$ sinfo > /tmp/sinfo.2; squeue > /tmp/squeue.2; scontrol show -dd jobid > /tmp/scontrol-show-dd-jobid.2 -bash-4.1$ diff -u /tmp/squeue.1 /tmp/squeue.2 --- /tmp/squeue.1 2017-08-02 16:08:30.194243398 -0700 +++ /tmp/squeue.2 2017-08-02 16:09:02.998294563 -0700 @@ -25,9 +25,10 @@ 304040 x071102 R license cfdace 14./tmp/tmp.gJul 25 23:14 10 1 dcalph036 289556 e111472 R interact VNC /user/e111Jun 20 15:21 1 1 dcalph075 240545 e121045 R interact VNC /user/e121Mar 29 23:25 1 1 dcalph075 +308921 e154466 R license fluent /tmp/tmp.FAug 2 16:08 72 2 dcalph[001-002] 216154 e154466 R interact VNC /user/e154Mar 15 10:06 1 1 dcalph075 302350 e61958 S open tri3dynl ./tri3dynlJul 24 10:16 1 1 dcalph010 -308499 e153547 R normal VASP 5.4.1/tmp/tmp.qAug 2 8:34 72 2 dcalph[001-002] +308499 e153547 S normal VASP 5.4.1/tmp/tmp.qAug 2 8:34 72 2 dcalph[001-002] 304614 e153547 R normal run.sh /dat/usr/eJul 26 17:02 32 1 dcalph010 303933 e153547 R normal run.sh /dat/usr/eJul 25 19:45 36 1 dcalph019 306467 e153547 R normal run.sh /dat/usr/eJul 29 18:25 32 1 dcalph020 @@ -110,6 +111,7 @@ 308901 e158714 PD open radicals/c/tmp/tmp.QAug 2 15:35 36 1 (Priority) 308903 e158714 PD open radicals/c/tmp/tmp.0Aug 2 15:39 36 1 (Priority) 308904 e158714 PD open radicals/c/tmp/tmp.vAug 2 15:40 36 1 (Priority) +308919 e158714 PD open radicals/c/tmp/tmp.jAug 2 16:08 36 1 (Priority) 308587 e158714 R normal radicals/c/tmp/tmp.TAug 2 10:03 36 1 dcalph003 308485 e158714 S open entropy/su/tmp/tmp.iAug 2 8:22 36 1 dcalph033 308630 e158714 R open radicals/c/tmp/tmp.aAug 2 11:17 36 1 dcalph037 -bash-4.1$ I am going to upload all diagnostic info shortly. Created attachment 5018 [details] Diagnostic info for comment #20 Diagnostic info for comment #20 has been uploded. Including sinfo/squeue/scontrol show -dd jobid before and after submission as well as slurm.conf and slurmctld log with debug level 7 Sergey, When searching for a job to preempt, the controller makes a list of job candidates that is ordered by priority. It then orders the list again by size to minimize number of preempted jobs. So the resulting list is compromise between priority and jobs with a similar node count. This could be why a job with the normal partition got preempted, rather than jobs from the low partition. You can turn off this second ordering by setting the following in the slurm.conf: SchedulerParameters=preempt_strict_order This will keep the list of job candidates ordered strictly by job priority. Try this setting to see if this gives you the desired behavior and let me know how it goes. Regards. Tim Unfortunatelly SchedulerParameters=preempt_strict_order doesn't help us. Example is job # 309580 Before the submission of 309580: ================================ [e154466@DCALPH000 slowdown]$ squeue -wdcalph[045-046] JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS NODES NODELIST(REASON) 309305e157618 R low VASP 5.4.1.05Feb16 /tmp/tmp.1Ych8TT57C Aug 3 8:31 36 1 dcalph045 309304e157618 R normal VASP 5.4.4-vtst /tmp/tmp.txlFOSDvOu Aug 3 8:30 72 2 dcalph[046-047] 307280e157618 S low VASP 5.4.1.05Feb16 /tmp/tmp.uKZQDQfAO7 Jul 31 10:28 36 1 dcalph046 [e154466@DCALPH000 slowdown]$ Example of the low partition jobs which were better candidates for preemption ============================================================================= 307638e157618 R low VASP 5.4.1.05Feb16 /tmp/tmp.GzBIN35FkA Jul 31 18:56 36 1 dcalph003 307982e157618 R low VASP 5.4.1.05Feb16 /tmp/tmp.tx7evYaGLL Aug 1 10:34 36 1 dcalph005 Submission itself and diagnostic info collection around that ================================================================= [e154466@DCALPH000 slowdown]$ scontrol show config > /tmp/scontrol-show-config [e154466@DCALPH000 slowdown]$ sinfo > /tmp/sinfo.1; squeue > /tmp/squeue.1; scontrol show -dd jobid > /tmp/scontrol-show-dd-jobid.1 [e154466@DCALPH000 slowdown]$ fluentslurm -x"-C E5-2699v3 -N2" -n 72 -j Jun27C.jou -v 3ddp -f 171 -p license Submitted batch job 309580 [e154466@DCALPH000 slowdown]$ sinfo > /tmp/sinfo.2; squeue > /tmp/squeue.2; scontrol show -dd jobid > /tmp/scontrol-show-dd-jobid.2 #309580 in fact suspended normal job: ======================================== [e154466@DCALPH000 slowdown]$ squeue -wdcalph[045-046] JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS NODES NODELIST(REASON) 309580e154466 R license fluent /tmp/tmp.8XEz5kmz9K Aug 3 16:43 72 2 dcalph[045-046] 309305e157618 S low VASP 5.4.1.05Feb16 /tmp/tmp.1Ych8TT57C Aug 3 8:31 36 1 dcalph045 309304e157618 S normal VASP 5.4.4-vtst /tmp/tmp.txlFOSDvOu Aug 3 8:30 72 2 dcalph[046-047] 307280e157618 S low VASP 5.4.1.05Feb16 /tmp/tmp.uKZQDQfAO7 Jul 31 10:28 36 1 dcalph046 [e154466@DCALPH000 slowdown]$ Will upload all the diagnostic info shortly. Created attachment 5021 [details] All the diagnostic data for comment #25 All from comment #25 as well as slurm.conf and slurmctl log Hi Could you send me slurmctld.log with debugflags SelectType? turn on: scontrol setdebugflags +SelectType turn off: scontrol setdebugflags -SelectType Dominik CLuster is so full at that moment that it hard to repoduce the issue. Give us ~week please. Sergey - Any luck getting logs for this? Hi Tim, Give us a bit more time. Here goes reproduction with "scontrol setdebugflags +SelectType" ============================================================================== -bash-4.1$ scontrol show config > /tmp/scontrol-show-config -bash-4.1$ sinfo > /tmp/sinfo.1; squeue > /tmp/squeue.1; scontrol show -dd jobid > /tmp/scontrol-show-dd-jobid.1 -bash-4.1$ fluentslurm -x"-N2" -n 72 -j Jun27C.jou -v 3ddp -f 171 -p license Submitted batch job 353018 -bash-4.1$ sinfo > /tmp/sinfo.2; squeue > /tmp/squeue.2; scontrol show -dd jobid > /tmp/scontrol-show-dd-jobid.2 -bash-4.1$ squeue -w 'dcalph[001-002]' JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS NODES NODELIST(REASON) 353018 e154466 R license fluent /tmp/tmp.4Aug 27 14:42 72 2 dcalph[001-002] 342903 e154414 S normal VASP 5.4.1/tmp/tmp.yAug 26 6:16 180 5 dcalph[001,026-027,034-035] 352710 e157618 S low VASP 5.4.1/tmp/tmp.bAug 27 13:13 36 1 dcalph002 -bash-4.1$ squeue | grep 'R' | grep low 352978 e154414 R low 8I11S4-USP/dat/usr/eAug 27 14:15 12 1 dcalph017 343048 e154414 R low USPEX (mas/tmp/tmp.fAug 26 10:43 1 1 dcalph075 352680 e157618 R low VASP 5.4.1/tmp/tmp.GAug 27 13:08 36 1 dcalph020 352610 e157618 R low VASP 5.4.1/tmp/tmp.6Aug 27 12:36 36 1 dcalph024 342310 e157618 R low VASP 5.4.1/tmp/tmp.fAug 25 16:09 36 1 dcalph030 352623 e157618 R low VASP 5.4.1/tmp/tmp.9Aug 27 12:38 36 1 dcalph050 352686 e157618 R low VASP 5.4.1/tmp/tmp.eAug 27 13:12 36 1 dcalph071 352676 e157618 R low VASP 5.4.1/tmp/tmp.yAug 27 13:06 36 1 dcalph074 -bash-4.1$ ============================================================================== Will update log and diagnostic info shortly. Created attachment 5157 [details] All the diagnostic data for comment #33 Hi We can recreate this issue. We prepared patch which improved selection nodes logic. But we need more time and tests before we commit this to git. If you want I can give you 15.08 version now. Dominik Created attachment 5195 [details]
patch 15.08
Hi Dominick, That is great news! Thanks! About 10 minutes before you last comment we had already upgraded to slurm 17.02.7. Is that issue affects 17.02.7 as well? Hi I am afraid yes. I can give you preliminary patch or you can wait until it will be in 17.02 branch. Dominik We could certanly wait until fix went in 17.02 branch. Thanks! I see next commit: commit 0f501359c635801b08cbc2b5e61164284f2610b7 Author: Dominik Bartkiewicz <bart@schedmd.com> Date: Thu Sep 7 14:46:45 2017 -0600 Optimization enhancements for partition based job preemption bug 3824 Do you plan to commit something related to #3824 on top before 17.02.08 release? I understand that sometimes any change might bring unexpected side-effect or might not fully achieve what was expected to be achieved. But if in the meantime you do not think that anything else on top is going to be committed for that – I am willing to try that out… Hi If it is working as you expect, we are not planning any modification of this algorithm. This commit contains all necessary changes and it can be applied separately. Dominik Hi I am marking this as resolved. But feel free to reopen if any problem occur. Dominik |
Created attachment 4592 [details] slurm.conf Hi, -bash-4.1$ scontrol show node=dcalph090 NodeName=dcalph090 Arch=x86_64 CoresPerSocket=8 CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.02 Features=E5-2680,64G,cae Gres=(null) NodeAddr=dcalph090 NodeHostName=dcalph090 Version=(null) OS=Linux RealMemory=64386 AllocMem=0 FreeMem=53661 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=1951 Weight=1 Owner=N/A BootTime=Mar 5 10:41 SlurmdStartTime=Mar 5 10:46 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s -bash-4.1$ sbatch -p open -n 16 -w dcalph090 --wrap="sleep 1000" Submitted batch job 279893 -bash-4.1$ sbatch -p low -n 8 -w dcalph090 --wrap="sleep 1000" Submitted batch job 279894 -bash-4.1$ squeue -w dcalph090 JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS NODES NODELIST(REASON) 279894 e154466 R low wrap (null) May 18 18:12 8 1 dcalph090 279893 e154466 S open wrap (null) May 18 18:12 16 1 dcalph090 -bash-4.1$ sbatch -p normal -n 8 -w dcalph090 --wrap="sleep 1000" Submitted batch job 279895 -bash-4.1$ squeue -w dcalph090 JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS NODES NODELIST(REASON) 279895 e154466 R normal wrap (null) May 18 18:12 8 1 dcalph090 279894 e154466 S low wrap (null) May 18 18:12 8 1 dcalph090 279893 e154466 S open wrap (null) May 18 18:12 16 1 dcalph090 -bash-4.1$ Why Slurm preempted low partition job?