Ticket 3824

Summary:	Partition based preemption works incorrectly
Product:	Slurm	Reporter:	Sergey Meirovich <sergey_meirovich>
Component:	slurmctld	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	bart
Version:	15.08.12
Hardware:	Linux
OS:	Linux
Site:	AMAT	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.02.8
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf patch 15.08 Updated patch with memory leak fixed Diagnostic info for comment #20 All the diagnostic data for comment #25 All the diagnostic data for comment #33 patch 15.08

Description Sergey Meirovich 2017-05-18 19:22:49 MDT

Created attachment 4592 [details]
slurm.conf

Hi,

-bash-4.1$ scontrol show node=dcalph090
NodeName=dcalph090 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.02 Features=E5-2680,64G,cae
   Gres=(null)
   NodeAddr=dcalph090 NodeHostName=dcalph090 Version=(null)
   OS=Linux RealMemory=64386 AllocMem=0 FreeMem=53661 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=1951 Weight=1 Owner=N/A
   BootTime=Mar  5 10:41 SlurmdStartTime=Mar  5 10:46
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

-bash-4.1$ sbatch -p open -n 16 -w dcalph090 --wrap="sleep 1000"
Submitted batch job 279893
-bash-4.1$ sbatch -p low -n 8 -w dcalph090 --wrap="sleep 1000"
Submitted batch job 279894
-bash-4.1$ squeue -w dcalph090
JOBID  USER     ST PARTITION    NAME      COMMAND   SUBMIT_TIME  CPUS NODES NODELIST(REASON)                                  
279894 e154466  R  low          wrap      (null)    May 18 18:12 8    1     dcalph090                                         
279893 e154466  S  open         wrap      (null)    May 18 18:12 16   1     dcalph090                                         
-bash-4.1$ sbatch -p normal -n 8 -w dcalph090 --wrap="sleep 1000"
Submitted batch job 279895
-bash-4.1$ squeue -w dcalph090
JOBID  USER     ST PARTITION    NAME      COMMAND   SUBMIT_TIME  CPUS NODES NODELIST(REASON)                                  
279895 e154466  R  normal       wrap      (null)    May 18 18:12 8    1     dcalph090                                         
279894 e154466  S  low          wrap      (null)    May 18 18:12 8    1     dcalph090                                         
279893 e154466  S  open         wrap      (null)    May 18 18:12 16   1     dcalph090                                         
-bash-4.1$ 



Why Slurm preempted low partition job?

Comment 1 Tim Wickberg 2017-05-18 21:16:52 MDT

> Why Slurm preempted low partition job?

I can confirm and reproduce this behavior, although I can't exactly explain it at the moment.

This does appear to be a side effect of the current implementation - multiple levels of preemption weren't originally anticipated.

Oddly enough, if you flip the order of the second two jobs around they will both run.

My test system is setup with:

PartitionName=p1 Nodes=node001 Priority=1 PreemptMode=SUSPEND
PartitionName=p2 Nodes=node001 Priority=2 PreemptMode=SUSPEND
PartitionName=p3 Nodes=node001 Priority=3 

node001 has 8 cpus available. Reproducing what you're seeing:
tim@zoidberg:~$ sbatch -p p1 --wrap "sleep 100" -n 8
Submitted batch job 38922
tim@zoidberg:~$ sbatch -p p2 --wrap "sleep 100" -n 4
Submitted batch job 38923
tim@zoidberg:~$ sbatch -p p3 --wrap "sleep 100" -n 4
Submitted batch job 38924
tim@zoidberg:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             38924        p3     wrap      tim  R       0:00      1 node001
             38922        p1     wrap      tim  S       0:06      1 node001
             38923        p2     wrap      tim  S       0:03      1 node001


But if I flip the order of jobs submitted to p2 and p3:

tim@zoidberg:~$ sbatch -p p1 --wrap "sleep 100" -n 8
Submitted batch job 38925
tim@zoidberg:~$ sbatch -p p3 --wrap "sleep 100" -n 4
Submitted batch job 38926
tim@zoidberg:~$ sbatch -p p2 --wrap "sleep 100" -n 4
Submitted batch job 38927
tim@zoidberg:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             38926        p3     wrap      tim  R       0:01      1 node001
             38927        p2     wrap      tim  R       0:01      1 node001
             38925        p1     wrap      tim  S       0:04      1 node001

then all the slots on the node are filled up.

I'm going to see if Dominik may be able to find this, although this may take some time to address.

Comment 2 Dominik Bartkiewicz 2017-05-19 05:03:16 MDT

Hi 

I have found the problem in the source code, now I am working on it.

Dominik

Comment 5 Dominik Bartkiewicz 2017-05-25 07:06:42 MDT

Hi,

this is more like side effect of our preemption model/algorithm than the bug. We are working on tuning this but it is not so simple so we need more time. Could you drop severity to 3?

Dominik

Comment 6 Sergey Meirovich 2017-05-25 08:54:53 MDT

Lowered to Sev-3 as requested.

Comment 7 Sergey Meirovich 2017-06-05 13:32:43 MDT

Even if that is side effect of preemption model/algorithm it still leads to inefficient utilization of the cluster resources and we see clear benifits from fixing this.

Is there any aproximate ETA on this?

Comment 8 Dominik Bartkiewicz 2017-06-05 14:45:27 MDT

Hi

I think that in next few days I will provide some
solution.

Dominik

Comment 14 Dominik Bartkiewicz 2017-06-09 05:37:03 MDT

Hi

We have improved job selection to preempt when there are multiple partition.
This commit is only in 17.02:
https://github.com/SchedMD/slurm/commit/47b5fe608b7a8ab58b416a3218c8644b7e67da09
I will provide you patch to 15.08 but I recommend to update to current version.

Dominik

Comment 15 Dominik Bartkiewicz 2017-06-09 07:15:39 MDT

Created attachment 4735 [details]
patch 15.08

Comment 17 Dominik Bartkiewicz 2017-06-12 04:01:23 MDT

Hi

In patch from comment 15 we found some not critical memory leak.
New version hasn't got this issue.

Dominik

Comment 18 Sergey Meirovich 2017-06-13 09:08:14 MDT

Thanks Dominik!

Appreciate this.

Comment 19 Dominik Bartkiewicz 2017-06-14 10:28:31 MDT

Hi

I am marking this as resolved.
But feel free to reopen if any problem occur.

Dominik

Comment 20 Sergey Meirovich 2017-08-02 17:27:46 MDT

Hi,

I have sad news. I am reproducing behavior similiar to what we had.

Sending 72 cores (with -N2) job #308921  into license partition (Priority=5) preempts normal partition (Priority=3) job while there were enough low partition (Priority=2) jobs for that:

Situation after #30892 submission
==================================

bash-4.1$ squeue -w 'dcalph[001-002]'
JOBID  USER     ST PARTITION    NAME      COMMAND   SUBMIT_TIME  CPUS NODES NODELIST(REASON)                                  
308921 e154466  R  license      fluent    /tmp/tmp.FAug  2 16:08 72   2     dcalph[001-002]                                   
308499 e153547  S  normal       VASP 5.4.1/tmp/tmp.qAug  2  8:34 72   2     dcalph[001-002]                                   
307973 e157618  S  low          VASP 5.4.1/tmp/tmp.gAug  1 10:25 36   1     dcalph001                                         
307640 e157618  S  low          VASP 5.4.1/tmp/tmp.fJul 31 18:59 36   1     dcalph002                                         
-bash-4.1$ 


Example of the low partition jobs which were better candidates for preemption
=============================================================================
308529 e157618  R  low          VASP 5.4.1/tmp/tmp.wAug  2  9:11 36   1     dcalph008                                         
308015 e157618  R  low          VASP 5.4.1/tmp/tmp.zAug  1 11:11 36   1     dcalph009  



Submission itself and diagnostic info collection around that
=================================================================

-bash-4.1$ fluentslurm  -x"-C E5-2699v3 -N2" -n 72  -j Jun27C.jou -v 3ddp -f 171 -p license
Submitted batch job 308921
-bash-4.1$ sinfo > /tmp/sinfo.2; squeue > /tmp/squeue.2; scontrol show -dd jobid >  /tmp/scontrol-show-dd-jobid.2
-bash-4.1$ diff -u /tmp/squeue.1 /tmp/squeue.2 
--- /tmp/squeue.1	2017-08-02 16:08:30.194243398 -0700
+++ /tmp/squeue.2	2017-08-02 16:09:02.998294563 -0700
@@ -25,9 +25,10 @@
 304040 x071102  R  license      cfdace 14./tmp/tmp.gJul 25 23:14 10   1     dcalph036                                         
 289556 e111472  R  interact     VNC       /user/e111Jun 20 15:21 1    1     dcalph075                                         
 240545 e121045  R  interact     VNC       /user/e121Mar 29 23:25 1    1     dcalph075                                         
+308921 e154466  R  license      fluent    /tmp/tmp.FAug  2 16:08 72   2     dcalph[001-002]                                   
 216154 e154466  R  interact     VNC       /user/e154Mar 15 10:06 1    1     dcalph075                                         
 302350 e61958   S  open         tri3dynl  ./tri3dynlJul 24 10:16 1    1     dcalph010                                         
-308499 e153547  R  normal       VASP 5.4.1/tmp/tmp.qAug  2  8:34 72   2     dcalph[001-002]                                   
+308499 e153547  S  normal       VASP 5.4.1/tmp/tmp.qAug  2  8:34 72   2     dcalph[001-002]                                   
 304614 e153547  R  normal       run.sh    /dat/usr/eJul 26 17:02 32   1     dcalph010                                         
 303933 e153547  R  normal       run.sh    /dat/usr/eJul 25 19:45 36   1     dcalph019                                         
 306467 e153547  R  normal       run.sh    /dat/usr/eJul 29 18:25 32   1     dcalph020                                         
@@ -110,6 +111,7 @@
 308901 e158714  PD open         radicals/c/tmp/tmp.QAug  2 15:35 36   1     (Priority)                                        
 308903 e158714  PD open         radicals/c/tmp/tmp.0Aug  2 15:39 36   1     (Priority)                                        
 308904 e158714  PD open         radicals/c/tmp/tmp.vAug  2 15:40 36   1     (Priority)                                        
+308919 e158714  PD open         radicals/c/tmp/tmp.jAug  2 16:08 36   1     (Priority)                                        
 308587 e158714  R  normal       radicals/c/tmp/tmp.TAug  2 10:03 36   1     dcalph003                                         
 308485 e158714  S  open         entropy/su/tmp/tmp.iAug  2  8:22 36   1     dcalph033                                         
 308630 e158714  R  open         radicals/c/tmp/tmp.aAug  2 11:17 36   1     dcalph037                                         
-bash-4.1$ 


I am going to upload all diagnostic info shortly.

Comment 21 Sergey Meirovich 2017-08-02 17:38:09 MDT

Created attachment 5018 [details]
Diagnostic info for comment #20

Diagnostic info for comment #20 has been uploded. Including sinfo/squeue/scontrol show -dd jobid before and after submission as well as slurm.conf and slurmctld log with debug level 7

Comment 24 Tim Shaw 2017-08-03 17:04:04 MDT

Sergey,

When searching for a job to preempt, the controller makes a list of job candidates that is ordered by priority.  It then orders the list again by size to minimize number of preempted jobs.  So the resulting list is compromise between priority and jobs with a similar node count.  This could be why a job with the normal partition got preempted, rather than jobs from the low partition.

You can turn off this second ordering by setting the following in the slurm.conf:

SchedulerParameters=preempt_strict_order

This will keep the list of job candidates ordered strictly by job priority.

Try this setting to see if this gives you the desired behavior and let me know how it goes.

Regards.

Tim

Comment 25 Sergey Meirovich 2017-08-03 17:46:57 MDT

Unfortunatelly SchedulerParameters=preempt_strict_order doesn't help us. Example is job # 309580

Before the submission of 309580:
================================
[e154466@DCALPH000 slowdown]$ squeue -wdcalph[045-046]    
JOBID USER     ST PARTITION    NAME                COMMAND             SUBMIT_TIME  CPUS NODES NODELIST(REASON)                                  
309305e157618  R  low          VASP 5.4.1.05Feb16  /tmp/tmp.1Ych8TT57C Aug  3  8:31 36   1     dcalph045                                         
309304e157618  R  normal       VASP 5.4.4-vtst     /tmp/tmp.txlFOSDvOu Aug  3  8:30 72   2     dcalph[046-047]                                   
307280e157618  S  low          VASP 5.4.1.05Feb16  /tmp/tmp.uKZQDQfAO7 Jul 31 10:28 36   1     dcalph046                                         
[e154466@DCALPH000 slowdown]$

Example of the low partition jobs which were better candidates for preemption
=============================================================================
307638e157618  R  low          VASP 5.4.1.05Feb16  /tmp/tmp.GzBIN35FkA Jul 31 18:56 36   1     dcalph003                                         
307982e157618  R  low          VASP 5.4.1.05Feb16  /tmp/tmp.tx7evYaGLL Aug  1 10:34 36   1     dcalph005 


Submission itself and diagnostic info collection around that
=================================================================  

[e154466@DCALPH000 slowdown]$ scontrol show config > /tmp/scontrol-show-config
[e154466@DCALPH000 slowdown]$ sinfo > /tmp/sinfo.1; squeue > /tmp/squeue.1; scontrol show -dd jobid >  /tmp/scontrol-show-dd-jobid.1
[e154466@DCALPH000 slowdown]$ fluentslurm  -x"-C E5-2699v3 -N2" -n 72  -j Jun27C.jou -v 3ddp -f 171 -p license
Submitted batch job 309580
[e154466@DCALPH000 slowdown]$  sinfo > /tmp/sinfo.2; squeue > /tmp/squeue.2; scontrol show -dd jobid >  /tmp/scontrol-show-dd-jobid.2

#309580 in fact suspended normal job:
========================================                                   
[e154466@DCALPH000 slowdown]$ squeue -wdcalph[045-046]    
JOBID USER     ST PARTITION    NAME                COMMAND             SUBMIT_TIME  CPUS NODES NODELIST(REASON)                                  
309580e154466  R  license      fluent              /tmp/tmp.8XEz5kmz9K Aug  3 16:43 72   2     dcalph[045-046]                                   
309305e157618  S  low          VASP 5.4.1.05Feb16  /tmp/tmp.1Ych8TT57C Aug  3  8:31 36   1     dcalph045                                         
309304e157618  S  normal       VASP 5.4.4-vtst     /tmp/tmp.txlFOSDvOu Aug  3  8:30 72   2     dcalph[046-047]                                   
307280e157618  S  low          VASP 5.4.1.05Feb16  /tmp/tmp.uKZQDQfAO7 Jul 31 10:28 36   1     dcalph046                                         
[e154466@DCALPH000 slowdown]$ 


Will upload all the diagnostic info shortly.

Comment 26 Sergey Meirovich 2017-08-03 17:55:32 MDT

Created attachment 5021 [details]
All the diagnostic data for comment #25

All from comment #25 as well as slurm.conf and slurmctl log

Comment 29 Dominik Bartkiewicz 2017-08-10 09:32:00 MDT

Hi

Could you send me slurmctld.log with debugflags SelectType?

turn on:
scontrol setdebugflags +SelectType
turn off:
scontrol setdebugflags -SelectType

Dominik

Comment 30 Sergey Meirovich 2017-08-17 01:38:00 MDT

CLuster is so full at that moment that it hard to repoduce the issue. Give us ~week please.

Comment 31 Tim Wickberg 2017-08-22 23:54:54 MDT

Sergey -

Any luck getting logs for this?

Comment 32 Sergey Meirovich 2017-08-24 17:48:01 MDT

Hi Tim,

Give us a bit more time.

Comment 33 Sergey Meirovich 2017-08-27 15:46:16 MDT

Here goes reproduction with "scontrol setdebugflags +SelectType"
==============================================================================
-bash-4.1$ scontrol show config > /tmp/scontrol-show-config
-bash-4.1$ sinfo > /tmp/sinfo.1; squeue > /tmp/squeue.1; scontrol show -dd jobid >  /tmp/scontrol-show-dd-jobid.1
-bash-4.1$ fluentslurm  -x"-N2" -n 72  -j Jun27C.jou -v 3ddp -f 171 -p license
Submitted batch job 353018
-bash-4.1$ sinfo > /tmp/sinfo.2; squeue > /tmp/squeue.2; scontrol show -dd jobid >  /tmp/scontrol-show-dd-jobid.2
-bash-4.1$ squeue -w 'dcalph[001-002]'
JOBID  USER     ST PARTITION    NAME      COMMAND   SUBMIT_TIME  CPUS NODES NODELIST(REASON)                                  
353018 e154466  R  license      fluent    /tmp/tmp.4Aug 27 14:42 72   2     dcalph[001-002]                                   
342903 e154414  S  normal       VASP 5.4.1/tmp/tmp.yAug 26  6:16 180  5     dcalph[001,026-027,034-035]                       
352710 e157618  S  low          VASP 5.4.1/tmp/tmp.bAug 27 13:13 36   1     dcalph002                                                     
-bash-4.1$ squeue | grep 'R' | grep low
352978 e154414  R  low          8I11S4-USP/dat/usr/eAug 27 14:15 12   1     dcalph017                                         
343048 e154414  R  low          USPEX (mas/tmp/tmp.fAug 26 10:43 1    1     dcalph075                                         
352680 e157618  R  low          VASP 5.4.1/tmp/tmp.GAug 27 13:08 36   1     dcalph020                                         
352610 e157618  R  low          VASP 5.4.1/tmp/tmp.6Aug 27 12:36 36   1     dcalph024                                         
342310 e157618  R  low          VASP 5.4.1/tmp/tmp.fAug 25 16:09 36   1     dcalph030                                         
352623 e157618  R  low          VASP 5.4.1/tmp/tmp.9Aug 27 12:38 36   1     dcalph050                                         
352686 e157618  R  low          VASP 5.4.1/tmp/tmp.eAug 27 13:12 36   1     dcalph071                                         
352676 e157618  R  low          VASP 5.4.1/tmp/tmp.yAug 27 13:06 36   1     dcalph074                                         
-bash-4.1$
==============================================================================


Will update log and diagnostic info shortly.

Comment 34 Sergey Meirovich 2017-08-27 15:52:59 MDT

Created attachment 5157 [details]
All the diagnostic data for comment #33

Comment 39 Dominik Bartkiewicz 2017-09-06 04:43:37 MDT

Hi

We can recreate this issue.
We prepared patch which improved selection nodes logic.
But we need more time and tests before we commit this to git.
If you want I can give you 15.08 version now.

Dominik

Comment 40 Dominik Bartkiewicz 2017-09-06 07:06:35 MDT

Created attachment 5195 [details]
patch 15.08

Comment 41 Sergey Meirovich 2017-09-06 07:37:34 MDT

Hi Dominick,

That is great news! Thanks!

About 10 minutes before you last comment we had already upgraded to slurm 17.02.7. Is that issue affects 17.02.7 as well?

Comment 42 Dominik Bartkiewicz 2017-09-06 08:39:58 MDT

Hi

I am afraid yes.
I can give you preliminary patch or you can wait until it will be in 17.02 branch.

Dominik

Comment 43 Sergey Meirovich 2017-09-06 08:52:00 MDT

We could certanly wait until fix went in 17.02 branch. Thanks!

Comment 46 Sergey Meirovich 2017-09-07 19:23:01 MDT

I see next commit:

commit 0f501359c635801b08cbc2b5e61164284f2610b7
Author: Dominik Bartkiewicz <bart@schedmd.com>
Date:   Thu Sep 7 14:46:45 2017 -0600

    Optimization enhancements for partition based job preemption
    
    bug 3824


Do you plan to commit something related to #3824 on top before 17.02.08 release? I understand that sometimes any change might bring unexpected side-effect or might not fully achieve  what was expected to be achieved. But if in the meantime you do not  think that anything else on top is going to be committed for that – I am willing to try that out…

Comment 47 Dominik Bartkiewicz 2017-09-08 02:49:52 MDT

Hi

If it is working as you expect, we are not planning any modification of this algorithm.
This commit contains all necessary changes and it can be applied separately.

Dominik

Comment 48 Dominik Bartkiewicz 2017-09-11 03:57:42 MDT

Hi

I am marking this as resolved.
But feel free to reopen if any problem occur.

Dominik