7540 – Node getting oversubscribed in spite of settings - Suspend,Gang

Ticket 7540 - Node getting oversubscribed in spite of settings - Suspend,Gang

Summary: Node getting oversubscribed in spite of settings - Suspend,Gang

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	18.08.7
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-08-08 14:38 MDT by Deb Crocker
Modified:	2019-08-23 02:21 MDT (History)
CC List:	0 users

See Also:
Site:	University of Alabama
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
MAIN SLURM.CONF (3.17 KB, text/plain) 2019-08-08 14:38 MDT, Deb Crocker	Details
NODE CONFIGURATINO (11.87 KB, text/x-matlab) 2019-08-08 14:38 MDT, Deb Crocker	Details
PARTITION CONFIGURATION (1.97 KB, text/x-matlab) 2019-08-08 14:39 MDT, Deb Crocker	Details
SMALL PART OF CONFIG INCLUDE FILE (123 bytes, text/plain) 2019-08-08 14:39 MDT, Deb Crocker	Details
SLURMCTLD LOG FILE (87.69 MB, text/plain) 2019-08-08 14:42 MDT, Deb Crocker	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Deb Crocker 2019-08-08 14:38:32 MDT

Created attachment 11162 [details]
MAIN SLURM.CONF

We have two jobs which have ended up sharing a node and running on more than the available cores even though we have OverSubscribe set to NO. Jobs were submitted to different partitions. The one that is getting added to the node as -n 1 -c 16 when there only 9 free cores. We have not seen anything like this over the several years we have been running Slurm. I'm including some information here and will attach our config files.

Here is the squeue onput for the node:

         268833      main   main  2D.sh oazadehran  R    3:33:26   20:26:34   11/ 3     4G compute-2-[4-6] (null)
         263453      long   long sbatch     hhan19  R    1:35:59 6-22:24:01   16/ 1     8G compute-2-6 (null)

Here are the job submission files for each job:

268833:
!/bin/bash
#SBATCH --mem-per-cpu 4G
#SBATCH -C intel
#SBATCH -p main
#SBATCH --qos main
#SBATCH -n 11

rm -f *.o *.mod

srun  ./P3


263452:
#!/bin/bash
#SBATCH -n 1
#SBATCH -c 16
#SBATCH -p long
#SBATCH --qos long
#SBATCH --mem=8gb
#SBATCH -e errors.%A
#SBATCH -o output.%A

python meta_test.py


Here is the qstat output for each job:

268833:
        Job_Name = 2D.sh
        Job_Owner = oazadehranjbar@uahpc
        job_state = R
        queue = main
        qtime = Thu Aug  8 11:13:36 2019
        mtime = Thu Aug  8 11:13:36 2019
        ctime = Fri Aug  9 11:13:36 2019
        Account_Name = avolkov1_grp
        exec_host = compute-2-4/1+compute-2-5/3+compute-2-6/7
        Priority = 3655
        euser = oazadehranjbar(146695)
        egroup = users(100)
        Resource_List.walltime = 24:00:00
        Resource_List.nodect = 3
        Resource_List.ncpus = 11

263453:
        Job_Name = sbatch
        Job_Owner = hhan19@uahpc
        job_state = R
        queue = long
        qtime = Sun Aug  4 16:40:11 2019
        mtime = Thu Aug  8 13:11:03 2019
        ctime = Thu Aug 15 13:11:03 2019
        Account_Name = hhan19_grp
        exec_host = compute-2-6/16
        Priority = 2520
        euser = hhan19(105563)
        egroup = users(100)
        Resource_List.walltime = 168:00:00
        Resource_List.nodect = 1
        Resource_List.ncpus = 16

Comment 1 Deb Crocker 2019-08-08 14:38:58 MDT

Created attachment 11163 [details]
NODE CONFIGURATINO

Comment 2 Deb Crocker 2019-08-08 14:39:15 MDT

Created attachment 11164 [details]
PARTITION CONFIGURATION

Comment 3 Deb Crocker 2019-08-08 14:39:40 MDT

Created attachment 11165 [details]
SMALL PART OF CONFIG INCLUDE FILE

Comment 4 Deb Crocker 2019-08-08 14:42:45 MDT

Created attachment 11166 [details]
SLURMCTLD LOG FILE

Comment 5 Jason Booth 2019-08-08 15:06:36 MDT

Hi Deborah -  does this happen on every job or does this occur once and awhile?

Comment 6 Deb Crocker 2019-08-08 15:07:44 MDT

We haven't seen this before. It is happening to the one user hhan19 on a number of jobs he has submitted.

Comment 7 Jason Booth 2019-08-08 16:10:27 MDT

Would you also send us the output of 

scontrol show job <jobId>

For both of these jobs?

Comment 8 Deb Crocker 2019-08-08 19:10:11 MDT

The job 268833 has finished but we capture job output for our records. I believe this has everything from scontrol show job:

 JobId=268833 JobName=2D.sh
   UserId=oazadehranjbar(146695) GroupId=users(100) MCS_label=N/A
   Priority=3655 Nice=0 Account=avolkov1_grp QOS=main WCKey=*default
   JobState=COMPLETING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=08:42:41 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2019-08-08T11:13:36 EligibleTime=2019-08-08T11:13:36
   AccrueTime=2019-08-08T11:13:36
   StartTime=2019-08-08T11:13:36 EndTime=2019-08-08T19:56:17 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-08-08T11:13:36
   Partition=main AllocNode:Sid=uahpc:117894
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=compute-2-[4-6]
   BatchHost=compute-2-4
   NumNodes=3 NumCPUs=11 NumTasks=11 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=11,mem=44G,node=3,billing=22
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=compute-2-4 CPU_IDs=8 Mem=4096 GRES_IDX=
     Nodes=compute-2-5 CPU_IDs=4,6-7 Mem=12288 GRES_IDX=
     Nodes=compute-2-6 CPU_IDs=0-6 Mem=28672 GRES_IDX=
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=intel DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bighome/oazadehranjbar/2D/1/100mic/Test_TEM10/2D.sh
   WorkDir=/bighome/oazadehranjbar/2D/1/100mic/Test_TEM10
   StdErr=/bighome/oazadehranjbar/2D/1/100mic/Test_TEM10/slurm-268833.out
   StdIn=/dev/null
   StdOut=/bighome/oazadehranjbar/2D/1/100mic/Test_TEM10/slurm-268833.out
   Power=
     

From 263453:

JobId=263453 JobName=sbatch
   UserId=hhan19(105563) GroupId=users(100) MCS_label=N/A
   Priority=2520 Nice=0 Account=hhan19_grp QOS=long WCKey=*
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=06:58:29 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2019-08-04T16:40:11 EligibleTime=2019-08-04T16:40:11
   AccrueTime=2019-08-04T16:40:11
   StartTime=2019-08-08T13:11:03 EndTime=2019-08-15T13:11:03 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-08-08T13:11:03
   Partition=long AllocNode:Sid=uahpc:46852
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=compute-2-6
   BatchHost=compute-2-6
   NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=8G,node=1,billing=18
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryNode=8G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/hhan19/BMeta
   StdErr=/home/hhan19/BMeta/errors.263453
   StdIn=/dev/null
   StdOut=/home/hhan19/BMeta/output.263453

Comment 9 Felip Moll 2019-08-09 02:11:56 MDT

Hi Deb,

I would need the slurmd log on compute-2-6.

What I've seen so far is that the main scheduler is responsible of allocating the job into the wrong node:

[2019-08-08T13:11:02.986] _job_complete: JobId=263702 done
[2019-08-08T13:11:03.123] debug:  sched: Running job scheduler
[2019-08-08T13:11:03.124] BillingWeight: JobId=263453 is either new or it was resized
[2019-08-08T13:11:03.124] BillingWeight: JobId=263453 using "CPU=1.0,Mem=0.25G" from partition long
[2019-08-08T13:11:03.124] BillingWeight: JobId=263453 SUM(TRES) = 18.000000
[2019-08-08T13:11:03.125] sched: Allocate JobId=263453 NodeList=compute-2-6 #CPUs=16 Partition=long

The backfill is doing it correctly all the time:
[2019-08-08T13:10:16.897] =========================================
[2019-08-08T13:10:16.897] backfill test for JobId=263453 Prio=2520 Partition=long
[2019-08-08T13:10:16.897] Test JobId=263453 at 2019-08-08T13:10:16 on compute-2-[1,3-7]
[2019-08-08T13:10:16.898] JobId=263453 to start at 2019-08-10T10:54:28, end at 2019-08-17T10:54:00 on nodes compute-2-7 in partition long

Given that, I would need info for 263702, an 'scontrol show job 263702' or the information you say you record for each job.

Comment 10 Deb Crocker 2019-08-09 11:45:32 MDT

The slurmd log is empty on compute-2-6. Here is the job info you requested:

JobId=263702 JobName=RC2_12c1_n1_T300_s_r0.sh
   UserId=avolkov1(79148) GroupId=users(100) MCS_label=N/A
   Priority=3663 Nice=0 Account=avolkov1_grp QOS=long WCKey=*
   JobState=COMPLETING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=2-04:44:32 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2019-08-06T08:26:29 EligibleTime=2019-08-06T08:26:29
   AccrueTime=2019-08-06T08:26:29
   StartTime=2019-08-06T08:26:30 EndTime=2019-08-08T13:11:02 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-08-06T08:26:30
   Partition=long AllocNode:Sid=uahpc:17802
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=compute-2-[4-6]
   BatchHost=compute-2-4
   NumNodes=3 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=80G,node=3,billing=36
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=compute-2-4 CPU_IDs=0-7 Mem=40960 GRES_IDX=
     Nodes=compute-2-5 CPU_IDs=0-2 Mem=15360 GRES_IDX=
     Nodes=compute-2-6 CPU_IDs=8-9,12-14 Mem=25600 GRES_IDX=
   MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0
   Features=intel DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0/RC2_12c1_n1_T300_s_r0.sh
   WorkDir=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0
   StdErr=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0/slurm-263702.out
   StdIn=/dev/null
   StdOut=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0/slurm-263702.out
   Power=

Comment 11 Felip Moll 2019-08-15 10:04:29 MDT

(In reply to Deb Crocker from comment #10)
> The slurmd log is empty on compute-2-6. Here is the job info you requested:
> 
> JobId=263702 JobName=RC2_12c1_n1_T300_s_r0.sh
>    UserId=avolkov1(79148) GroupId=users(100) MCS_label=N/A
>    Priority=3663 Nice=0 Account=avolkov1_grp QOS=long WCKey=*
>    JobState=COMPLETING Reason=None Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    DerivedExitCode=0:0
>    RunTime=2-04:44:32 TimeLimit=7-00:00:00 TimeMin=N/A
>    SubmitTime=2019-08-06T08:26:29 EligibleTime=2019-08-06T08:26:29
>    AccrueTime=2019-08-06T08:26:29
>    StartTime=2019-08-06T08:26:30 EndTime=2019-08-08T13:11:02 Deadline=N/A
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    LastSchedEval=2019-08-06T08:26:30
>    Partition=long AllocNode:Sid=uahpc:17802
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=compute-2-[4-6]
>    BatchHost=compute-2-4
>    NumNodes=3 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=16,mem=80G,node=3,billing=36
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>      Nodes=compute-2-4 CPU_IDs=0-7 Mem=40960 GRES_IDX=
>      Nodes=compute-2-5 CPU_IDs=0-2 Mem=15360 GRES_IDX=
>      Nodes=compute-2-6 CPU_IDs=8-9,12-14 Mem=25600 GRES_IDX=
>    MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0
>    Features=intel DelayBoot=00:00:00
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>   
> Command=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0/RC2_12c1_n1_T300_s_r0.
> sh
>    WorkDir=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0
>    StdErr=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0/slurm-263702.out
>    StdIn=/dev/null
>    StdOut=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0/slurm-263702.out
>    Power=

Deb,

I haven't been able to reproduce so far. I see, in your scontrol show job:

JobState=COMPLETING

is this state the one captured by you when the job was ending?
Another question, do you know if the job was modified after submission in some way?

Have you seen the issue again?

Comment 12 Deb Crocker 2019-08-15 10:28:20 MDT

on COMPLETING - yes that was captured as the job was finishing.

It is unlikely that the users modified the job after submission. They are new to the system and probably would not know this is possible.

I think we did have a setting in the configuration file that we changed in testing the problem. It that was there when we changed our preemption to cancel. We left it because docs say it is overwritten by partition settings.  It was

PreemptMode             = SUSPEND,GANG

Comment 13 Felip Moll 2019-08-20 05:37:18 MDT

> I think we did have a setting in the configuration file that we changed in
> testing the problem. It that was there when we changed our preemption to
> cancel. We left it because docs say it is overwritten by partition settings.
> It was
> 
> PreemptMode             = SUSPEND,GANG

That's interesting, are you saying that you previously had SUSPEND instead of the current setting PreemptMode=CANCEL ?

That may be indeed part of the cause, I remember having seen before that a suspended job that then was resumed ended up oversubscribing new jobs allocated to the node.
Let me do some tests and confirm or discard this possibility.

Also note:
      preempt/partition_prio
             Job preemption is based upon partition priority tier.
             Jobs in higher priority partitions (queues) may preempt jobs from lower priority partitions.
             This is not  compatible  with  PreemptMode=OFF.

You have all except two partitions to PreemptMode=OFF. I will also check if this is a problem since the documentation suggests it is not ok.

Comment 14 Deb Crocker 2019-08-20 10:23:22 MDT

On the ones where it is off, what do we have to set that to and how do we keep jobs from preempting where we don't want them to. The only preemption allowed should be

Highmem preempts main
Owners preempts main
Highmem can't preempt owners and vice versa
Long can't preempt anyone
Main can't preemtp anyone

Comment 15 Deb Crocker 2019-08-20 10:28:56 MDT

From the Docs:

preempt/partition_prio
    Job preemption is based upon partition priority tier. Jobs in higher priority partitions (queues) may preempt jobs from lower priority partitions. This is not compatible with PreemptMode=OFF.

Are they only referring to the cluster level value (not the node level value)?

Comment 16 Felip Moll 2019-08-22 10:12:42 MDT

This small example achieves what you want, and it turns to be exactly as how you have it configured, so I think your current configuration is good and as you guessed PreemptMode=OFF is incompatible at a global level with partition_prio, *but not at partition level*.

PreemptType=preempt/partition_prio
PreemptMode=CANCEL

PartitionName=wheel Nodes=ALL Priority=10
PartitionName=highmem Nodes=ALL Priority=2
PartitionName=owners Nodes=ALL Priority=2
PartitionName=long Nodes=ALL Priority=1 PreemptMode=OFF
PartitionName=main Nodes=ALL Priority=1 PreemptMode=CANCEL

Having said that, main and long are partitions which overlapping nodes, specifically:

compute-0-[4-11],compute-1-[0-9],compute-2-[0-7]

Job 268833 run first in partition main in nodes compute-2-[4-6]
Job 263453 seemed to run second in partition long in nodes compute-2-6

I tried your configurations and did a set of tests, and I could reproduce exactly your issue.
The reproducer includes your PreemptMode=SUSPEND,GANG.

My logs looks like:

slurmctld: debug: sched: Running job scheduler
slurmctld: gang: entering gs_job_start for JobId=142
slurmctld: gang: _add_job_to_part: adding JobId=142 to long <------------
slurmctld: gang: _add_job_to_part: JobId=142 remains running
slurmctld: gang: part long has 1 jobs, 0 shadows: <------------ looking at part level
slurmctld: gang: JobId=142 row_s GS_FILLER, sig_s GS_RESUME
slurmctld: gang: active resmap has 16 of 64 bits set
slurmctld: gang: update_active_row: rebuilding part wheel...
slurmctld: gang: update_active_row: rebuilding part highmem...
slurmctld: gang: update_active_row: rebuilding part owners...
slurmctld: gang: update_active_row: rebuilding part long...
slurmctld: gang: update_active_row: rebuilding part main...
slurmctld: gang: leaving gs_job_start
slurmctld: sched: Allocate JobId=142 NodeList=gamba1 #CPUs=16 Partition=long
slurmctld: prolog_running_decr: Configuration for JobId=142 is complete

here we can see that even the resources are busy by other jobs, job 142 is added to long. After that the scheduler allocates and starts the job.

I then reduced the issue and found that is happening exactly when you have to partitions with equal priority and SUSPEND,GANG is enabled. In that situation,
the resources are oversubscribed even if Oversubscribe=No.

In part it makes sense because SUSPEND+GANG will try to preempt and time-slice a job in two different partitions but won't know which of them has priority so it ends up starting the jobs
in both partitions. Having SUSPEND,GANG with partitions on the same priority doesn't make sense. To get it worse, GANG works at a partition level, so overlapping nodes
in different partitions is also not recommended when using GANG.

From documentation:

GANG enables gang scheduling (time slicing) of jobs in the same partition. NOTE: Gang scheduling is performed independently for each partition,
so configuring partitions with overlapping nodes and gang scheduling is generally not recommended.

Interesting note:

Preemption Design and Operation

The select plugin will identify resources where a pending job can begin execution. When PreemptMode is configured to CANCEL, CHECKPOINT, SUSPEND or REQUEUE, the select
plugin will also preempt running jobs as needed to initiate the pending job. When PreemptMode=SUSPEND,GANG the select plugin will initiate the pending job and rely upon
the gang scheduling logic to perform job suspend and resume as described below.

*********** I think that this is clear that you want: ***********

PreemptType=preempt/partition_prio
PreemptMode=CANCEL

and your current settings in each partition.

***********

Also is clear that, as the documentation says, a user that wants preemption between partitions + GANG needs to set:

PreemptType=preempt/partition_prio
PreemptMode=SUSPEND,GANG

and partitions with different priorities, because otherwise GANG just ignores PreemptMode on each partition and just schedules by partition not preempting anything between partitions.
Finally disjoint nodes in every part. are also recommended if using GANG.

Remember that GANG does only time-slicing at same partition level, while if you combine it with SUSPEND you get also preemption between partitions.

Does it make sense?

Comment 17 Deb Crocker 2019-08-22 10:20:01 MDT

Thank you, I do understand. It looks like we are okay now. You may close the ticket.

Comment 18 Felip Moll 2019-08-23 02:21:40 MDT

Thanks Deb, closing the issue.