Created attachment 11162 [details] MAIN SLURM.CONF We have two jobs which have ended up sharing a node and running on more than the available cores even though we have OverSubscribe set to NO. Jobs were submitted to different partitions. The one that is getting added to the node as -n 1 -c 16 when there only 9 free cores. We have not seen anything like this over the several years we have been running Slurm. I'm including some information here and will attach our config files. Here is the squeue onput for the node: 268833 main main 2D.sh oazadehran R 3:33:26 20:26:34 11/ 3 4G compute-2-[4-6] (null) 263453 long long sbatch hhan19 R 1:35:59 6-22:24:01 16/ 1 8G compute-2-6 (null) Here are the job submission files for each job: 268833: !/bin/bash #SBATCH --mem-per-cpu 4G #SBATCH -C intel #SBATCH -p main #SBATCH --qos main #SBATCH -n 11 rm -f *.o *.mod srun ./P3 263452: #!/bin/bash #SBATCH -n 1 #SBATCH -c 16 #SBATCH -p long #SBATCH --qos long #SBATCH --mem=8gb #SBATCH -e errors.%A #SBATCH -o output.%A python meta_test.py Here is the qstat output for each job: 268833: Job_Name = 2D.sh Job_Owner = oazadehranjbar@uahpc job_state = R queue = main qtime = Thu Aug 8 11:13:36 2019 mtime = Thu Aug 8 11:13:36 2019 ctime = Fri Aug 9 11:13:36 2019 Account_Name = avolkov1_grp exec_host = compute-2-4/1+compute-2-5/3+compute-2-6/7 Priority = 3655 euser = oazadehranjbar(146695) egroup = users(100) Resource_List.walltime = 24:00:00 Resource_List.nodect = 3 Resource_List.ncpus = 11 263453: Job_Name = sbatch Job_Owner = hhan19@uahpc job_state = R queue = long qtime = Sun Aug 4 16:40:11 2019 mtime = Thu Aug 8 13:11:03 2019 ctime = Thu Aug 15 13:11:03 2019 Account_Name = hhan19_grp exec_host = compute-2-6/16 Priority = 2520 euser = hhan19(105563) egroup = users(100) Resource_List.walltime = 168:00:00 Resource_List.nodect = 1 Resource_List.ncpus = 16
Created attachment 11163 [details] NODE CONFIGURATINO
Created attachment 11164 [details] PARTITION CONFIGURATION
Created attachment 11165 [details] SMALL PART OF CONFIG INCLUDE FILE
Created attachment 11166 [details] SLURMCTLD LOG FILE
Hi Deborah - does this happen on every job or does this occur once and awhile?
We haven't seen this before. It is happening to the one user hhan19 on a number of jobs he has submitted.
Would you also send us the output of scontrol show job <jobId> For both of these jobs?
The job 268833 has finished but we capture job output for our records. I believe this has everything from scontrol show job: JobId=268833 JobName=2D.sh UserId=oazadehranjbar(146695) GroupId=users(100) MCS_label=N/A Priority=3655 Nice=0 Account=avolkov1_grp QOS=main WCKey=*default JobState=COMPLETING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=08:42:41 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2019-08-08T11:13:36 EligibleTime=2019-08-08T11:13:36 AccrueTime=2019-08-08T11:13:36 StartTime=2019-08-08T11:13:36 EndTime=2019-08-08T19:56:17 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-08-08T11:13:36 Partition=main AllocNode:Sid=uahpc:117894 ReqNodeList=(null) ExcNodeList=(null) NodeList=compute-2-[4-6] BatchHost=compute-2-4 NumNodes=3 NumCPUs=11 NumTasks=11 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=11,mem=44G,node=3,billing=22 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=compute-2-4 CPU_IDs=8 Mem=4096 GRES_IDX= Nodes=compute-2-5 CPU_IDs=4,6-7 Mem=12288 GRES_IDX= Nodes=compute-2-6 CPU_IDs=0-6 Mem=28672 GRES_IDX= MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0 Features=intel DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bighome/oazadehranjbar/2D/1/100mic/Test_TEM10/2D.sh WorkDir=/bighome/oazadehranjbar/2D/1/100mic/Test_TEM10 StdErr=/bighome/oazadehranjbar/2D/1/100mic/Test_TEM10/slurm-268833.out StdIn=/dev/null StdOut=/bighome/oazadehranjbar/2D/1/100mic/Test_TEM10/slurm-268833.out Power= From 263453: JobId=263453 JobName=sbatch UserId=hhan19(105563) GroupId=users(100) MCS_label=N/A Priority=2520 Nice=0 Account=hhan19_grp QOS=long WCKey=* JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=06:58:29 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2019-08-04T16:40:11 EligibleTime=2019-08-04T16:40:11 AccrueTime=2019-08-04T16:40:11 StartTime=2019-08-08T13:11:03 EndTime=2019-08-15T13:11:03 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-08-08T13:11:03 Partition=long AllocNode:Sid=uahpc:46852 ReqNodeList=(null) ExcNodeList=(null) NodeList=compute-2-6 BatchHost=compute-2-6 NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:* TRES=cpu=16,mem=8G,node=1,billing=18 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=16 MinMemoryNode=8G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/hhan19/BMeta StdErr=/home/hhan19/BMeta/errors.263453 StdIn=/dev/null StdOut=/home/hhan19/BMeta/output.263453
Hi Deb, I would need the slurmd log on compute-2-6. What I've seen so far is that the main scheduler is responsible of allocating the job into the wrong node: [2019-08-08T13:11:02.986] _job_complete: JobId=263702 done [2019-08-08T13:11:03.123] debug: sched: Running job scheduler [2019-08-08T13:11:03.124] BillingWeight: JobId=263453 is either new or it was resized [2019-08-08T13:11:03.124] BillingWeight: JobId=263453 using "CPU=1.0,Mem=0.25G" from partition long [2019-08-08T13:11:03.124] BillingWeight: JobId=263453 SUM(TRES) = 18.000000 [2019-08-08T13:11:03.125] sched: Allocate JobId=263453 NodeList=compute-2-6 #CPUs=16 Partition=long The backfill is doing it correctly all the time: [2019-08-08T13:10:16.897] ========================================= [2019-08-08T13:10:16.897] backfill test for JobId=263453 Prio=2520 Partition=long [2019-08-08T13:10:16.897] Test JobId=263453 at 2019-08-08T13:10:16 on compute-2-[1,3-7] [2019-08-08T13:10:16.898] JobId=263453 to start at 2019-08-10T10:54:28, end at 2019-08-17T10:54:00 on nodes compute-2-7 in partition long Given that, I would need info for 263702, an 'scontrol show job 263702' or the information you say you record for each job.
The slurmd log is empty on compute-2-6. Here is the job info you requested: JobId=263702 JobName=RC2_12c1_n1_T300_s_r0.sh UserId=avolkov1(79148) GroupId=users(100) MCS_label=N/A Priority=3663 Nice=0 Account=avolkov1_grp QOS=long WCKey=* JobState=COMPLETING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=2-04:44:32 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2019-08-06T08:26:29 EligibleTime=2019-08-06T08:26:29 AccrueTime=2019-08-06T08:26:29 StartTime=2019-08-06T08:26:30 EndTime=2019-08-08T13:11:02 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-08-06T08:26:30 Partition=long AllocNode:Sid=uahpc:17802 ReqNodeList=(null) ExcNodeList=(null) NodeList=compute-2-[4-6] BatchHost=compute-2-4 NumNodes=3 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=16,mem=80G,node=3,billing=36 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=compute-2-4 CPU_IDs=0-7 Mem=40960 GRES_IDX= Nodes=compute-2-5 CPU_IDs=0-2 Mem=15360 GRES_IDX= Nodes=compute-2-6 CPU_IDs=8-9,12-14 Mem=25600 GRES_IDX= MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0 Features=intel DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0/RC2_12c1_n1_T300_s_r0.sh WorkDir=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0 StdErr=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0/slurm-263702.out StdIn=/dev/null StdOut=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0/slurm-263702.out Power=
(In reply to Deb Crocker from comment #10) > The slurmd log is empty on compute-2-6. Here is the job info you requested: > > JobId=263702 JobName=RC2_12c1_n1_T300_s_r0.sh > UserId=avolkov1(79148) GroupId=users(100) MCS_label=N/A > Priority=3663 Nice=0 Account=avolkov1_grp QOS=long WCKey=* > JobState=COMPLETING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=2-04:44:32 TimeLimit=7-00:00:00 TimeMin=N/A > SubmitTime=2019-08-06T08:26:29 EligibleTime=2019-08-06T08:26:29 > AccrueTime=2019-08-06T08:26:29 > StartTime=2019-08-06T08:26:30 EndTime=2019-08-08T13:11:02 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-08-06T08:26:30 > Partition=long AllocNode:Sid=uahpc:17802 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=compute-2-[4-6] > BatchHost=compute-2-4 > NumNodes=3 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=16,mem=80G,node=3,billing=36 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > Nodes=compute-2-4 CPU_IDs=0-7 Mem=40960 GRES_IDX= > Nodes=compute-2-5 CPU_IDs=0-2 Mem=15360 GRES_IDX= > Nodes=compute-2-6 CPU_IDs=8-9,12-14 Mem=25600 GRES_IDX= > MinCPUsNode=1 MinMemoryCPU=5G MinTmpDiskNode=0 > Features=intel DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > > Command=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0/RC2_12c1_n1_T300_s_r0. > sh > WorkDir=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0 > StdErr=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0/slurm-263702.out > StdIn=/dev/null > StdOut=/bighome/avolkov1/CNT_M4/RC2_12c1_n1_T300_s_r0/slurm-263702.out > Power= Deb, I haven't been able to reproduce so far. I see, in your scontrol show job: JobState=COMPLETING is this state the one captured by you when the job was ending? Another question, do you know if the job was modified after submission in some way? Have you seen the issue again?
on COMPLETING - yes that was captured as the job was finishing. It is unlikely that the users modified the job after submission. They are new to the system and probably would not know this is possible. I think we did have a setting in the configuration file that we changed in testing the problem. It that was there when we changed our preemption to cancel. We left it because docs say it is overwritten by partition settings. It was PreemptMode = SUSPEND,GANG
> I think we did have a setting in the configuration file that we changed in > testing the problem. It that was there when we changed our preemption to > cancel. We left it because docs say it is overwritten by partition settings. > It was > > PreemptMode = SUSPEND,GANG That's interesting, are you saying that you previously had SUSPEND instead of the current setting PreemptMode=CANCEL ? That may be indeed part of the cause, I remember having seen before that a suspended job that then was resumed ended up oversubscribing new jobs allocated to the node. Let me do some tests and confirm or discard this possibility. Also note: preempt/partition_prio Job preemption is based upon partition priority tier. Jobs in higher priority partitions (queues) may preempt jobs from lower priority partitions. This is not compatible with PreemptMode=OFF. You have all except two partitions to PreemptMode=OFF. I will also check if this is a problem since the documentation suggests it is not ok.
On the ones where it is off, what do we have to set that to and how do we keep jobs from preempting where we don't want them to. The only preemption allowed should be Highmem preempts main Owners preempts main Highmem can't preempt owners and vice versa Long can't preempt anyone Main can't preemtp anyone
From the Docs: preempt/partition_prio Job preemption is based upon partition priority tier. Jobs in higher priority partitions (queues) may preempt jobs from lower priority partitions. This is not compatible with PreemptMode=OFF. Are they only referring to the cluster level value (not the node level value)?
This small example achieves what you want, and it turns to be exactly as how you have it configured, so I think your current configuration is good and as you guessed PreemptMode=OFF is incompatible at a global level with partition_prio, *but not at partition level*. PreemptType=preempt/partition_prio PreemptMode=CANCEL PartitionName=wheel Nodes=ALL Priority=10 PartitionName=highmem Nodes=ALL Priority=2 PartitionName=owners Nodes=ALL Priority=2 PartitionName=long Nodes=ALL Priority=1 PreemptMode=OFF PartitionName=main Nodes=ALL Priority=1 PreemptMode=CANCEL Having said that, main and long are partitions which overlapping nodes, specifically: compute-0-[4-11],compute-1-[0-9],compute-2-[0-7] Job 268833 run first in partition main in nodes compute-2-[4-6] Job 263453 seemed to run second in partition long in nodes compute-2-6 I tried your configurations and did a set of tests, and I could reproduce exactly your issue. The reproducer includes your PreemptMode=SUSPEND,GANG. My logs looks like: slurmctld: debug: sched: Running job scheduler slurmctld: gang: entering gs_job_start for JobId=142 slurmctld: gang: _add_job_to_part: adding JobId=142 to long <------------ slurmctld: gang: _add_job_to_part: JobId=142 remains running slurmctld: gang: part long has 1 jobs, 0 shadows: <------------ looking at part level slurmctld: gang: JobId=142 row_s GS_FILLER, sig_s GS_RESUME slurmctld: gang: active resmap has 16 of 64 bits set slurmctld: gang: update_active_row: rebuilding part wheel... slurmctld: gang: update_active_row: rebuilding part highmem... slurmctld: gang: update_active_row: rebuilding part owners... slurmctld: gang: update_active_row: rebuilding part long... slurmctld: gang: update_active_row: rebuilding part main... slurmctld: gang: leaving gs_job_start slurmctld: sched: Allocate JobId=142 NodeList=gamba1 #CPUs=16 Partition=long slurmctld: prolog_running_decr: Configuration for JobId=142 is complete here we can see that even the resources are busy by other jobs, job 142 is added to long. After that the scheduler allocates and starts the job. I then reduced the issue and found that is happening exactly when you have to partitions with equal priority and SUSPEND,GANG is enabled. In that situation, the resources are oversubscribed even if Oversubscribe=No. In part it makes sense because SUSPEND+GANG will try to preempt and time-slice a job in two different partitions but won't know which of them has priority so it ends up starting the jobs in both partitions. Having SUSPEND,GANG with partitions on the same priority doesn't make sense. To get it worse, GANG works at a partition level, so overlapping nodes in different partitions is also not recommended when using GANG. From documentation: GANG enables gang scheduling (time slicing) of jobs in the same partition. NOTE: Gang scheduling is performed independently for each partition, so configuring partitions with overlapping nodes and gang scheduling is generally not recommended. Interesting note: Preemption Design and Operation The select plugin will identify resources where a pending job can begin execution. When PreemptMode is configured to CANCEL, CHECKPOINT, SUSPEND or REQUEUE, the select plugin will also preempt running jobs as needed to initiate the pending job. When PreemptMode=SUSPEND,GANG the select plugin will initiate the pending job and rely upon the gang scheduling logic to perform job suspend and resume as described below. *********** I think that this is clear that you want: *********** PreemptType=preempt/partition_prio PreemptMode=CANCEL and your current settings in each partition. *********** Also is clear that, as the documentation says, a user that wants preemption between partitions + GANG needs to set: PreemptType=preempt/partition_prio PreemptMode=SUSPEND,GANG and partitions with different priorities, because otherwise GANG just ignores PreemptMode on each partition and just schedules by partition not preempting anything between partitions. Finally disjoint nodes in every part. are also recommended if using GANG. Remember that GANG does only time-slicing at same partition level, while if you combine it with SUSPEND you get also preemption between partitions. Does it make sense?
Thank you, I do understand. It looks like we are okay now. You may close the ticket.
Thanks Deb, closing the issue.