Ticket 9636

Summary:	Requesting multiple GPUs via --gpus=2 rejected despite being valid request
Product:	Slurm	Reporter:	Trey Dockendorf <tdockendorf>
Component:	GPU	Assignee:	Director of Support <support>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	cinek, felip.moll, tdockendorf, troy
Version:	20.02.4
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=9716 https://bugs.schedmd.com/show_bug.cgi?id=10569 https://bugs.schedmd.com/show_bug.cgi?id=10623
Site:	Ohio State OSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.11.3
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf gres.conf

Description Trey Dockendorf 2020-08-21 10:36:30 MDT

Created attachment 15539 [details]
slurm.conf

I am unable to submit a job with --gpus=2 to a partition that has MaxNodes=1.  What I've noticed if I submit to partition with MinNodes=2 on the same host with --gpus=2 the request is accepted but I am allocated 2 nodes with 1 GPU per node rather than the expected 1 node with 2 GPUs.

$ sbatch -w p0302 --gpus=2 -p gpuserial-48core hostname.sbatch 
sbatch: error: Batch job submission failed: Requested partition configuration not available now

$ sbatch -w p0302 --gpus=1 -p gpuserial-48core hostname.sbatch 
Submitted batch job 14116

$ sbatch --gpus=2 -p gpuparallel-48core hostname.sbatch 
Submitted batch job 14117

$ scontrol show job=14117
JobId=14117 JobName=hostname.sbatch
   UserId=tdockendorf(20821) GroupId=PZS0708(5509) MCS_label=N/A
   Priority=100047874 Nice=0 Account=pzs0708 QOS=pitzer-all
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2020-08-21T12:33:30 EligibleTime=2020-08-21T12:33:30
   AccrueTime=2020-08-21T12:33:30
   StartTime=2020-08-21T12:33:32 EndTime=2020-08-21T12:33:32 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-08-21T12:33:32
   Partition=gpuparallel-48core AllocNode:Sid=pitzer-rw01:128863
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=p[0301-0302]
   BatchHost=p0301
   NumNodes=2 NumCPUs=96 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=96,node=2,billing=96,gres/gpfs:ess=0,gres/gpfs:project=0,gres/gpfs:scratch=0,gres/gpu=4,gres/gpu:v100-32g=4,gres/ime=0,gres/pfsdir=0,gres/pfsdir:ess=0,gres/pfsdir:scratch=0
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4556M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/users/sysp/tdockendorf/slurm-tests/hostname.sbatch
   WorkDir=/users/sysp/tdockendorf/slurm-tests
   Comment=stdout=/users/sysp/tdockendorf/slurm-tests/output/hostname-14117.out 
   StdErr=/users/sysp/tdockendorf/slurm-tests/output/hostname-14117.out
   StdIn=/dev/null
   StdOut=/users/sysp/tdockendorf/slurm-tests/output/hostname-14117.out
   Power=
   TresPerJob=gpu:2
   MailUser=(null) MailType=NONE


$ cat hostname.sbatch 
#!/bin/bash
#SBATCH -t 00:05:00
#SBATCH -o output/hostname-%j.out

env | sort
echo "SLURM_NODELIST"
echo $SLURM_NODELIST
echo "hostname"
hostname

$ scontrol show partition=gpuserial-48core
PartitionName=gpuserial-48core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-gpuserial-partition
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=48
   Nodes=p03[01-42]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2016 TotalNodes=42 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerCPU=9293

$ scontrol show partition=gpuparallel-48core
PartitionName=gpuparallel-48core
   AllowGroups=ALL DenyAccounts=pcon0060,pcon0003,pcon0014,pcon0015,pcon0016,pcon0008,pcon0010,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041,pcon0009,pcon0020,pcon0022,pcon0023,pcon0024,pcon0025,pcon0026,pcon0040,pcon0041 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=pitzer-gpuparallel-partition
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=4-00:00:00 MinNodes=2 LLN=NO MaxCPUsPerNode=48
   Nodes=p03[01-42]
   PriorityJobFactor=1000 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2016 TotalNodes=42 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerCPU=9293

$ scontrol show node=p0302
NodeName=p0302 Arch=x86_64 CoresPerSocket=24 
   CPUAlloc=0 CPUTot=48 CPULoad=0.19
   AvailableFeatures=48core,expansion,exp,r740,gpu,eth-pitzer-rack09h1,ib-i4l1s12,ib-i4,pitzer-rack08,v100-32g
   ActiveFeatures=48core,expansion,exp,r740,gpu,eth-pitzer-rack09h1,ib-i4l1s12,ib-i4,pitzer-rack08,v100-32g
   Gres=gpu:v100-32g:2(S:0-1),pfsdir:scratch:1,pfsdir:ess:1,ime:1,gpfs:project:1,gpfs:scratch:1,gpfs:ess:1
   NodeAddr=10.4.8.2 NodeHostName=p0302 Version=20.02.4
   OS=Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Wed Feb 12 14:08:31 UTC 2020 
   RealMemory=371712 AllocMem=0 FreeMem=376018 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=4 Owner=N/A MCS_label=N/A
   Partitions=batch,gpubackfill-parallel,gpubackfill-serial,gpudebug,gpuparallel,gpuparallel-48core,gpuserial,gpuserial-48core,systems 
   BootTime=2020-08-19T16:22:31 SlurmdStartTime=2020-08-19T16:23:51
   CfgTRES=cpu=48,mem=363G,billing=48,gres/gpfs:ess=1,gres/gpfs:project=1,gres/gpfs:scratch=1,gres/gpu=2,gres/gpu:v100-32g=2,gres/ime=1,gres/pfsdir=2,gres/pfsdir:ess=1,gres/pfsdir:scratch=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

# sacctmgr show qos --parsable
Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES|
pitzer-all|0|00:00:00|||cluster|||1.000000|cpu=29856||||||||||||1400||665|||
pitzer-default|0|00:00:00|||cluster|||1.000000|||||||||||||1000|cpu=2040|384|||
pitzer-override-tres|0|00:00:00|||cluster|||1.000000|||||||||||||1000||384|||
pitzer-gpuserial-partition|0|00:00:00|||cluster|DenyOnLimit||1.000000|||||||gres/gpu=4||||||||||gres/gpu=1|
pitzer-gpuparallel-partition|0|00:00:00|||cluster|DenyOnLimit||1.000000||||||||gres/gpu=4|||||||||gres/gpu=1|
pitzer-hugemem-partition|0|00:00:00|||cluster|DenyOnLimit||1.000000|||||||||||||||||mem=754G|
debug|0|00:00:00|||cluster|DenyOnLimit||1.000000||||||||||||1||||||
gpudebug|0|00:00:00|||cluster|DenyOnLimit||1.000000||||||||||||1|||||gres/gpu=1|
pitzer-datta|0|00:00:00|||cluster|||1.000000|node=44|||||||||||||||||
pitzer-largemem-partition|0|00:00:00|||cluster|DenyOnLimit||1.000000|||||||||||||||||mem=363G|

Comment 1 Trey Dockendorf 2020-08-21 10:36:45 MDT

Created attachment 15540 [details]
gres.conf

Comment 2 Michael Hinton 2020-08-21 16:26:56 MDT

Hi Trey,

I'm able to reproduce this, it appears. It looks like the MaxNodes=1 is messing things up - when I submit to a partition without a MaxNodes limit, I can get a one-node job with 2 GPUs. I will try to get to the bottom of this and get back to you.

Have you observed this behavior in 19.05?

Thanks,
-Michael

Comment 4 Trey Dockendorf 2020-08-24 06:10:32 MDT

We have no installed or tested 19.05 because our SLURM install is new and we started on 20.02.

Comment 5 Michael Hinton 2020-08-26 10:23:05 MDT

I'm able to reproduce this on 19.05.

As a workaround, I noticed that if you replace --gpus=2 with --gres=gpu:2, it works as expected with MaxNodes=1. The difference between --gpus=2 and --gres=gpu:2 is that the first specifies 2 GPUs per job, while the second specifies 2 GPUs per node in the job. When the job is limited to 1 node, these are effectively the same.

It looks like some edge case in the logic. Hopefully, this turns out to be an easy fix.

Comment 6 Michael Hinton 2020-08-26 10:29:55 MDT

(In reply to Trey Dockendorf from comment #0)
> What I've noticed is if I submit to a partition with MinNodes=2 on
> the same host with --gpus=2, the request is accepted but I am allocated 2
> nodes with 1 GPU per node rather than the expected 1 node with 2 GPUs.
Because MinNodes=2, I would expect no fewer than 2 nodes. So this case appears to be working as expected.

Comment 7 Trey Dockendorf 2020-08-26 10:34:53 MDT

Switching to --gres=gpu:2 isn't really a good solution for us if this is a bug, I don't want to have retrain our thousands of users once the bug is fixed.  We are still in the testing phases for our SLURM install and we begin letting early users test SLURM next week and then we go into production with SLURM on October 1st so ideally a patch would be sooner rather than later.

Comment 32 Michael Hinton 2021-01-15 12:13:25 MST

Hi Trey,

Thanks for the report. This has finally been fixed with commit ba353b8c13 and will make it into 20.11.3. See https://github.com/SchedMD/slurm/commit/ba353b8c13d0f523a84ef9fa522ff2980929e01c.

I'll go ahead and close this out. Feel free to reopen if this does not fix things for you.

Thanks!
-Michael