We upgraded to Slurm 20.02 roughly a month ago on a cluster and began using the cons_tres plugin. When a user requests an allocation using --gpus-per-task and the number of tasks is greater than one, all GPUs on the node are allocated. The extra GPUs allocated are not available in the users job, they are simply not available to anyone else to use. Using --gpus or --gpus-per-node does not appear to have the same issue. Is there something that can be done to prevent these GPUs from idling?
Jonas, I tried to reproduce it with: ># sbatch --mem=10 --gpus-per-task=2 -n1 -w test01 --wrap="sleep 100" >Submitted batch job 54114 ># sbatch --mem=10 --gpus-per-task=2 -n1 -w test01 --wrap="sleep 100" >Submitted batch job 54115 ># squeue > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) > 54114 AllNodes wrap root R 0:01 1 test01 > 54115 AllNodes wrap root R 0:01 1 test01 ># scontrol show node test01 >NodeName=test01 Arch=x86_64 CoresPerSocket=16 > CPUAlloc=8 CPUTot=128 CPULoad=0.01 > AvailableFeatures=(null) > ActiveFeatures=(null) > Gres=gpu:4(S:0-1) > NodeAddr=slurmctl NodeHostName=slurmctl Port=30001 Version=20.02.5 > OS=Linux 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 > RealMemory=900 AllocMem=20 FreeMem=42 Sockets=2 Boards=1 > State=MIXED ThreadsPerCore=4 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A > Partitions=AllNodes > BootTime=2020-11-02T09:32:25 SlurmdStartTime=2020-11-02T14:43:21 > CfgTRES=cpu=128,mem=900M,billing=128,gres/gpu=4 > AllocTRES=cpu=8,mem=20M,gres/gpu=4 > CapWatts=n/a > CurrentWatts=0 AveWatts=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Did I understand your description correctly? Could you please share some commands results with reproducer? cheers, Marcin
Yes. Here in an example when we try starting two jobs asking (what I would expect 2 gpus) on the same node that has 4 gpus. First job. > [jonst@sign ~]$ srun -n1 -t10 --gpus-per-task=v100:2 -Ansc --reservation=gpu --pty -E /bin/bash -l > [jonst@n2017 ~]$ scontrol show job $SLURM_JOBID > JobId=1075339 JobName=bash > UserId=jonst(1041) GroupId=jonst(1041) MCS_label=N/A > Priority=1000164420 Nice=0 Account=nsc QOS=nsc > JobState=RUNNING Reason=None Dependency=(null) > Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 > RunTime=00:00:53 TimeLimit=00:10:00 TimeMin=N/A > SubmitTime=2020-11-02T17:05:42 EligibleTime=2020-11-02T17:05:42 > AccrueTime=Unknown > StartTime=2020-11-02T17:05:42 EndTime=2020-11-02T17:15:46 Deadline=N/A > PreemptEligibleTime=2020-11-02T17:05:42 PreemptTime=None > SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-11-02T17:05:42 > Partition=sigma AllocNode:Sid=sign:215598 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=n2017 > BatchHost=n2017 > NumNodes=1 NumCPUs=18 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=18,mem=52272M,node=1,billing=18 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* > MinCPUsNode=1 MinMemoryCPU=2904M MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > Reservation=gpu > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/bin/bash > WorkDir=/home/jonst > Power= > TresPerTask=gpu:v100:2 > MailUser=(null) MailType=NONE nvidia-smi shows 2 gpus. > [jonst@n2017 ~]$ nvidia-smi > Mon Nov 2 17:09:55 2020 > +-----------------------------------------------------------------------------+ > | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | > |-------------------------------+----------------------+----------------------+ > | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | > | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | > |===============================+======================+======================| > | 0 Tesla V100-SXM2... On | 00000000:61:00.0 Off | 0 | > | N/A 40C P0 41W / 300W | 0MiB / 32510MiB | 0% Default | > +-------------------------------+----------------------+----------------------+ > | 1 Tesla V100-SXM2... On | 00000000:62:00.0 Off | 0 | > | N/A 40C P0 41W / 300W | 0MiB / 32510MiB | 0% Default | > +-------------------------------+----------------------+----------------------+ > > +-----------------------------------------------------------------------------+ > | Processes: GPU Memory | > | GPU PID Type Process name Usage | > |=============================================================================| > | No running processes found | > +-----------------------------------------------------------------------------+ Job has been allocted 4? > [jonst@n2017 ~]$ sacct -j $SLURM_JOBID --format=JobID,Start,END,ReqGRES%20,ReqTRES%40,AllocGRES,AllocTRES%40 > JobID Start End ReqGRES ReqTRES AllocGRES AllocTRES > ------------ ------------------- ------------------- -------------------- ---------------------------------------- ------------ ---------------------------------------- > 1075339 2020-11-02T17:05:42 Unknown PER_TASK:gpu:v100:2 billing=1,cpu=1,mem=2904M,node=1 gpu:4 billing=18,cpu=18,mem=52272M,node=1 > 1075339.ext+ 2020-11-02T17:05:42 Unknown PER_TASK:gpu:v100:2 gpu:4 billing=18,cpu=18,mem=52272M,node=1 > 1075339.0 2020-11-02T17:05:46 Unknown PER_TASK:gpu:v100:2 gpu:4 cpu=1,mem=0,node=1 Trying to start a second job. (Using -w to place it on the same node) > [jonst@sign ~]$ srun -n1 -t10 --gpus-per-task=v100:2 -Ansc --reservation=gpu --pty -E -w n2017 /bin/bash -l > srun: job 1075340 queued and waiting for resources Job gets stuck pending but starts as soon as the first job exits. > [jonst@n2017 ~]$ squeue -u jonst > JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) > 1075340 sigma bash jonst PD 0:00 1 (Resources) > 1075339 sigma bash jonst R 3:55 1 n2017
Jonas, Could you please attach your configuration files? I took a look at other bugs, but I can't find a config with sigma partition. Is this for "tetralith" or other machine? Does it look like reservation releated or happens on the empty node as well? Could you please share the command used to create the reservation? cheers, Marcin
Created attachment 16471 [details] slurm.conf for sigma
This is what the reservation looks like. > ReservationName=gpu StartTime=2020-09-14T12:54:44 EndTime=2030-07-24T12:54:44 Duration=3600-00:00:00 > Nodes=n[2017-2018] NodeCnt=2 CoreCnt=72 Features=(null) PartitionName=(null) Flags=SPEC_NODES > TRES=cpu=72 > Users=(null) Accounts=nsc,liu-gpu-2020-1,liu-gpu-2020-2,liu-gpu-2020-3 Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a > MaxStartDelay=(null) We are probably misusing reservations a bit. The gpu-nodes should probably have been in their own partition but the way we create users only support one partition per cluster (at the moment).
Jonas, I think I reproduced the issue and it looks like it's related to --cpus-per-gpu being set from slurm.conf default. This may be a duplicate of a Bug 9947 where we already have a patch in QA process, are you able to apply it and verify if it will fix the issue for you?(attachment 16578 [details]) Alternatively, Can you add direct --cpus-per-gpu=9 to your srun calls and check if it works correctly? cheers, Marcin
(In reply to Marcin Stolarek from comment #6) > Jonas, > > I think I reproduced the issue and it looks like it's related to > --cpus-per-gpu being set from slurm.conf default. This may be a duplicate of > a Bug 9947 where we already have a patch in QA process, are you able to > apply it and verify if it will fix the issue for you?(attachment 16578 [details] > [details]) > > Alternatively, Can you add direct --cpus-per-gpu=9 to your srun calls and > check if it works correctly? > > cheers, > Marcin My colleague tested and it seems like this was the problem. I saw that there was a patch for DefCpuPerGPU on the 20.02 branch, that is the one that would fix this? https://github.com/SchedMD/slurm/commit/0b6faf691c6fb5445fdb01c74daf81ecb87e05db
Yes - the commit you're asking for is exectly the same I shared in comment 6 attachment. You should be able to apply it manually or just wait and upgrade - it will be part of Slurm 20.02.7 release. I'm marking the case as duplicate of original Bug 9947 now. cheers, Marcin *** This ticket has been marked as a duplicate of ticket 9947 ***