Ticket 10103

Summary:	Question about --gpus-per-task
Product:	Slurm	Reporter:	Jonas Stare <jonst>
Component:	Configuration	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	cinek
Version:	20.02.4
Hardware:	Linux
OS:	Linux
Site:	SNIC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	NSC	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	sigma
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf for sigma

Description Jonas Stare 2020-10-30 09:19:43 MDT

We upgraded to Slurm 20.02 roughly a month ago on a cluster and began using the cons_tres plugin. When a user requests an allocation using --gpus-per-task and the number of tasks is greater than one, all GPUs on the node are allocated. The extra GPUs allocated are not available in the users job, they are simply not available to anyone else to use. 

Using --gpus or --gpus-per-node does not appear to have the same issue. 

Is there something that can be done to prevent these GPUs from idling?

Comment 1 Marcin Stolarek 2020-11-02 08:14:13 MST

Jonas,

I tried to reproduce it with:
># sbatch --mem=10 --gpus-per-task=2 -n1 -w test01 --wrap="sleep 100"
>Submitted batch job 54114
># sbatch --mem=10 --gpus-per-task=2 -n1 -w test01 --wrap="sleep 100"
>Submitted batch job 54115
># squeue 
>             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
>             54114  AllNodes     wrap     root  R       0:01      1 test01 
>             54115  AllNodes     wrap     root  R       0:01      1 test01 
># scontrol show node test01
>NodeName=test01 Arch=x86_64 CoresPerSocket=16 
>   CPUAlloc=8 CPUTot=128 CPULoad=0.01
>   AvailableFeatures=(null)
>   ActiveFeatures=(null)
>   Gres=gpu:4(S:0-1)
>   NodeAddr=slurmctl NodeHostName=slurmctl Port=30001 Version=20.02.5
>   OS=Linux 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 
>   RealMemory=900 AllocMem=20 FreeMem=42 Sockets=2 Boards=1
>   State=MIXED ThreadsPerCore=4 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>   Partitions=AllNodes 
>   BootTime=2020-11-02T09:32:25 SlurmdStartTime=2020-11-02T14:43:21
>   CfgTRES=cpu=128,mem=900M,billing=128,gres/gpu=4
>   AllocTRES=cpu=8,mem=20M,gres/gpu=4
>   CapWatts=n/a
>   CurrentWatts=0 AveWatts=0
>   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


Did I understand your description correctly? Could you please share some commands results with reproducer?

cheers,
Marcin

Comment 2 Jonas Stare 2020-11-02 09:26:32 MST

Yes. Here in an example when we try starting two jobs asking (what I would expect 2 gpus) on the same node that has 4 gpus.

First job.

> [jonst@sign ~]$ srun -n1 -t10 --gpus-per-task=v100:2 -Ansc --reservation=gpu --pty -E  /bin/bash -l
> [jonst@n2017 ~]$ scontrol show job $SLURM_JOBID
> JobId=1075339 JobName=bash
>    UserId=jonst(1041) GroupId=jonst(1041) MCS_label=N/A
>    Priority=1000164420 Nice=0 Account=nsc QOS=nsc
>    JobState=RUNNING Reason=None Dependency=(null)
>    Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
>    RunTime=00:00:53 TimeLimit=00:10:00 TimeMin=N/A
>    SubmitTime=2020-11-02T17:05:42 EligibleTime=2020-11-02T17:05:42
>    AccrueTime=Unknown
>    StartTime=2020-11-02T17:05:42 EndTime=2020-11-02T17:15:46 Deadline=N/A
>    PreemptEligibleTime=2020-11-02T17:05:42 PreemptTime=None
>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-11-02T17:05:42
>    Partition=sigma AllocNode:Sid=sign:215598
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=n2017
>    BatchHost=n2017
>    NumNodes=1 NumCPUs=18 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=18,mem=52272M,node=1,billing=18
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
>    MinCPUsNode=1 MinMemoryCPU=2904M MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    Reservation=gpu
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>    Command=/bin/bash
>    WorkDir=/home/jonst
>    Power=
>    TresPerTask=gpu:v100:2
>    MailUser=(null) MailType=NONE

nvidia-smi shows 2 gpus.

> [jonst@n2017 ~]$ nvidia-smi 
> Mon Nov  2 17:09:55 2020       
> +-----------------------------------------------------------------------------+
> | NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
> |-------------------------------+----------------------+----------------------+
> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
> |===============================+======================+======================|
> |   0  Tesla V100-SXM2...  On   | 00000000:61:00.0 Off |                    0 |
> | N/A   40C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
> +-------------------------------+----------------------+----------------------+
> |   1  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
> | N/A   40C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
> +-------------------------------+----------------------+----------------------+
>                                                                                
> +-----------------------------------------------------------------------------+
> | Processes:                                                       GPU Memory |
> |  GPU       PID   Type   Process name                             Usage      |
> |=============================================================================|
> |  No running processes found                                                 |
> +-----------------------------------------------------------------------------+

Job has been allocted 4?

> [jonst@n2017 ~]$ sacct -j $SLURM_JOBID --format=JobID,Start,END,ReqGRES%20,ReqTRES%40,AllocGRES,AllocTRES%40
>        JobID               Start                 End              ReqGRES                                  ReqTRES    AllocGRES                                AllocTRES 
> ------------ ------------------- ------------------- -------------------- ---------------------------------------- ------------ ---------------------------------------- 
> 1075339      2020-11-02T17:05:42             Unknown  PER_TASK:gpu:v100:2         billing=1,cpu=1,mem=2904M,node=1        gpu:4      billing=18,cpu=18,mem=52272M,node=1 
> 1075339.ext+ 2020-11-02T17:05:42             Unknown  PER_TASK:gpu:v100:2                                                 gpu:4      billing=18,cpu=18,mem=52272M,node=1 
> 1075339.0    2020-11-02T17:05:46             Unknown  PER_TASK:gpu:v100:2                                                 gpu:4                       cpu=1,mem=0,node=1 

Trying to start a second job. (Using -w to place it on the same node)

> [jonst@sign ~]$ srun -n1 -t10 --gpus-per-task=v100:2 -Ansc --reservation=gpu --pty -E -w n2017 /bin/bash -l
> srun: job 1075340 queued and waiting for resources

Job gets stuck pending but starts as soon as the first job exits.

> [jonst@n2017 ~]$ squeue -u jonst
>              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
>            1075340     sigma     bash    jonst PD       0:00      1 (Resources) 
>            1075339     sigma     bash    jonst  R       3:55      1 n2017

Comment 3 Marcin Stolarek 2020-11-03 01:19:07 MST

Jonas,

Could you please attach your configuration files? I took a look at other bugs, but I can't find a config with sigma partition. Is this for "tetralith" or other machine?

Does it look like reservation releated or happens on the empty node as well? Could you please share the command used to create the reservation?

cheers,
Marcin

Comment 4 Jonas Stare 2020-11-03 07:29:28 MST

Created attachment 16471 [details]
slurm.conf for sigma

Comment 5 Jonas Stare 2020-11-03 08:14:54 MST

This is what the reservation looks like.

> ReservationName=gpu StartTime=2020-09-14T12:54:44 EndTime=2030-07-24T12:54:44 Duration=3600-00:00:00
>    Nodes=n[2017-2018] NodeCnt=2 CoreCnt=72 Features=(null) PartitionName=(null) Flags=SPEC_NODES
>    TRES=cpu=72
>    Users=(null) Accounts=nsc,liu-gpu-2020-1,liu-gpu-2020-2,liu-gpu-2020-3 Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
>    MaxStartDelay=(null)

We are probably misusing reservations a bit. The gpu-nodes should probably have been in their own partition but the way we create users only support one partition per cluster (at the moment).

Comment 6 Marcin Stolarek 2020-11-12 07:20:03 MST

Jonas,

I think I reproduced the issue and it looks like it's related to --cpus-per-gpu being set from slurm.conf default. This may be a duplicate of a Bug 9947 where we already have a patch in QA process, are you able to apply it and verify if it will fix the issue for you?(attachment 16578 [details])

Alternatively, Can you add direct --cpus-per-gpu=9 to your srun calls and check if it works correctly?

cheers,
Marcin

Comment 7 Jonas Stare 2020-11-16 10:00:48 MST

(In reply to Marcin Stolarek from comment #6)
> Jonas,
> 
> I think I reproduced the issue and it looks like it's related to
> --cpus-per-gpu being set from slurm.conf default. This may be a duplicate of
> a Bug 9947 where we already have a patch in QA process, are you able to
> apply it and verify if it will fix the issue for you?(attachment 16578 [details]
> [details])
> 
> Alternatively, Can you add direct --cpus-per-gpu=9 to your srun calls and
> check if it works correctly?
> 
> cheers,
> Marcin

My colleague tested and it seems like this was the problem. I saw that there was a patch for DefCpuPerGPU on the 20.02 branch, that is the one that would fix this? 

https://github.com/SchedMD/slurm/commit/0b6faf691c6fb5445fdb01c74daf81ecb87e05db

Comment 8 Marcin Stolarek 2020-11-16 10:16:52 MST

Yes - the commit you're asking for is exectly the same I shared in comment 6 attachment. You should be able to apply it manually or just wait and upgrade - it will be part of Slurm 20.02.7 release.

I'm marking the case as duplicate of original Bug 9947 now.

cheers,
Marcin

*** This ticket has been marked as a duplicate of ticket 9947 ***