Ticket 11083

Summary: Erratic GPU allocation
Product: Slurm Reporter: Greg Wickham <greg.wickham>
Component: SchedulingAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: ahmed.mazaty, alex, bart, bas.vandervlies, kilian
Version: 20.11.2   
Hardware: Linux   
OS: Linux   
Site: KAUST Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.11.6 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf
gres.conf
Fragments related to two jobs from Slurmctld
Testing submission - slurmctld log
Partitions.conf

Description Greg Wickham 2021-03-15 01:46:13 MDT
Allocation resources with:

    srun --gpus-per-task=1 --ntasks=2 --nodes=2 --time 00:10:00 --pty /bin/bash -i

Has resulted in different allocations.

JobID|AllocTRES
14726565|billing=8,cpu=8,gres/gpu=3,mem=16G,node=2
14726565.extern|billing=8,cpu=8,gres/gpu=3,mem=16G,node=2
14726565.0|cpu=2,gres/gpu:gtx1080ti=2,gres/gpu=2,mem=0,node=2
14733058|billing=8,cpu=8,gres/gpu=5,mem=16G,node=2
14733058.extern|billing=8,cpu=8,gres/gpu=5,mem=16G,node=2
14733058.0|cpu=2,gres/gpu:gtx1080ti=2,gres/gpu=2,mem=0,node=2


Job # 14726565 was allocated 3 GPUs
Job # 14733058 was allocated 5 GPUs

Expected behavior is 1 task on each node, with each task being allocated 1 GPU.

   -Greg
Comment 1 Ahmed Essam ElMazaty 2021-03-15 02:36:12 MDT
Created attachment 18426 [details]
slurm.conf
Comment 2 Ahmed Essam ElMazaty 2021-03-15 02:36:40 MDT
Created attachment 18427 [details]
gres.conf
Comment 5 Marcin Stolarek 2021-03-16 02:38:37 MDT
Could you please set SlurmctldDebug to at least verbose, enable GRES debug flag and share slurmctld logs from the time when jobs are submitted and started?

cheers,
Marcin
Comment 6 Greg Wickham 2021-03-16 07:53:29 MDT
Created attachment 18468 [details]
Fragments related to two jobs from Slurmctld
Comment 7 Greg Wickham 2021-03-16 08:02:03 MDT
Created attachment 18469 [details]
Testing submission - slurmctld log

$ srun --gpus-per-task=1 --ntasks=2 --nodes=2 --time 00:10:00 --pty /bin/bash -i
srun: job 590 queued and waiting for resources
srun: job 590 has been allocated resources


$ scontrol show -d job=590
JobId=590 JobName=bash
   UserId=wickhagj(100302) GroupId=g-wickhagj(1100302) MCS_label=N/A
   Priority=889 Nice=0 Account=root QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:13 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2021-03-16T16:55:24 EligibleTime=2021-03-16T16:55:24
   AccrueTime=2021-03-16T16:55:24
   StartTime=2021-03-16T16:55:24 EndTime=2021-03-16T17:05:24 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-03-16T16:55:24
   Partition=batch AllocNode:Sid=slurm-02:2418
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=dgpu502-[29,33]
   BatchHost=dgpu502-29
   NumNodes=2 NumCPUs=8 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=16G,node=2,billing=8,gres/gpu=8
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=gpu:8
     Nodes=dgpu502-[29,33] CPU_IDs=0-3 Mem=8192 GRES=gpu:4(IDX:0-3)
   MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=nolmem DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/bash
   WorkDir=/home/wickhagj
   Power=
   CpusPerTres=gpu:4
   TresPerTask=gpu:1
   NtasksPerTRES:0

$ sacct -j 590 -P --format=jobid,alloctres
JobID|AllocTRES
590|billing=8,cpu=8,gres/gpu=8,mem=16G,node=2
590.extern|billing=8,cpu=8,gres/gpu=8,mem=16G,node=2
590.0|cpu=2,gres/gpu:gtx1080ti=2,gres/gpu=2,mem=0,node=2
Comment 8 Dominik Bartkiewicz 2021-03-16 08:08:58 MDT
Hi

Could you send us partitions.conf?

Dominik
Comment 9 Greg Wickham 2021-03-16 08:10:50 MDT
The full debug logs will be uploaded tomorrow.
Comment 10 Greg Wickham 2021-03-16 08:11:24 MDT
Created attachment 18470 [details]
Partitions.conf
Comment 11 Dominik Bartkiewicz 2021-03-16 09:04:10 MDT
Hi

I can recreate this issue.
I will let you know when the fix will be available.

Dominik
Comment 18 Dominik Bartkiewicz 2021-03-31 09:41:49 MDT
Hi

This commit should fix this issue.
It will be available in slurm 20.11.6 and above.
https://github.com/SchedMD/slurm/commit/bdf66674f9e0f03

Dominik
Comment 19 Dominik Bartkiewicz 2021-04-02 04:35:50 MDT
Hi

Is there anything else I can do to help or are you ok to close this ticket?

Dominik
Comment 20 Greg Wickham 2021-04-04 00:37:43 MDT
Hi Dominik,

If the bug has been resolved, the ticket can be closed.

thanks,

   -greg
Comment 22 Greg Wickham 2021-04-28 11:28:13 MDT
We upgraded to 20.11.6 today and it's working great.

Thanks Dominik.

   -Greg