| Summary: | Erratic GPU allocation | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Greg Wickham <greg.wickham> |
| Component: | Scheduling | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | ahmed.mazaty, alex, bart, bas.vandervlies, kilian |
| Version: | 20.11.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | KAUST | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 20.11.6 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
gres.conf Fragments related to two jobs from Slurmctld Testing submission - slurmctld log Partitions.conf |
||
Created attachment 18426 [details]
slurm.conf
Created attachment 18427 [details]
gres.conf
Could you please set SlurmctldDebug to at least verbose, enable GRES debug flag and share slurmctld logs from the time when jobs are submitted and started? cheers, Marcin Created attachment 18468 [details]
Fragments related to two jobs from Slurmctld
Created attachment 18469 [details]
Testing submission - slurmctld log
$ srun --gpus-per-task=1 --ntasks=2 --nodes=2 --time 00:10:00 --pty /bin/bash -i
srun: job 590 queued and waiting for resources
srun: job 590 has been allocated resources
$ scontrol show -d job=590
JobId=590 JobName=bash
UserId=wickhagj(100302) GroupId=g-wickhagj(1100302) MCS_label=N/A
Priority=889 Nice=0 Account=root QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:13 TimeLimit=00:10:00 TimeMin=N/A
SubmitTime=2021-03-16T16:55:24 EligibleTime=2021-03-16T16:55:24
AccrueTime=2021-03-16T16:55:24
StartTime=2021-03-16T16:55:24 EndTime=2021-03-16T17:05:24 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-03-16T16:55:24
Partition=batch AllocNode:Sid=slurm-02:2418
ReqNodeList=(null) ExcNodeList=(null)
NodeList=dgpu502-[29,33]
BatchHost=dgpu502-29
NumNodes=2 NumCPUs=8 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=16G,node=2,billing=8,gres/gpu=8
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
JOB_GRES=gpu:8
Nodes=dgpu502-[29,33] CPU_IDs=0-3 Mem=8192 GRES=gpu:4(IDX:0-3)
MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0
Features=nolmem DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/bin/bash
WorkDir=/home/wickhagj
Power=
CpusPerTres=gpu:4
TresPerTask=gpu:1
NtasksPerTRES:0
$ sacct -j 590 -P --format=jobid,alloctres
JobID|AllocTRES
590|billing=8,cpu=8,gres/gpu=8,mem=16G,node=2
590.extern|billing=8,cpu=8,gres/gpu=8,mem=16G,node=2
590.0|cpu=2,gres/gpu:gtx1080ti=2,gres/gpu=2,mem=0,node=2
Hi Could you send us partitions.conf? Dominik The full debug logs will be uploaded tomorrow. Created attachment 18470 [details]
Partitions.conf
Hi I can recreate this issue. I will let you know when the fix will be available. Dominik Hi This commit should fix this issue. It will be available in slurm 20.11.6 and above. https://github.com/SchedMD/slurm/commit/bdf66674f9e0f03 Dominik Hi Is there anything else I can do to help or are you ok to close this ticket? Dominik Hi Dominik, If the bug has been resolved, the ticket can be closed. thanks, -greg We upgraded to 20.11.6 today and it's working great. Thanks Dominik. -Greg |
Allocation resources with: srun --gpus-per-task=1 --ntasks=2 --nodes=2 --time 00:10:00 --pty /bin/bash -i Has resulted in different allocations. JobID|AllocTRES 14726565|billing=8,cpu=8,gres/gpu=3,mem=16G,node=2 14726565.extern|billing=8,cpu=8,gres/gpu=3,mem=16G,node=2 14726565.0|cpu=2,gres/gpu:gtx1080ti=2,gres/gpu=2,mem=0,node=2 14733058|billing=8,cpu=8,gres/gpu=5,mem=16G,node=2 14733058.extern|billing=8,cpu=8,gres/gpu=5,mem=16G,node=2 14733058.0|cpu=2,gres/gpu:gtx1080ti=2,gres/gpu=2,mem=0,node=2 Job # 14726565 was allocated 3 GPUs Job # 14733058 was allocated 5 GPUs Expected behavior is 1 task on each node, with each task being allocated 1 GPU. -Greg