Allocation resources with: srun --gpus-per-task=1 --ntasks=2 --nodes=2 --time 00:10:00 --pty /bin/bash -i Has resulted in different allocations. JobID|AllocTRES 14726565|billing=8,cpu=8,gres/gpu=3,mem=16G,node=2 14726565.extern|billing=8,cpu=8,gres/gpu=3,mem=16G,node=2 14726565.0|cpu=2,gres/gpu:gtx1080ti=2,gres/gpu=2,mem=0,node=2 14733058|billing=8,cpu=8,gres/gpu=5,mem=16G,node=2 14733058.extern|billing=8,cpu=8,gres/gpu=5,mem=16G,node=2 14733058.0|cpu=2,gres/gpu:gtx1080ti=2,gres/gpu=2,mem=0,node=2 Job # 14726565 was allocated 3 GPUs Job # 14733058 was allocated 5 GPUs Expected behavior is 1 task on each node, with each task being allocated 1 GPU. -Greg
Created attachment 18426 [details] slurm.conf
Created attachment 18427 [details] gres.conf
Could you please set SlurmctldDebug to at least verbose, enable GRES debug flag and share slurmctld logs from the time when jobs are submitted and started? cheers, Marcin
Created attachment 18468 [details] Fragments related to two jobs from Slurmctld
Created attachment 18469 [details] Testing submission - slurmctld log $ srun --gpus-per-task=1 --ntasks=2 --nodes=2 --time 00:10:00 --pty /bin/bash -i srun: job 590 queued and waiting for resources srun: job 590 has been allocated resources $ scontrol show -d job=590 JobId=590 JobName=bash UserId=wickhagj(100302) GroupId=g-wickhagj(1100302) MCS_label=N/A Priority=889 Nice=0 Account=root QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:13 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2021-03-16T16:55:24 EligibleTime=2021-03-16T16:55:24 AccrueTime=2021-03-16T16:55:24 StartTime=2021-03-16T16:55:24 EndTime=2021-03-16T17:05:24 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-03-16T16:55:24 Partition=batch AllocNode:Sid=slurm-02:2418 ReqNodeList=(null) ExcNodeList=(null) NodeList=dgpu502-[29,33] BatchHost=dgpu502-29 NumNodes=2 NumCPUs=8 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=8,mem=16G,node=2,billing=8,gres/gpu=8 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* JOB_GRES=gpu:8 Nodes=dgpu502-[29,33] CPU_IDs=0-3 Mem=8192 GRES=gpu:4(IDX:0-3) MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0 Features=nolmem DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bin/bash WorkDir=/home/wickhagj Power= CpusPerTres=gpu:4 TresPerTask=gpu:1 NtasksPerTRES:0 $ sacct -j 590 -P --format=jobid,alloctres JobID|AllocTRES 590|billing=8,cpu=8,gres/gpu=8,mem=16G,node=2 590.extern|billing=8,cpu=8,gres/gpu=8,mem=16G,node=2 590.0|cpu=2,gres/gpu:gtx1080ti=2,gres/gpu=2,mem=0,node=2
Hi Could you send us partitions.conf? Dominik
The full debug logs will be uploaded tomorrow.
Created attachment 18470 [details] Partitions.conf
Hi I can recreate this issue. I will let you know when the fix will be available. Dominik
Hi This commit should fix this issue. It will be available in slurm 20.11.6 and above. https://github.com/SchedMD/slurm/commit/bdf66674f9e0f03 Dominik
Hi Is there anything else I can do to help or are you ok to close this ticket? Dominik
Hi Dominik, If the bug has been resolved, the ticket can be closed. thanks, -greg
We upgraded to 20.11.6 today and it's working great. Thanks Dominik. -Greg