Created attachment 24870 [details] slurm.conf Trying to run a single task (srun -n1) in an allocation with multiple nodes results in an empty cpu task set, in some cases. When this happens, the slurmd logs include: [2022-05-05T16:04:31.262] error: cons_res: zero processors allocated to step [2022-05-05T16:04:31.316] [1375095.0] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0' [2022-05-05T16:04:31.316] [1375095.0] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '' [2022-05-05T16:04:31.316] [1375095.0] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0' [2022-05-05T16:04:31.316] [1375095.0] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '' Here's an example: >salloc -N 2 --ntasks-per-node=1 --gpus-per-task=1 -p gpu -w workergpu15,workergpu052 -t 1:00:00 bash salloc: Pending job allocation 1375095 salloc: job 1375095 queued and waiting for resources salloc: job 1375095 has been allocated resources salloc: Granted job allocation 1375095 salloc: Waiting for resource configuration salloc: Nodes workergpu[15,052] are ready for job >scontrol -d show job $SLURM_JOBID JobId=1375095 JobName=interactive UserId=dylan(1135) GroupId=dylan(1135) MCS_label=N/A Priority=4294829213 Nice=0 Account=scc QOS=gen JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:08 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2022-05-05T16:03:22 EligibleTime=2022-05-05T16:03:22 AccrueTime=2022-05-05T16:03:22 StartTime=2022-05-05T16:04:15 EndTime=2022-05-05T17:04:15 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-05-05T16:04:15 Scheduler=Backfill Partition=gpu AllocNode:Sid=rustyamd1:111648 ReqNodeList=workergpu[15,052] ExcNodeList=(null) NodeList=workergpu[15,052] BatchHost=workergpu15 NumNodes=2 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=2,mem=32000M,node=2,billing=2,gres/gpu=2 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* JOB_GRES=gpu:p100-16gb:1,gpu:v100-32gb:1 Nodes=workergpu15 CPU_IDs=0 Mem=16000 GRES=gpu:p100-16gb:1(IDX:0) Nodes=workergpu052 CPU_IDs=0 Mem=16000 GRES=gpu:v100-32gb:1(IDX:0) MinCPUsNode=1 MinMemoryCPU=16000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/mnt/home/dylan/scc/disBatch Power= TresPerTask=gres:gpu:1 rustyamd1:~/scc/disBatch [0]>scontrol -d show node $SLURM_NODELIST NodeName=workergpu15 Arch=x86_64 CoresPerSocket=18 CPUAlloc=1 CPUTot=36 CPULoad=0.05 AvailableFeatures=gpu,skylake,v100,v100-32gb,nvlink,sxm2,numai18,centos7 ActiveFeatures=gpu,skylake,v100,v100-32gb,nvlink,sxm2,numai18,centos7 Gres=gpu:v100-32gb:4(S:0-1) GresDrain=N/A GresUsed=gpu:v100-32gb:1(IDX:0),gdr:0 NodeAddr=workergpu15 NodeHostName=workergpu15 Version=21.08.6 OS=Linux 5.4.163.1.fi #1 SMP Wed Dec 1 05:10:33 EST 2021 RealMemory=768000 AllocMem=16000 FreeMem=754182 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=450000 Weight=55 Owner=N/A MCS_label=N/A Partitions=gpu,request BootTime=2022-05-05T15:46:17 SlurmdStartTime=2022-05-05T15:46:17 LastBusyTime=2022-05-05T16:03:22 CfgTRES=cpu=36,mem=750G,billing=36,gres/gpu=4 AllocTRES=cpu=1,mem=16000M,gres/gpu=1 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=workergpu052 Arch=x86_64 CoresPerSocket=14 CPUAlloc=1 CPUTot=28 CPULoad=0.01 AvailableFeatures=gpu,p100,ib,numai14,centos7 ActiveFeatures=gpu,p100,ib,numai14,centos7 Gres=gpu:p100-16gb:2(S:0-1) GresDrain=N/A GresUsed=gpu:p100-16gb:1(IDX:0),gdr:0 NodeAddr=workergpu052 NodeHostName=workergpu052 Version=21.08.6 OS=Linux 5.4.163.1.fi #1 SMP Wed Dec 1 05:10:33 EST 2021 RealMemory=512000 AllocMem=16000 FreeMem=503742 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=950000 Weight=45 Owner=N/A MCS_label=N/A Partitions=gpu,request BootTime=2022-05-05T15:46:17 SlurmdStartTime=2022-05-05T15:46:18 LastBusyTime=2022-05-05T16:03:22 CfgTRES=cpu=28,mem=500G,billing=28,gres/gpu=2 AllocTRES=cpu=1,mem=16000M,gres/gpu=1 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >srun -N1 -n1 -w workergpu15 nproc srun: error: task 0 launch failed: Slurmd could not execve job slurmstepd: error: common_file_write_uint32s: write pid 361312 to /sys/fs/cgroup/cpuset/slurm/uid_1135/job_1375095/step_0/cgroup.procs failed: No space left on device slurmstepd: error: unable to add pids to '/sys/fs/cgroup/cpuset/slurm/uid_1135/job_1375095/step_0' slurmstepd: error: task_g_pre_set_affinity: No space left on device slurmstepd: error: _exec_wait_child_wait_for_parent: failed: No error The slurmctld log has nothing unusual, but I'll attach the "slurmd -d" log. I tried turning on debugflag cpu_bind but didn't see much, but happy to try again if you have specific suggestions. We've only seen this happen on GPU nodes, and it seems like it mainly happens when the two nodes have different CPU configurations in some way, either their NUMA cpu maps or gres gpu cores. For example, the two nodes above: workergpu052 (two GPUs, one one each socket): NodeName=workergpu[048,049,051,052] Name=gpu Type=p100-16gb Count=1 File=/dev/nvidia0 Cores=0,2,4,6,8,10,12,14,16,18,20,22,24,26 NodeName=workergpu[048,049,051,052] Name=gpu Type=p100-16gb Count=1 File=/dev/nvidia1 Cores=1,3,5,7,9,11,13,15,17,19,21,23,25,27 NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27 workergpu15 (4 GPUs, all on NUMA0 so we leave out cores): NodeName=workergpu[053,054],workergpu[15,23-30,32-34] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia0 NodeName=workergpu[053,054],workergpu[15,23-30,32-34] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia1 NodeName=workergpu[053,054],workergpu[15,23-30,32-34] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia2 NodeName=workergpu[053,054],workergpu[15,23-30,32-34] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia3 NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35
Created attachment 24871 [details] workergpu15 slurmd log
I just managed to reproduce this on these same nodes with --exclusive. Note particularly the job GRES: JOB_GRES=gpu:p100-16gb:2,gpu:v100-32gb:4 Nodes=workergpu15 CPU_IDs=0-27 Mem=512000 GRES=gpu:p100-16gb:2(IDX:0-1) Nodes=workergpu052 CPU_IDs=0-35 Mem=768000 GRES=gpu:v100-32gb:4(IDX:0-3) These are reversed: workergpu15 has 36 cpus and 4 gpus, and 052 has 28 cpus and 2 gpus! The slurmd log also messes up the physical CPU mapping: [2022-05-05T16:38:13.724] [1375114.1] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-27' [2022-05-05T16:38:13.724] [1375114.1] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '' [2022-05-05T16:38:13.724] [1375114.1] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-20,22,24,26,28,30,32,34' [2022-05-05T16:38:13.724] [1375114.1] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '' JobId=1375114 JobName=interactive UserId=dylan(1135) GroupId=dylan(1135) MCS_label=N/A Priority=4294829194 Nice=0 Account=scc QOS=unlimit JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:01:47 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2022-05-05T16:35:49 EligibleTime=2022-05-05T16:35:49 AccrueTime=2022-05-05T16:36:34 StartTime=2022-05-05T16:37:41 EndTime=2022-05-05T17:37:41 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-05-05T16:37:16 Scheduler=Backfill Partition=request AllocNode:Sid=rustyamd1:111648 ReqNodeList=workergpu[15,052] ExcNodeList=(null) NodeList=workergpu[15,052] BatchHost=workergpu15 NumNodes=2 NumCPUs=64 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=64,mem=1250G,node=2,billing=64,gres/gpu=6 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* JOB_GRES=gpu:p100-16gb:2,gpu:v100-32gb:4 Nodes=workergpu15 CPU_IDs=0-27 Mem=512000 GRES=gpu:p100-16gb:2(IDX:0-1) Nodes=workergpu052 CPU_IDs=0-35 Mem=768000 GRES=gpu:v100-32gb:4(IDX:0-3) MinCPUsNode=1 MinMemoryNode=500G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/mnt/home/dylan Power= TresPerTask=gres:gpu:1 NodeName=workergpu15 Arch=x86_64 CoresPerSocket=18 CPUAlloc=36 CPUTot=36 CPULoad=0.03 AvailableFeatures=gpu,skylake,v100,v100-32gb,nvlink,sxm2,numai18,centos7 ActiveFeatures=gpu,skylake,v100,v100-32gb,nvlink,sxm2,numai18,centos7 Gres=gpu:v100-32gb:4(S:0-1) GresDrain=N/A GresUsed=gpu:v100-32gb:4(IDX:0-3),gdr:0 NodeAddr=workergpu15 NodeHostName=workergpu15 Version=21.08.6 OS=Linux 5.4.163.1.fi #1 SMP Wed Dec 1 05:10:33 EST 2021 RealMemory=768000 AllocMem=768000 FreeMem=754182 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=450000 Weight=55 Owner=N/A MCS_label=N/A Partitions=gpu,request BootTime=2022-05-05T16:37:17 SlurmdStartTime=2022-05-05T16:37:17 LastBusyTime=2022-05-05T16:37:17 CfgTRES=cpu=36,mem=750G,billing=36,gres/gpu=4 AllocTRES=cpu=36,mem=750G,gres/gpu=4 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=workergpu052 Arch=x86_64 CoresPerSocket=14 CPUAlloc=28 CPUTot=28 CPULoad=0.00 AvailableFeatures=gpu,p100,ib,numai14,centos7 ActiveFeatures=gpu,p100,ib,numai14,centos7 Gres=gpu:p100-16gb:2(S:0-1) GresDrain=N/A GresUsed=gpu:p100-16gb:2(IDX:0-1),gdr:0 NodeAddr=workergpu052 NodeHostName=workergpu052 Version=21.08.6 OS=Linux 5.4.163.1.fi #1 SMP Wed Dec 1 05:10:33 EST 2021 RealMemory=512000 AllocMem=512000 FreeMem=503747 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=950000 Weight=45 Owner=N/A MCS_label=N/A Partitions=gpu,request BootTime=2022-05-05T16:37:17 SlurmdStartTime=2022-05-05T16:37:19 LastBusyTime=2022-05-05T16:37:19 CfgTRES=cpu=28,mem=500G,billing=28,gres/gpu=2 AllocTRES=cpu=28,mem=500G,gres/gpu=2 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Maybe this only happens when we mix our (crazy) 2 and 3 digit hostnames (workergpuXX and workergpuXXX) and something is sorting them in different orders?
Hi Dylan, (In reply to Dylan Simon from comment #3) > Maybe this only happens when we mix our (crazy) 2 and 3 digit hostnames > (workergpuXX and workergpuXXX) and something is sorting them in different > orders? That would be my first guess. There might be different sorting approaches at different points. My initial guess is that Slurm builds the GRES bitmaps sorting nodes with a strict alphanumeric sort, while perhaps scontrol prints out host lists according to some natural sort where leading 0's are ignored. Or vice versa. Could you try making your GPU node names of consistent width and see if that fixes the issue? Thanks, -Michael
As for the "No space left on device" error, I wonder if this is related to bug 5082. In that bug, the user also was using CentOS 7. What exact version of CentOS is node workergpu15 running? Also, maybe double check to make sure you aren't actually running out of disk space.
We're running CentOS 7.9.2009 with our own 5.4.163 kernel. However, I can also reproduce this issue on Rocky 8.5 as well. I'm very sure that the "No space" message is because the cpuset cgroup has no CPUs assigned to it. I just tried renaming workergpu052 to workergpu52 and repeating the same test with workergpu15 and did not see the problem, so it does seem likely to be a sorting issue.
Hi Dylan, I'm going to reduce this to a severity 4, since there is a workaround for the issue. The fix for this won't make it into 22.05, since that is coming out this month, but we'll look into fixing this. Thanks! -Michael
*** Ticket 14701 has been marked as a duplicate of this ticket. ***
Hi Dylan, We've fixed this issue in the following commits ahead of 22.05.4: | * 8bebdd1147 (origin/slurm-22.05) NEWS for the previous 5 commits | * be3362824a Sort job and step nodelists in Slurm user commands | * f9d976b8f4 Add slurm_sort_node_list_str() to sort a node_list string | * bcda9684a9 Add hostset_[de]ranged_string_xmalloc functions | * bd7641da57 Change all hostset_finds to hostlist_finds | * 9129b8a211 Fix expected ordering for hostlists from controller and daemons Let us know if you have any questions. Thanks, Brian