Summary: | cpuset: No space left on device on heterogenous gpu nodes | ||
---|---|---|---|
Product: | Slurm | Reporter: | Dylan Simon <dsimon> |
Component: | Scheduling | Assignee: | Brian Christiansen <brian> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | brian, marshall, shall |
Version: | 21.08.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=5082 https://bugs.schedmd.com/show_bug.cgi?id=16295 |
||
Site: | Simons Foundation & Flatiron Institute | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 22.05.4 23.02.1pre1 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
slurm.conf
workergpu15 slurmd log |
Description
Dylan Simon
2022-05-05 14:23:01 MDT
Created attachment 24871 [details]
workergpu15 slurmd log
I just managed to reproduce this on these same nodes with --exclusive. Note particularly the job GRES: JOB_GRES=gpu:p100-16gb:2,gpu:v100-32gb:4 Nodes=workergpu15 CPU_IDs=0-27 Mem=512000 GRES=gpu:p100-16gb:2(IDX:0-1) Nodes=workergpu052 CPU_IDs=0-35 Mem=768000 GRES=gpu:v100-32gb:4(IDX:0-3) These are reversed: workergpu15 has 36 cpus and 4 gpus, and 052 has 28 cpus and 2 gpus! The slurmd log also messes up the physical CPU mapping: [2022-05-05T16:38:13.724] [1375114.1] debug: task/cgroup: task_cgroup_cpuset_create: job abstract cores are '0-27' [2022-05-05T16:38:13.724] [1375114.1] debug: task/cgroup: task_cgroup_cpuset_create: step abstract cores are '' [2022-05-05T16:38:13.724] [1375114.1] debug: task/cgroup: task_cgroup_cpuset_create: job physical CPUs are '0-20,22,24,26,28,30,32,34' [2022-05-05T16:38:13.724] [1375114.1] debug: task/cgroup: task_cgroup_cpuset_create: step physical CPUs are '' JobId=1375114 JobName=interactive UserId=dylan(1135) GroupId=dylan(1135) MCS_label=N/A Priority=4294829194 Nice=0 Account=scc QOS=unlimit JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:01:47 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2022-05-05T16:35:49 EligibleTime=2022-05-05T16:35:49 AccrueTime=2022-05-05T16:36:34 StartTime=2022-05-05T16:37:41 EndTime=2022-05-05T17:37:41 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-05-05T16:37:16 Scheduler=Backfill Partition=request AllocNode:Sid=rustyamd1:111648 ReqNodeList=workergpu[15,052] ExcNodeList=(null) NodeList=workergpu[15,052] BatchHost=workergpu15 NumNodes=2 NumCPUs=64 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=64,mem=1250G,node=2,billing=64,gres/gpu=6 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* JOB_GRES=gpu:p100-16gb:2,gpu:v100-32gb:4 Nodes=workergpu15 CPU_IDs=0-27 Mem=512000 GRES=gpu:p100-16gb:2(IDX:0-1) Nodes=workergpu052 CPU_IDs=0-35 Mem=768000 GRES=gpu:v100-32gb:4(IDX:0-3) MinCPUsNode=1 MinMemoryNode=500G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/mnt/home/dylan Power= TresPerTask=gres:gpu:1 NodeName=workergpu15 Arch=x86_64 CoresPerSocket=18 CPUAlloc=36 CPUTot=36 CPULoad=0.03 AvailableFeatures=gpu,skylake,v100,v100-32gb,nvlink,sxm2,numai18,centos7 ActiveFeatures=gpu,skylake,v100,v100-32gb,nvlink,sxm2,numai18,centos7 Gres=gpu:v100-32gb:4(S:0-1) GresDrain=N/A GresUsed=gpu:v100-32gb:4(IDX:0-3),gdr:0 NodeAddr=workergpu15 NodeHostName=workergpu15 Version=21.08.6 OS=Linux 5.4.163.1.fi #1 SMP Wed Dec 1 05:10:33 EST 2021 RealMemory=768000 AllocMem=768000 FreeMem=754182 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=450000 Weight=55 Owner=N/A MCS_label=N/A Partitions=gpu,request BootTime=2022-05-05T16:37:17 SlurmdStartTime=2022-05-05T16:37:17 LastBusyTime=2022-05-05T16:37:17 CfgTRES=cpu=36,mem=750G,billing=36,gres/gpu=4 AllocTRES=cpu=36,mem=750G,gres/gpu=4 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s NodeName=workergpu052 Arch=x86_64 CoresPerSocket=14 CPUAlloc=28 CPUTot=28 CPULoad=0.00 AvailableFeatures=gpu,p100,ib,numai14,centos7 ActiveFeatures=gpu,p100,ib,numai14,centos7 Gres=gpu:p100-16gb:2(S:0-1) GresDrain=N/A GresUsed=gpu:p100-16gb:2(IDX:0-1),gdr:0 NodeAddr=workergpu052 NodeHostName=workergpu052 Version=21.08.6 OS=Linux 5.4.163.1.fi #1 SMP Wed Dec 1 05:10:33 EST 2021 RealMemory=512000 AllocMem=512000 FreeMem=503747 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=950000 Weight=45 Owner=N/A MCS_label=N/A Partitions=gpu,request BootTime=2022-05-05T16:37:17 SlurmdStartTime=2022-05-05T16:37:19 LastBusyTime=2022-05-05T16:37:19 CfgTRES=cpu=28,mem=500G,billing=28,gres/gpu=2 AllocTRES=cpu=28,mem=500G,gres/gpu=2 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Maybe this only happens when we mix our (crazy) 2 and 3 digit hostnames (workergpuXX and workergpuXXX) and something is sorting them in different orders? Hi Dylan, (In reply to Dylan Simon from comment #3) > Maybe this only happens when we mix our (crazy) 2 and 3 digit hostnames > (workergpuXX and workergpuXXX) and something is sorting them in different > orders? That would be my first guess. There might be different sorting approaches at different points. My initial guess is that Slurm builds the GRES bitmaps sorting nodes with a strict alphanumeric sort, while perhaps scontrol prints out host lists according to some natural sort where leading 0's are ignored. Or vice versa. Could you try making your GPU node names of consistent width and see if that fixes the issue? Thanks, -Michael As for the "No space left on device" error, I wonder if this is related to bug 5082. In that bug, the user also was using CentOS 7. What exact version of CentOS is node workergpu15 running? Also, maybe double check to make sure you aren't actually running out of disk space. We're running CentOS 7.9.2009 with our own 5.4.163 kernel. However, I can also reproduce this issue on Rocky 8.5 as well. I'm very sure that the "No space" message is because the cpuset cgroup has no CPUs assigned to it. I just tried renaming workergpu052 to workergpu52 and repeating the same test with workergpu15 and did not see the problem, so it does seem likely to be a sorting issue. Hi Dylan, I'm going to reduce this to a severity 4, since there is a workaround for the issue. The fix for this won't make it into 22.05, since that is coming out this month, but we'll look into fixing this. Thanks! -Michael *** Ticket 14701 has been marked as a duplicate of this ticket. *** Hi Dylan, We've fixed this issue in the following commits ahead of 22.05.4: | * 8bebdd1147 (origin/slurm-22.05) NEWS for the previous 5 commits | * be3362824a Sort job and step nodelists in Slurm user commands | * f9d976b8f4 Add slurm_sort_node_list_str() to sort a node_list string | * bcda9684a9 Add hostset_[de]ranged_string_xmalloc functions | * bd7641da57 Change all hostset_finds to hostlist_finds | * 9129b8a211 Fix expected ordering for hostlists from controller and daemons Let us know if you have any questions. Thanks, Brian |