Created attachment 21094 [details]
slurm.conf
Please also attach your cgroup.conf and gres.conf. Did you reboot the nodes after upgrading? Also, in this test system are these real devices you are constraining or fake devices? Created attachment 21112 [details]
cgroup.conf
Created attachment 21113 [details]
gres.conf
(In reply to Jason Booth from comment #3) > Please also attach your cgroup.conf and gres.conf. Did you reboot the nodes > after upgrading? Also, in this test system are these real devices you are > constraining or fake devices? Nodes were rebooted for 20.11.0 All real devices. -Greg (In reply to Greg Wickham from comment #6) > Nodes were rebooted for 20.11.0 Nodes were rebooted for . . 21.08 . . Would you also send the output the "<cgroup_path>/cgroup.clone_children"
I am running multiple slurmd's but here is what it looks like on my system.
> $:/sys/fs/cgroup/memory$ cat slurm_n1/cgroup.clone_children
> 0
Also, do new jobs spawn with the default of 0 for cgroup.clone_children (inherited)? Greg - it would also be helpful to know what is under these directories, where /sys/fs/cgroup is your cgroup mount location. ls "/sys/fs/cgroup/memory" ls "/sys/fs/cgroup/cpuset/" ls "/sys/fs/cgroup/devices/" ls "/sys/fs/cgroup/cpu" The error comes from setting up each respective cgroup: https://github.com/SchedMD/slurm/blob/slurm-21-08-0-1/src/plugins/task/cgroup/task_cgroup.c#L156 Hi Jason, We have some users on our test cluster so I can't swap between releases. Give me another day to spin up another cluster so I can provide the required information. -Greg Greg, I'll be taking over this bug for Jason. While we're waiting for the information, I just wanted to give you some background information about why we're asking for that info. We refactored our cgroup plugins for 21.08. One of the things that was changed was setting cgroup.clone_children to zero, since there were some problems with it (which look similar to the errors we see in your log) on some systems where clone_children was set to 1. See commit 2bf25fdaf2c20 for more information. - Marshall Hi Marshall, On our new test cluster: $ srun --gres gpu:1 --time 00:10:00 --pty /bin/bash -i srun: job 1 queued and waiting for resources srun: job 1 has been allocated resources srun: Force Terminated job 1 root@gpu203-23-l: ~ # ls "/sys/fs/cgroup/memory" cgroup.clone_children memory.kmem.slabinfo memory.memsw.limit_in_bytes memory.swappiness cgroup.event_control memory.kmem.tcp.failcnt memory.memsw.max_usage_in_bytes memory.usage_in_bytes cgroup.procs memory.kmem.tcp.limit_in_bytes memory.memsw.usage_in_bytes memory.use_hierarchy cgroup.sane_behavior memory.kmem.tcp.max_usage_in_bytes memory.move_charge_at_immigrate notify_on_release memory.failcnt memory.kmem.tcp.usage_in_bytes memory.numa_stat release_agent memory.force_empty memory.kmem.usage_in_bytes memory.oom_control slurm memory.kmem.failcnt memory.limit_in_bytes memory.pressure_level system.slice memory.kmem.limit_in_bytes memory.max_usage_in_bytes memory.soft_limit_in_bytes tasks memory.kmem.max_usage_in_bytes memory.memsw.failcnt memory.stat user.slice root@gpu203-23-l: ~ # ls "/sys/fs/cgroup/cpuset/" cgroup.clone_children cpuset.cpus cpuset.memory_migrate cpuset.mems slurm cgroup.event_control cpuset.effective_cpus cpuset.memory_pressure cpuset.sched_load_balance system cgroup.procs cpuset.effective_mems cpuset.memory_pressure_enabled cpuset.sched_relax_domain_level tasks cgroup.sane_behavior cpuset.mem_exclusive cpuset.memory_spread_page notify_on_release weka-client cpuset.cpu_exclusive cpuset.mem_hardwall cpuset.memory_spread_slab release_agent weka-default root@gpu203-23-l: ~ # ls "/sys/fs/cgroup/devices/" cgroup.clone_children cgroup.procs devices.allow devices.list release_agent system.slice user.slice cgroup.event_control cgroup.sane_behavior devices.deny notify_on_release slurm tasks root@gpu203-23-l: ~ # ls "/sys/fs/cgroup/cpu" cgroup.clone_children cgroup.sane_behavior cpuacct.usage_percpu cpu.rt_period_us cpu.stat system.slice cgroup.event_control cpuacct.stat cpu.cfs_period_us cpu.rt_runtime_us notify_on_release tasks cgroup.procs cpuacct.usage cpu.cfs_quota_us cpu.shares release_agent user.slice root@gpu203-23-l: /sys/fs/cgroup/memory # cat slurm/cgroup.clone_children 0 Hi Greg,
Thanks for that data. Can you run these commands for me? Most likely these are okay, but I'm just doule checking a couple of things.
find /sys/fs/cgroup/cpuset/slurm/ -name cgroup.clone_children -exec echo '{}' \; -exec cat '{}' \;
find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \;
Also, can you set DebugFlags=cgroup in slurm.conf (on all daemons), restart all daemons, re-run your test, then upload the slurmd log file? This should tell us more details about what is actually happening.
# find /sys/fs/cgroup/cpuset/slurm/ -name cgroup.clone_children -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/system/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/cgroup.clone_children
0
# find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/system/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/step_extern/cpuset.cpus
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/cpuset.cpus
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/step_extern/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/uid_100302/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/cpuset.cpus
2-63
# find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/system/cpuset.mems
0-3
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/step_extern/cpuset.mems
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/cpuset.mems
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/step_extern/cpuset.mems
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/cpuset.mems
/sys/fs/cgroup/cpuset/slurm/uid_100302/cpuset.mems
0-3
/sys/fs/cgroup/cpuset/slurm/cpuset.mems
0-3
#
Marshall, I added "DebugFlags=cgroup" to slurm.conf and restarted slurmctld and slurmd. This is all that was displayed: [2021-09-02T19:37:46.881] slurmd version 21.08.0 started [2021-09-02T19:37:46.891] slurmd started on Thu, 02 Sep 2021 19:37:46 +0300 [2021-09-02T19:37:54.170] CPUs=64 Boards=1 Sockets=1 Cores=64 Threads=1 Memory=515613 TmpDisk=6000273 Uptime=24322 CPUSpecList=0-1 FeaturesAvail=(null) FeaturesActive=(null) [2021-09-02T19:38:26.579] [2.extern] Considering each NUMA node as a socket [2021-09-02T19:38:26.602] [2.extern] error: common_file_write_content: unable to write 8 bytes to cgroup /sys/fs/cgroup/cpuset/slurm/uid_100302/cpuset.cpus: Permission denied [2021-09-02T19:38:26.626] [2.extern] task/cgroup: _memcg_initialize: job: alloc=8192MB mem.limit=8192MB memsw.limit=16384MB [2021-09-02T19:38:26.626] [2.extern] task/cgroup: _memcg_initialize: step: alloc=8192MB mem.limit=8192MB memsw.limit=16384MB [2021-09-02T19:38:26.629] [2.extern] error: _spawn_job_container: Failed to invoke task plugins: one of task_p_pre_setuid functions returned error [2021-09-02T19:38:26.630] [2.extern] done with job [2021-09-02T19:38:26.635] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No such process [2021-09-02T19:38:26.636] Could not launch job 2 and not able to requeue it, cancelling job I must need to do something else? > I must need to do something else?
I think you set the debug flag correctly, it's likely that it's failing before it hits log statements with that debug flag though. I'm digging into the code a bit more.
Thanks for that info. I noticed that cpus.cpus is 2-63 everywhere you showed us. What is the value of cpuset.cpus in /sys/fs/cgroup/cpuset? cat /sys/fs/cgroup/cpuset/cpuset.cpus It looks like you have CoreSpecCount (or CpuSpecList) set in the node definition. Can you upload the node definition from the configuration file for the node you are testing? I wonder if Slurm is trying to set all CPUs in cpuset.cpus for a job even though it may not have access to all the CPUs. Also just a quick sanity check - is SlurmdUser=root? I see you have it commented out in slurm.conf, and since the default is root then it should be root unless you have it set in an included configuration file. Just so you know, I believe the crux of this issue is the "permission denied" errors when trying to set cpuset.cpus, so that's what I'm trying to solve. Hi Marshall, 1/ $ cat /sys/fs/cgroup/cpuset/cpuset.cpus 0-63 2/ Node Definition NodeName=DEFAULT Gres=gpu:a100:4 CpuSpecList=0-1 Feature=cpu_amd_epyc_7713,amd,milan,nolmem,local_200G,local_400G,local_500G,local_950G,gpu_a100,a100 RealMemory=483328 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1 Weight=150 NodeName=gpu101-02-r 3/ slurmd runs as root We're reserving two cores for file system operations - it's a 64 core CPU with Slurm only being able to allocate 62. Greg, Can you try removing CpuSpecList=0-1 and run the test again? I just want to see if this works, and if it does then it points to Slurm's new cgroup code not properly handling CpuSpecList. Hi Marshall, No go. Removed "CpuSpecList=0-1" $ srun --nodelist gpu203-23-r --time 00:10:00 --gres gpu:1 --pty /bin/bash -i srun: job 7 queued and waiting for resources srun: job 7 has been allocated resources srun: Force Terminated job 7 [2021-09-08T02:13:57.346] slurmd version 21.08.0 started [2021-09-08T02:13:57.353] slurmd started on Wed, 08 Sep 2021 02:13:57 +0300 [2021-09-08T02:14:04.631] CPUs=64 Boards=1 Sockets=1 Cores=64 Threads=1 Memory=515613 TmpDisk=6000273 Uptime=480093 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2021-09-08T02:14:30.427] [4.extern] Considering each NUMA node as a socket [2021-09-08T02:14:30.442] [4.extern] error: common_file_write_content: unable to write 8 bytes to cgroup /sys/fs/cgroup/cpuset/slurm/uid_100302/cpuset.cpus: Permission denied [2021-09-08T02:14:30.470] [4.extern] task/cgroup: _memcg_initialize: job: alloc=8192MB mem.limit=8192MB memsw.limit=16384MB [2021-09-08T02:14:30.470] [4.extern] task/cgroup: _memcg_initialize: step: alloc=8192MB mem.limit=8192MB memsw.limit=16384MB [2021-09-08T02:14:30.472] [4.extern] error: _spawn_job_container: Failed to invoke task plugins: one of task_p_pre_setuid functions returned error [2021-09-08T02:14:30.473] [4.extern] done with job [2021-09-08T02:14:30.475] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No such process [2021-09-08T02:14:30.477] Could not launch job 4 and not able to requeue it, cancelling job $ scontrol show node gpu203-23-l NodeName=gpu203-23-l Arch=x86_64 CoresPerSocket=64 CPUAlloc=0 CPUTot=64 CPULoad=6.01 AvailableFeatures=cpu_amd_epyc_7713,amd,milan,nolmem,local_200G,local_400G,local_500G,local_950G,gpu_a100,a100 ActiveFeatures=cpu_amd_epyc_7713,amd,milan,nolmem,local_200G,local_400G,local_500G,local_950G,gpu_a100,a100 Gres=gpu:a100:4(S:0) NodeAddr=gpu203-23-l NodeHostName=gpu203-23-l Version=21.08.0 OS=Linux 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Apr 8 19:51:47 UTC 2021 RealMemory=483328 AllocMem=0 FreeMem=479792 Sockets=1 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=150 Owner=N/A MCS_label=N/A Partitions=ALL,batch,gpu24 BootTime=2021-09-02T12:52:32 SlurmdStartTime=2021-09-08T02:18:12 LastBusyTime=2021-09-08T02:22:04 CfgTRES=cpu=64,mem=472G,billing=64,gres/gpu=4 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s I suspect that the cgroup cpuset values didn't change. Can you check with the same find command as you previously did?
# find /sys/fs/cgroup/cpuset/slurm/ -name cgroup.clone_children -exec echo '{}' \; -exec cat '{}' \;
If the cpuset values are still 2-63, then you'll need to reboot the nodes or manually change them so the cgroup values can change. Then can you re-run the test and also check the cpuset values again after rebooting the nodes?
Before reboot:
# cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-63
# find /sys/fs/cgroup/cpuset/slurm/ -name cgroup.clone_children -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_8/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_8/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_5/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_5/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_4/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_4/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/cgroup.clone_children
0
gpu203-23-l was rebooted, then a job launched:
$ srun --gres gpu:1 --time 00:10:00 --pty /bin/bash -i
srun: job 10 queued and waiting for resources
srun: job 10 has been allocated resources
$
$ cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-63
[wickhagj@gpu203-23-l ~]$ find /sys/fs/cgroup/cpuset/slurm/ -name cgroup.clone_children -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_10/step_0/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_10/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_10/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/cgroup.clone_children
0
(In reply to Greg Wickham from comment #30) > gpu203-23-l was rebooted, then a job launched: Awesome, thanks for testing that! So it looks like the new cgroup code isn't handling CoreSpecCount or CpuSpecList correctly. I was having trouble reproducing this, but this should make it easier. I'll keep you updated on my progress and will try to get in a fix before 21.08.1 is released. Greg,
Even when I manually remove my cgroup directories and then restart slurmd with CpuSpecList configured, I can't reproduce the errors you were seeing. Can you configure CpuSpecList again, stop Slurm, manually (as root) remove the slurm/ cgroup directories under /sys/fs/cgroup/cpuset, then restart Slurm. Then can you run a job? Can you also run the find command to print out the cpuset values? Can you also have DebugFlags=cgroup and SlurmdDebug=debug3 turned on at the same time?
find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
Hi Marshall,
Rebooted the node; put back the original configuration; ran the job:
$ srun --time 00:10:00 --gres gpu:1 --pty /bin/bash -i
srun: job 13 queued and waiting for resources
srun: job 13 has been allocated resources
srun: Force Terminated job 13
root@gpu203-23-l: ~ # find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/system/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_13/step_extern/cpuset.cpus
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_13/cpuset.cpus
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_12/step_extern/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_12/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/uid_100302/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/cpuset.cpus
2-63
Will attach the slurmd.log
Created attachment 21196 [details]
slurmd.log with Debug3 and Debugflags=cgroup
Hi Greg, I was afraid that would still happen. I'll keep working on reproducing it. Was that slurmd log the entire slurmd log? If not, can you upload the whole slurmd log from when you made the change to slurm.conf until you ran the test? I'm particularly interested in the logs at slurmd startup from when the slurm.conf was changed to include CpuSpecList again. Created attachment 21203 [details]
slurmd.log since 2021-09-08T04:02:02.059
Hi Marshall.
Uploaded the full log since rotation yesterday morning. If you need the previous log I can upload that too.
-Greg
Thanks, that log has what I was looking for Greg, I found the problem. In 20.11, when we set cpuset.cpus for the UID directory, we set the job's allocated CPUs, but we also set the CPUs from the slurm cgroup directory. In 21.08, we made a mistake when refactoring this part of the code. Instead of setting CPUs from the slurm cgroup directory, we are setting CPUs from the root cgroup directory. The root cpuset cgroup directory always has *all* the CPUs in the system. When you have CpuSpecList (or CoreSpecCount), the slurm cgroup directory does *not* have all the CPUs in the system. But then when we make the cpuset UID cgroup directory, when we try to set the root CPUs there we get this permission denied error because the parent (slurm) cgroup directory doesn't have access to all of those CPUs. I'm looking into the best way to fix this. Created attachment 21230 [details]
21.08 v1
Greg,
I'm attaching a patchset that seems to fix the issue for me. Basically, this sets the CPUs from the slurm cpuset directory (instead of the root cpuset directory) in the UID cpuset directory. Can you apply this patch to your 21.08 test system, run a test, and let me know whether or not it works? Even if it succeeds, can you upload the slurmd logs (with debug and the cgroup debugflag) as you've been doing, and can you also run the find command to show the cpuset.cpus values in all the subdirectories? (I don't care about the clone_children values - we know those are correct.)
This patchset has *not* gone through our peer review process, but since I've had some trouble exactly replicating what you are seeing I'm hoping to get some more data from you on your test system. (I can replicate it artificially by manually setting the slurm cgroup to exclude specific CPUs.)
(If you're interested in the patches, the first two patches are adding infrastructure, and the third patch is the actual fix. The third patch is straightforward.)
Thanks!
- Marshall
Hi Marshall,
Success!
$ srun --gres gpu:1 --time 00:10:00 --pty /bin/bash -i
srun: job 14 queued and waiting for resources
srun: job 14 has been allocated resources
gpu203-23-l ~]$
. . and . .
# find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/system/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/cpuset.cpus
2-63
#
Will upload the slurmd.log separately.
Created attachment 21236 [details]
slurmd log since 2021-09-11T04:02:01.830
And a result of the find command with another job running:
# find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_15/step_0/cpuset.cpus
2-5
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_15/step_extern/cpuset.cpus
2-5
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_15/cpuset.cpus
2-5
/sys/fs/cgroup/cpuset/slurm/uid_100302/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/system/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/cpuset.cpus
2-63
#
Thanks Greg! That will be really helpful during our peer review process. I'll keep you updated on our progress in getting a fix in. Hi Marshall, Any updates on when this bug fix be in an official release? -Greg The patch I gave you changes plugin ABI, so we aren't going to put that patch in 21.08. I need to figure out a different way to fix it that doesn't change plugin ABI so we can get a fix in 21.08. Acknowledged. -greg Greg, What's the distro and exact OS version of compute nodes on this cluster? Greg, Can you also post the output of this command? cat /sys/fs/cgroup/cpuset/cpuset.cpus Previously we've only ever looked at cpuset.cpus under the slurm/ directory, but I'm interested in what it is at the cpuset/ directory. Marshall, # lsb_release -a LSB Version: :core-4.1-amd64:core-4.1-ia32:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 7.9.2009 (Core) Release: 7.9.2009 Codename: Core # uname -a Linux gpu101-02-r 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Apr 8 19:51:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux # cat /sys/fs/cgroup/cpuset/cpuset.cpus 0-63 CPU is "AMD EPYC 7713P 64-Core Processor" -greg Greg, I still haven't been able to reproduce this locally - I always see all the CPUs in cpuset.cpus in the slurm/ cgroup directory, so I never get the permission denied errors. I don't understand why I'm not seeing the CpuSpecList cpus removed from cpuset.cpus in the slurm/ directory, which is what you see. We have discussed two possible fixes: the patch I gave you and a different possible fix. But, both of these options will have to wait until 22.05. It's possible that it won't be able to be fixed for 21.08. Just continue to run with the patch that I've given you, and we will keep you updated on our progress. - Marshall Hi Marshall, Thanks for the update. I'll chat with Ahmed tomorrow and see if we can look at refactoring the configuration / build process to create a minimal bundle (configuration / build) that hopefully can be used to replicate the issue. -Greg Marhsall, BTW - have you tested on the same OS Release / Kernel Version? CentOS 7.9.2009 3.10.0-1160.24.1.el7.x86_64 -greg (In reply to Greg Wickham from comment #76) > Marhsall, > > BTW - have you tested on the same OS Release / Kernel Version? > > CentOS 7.9.2009 > 3.10.0-1160.24.1.el7.x86_64 > > -greg I did test on CentOS 7 but it was a different version. I haven't had a chance to setup a VM to test that specific version, but I'm not sure if the VM will have the same or different behavior as bare metal with regards to cgroups. Also, since I know this is a problem and can clearly see the problem in the code, I wasn't as motivated to setup the VM since I know we need to fix this anyway. Ok! Thanks! We'll try and find the smallest bundle that has the issue and will report back. By the way - There was another issue with CpuSpecList (and CoreSpecCount) in 21.08 where they didn't actually constrain slurmd/slurmstepd to the proper CPUs anymore. It was tangential to the bug here, so I opened an internal bug to handle it. We just pushed a fix for it and it will be in 21.08.4. Hi Marshall, We have now better understanding about the cause of this issue. These cgroup errors appear only on nodes mounting WekaFS. We dedicate cores 0 and 1 on clients which mount WekaFS using "core=" mount option. https://docs.weka.io/fs/mounting-filesystems And we exclude these cores from SLURM using "CpuSpecList=0-1" I tried booting nodes in 21.08 without Weka and jobs can be submitted normally. However We have many nodes mounting Weka with the same configuration and setup in our 20.11 cluster. We've never faced such errors there. Thanks, Ahmed (In reply to Ahmed Essam ElMazaty from comment #82) > Hi Marshall, > > We have now better understanding about the cause of this issue. > These cgroup errors appear only on nodes mounting WekaFS. > We dedicate cores 0 and 1 on clients which mount WekaFS using "core=" mount > option. > https://docs.weka.io/fs/mounting-filesystems > > And we exclude these cores from SLURM using "CpuSpecList=0-1" Thanks, that's really helpful. Can you run the following commands on a node with the following configurations? # cat /sys/fs/cgroup/cpuset/cpuset.cpus # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus (1) Run them on a node with Weka configured, with the core= option in Weka, and with CpuSpecList (2) Run them on a node without Weka configured, but still with CpuSpecList (3) Run them on a node without Weka configured, and without CpuSpecList If it's easier to run these on three different nodes that are already configured in these different ways, that is fine. > I tried booting nodes in 21.08 without Weka and jobs can be submitted > normally. Does this node configuration still have CpuSpecList=0-1? > However We have many nodes mounting Weka with the same configuration and > setup in our 20.11 cluster. We've never faced such errors there. Yes, the bug does not exist in 20.11. The bug was a regression in 21.08 due to refactoring that we did. I identified the bug in 21.08 and provided a fix for you to run locally - Greg told me that the patch I provided works. The problem is that the patch changes plugin ABI and that's something we don't want to do in a micro (bug fix) release. Another problem is that I haven't been able to reproduce the bug. But since you said it's caused by setting "Core=" with Weka, I have more suspicions about what may be happening. When you have the core= option with Weka, are those cores visible to Slurm or any other process? I am not familiar with Weka, but if Weka prevents any other process from using those cores, then you shouldn't need to configure CpuSpecList=0-1 in Slurm. But if Weka doesn't prevent those cores from being used by other processes, then CpuSpecList is needed. If you aren't sure, then could you pass this question along to Weka support - does Weka use cgroups with the core= option? If so, what does Weka do? That will help me know what advice to give you about Slurm's CpuSpecList option. Hi Marshall, Thanks for your reply (In reply to Marshall Garey from comment #83) > (In reply to Ahmed Essam ElMazaty from comment #82) > > Thanks, that's really helpful. Can you run the following commands on a node > with the following configurations? > > # cat /sys/fs/cgroup/cpuset/cpuset.cpus > # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus > > (1) Run them on a node with Weka configured, with the core= option in Weka, > and with CpuSpecList > (2) Run them on a node without Weka configured, but still with CpuSpecList > (3) Run them on a node without Weka configured, and without CpuSpecList > > If it's easier to run these on three different nodes that are already > configured in these different ways, that is fine. We'll test this and let you know the output soon > Does this node configuration still have CpuSpecList=0-1? > I tried both with and without "CpuSpecList". Didn't face any issues in either of them as long as weka wasn't mounted > > Another problem is that I haven't been able to reproduce the bug. But since > you said it's caused by setting "Core=" with Weka, I have more suspicions > about what may be happening. > > When you have the core= option with Weka, are those cores visible to Slurm > or any other process? I am not familiar with Weka, but if Weka prevents any > other process from using those cores, then you shouldn't need to configure > CpuSpecList=0-1 in Slurm. But if Weka doesn't prevent those cores from being > used by other processes, then CpuSpecList is needed. > > If you aren't sure, then could you pass this question along to Weka support > - does Weka use cgroups with the core= option? If so, what does Weka do? > That will help me know what advice to give you about Slurm's CpuSpecList > option. Greg has forwarded these questions to Weka support and we'll get back to you with their detailed answers soon. From my previous experience with 20.11, the cores were visible to other processes and to SLURM allocations. that's why we're using CpuSpecList to prevent jobs from landing on these cores. Thanks, Ahmed Hi Marshall, Response from the WekaIO team is that they do use cgroups, however to answer 'if so what does Weka do' requires some further digging on their part. -Greg Thanks Ahmed and Greg.
At the moment I am most interested in the results of this test - I'm hoping that this will give me enough information that I can reproduce this bug on my own machine:
> Thanks, that's really helpful. Can you run the following commands on a node
> with the following configurations?
>
> # cat /sys/fs/cgroup/cpuset/cpuset.cpus
> # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus
>
> (1) Run them on a node with Weka configured, with the core= option in Weka,
> and with CpuSpecList
> (2) Run them on a node without Weka configured, but still with CpuSpecList
> (3) Run them on a node without Weka configured, and without CpuSpecList
>
> If it's easier to run these on three different nodes that are already
> configured in these different ways, that is fine.
Just to reiterate - this is definitely a Slurm bug, and we have a fix which can definitely go into 22.05, although we are considering other options for 22.05. But we'll be talking more about what we can do for 21.08. I also appreciate Weka support looking at how it uses cgroups. This will help us in the future.
Hi Marshall, I've added a 40-core node to our test cluster (In reply to Marshall Garey from comment #86) > > Thanks, that's really helpful. Can you run the following commands on a node > > with the following configurations? > > > > # cat /sys/fs/cgroup/cpuset/cpuset.cpus > > # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus > > > > (1) Run them on a node with Weka configured, with the core= option in Weka, > > and with CpuSpecList [mazatyae@slurm-04 ~]$ srun --pty -t 1 bash -l srun: job 90 queued and waiting for resources srun: job 90 has been allocated resources srun: Force Terminated job 90 root@cn605-26-l: ~ # cat /sys/fs/cgroup/cpuset/cpuset.cpus 0-39 root@cn605-26-l: ~ # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus 2-39 > > (2) Run them on a node without Weka configured, but still with CpuSpecList [mazatyae@slurm-04 ~]$ srun --pty -t 1 bash -l srun: job 91 queued and waiting for resources srun: job 91 has been allocated resources [mazatyae@cn605-26-l ~]$ root@cn605-26-l: ~ # cat /sys/fs/cgroup/cpuset/cpuset.cpus 0-39 root@cn605-26-l: ~ # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus 0-39 > > (3) Run them on a node without Weka configured, and without CpuSpecList [mazatyae@slurm-04 ~]$ srun --pty -t 1 bash -l srun: job 92 queued and waiting for resources srun: job 92 has been allocated resources [mazatyae@cn605-26-l ~]$ root@cn605-26-l: ~ # cat /sys/fs/cgroup/cpuset/cpuset.cpus 0-39 root@cn605-26-l: ~ # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus 0-39 Thanks, Ahmed Hi Marshall, Any updates on this issue? Regards, Ahmed I recently submitted patches to our review queue to fix this issue in Slurm 21.08 and master (though the patches for each are quite different). So right now the patches are pending review. Dear Marshall, As an update, We've received this KB article from WekaIO engineers regarding using SLURM with Weka. https://support.weka.io/s/article/Using-Slurm-or-another-job-scheduler-with-Weka On SLURM config side nothing needed to be changed. Only the value of parameter "isolate_cpusets" in Weka configuration needed to be changed. However changing this also did not help as we're still getting the same error on 21.08 Best regards, Ahmed I can't see that article since I don't have an account with Weka. Regardless, we're still in the review process for my patches to Slurm. Review has slowed down for us in the last few weeks due to holidays. Hello, regarding the weka article, I've pasted the interesting part in a comment in another bug: https://bugs.schedmd.com/show_bug.cgi?id=13000#c5 Jean-Baptiste We've just hit the problem. We work around it using dedicated_mode=none option in the weka mount for the moment. Quick update - we're still in the review process but we've made some progress. I just pinged my colleague who is doing the review to see if we can get this finished. Greg, We've gone through several revisions of the patch and now we're at a version which we hope to check into 21.08. Can you test attachment 23267 [details] (the file is named bug12393_2108_v9.patch)? (I'm about to make this patch public again, so you'll get another email notification.) The functionality of this patch should be the same, but since we haven't found a way to reproduce it locally and we don't want to break something accidentally, we'd appreciate it if you can test this and verify that it fixes the problem. We're thinking about releasing 21.08.6 in mid-February, so the sooner you can test it, the more likely it can make it into 21.08.6 (though still not a guarantee). Otherwise, it may slip to 21.08.7. Thanks, - Marshall Comment on attachment 23267 [details]
21.08 v9
Making this patch public
Hi Marshall! Using: $ srun -V slurm 21.08.5 NodeName=DEFAULT Gres="" CpuSpecList=0-1 Feature=dragon,cpu_intel_gold_6148,skylake,intel,ibex2018,nogpu,nolmem,local_200G,local_400G,local_500G RealMemory=375618 Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Weight=100 NodeName=cn605-26-l (the node has 40 cores, but 2 cores are reserved for WekaIO) $ srun -n 38 --time 00:10:00 --pty /bin/bash -i srun: job 132 queued and waiting for resources srun: job 132 has been allocated resources [cn605-26-l ~]$ scontrol show node $(hostname) NodeName=cn605-26-l Arch=x86_64 CoresPerSocket=20 CPUAlloc=38 CPUTot=40 CPULoad=0.47 AvailableFeatures=dragon,cpu_intel_gold_6148,skylake,intel,ibex2018,nogpu,nolmem,local_200G,local_400G,local_500G ActiveFeatures=dragon,cpu_intel_gold_6148,skylake,intel,ibex2018,nogpu,nolmem,local_200G,local_400G,local_500G Gres=(null) NodeAddr=cn605-26-l NodeHostName=cn605-26-l Version=21.08.5 OS=Linux 3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021 RealMemory=375618 AllocMem=77824 FreeMem=370880 Sockets=2 Boards=1 CoreSpecCount=2 CPUSpecList=0-1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A Partitions=batch BootTime=2022-02-06T08:48:51 SlurmdStartTime=2022-02-06T08:56:06 LastBusyTime=2022-02-06T08:57:25 CfgTRES=cpu=40,mem=375618M,billing=40 AllocTRES=cpu=38,mem=76G CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s $ exit $ srun -n 39 --time 00:10:00 --pty /bin/bash -i srun: error: Unable to allocate resources: Requested node configuration is not available The patch appears to work. Are there any other specific tests / commands you would like the output of? -greg Interestingly: $ srun --exclusive --time 00:10:00 --pty /bin/bash -i srun: job 135 queued and waiting for resources srun: job 135 has been allocated resources [cn605-26-l ~]$ scontrol show job=135 | grep TRES TRES=cpu=40,mem=76G,node=1,billing=40 [cn605-26-l ~]$ squeue -j 135 --Format tres-alloc TRES_ALLOC cpu=40,mem=76G,node= [cn605-26-l ~]$ set | grep SLURM | grep 38 SLURM_CPUS_ON_NODE=38 SLURM_JOB_CPUS_PER_NODE=38 This seems to indicate that the accounting when using '--exclusive' isn't deducting the CpuSpecList cores. Thanks for testing that, Greg! For the accounting question - can you submit a new bug report about it? I agree that is strange behavior. Greg, Don't worry about creating a new bug for the accounting issue. I just barely created bug 13357 to track this. I made the bug public so that you can view it if you want. Greg, We pushed the patch to github ahead of 21.08.6: commit 5b9f9d3fae97f291a7e5718a5e458a2568051806 Author: Marshall Garey <marshall@schedmd.com> Date: Thu Jan 27 22:28:06 2022 +0100 NEWS for the previous three commits Bug 12393 commit d656f6e1df5364b0f088634464546c1e75a1aa37 Author: Marshall Garey <marshall@schedmd.com> Date: Fri Dec 3 15:32:15 2021 -0700 Change variable name to reflect true behavior Continuation of the previous commit. Bug 12393 commit aedcbf80503d65087698f032063fc11518d36a65 Author: Marshall Garey <marshall@schedmd.com> Date: Thu Jan 27 22:27:35 2022 +0100 Inherit correct limits for the UID cpuset cgroup. Fix regression in 21.08.0rc1 where job steps failed to setup the cpuset cgroup and thus job steps could not launch on systems that reserved a CPU in a cgroup outside of Slurm (for example, on systems with WekaIO). On such systems, the slurm cpuset cgroup does not have access to all the CPUs. When a job step tried to create the UID cpuset cgroup, it tried to inherit the CPUs from the root cpuset cgroup. The root cpuset cgroup has access to all the CPUs, but for this system the slurm cpuset cgroup does not have access to all the CPUs. This results in a permission denied error and causes job steps to fail. Before 21.08, the UID cpuset cgroup always inherited the limits from the slurm cpuset cgroup, not the root cpuset cgroup. Bug 12393 commit f67b919f7e7dd73e3d5c6f7383f78c21ed85c445 Author: Felip Moll <felip.moll@schedmd.com> Date: Thu Jan 27 17:04:42 2022 +0100 Keep track of the slurm cgroup We are already tracking all the other cgroups representing every node of the hierarchy, namely root, uid, job, step and task cgroups. But the slurm cgroup was not kept and when we needed it we had to create and load it again each time. This will allow to get or set constrains from it directly. Bug 12393 Thanks for reporting this, for being really helpful with testing, and for being really patient as we took quite awhile to finally get this in. I'm closing this as resolved/fixed. *** Ticket 13338 has been marked as a duplicate of this ticket. *** |
Created attachment 21093 [details] Slurmd log (debug3) Good morning. Finally testing 21.08 and have an issue. During job setup there is always a failure that seems to be related to groups. $ srun --nodelist gpu101-02-r --rese TESTING --gres gpu:1 --time 00:10:00 --pty /bin/bash -i srun: job 2549 queued and waiting for resources srun: job 2549 has been allocated resources srun: Force Terminated job 2549 ... slurm.conf and log (debug3) from slurmd attached.