12393 – Job launch always fails with cgroup error.

Ticket 12393 - Job launch always fails with cgroup error.

Summary: Job launch always fails with cgroup error.

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	21.08.0
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marshall Garey
QA Contact:

URL:

Duplicates (1):	13338 (view as ticket list)
Depends on:
Blocks:

Reported:	2021-08-30 23:08 MDT by Greg Wickham
Modified:	2022-02-24 12:54 MST (History)
CC List:	5 users (show)

See Also:	12477 12802 10460 13357 13338 13388
Site:	KAUST
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	21.08.6 22.05.0pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurmd log (debug3) (106.24 KB, text/plain) 2021-08-30 23:08 MDT, Greg Wickham	Details
slurm.conf (4.22 KB, text/plain) 2021-08-30 23:11 MDT, Greg Wickham	Details
cgroup.conf (689 bytes, text/x-matlab) 2021-08-31 13:08 MDT, Greg Wickham	Details
gres.conf (4.17 KB, text/plain) 2021-08-31 13:09 MDT, Greg Wickham	Details
slurmd.log with Debug3 and Debugflags=cgroup (77.75 KB, text/plain) 2021-09-09 06:48 MDT, Greg Wickham	Details
slurmd.log since 2021-09-08T04:02:02.059 (9.82 KB, application/x-bzip2) 2021-09-09 10:06 MDT, Greg Wickham	Details
21.08 v1 (16.23 KB, patch) 2021-09-10 16:38 MDT, Marshall Garey	Details \| Diff
slurmd log since 2021-09-11T04:02:01.830 (18.86 KB, application/x-bzip2) 2021-09-11 06:10 MDT, Greg Wickham	Details
21.08 v9 (16.15 KB, patch) 2022-02-03 13:23 MST, Danny Auble	Details \| Diff
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Greg Wickham 2021-08-30 23:08:56 MDT

Created attachment 21093 [details]
Slurmd log (debug3)

Good morning.

Finally testing 21.08 and have an issue.

During job setup there is always a failure that seems to be related to groups.

$ srun --nodelist gpu101-02-r --rese TESTING --gres gpu:1 --time 00:10:00 --pty /bin/bash -i
srun: job 2549 queued and waiting for resources
srun: job 2549 has been allocated resources
srun: Force Terminated job 2549
...

slurm.conf and log (debug3) from slurmd attached.

Comment 1 Greg Wickham 2021-08-30 23:11:13 MDT

Created attachment 21094 [details]
slurm.conf

Comment 3 Jason Booth 2021-08-31 13:02:04 MDT

Please also attach your cgroup.conf and gres.conf. Did you reboot the nodes after upgrading? Also, in this test system are these real devices you are constraining or fake devices?

Comment 4 Greg Wickham 2021-08-31 13:08:23 MDT

Created attachment 21112 [details]
cgroup.conf

Comment 5 Greg Wickham 2021-08-31 13:09:43 MDT

Created attachment 21113 [details]
gres.conf

Comment 6 Greg Wickham 2021-08-31 13:10:35 MDT

(In reply to Jason Booth from comment #3)
> Please also attach your cgroup.conf and gres.conf. Did you reboot the nodes
> after upgrading? Also, in this test system are these real devices you are
> constraining or fake devices?

Nodes were rebooted for 20.11.0

All real devices.

   -Greg

Comment 7 Greg Wickham 2021-08-31 13:11:13 MDT

(In reply to Greg Wickham from comment #6)
> Nodes were rebooted for 20.11.0

Nodes were rebooted for . . 21.08 . .

Comment 8 Jason Booth 2021-08-31 15:42:11 MDT

Would you also send the output the "<cgroup_path>/cgroup.clone_children"

I am running multiple slurmd's but here is what it looks like on my system.

> $:/sys/fs/cgroup/memory$ cat slurm_n1/cgroup.clone_children 
> 0

Comment 9 Jason Booth 2021-08-31 15:46:00 MDT

Also, do new jobs spawn with the default of 0 for cgroup.clone_children (inherited)?

Comment 11 Jason Booth 2021-08-31 15:57:48 MDT

Greg - it would also be helpful to know what is under these directories, where /sys/fs/cgroup is your cgroup mount location.



ls "/sys/fs/cgroup/memory"
ls "/sys/fs/cgroup/cpuset/"
ls "/sys/fs/cgroup/devices/"
ls "/sys/fs/cgroup/cpu"

The error comes from setting up each respective cgroup:

https://github.com/SchedMD/slurm/blob/slurm-21-08-0-1/src/plugins/task/cgroup/task_cgroup.c#L156

Comment 12 Greg Wickham 2021-09-01 07:56:13 MDT

Hi Jason,

We have some users on our test cluster so I can't swap between releases. Give me another day to spin up another cluster so I can provide the required information.

   -Greg

Comment 13 Marshall Garey 2021-09-01 15:17:58 MDT

Greg,

I'll be taking over this bug for Jason.

While we're waiting for the information, I just wanted to give you some background information about why we're asking for that info. We refactored our cgroup plugins for 21.08. One of the things that was changed was setting cgroup.clone_children to zero, since there were some problems with it (which look similar to the errors we see in your log) on some systems where clone_children was set to 1. See commit 2bf25fdaf2c20 for more information.

- Marshall

Comment 15 Greg Wickham 2021-09-02 05:24:11 MDT

Hi Marshall,

On our new test cluster:

$ srun --gres gpu:1 --time 00:10:00 --pty /bin/bash -i
srun: job 1 queued and waiting for resources
srun: job 1 has been allocated resources
srun: Force Terminated job 1

root@gpu203-23-l: ~ # ls "/sys/fs/cgroup/memory"
cgroup.clone_children           memory.kmem.slabinfo                memory.memsw.limit_in_bytes      memory.swappiness
cgroup.event_control            memory.kmem.tcp.failcnt             memory.memsw.max_usage_in_bytes  memory.usage_in_bytes
cgroup.procs                    memory.kmem.tcp.limit_in_bytes      memory.memsw.usage_in_bytes      memory.use_hierarchy
cgroup.sane_behavior            memory.kmem.tcp.max_usage_in_bytes  memory.move_charge_at_immigrate  notify_on_release
memory.failcnt                  memory.kmem.tcp.usage_in_bytes      memory.numa_stat                 release_agent
memory.force_empty              memory.kmem.usage_in_bytes          memory.oom_control               slurm
memory.kmem.failcnt             memory.limit_in_bytes               memory.pressure_level            system.slice
memory.kmem.limit_in_bytes      memory.max_usage_in_bytes           memory.soft_limit_in_bytes       tasks
memory.kmem.max_usage_in_bytes  memory.memsw.failcnt                memory.stat                      user.slice


root@gpu203-23-l: ~ # ls "/sys/fs/cgroup/cpuset/"
cgroup.clone_children  cpuset.cpus            cpuset.memory_migrate           cpuset.mems                      slurm
cgroup.event_control   cpuset.effective_cpus  cpuset.memory_pressure          cpuset.sched_load_balance        system
cgroup.procs           cpuset.effective_mems  cpuset.memory_pressure_enabled  cpuset.sched_relax_domain_level  tasks
cgroup.sane_behavior   cpuset.mem_exclusive   cpuset.memory_spread_page       notify_on_release                weka-client
cpuset.cpu_exclusive   cpuset.mem_hardwall    cpuset.memory_spread_slab       release_agent                    weka-default


root@gpu203-23-l: ~ # ls "/sys/fs/cgroup/devices/"
cgroup.clone_children  cgroup.procs          devices.allow  devices.list       release_agent  system.slice  user.slice
cgroup.event_control   cgroup.sane_behavior  devices.deny   notify_on_release  slurm          tasks

root@gpu203-23-l: ~ # ls "/sys/fs/cgroup/cpu"
cgroup.clone_children  cgroup.sane_behavior  cpuacct.usage_percpu  cpu.rt_period_us   cpu.stat           system.slice
cgroup.event_control   cpuacct.stat          cpu.cfs_period_us     cpu.rt_runtime_us  notify_on_release  tasks
cgroup.procs           cpuacct.usage         cpu.cfs_quota_us      cpu.shares         release_agent      user.slice


root@gpu203-23-l: /sys/fs/cgroup/memory # cat slurm/cgroup.clone_children
0

Comment 18 Marshall Garey 2021-09-02 08:33:28 MDT

Hi Greg,

Thanks for that data. Can you run these commands for me? Most likely these are okay, but I'm just doule checking a couple of things.

find /sys/fs/cgroup/cpuset/slurm/ -name cgroup.clone_children -exec echo '{}' \; -exec cat '{}' \;
find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \;


Also, can you set DebugFlags=cgroup in slurm.conf (on all daemons), restart all daemons, re-run your test, then upload the slurmd log file? This should tell us more details about what is actually happening.

Comment 19 Greg Wickham 2021-09-02 10:41:40 MDT

# find /sys/fs/cgroup/cpuset/slurm/ -name cgroup.clone_children -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/system/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/cgroup.clone_children
0
# find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/system/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/step_extern/cpuset.cpus

/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/cpuset.cpus

/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/step_extern/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/uid_100302/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/cpuset.cpus
2-63
# find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.mems -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/system/cpuset.mems
0-3
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/step_extern/cpuset.mems

/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/cpuset.mems

/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/step_extern/cpuset.mems

/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/cpuset.mems

/sys/fs/cgroup/cpuset/slurm/uid_100302/cpuset.mems
0-3
/sys/fs/cgroup/cpuset/slurm/cpuset.mems
0-3
#

Comment 20 Greg Wickham 2021-09-02 10:43:45 MDT

Marshall,

I added "DebugFlags=cgroup" to slurm.conf and restarted slurmctld and slurmd.

This is all that was displayed:

[2021-09-02T19:37:46.881] slurmd version 21.08.0 started
[2021-09-02T19:37:46.891] slurmd started on Thu, 02 Sep 2021 19:37:46 +0300
[2021-09-02T19:37:54.170] CPUs=64 Boards=1 Sockets=1 Cores=64 Threads=1 Memory=515613 TmpDisk=6000273 Uptime=24322 CPUSpecList=0-1 FeaturesAvail=(null) FeaturesActive=(null)
[2021-09-02T19:38:26.579] [2.extern] Considering each NUMA node as a socket
[2021-09-02T19:38:26.602] [2.extern] error: common_file_write_content: unable to write 8 bytes to cgroup /sys/fs/cgroup/cpuset/slurm/uid_100302/cpuset.cpus: Permission denied
[2021-09-02T19:38:26.626] [2.extern] task/cgroup: _memcg_initialize: job: alloc=8192MB mem.limit=8192MB memsw.limit=16384MB
[2021-09-02T19:38:26.626] [2.extern] task/cgroup: _memcg_initialize: step: alloc=8192MB mem.limit=8192MB memsw.limit=16384MB
[2021-09-02T19:38:26.629] [2.extern] error: _spawn_job_container: Failed to invoke task plugins: one of task_p_pre_setuid functions returned error
[2021-09-02T19:38:26.630] [2.extern] done with job
[2021-09-02T19:38:26.635] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No such process
[2021-09-02T19:38:26.636] Could not launch job 2 and not able to requeue it, cancelling job

I must need to do something else?

Comment 21 Marshall Garey 2021-09-02 11:07:57 MDT

> I must need to do something else?

I think you set the debug flag correctly, it's likely that it's failing before it hits log statements with that debug flag though. I'm digging into the code a bit more.

Comment 24 Marshall Garey 2021-09-07 11:26:59 MDT

Thanks for that info.

I noticed that cpus.cpus is 2-63 everywhere you showed us.

What is the value of cpuset.cpus in /sys/fs/cgroup/cpuset?

cat /sys/fs/cgroup/cpuset/cpuset.cpus

It looks like you have CoreSpecCount (or CpuSpecList) set in the node definition. Can you upload the node definition from the configuration file for the node you are testing?

I wonder if Slurm is trying to set all CPUs in cpuset.cpus for a job even though it may not have access to all the CPUs.


Also just a quick sanity check - is SlurmdUser=root? I see you have it commented out in slurm.conf, and since the default is root then it should be root unless you have it set in an included configuration file.

Just so you know, I believe the crux of this issue is the "permission denied" errors when trying to set cpuset.cpus, so that's what I'm trying to solve.

Comment 25 Greg Wickham 2021-09-07 13:04:24 MDT

Hi Marshall,

1/

$ cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-63

2/ Node Definition

NodeName=DEFAULT Gres=gpu:a100:4 CpuSpecList=0-1 Feature=cpu_amd_epyc_7713,amd,milan,nolmem,local_200G,local_400G,local_500G,local_950G,gpu_a100,a100 RealMemory=483328 Sockets=1 CoresPerSocket=64 ThreadsPerCore=1  Weight=150

NodeName=gpu101-02-r

3/ slurmd runs as root

We're reserving two cores for file system operations - it's a 64 core CPU with Slurm only being able to allocate 62.

Comment 26 Marshall Garey 2021-09-07 15:17:54 MDT

Greg,

Can you try removing CpuSpecList=0-1 and run the test again? I just want to see if this works, and if it does then it points to Slurm's new cgroup code not properly handling CpuSpecList.

Comment 27 Greg Wickham 2021-09-07 17:27:26 MDT

Hi Marshall,

No go.

Removed "CpuSpecList=0-1"


$ srun --nodelist gpu203-23-r --time 00:10:00 --gres gpu:1 --pty /bin/bash -i
srun: job 7 queued and waiting for resources
srun: job 7 has been allocated resources
srun: Force Terminated job 7


[2021-09-08T02:13:57.346] slurmd version 21.08.0 started
[2021-09-08T02:13:57.353] slurmd started on Wed, 08 Sep 2021 02:13:57 +0300
[2021-09-08T02:14:04.631] CPUs=64 Boards=1 Sockets=1 Cores=64 Threads=1 Memory=515613 TmpDisk=6000273 Uptime=480093 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2021-09-08T02:14:30.427] [4.extern] Considering each NUMA node as a socket
[2021-09-08T02:14:30.442] [4.extern] error: common_file_write_content: unable to write 8 bytes to cgroup /sys/fs/cgroup/cpuset/slurm/uid_100302/cpuset.cpus: Permission denied
[2021-09-08T02:14:30.470] [4.extern] task/cgroup: _memcg_initialize: job: alloc=8192MB mem.limit=8192MB memsw.limit=16384MB
[2021-09-08T02:14:30.470] [4.extern] task/cgroup: _memcg_initialize: step: alloc=8192MB mem.limit=8192MB memsw.limit=16384MB
[2021-09-08T02:14:30.472] [4.extern] error: _spawn_job_container: Failed to invoke task plugins: one of task_p_pre_setuid functions returned error
[2021-09-08T02:14:30.473] [4.extern] done with job
[2021-09-08T02:14:30.475] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No such process
[2021-09-08T02:14:30.477] Could not launch job 4 and not able to requeue it, cancelling job

$ scontrol show node gpu203-23-l
NodeName=gpu203-23-l Arch=x86_64 CoresPerSocket=64 
   CPUAlloc=0 CPUTot=64 CPULoad=6.01
   AvailableFeatures=cpu_amd_epyc_7713,amd,milan,nolmem,local_200G,local_400G,local_500G,local_950G,gpu_a100,a100
   ActiveFeatures=cpu_amd_epyc_7713,amd,milan,nolmem,local_200G,local_400G,local_500G,local_950G,gpu_a100,a100
   Gres=gpu:a100:4(S:0)
   NodeAddr=gpu203-23-l NodeHostName=gpu203-23-l Version=21.08.0
   OS=Linux 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Apr 8 19:51:47 UTC 2021 
   RealMemory=483328 AllocMem=0 FreeMem=479792 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=150 Owner=N/A MCS_label=N/A
   Partitions=ALL,batch,gpu24 
   BootTime=2021-09-02T12:52:32 SlurmdStartTime=2021-09-08T02:18:12
   LastBusyTime=2021-09-08T02:22:04
   CfgTRES=cpu=64,mem=472G,billing=64,gres/gpu=4
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Comment 28 Marshall Garey 2021-09-08 09:02:25 MDT

I suspect that the cgroup cpuset values didn't change. Can you check with the same find command as you previously did?

# find /sys/fs/cgroup/cpuset/slurm/ -name cgroup.clone_children -exec echo '{}' \; -exec cat '{}' \;

If the cpuset values are still 2-63, then you'll need to reboot the nodes or manually change them so the cgroup values can change. Then can you re-run the test and also check the cpuset values again after rebooting the nodes?

Comment 29 Greg Wickham 2021-09-08 09:44:14 MDT

Before reboot:

# cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-63

# find /sys/fs/cgroup/cpuset/slurm/ -name cgroup.clone_children -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_8/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_8/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_5/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_5/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_4/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_4/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_2/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_1/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/cgroup.clone_children
0

Comment 30 Greg Wickham 2021-09-08 10:41:36 MDT

gpu203-23-l was rebooted, then a job launched:

$ srun --gres gpu:1 --time 00:10:00 --pty /bin/bash -i
srun: job 10 queued and waiting for resources
srun: job 10 has been allocated resources
$ 


$ cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-63
[wickhagj@gpu203-23-l ~]$ find /sys/fs/cgroup/cpuset/slurm/ -name cgroup.clone_children -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_10/step_0/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_10/step_extern/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_10/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/uid_100302/cgroup.clone_children
0
/sys/fs/cgroup/cpuset/slurm/cgroup.clone_children
0

Comment 31 Marshall Garey 2021-09-08 10:59:03 MDT

(In reply to Greg Wickham from comment #30)
> gpu203-23-l was rebooted, then a job launched:

Awesome, thanks for testing that! So it looks like the new cgroup code isn't handling CoreSpecCount or CpuSpecList correctly.

I was having trouble reproducing this, but this should make it easier. I'll keep you updated on my progress and will try to get in a fix before 21.08.1 is released.

Comment 39 Marshall Garey 2021-09-08 15:35:53 MDT

Greg,

Even when I manually remove my cgroup directories and then restart slurmd with CpuSpecList configured, I can't reproduce the errors you were seeing. Can you configure CpuSpecList again, stop Slurm, manually (as root) remove the slurm/ cgroup directories under /sys/fs/cgroup/cpuset, then restart Slurm. Then can you run a job? Can you also run the find command to print out the cpuset values? Can you also have DebugFlags=cgroup and SlurmdDebug=debug3 turned on at the same time?

find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;

Comment 41 Greg Wickham 2021-09-09 06:47:31 MDT

Hi Marshall,

Rebooted the node; put back the original configuration; ran the job:

$ srun --time 00:10:00 --gres gpu:1 --pty /bin/bash -i
srun: job 13 queued and waiting for resources
srun: job 13 has been allocated resources
srun: Force Terminated job 13

root@gpu203-23-l: ~ # find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/system/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_13/step_extern/cpuset.cpus

/sys/fs/cgroup/cpuset/slurm/uid_100302/job_13/cpuset.cpus

/sys/fs/cgroup/cpuset/slurm/uid_100302/job_12/step_extern/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_12/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/uid_100302/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/cpuset.cpus
2-63


Will attach the slurmd.log

Comment 42 Greg Wickham 2021-09-09 06:48:16 MDT

Created attachment 21196 [details]
slurmd.log with Debug3 and Debugflags=cgroup

Comment 43 Marshall Garey 2021-09-09 09:10:04 MDT

Hi Greg,

I was afraid that would still happen. I'll keep working on reproducing it.

Was that slurmd log the entire slurmd log? If not, can you upload the whole slurmd log from when you made the change to slurm.conf until you ran the test? I'm particularly interested in the logs at slurmd startup from when the slurm.conf was changed to include CpuSpecList again.

Comment 44 Greg Wickham 2021-09-09 10:06:53 MDT

Created attachment 21203 [details]
slurmd.log since 2021-09-08T04:02:02.059

Hi Marshall.

Uploaded the full log since rotation yesterday morning. If you need the previous log I can upload that too.

   -Greg

Comment 45 Marshall Garey 2021-09-09 10:15:58 MDT

Thanks, that log has what I was looking for

Comment 52 Marshall Garey 2021-09-10 11:52:24 MDT

Greg,

I found the problem. In 20.11, when we set cpuset.cpus for the UID directory, we set the job's allocated CPUs, but we also set the CPUs from the slurm cgroup directory.

In 21.08, we made a mistake when refactoring this part of the code. Instead of setting CPUs from the slurm cgroup directory, we are setting CPUs from the root cgroup directory. The root cpuset cgroup directory always has *all* the CPUs in the system. When you have CpuSpecList (or CoreSpecCount), the slurm cgroup directory does *not* have all the CPUs in the system. But then when we make the cpuset UID cgroup directory, when we try to set the root CPUs there we get this permission denied error because the parent (slurm) cgroup directory doesn't have access to all of those CPUs.

I'm looking into the best way to fix this.

Comment 54 Marshall Garey 2021-09-10 16:38:41 MDT

Created attachment 21230 [details]
21.08 v1

Greg,

I'm attaching a patchset that seems to fix the issue for me. Basically, this sets the CPUs from the slurm cpuset directory (instead of the root cpuset directory) in the UID cpuset directory. Can you apply this patch to your 21.08 test system, run a test, and let me know whether or not it works? Even if it succeeds, can you upload the slurmd logs (with debug and the cgroup debugflag) as you've been doing, and can you also run the find command to show the cpuset.cpus values in all the subdirectories? (I don't care about the clone_children values - we know those are correct.)

This patchset has *not* gone through our peer review process, but since I've had some trouble exactly replicating what you are seeing I'm hoping to get some more data from you on your test system. (I can replicate it artificially by manually setting the slurm cgroup to exclude specific CPUs.)

(If you're interested in the patches, the first two patches are adding infrastructure, and the third patch is the actual fix. The third patch is straightforward.)

Thanks!
- Marshall

Comment 56 Greg Wickham 2021-09-11 06:09:28 MDT

Hi Marshall,

Success!

$ srun --gres gpu:1 --time 00:10:00 --pty /bin/bash -i
srun: job 14 queued and waiting for resources
srun: job 14 has been allocated resources
gpu203-23-l ~]$

. . and . . 

# find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/system/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/cpuset.cpus
2-63
# 

Will upload the slurmd.log separately.

Comment 57 Greg Wickham 2021-09-11 06:10:59 MDT

Created attachment 21236 [details]
slurmd log since 2021-09-11T04:02:01.830

Comment 58 Greg Wickham 2021-09-11 06:12:02 MDT

And a result of the find command with another job running:

# find /sys/fs/cgroup/cpuset/slurm/ -name cpuset.cpus -exec echo '{}' \; -exec cat '{}' \;
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_15/step_0/cpuset.cpus
2-5
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_15/step_extern/cpuset.cpus
2-5
/sys/fs/cgroup/cpuset/slurm/uid_100302/job_15/cpuset.cpus
2-5
/sys/fs/cgroup/cpuset/slurm/uid_100302/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/system/cpuset.cpus
2-63
/sys/fs/cgroup/cpuset/slurm/cpuset.cpus
2-63
#

Comment 59 Marshall Garey 2021-09-13 09:00:39 MDT

Thanks Greg! That will be really helpful during our peer review process.

I'll keep you updated on our progress in getting a fix in.

Comment 66 Greg Wickham 2021-10-05 21:28:29 MDT

Hi Marshall,

Any updates on when this bug fix be in an official release?

  -Greg

Comment 67 Marshall Garey 2021-10-06 08:52:49 MDT

The patch I gave you changes plugin ABI, so we aren't going to put that patch in 21.08. I need to figure out a different way to fix it that doesn't change plugin ABI so we can get a fix in 21.08.

Comment 68 Greg Wickham 2021-10-06 08:54:39 MDT

Acknowledged.

   -greg

Comment 69 Marshall Garey 2021-10-06 10:03:48 MDT

Greg,

What's the distro and exact OS version of compute nodes on this cluster?

Comment 71 Marshall Garey 2021-10-06 10:36:59 MDT

Greg,

Can you also post the output of this command?

cat /sys/fs/cgroup/cpuset/cpuset.cpus

Previously we've only ever looked at cpuset.cpus under the slurm/ directory, but I'm interested in what it is at the cpuset/ directory.

Comment 72 Greg Wickham 2021-10-06 12:04:14 MDT

Marshall,

# lsb_release -a
LSB Version:	:core-4.1-amd64:core-4.1-ia32:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.9.2009 (Core)
Release:	7.9.2009
Codename:	Core

# uname -a
Linux gpu101-02-r 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Apr 8 19:51:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

# cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-63

CPU is "AMD EPYC 7713P 64-Core Processor"

   -greg

Comment 73 Marshall Garey 2021-11-02 11:26:21 MDT

Greg,

I still haven't been able to reproduce this locally - I always see all the CPUs in cpuset.cpus in the slurm/ cgroup directory, so I never get the permission denied errors. I don't understand why I'm not seeing the CpuSpecList cpus removed from cpuset.cpus in the slurm/ directory, which is what you see.


We have discussed two possible fixes: the patch I gave you and a different possible fix. But, both of these options will have to wait until 22.05. It's possible that it won't be able to be fixed for 21.08. Just continue to run with the patch that I've given you, and we will keep you updated on our progress.

- Marshall

Comment 75 Greg Wickham 2021-11-02 12:11:55 MDT

Hi Marshall,

Thanks for the update.

I'll chat with Ahmed tomorrow and see if we can look at refactoring the configuration / build process to create a minimal bundle (configuration / build) that hopefully can be used to replicate the issue.

   -Greg

Comment 76 Greg Wickham 2021-11-02 12:13:22 MDT

Marhsall,

BTW - have you tested on the same OS Release / Kernel Version?

CentOS 7.9.2009
3.10.0-1160.24.1.el7.x86_64

   -greg

Comment 77 Marshall Garey 2021-11-02 13:11:20 MDT

(In reply to Greg Wickham from comment #76)
> Marhsall,
> 
> BTW - have you tested on the same OS Release / Kernel Version?
> 
> CentOS 7.9.2009
> 3.10.0-1160.24.1.el7.x86_64
> 
>    -greg

I did test on CentOS 7 but it was a different version. I haven't had a chance to setup a VM to test that specific version, but I'm not sure if the VM will have the same or different behavior as bare metal with regards to cgroups. Also, since I know this is a problem and can clearly see the problem in the code, I wasn't as motivated to setup the VM since I know we need to fix this anyway.

Comment 78 Greg Wickham 2021-11-02 13:18:13 MDT

Ok! Thanks! We'll try and find the smallest bundle that has the issue and will report back.

Comment 81 Marshall Garey 2021-11-12 15:32:41 MST

By the way -

There was another issue with CpuSpecList (and CoreSpecCount) in 21.08 where they didn't actually constrain slurmd/slurmstepd to the proper CPUs anymore. It was tangential to the bug here, so I opened an internal bug to handle it. We just pushed a fix for it and it will be in 21.08.4.

Comment 82 Ahmed Essam ElMazaty 2021-11-14 03:24:47 MST

Hi Marshall,

We have now better understanding about the cause of this issue.
These cgroup errors appear only on nodes mounting WekaFS.
We dedicate cores 0 and 1 on clients which mount WekaFS using "core=" mount option. 
https://docs.weka.io/fs/mounting-filesystems

And we exclude these cores from SLURM using "CpuSpecList=0-1"

I tried booting nodes in 21.08 without Weka and jobs can be submitted normally.
However We have many nodes mounting Weka with the same configuration and setup in our 20.11 cluster. We've never faced such errors there.

Thanks,
Ahmed

Comment 83 Marshall Garey 2021-11-15 12:09:36 MST

(In reply to Ahmed Essam ElMazaty from comment #82)
> Hi Marshall,
>
> We have now better understanding about the cause of this issue.
> These cgroup errors appear only on nodes mounting WekaFS.
> We dedicate cores 0 and 1 on clients which mount WekaFS using "core=" mount
> option.
> https://docs.weka.io/fs/mounting-filesystems
>
> And we exclude these cores from SLURM using "CpuSpecList=0-1"

Thanks, that's really helpful. Can you run the following commands on a node with the following configurations?

# cat /sys/fs/cgroup/cpuset/cpuset.cpus
# cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus

(1) Run them on a node with Weka configured, with the core= option in Weka, and with CpuSpecList
(2) Run them on a node without Weka configured, but still with CpuSpecList
(3) Run them on a node without Weka configured, and without CpuSpecList

If it's easier to run these on three different nodes that are already configured in these different ways, that is fine.


> I tried booting nodes in 21.08 without Weka and jobs can be submitted
> normally.

Does this node configuration still have CpuSpecList=0-1?


> However We have many nodes mounting Weka with the same configuration and
> setup in our 20.11 cluster. We've never faced such errors there.

Yes, the bug does not exist in 20.11. The bug was a regression in 21.08 due to refactoring that we did. I identified the bug in 21.08 and provided a fix for you to run locally - Greg told me that the patch I provided works. The problem is that the patch changes plugin ABI and that's something we don't want to do in a micro (bug fix) release.

Another problem is that I haven't been able to reproduce the bug. But since you said it's caused by setting "Core=" with Weka, I have more suspicions about what may be happening.

When you have the core= option with Weka, are those cores visible to Slurm or any other process? I am not familiar with Weka, but if Weka prevents any other process from using those cores, then you shouldn't need to configure CpuSpecList=0-1 in Slurm. But if Weka doesn't prevent those cores from being used by other processes, then CpuSpecList is needed.

If you aren't sure, then could you pass this question along to Weka support - does Weka use cgroups with the core= option? If so, what does Weka do? That will help me know what advice to give you about Slurm's CpuSpecList option.

Comment 84 Ahmed Essam ElMazaty 2021-11-16 01:01:16 MST

Hi Marshall,
Thanks for your reply

(In reply to Marshall Garey from comment #83)
> (In reply to Ahmed Essam ElMazaty from comment #82)

> 
> Thanks, that's really helpful. Can you run the following commands on a node
> with the following configurations?
> 
> # cat /sys/fs/cgroup/cpuset/cpuset.cpus
> # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus
> 
> (1) Run them on a node with Weka configured, with the core= option in Weka,
> and with CpuSpecList
> (2) Run them on a node without Weka configured, but still with CpuSpecList
> (3) Run them on a node without Weka configured, and without CpuSpecList
> 
> If it's easier to run these on three different nodes that are already
> configured in these different ways, that is fine.

We'll test this and let you know the output soon


> Does this node configuration still have CpuSpecList=0-1?
>
I tried both with and without "CpuSpecList". Didn't face any issues in either of them as long as weka wasn't mounted



> 
> Another problem is that I haven't been able to reproduce the bug. But since
> you said it's caused by setting "Core=" with Weka, I have more suspicions
> about what may be happening.
> 
> When you have the core= option with Weka, are those cores visible to Slurm
> or any other process? I am not familiar with Weka, but if Weka prevents any
> other process from using those cores, then you shouldn't need to configure
> CpuSpecList=0-1 in Slurm. But if Weka doesn't prevent those cores from being
> used by other processes, then CpuSpecList is needed.
> 
> If you aren't sure, then could you pass this question along to Weka support
> - does Weka use cgroups with the core= option? If so, what does Weka do?
> That will help me know what advice to give you about Slurm's CpuSpecList
> option.

Greg has forwarded these questions to Weka support and we'll get back to you with their detailed answers soon. From my previous experience with 20.11, the cores were visible to other processes and to SLURM allocations. that's why we're using CpuSpecList to prevent jobs from landing on these cores.

Thanks,
Ahmed

Comment 85 Greg Wickham 2021-11-16 07:13:59 MST

Hi Marshall,

Response from the WekaIO team is that they do use cgroups, however to answer 'if so what does Weka do' requires some further digging on their part.

   -Greg

Comment 86 Marshall Garey 2021-11-16 09:18:53 MST

Thanks Ahmed and Greg.

At the moment I am most interested in the results of this test - I'm hoping that this will give me enough information that I can reproduce this bug on my own machine:

 
> Thanks, that's really helpful. Can you run the following commands on a node
> with the following configurations?
> 
> # cat /sys/fs/cgroup/cpuset/cpuset.cpus
> # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus
> 
> (1) Run them on a node with Weka configured, with the core= option in Weka,
> and with CpuSpecList
> (2) Run them on a node without Weka configured, but still with CpuSpecList
> (3) Run them on a node without Weka configured, and without CpuSpecList
> 
> If it's easier to run these on three different nodes that are already
> configured in these different ways, that is fine.


Just to reiterate - this is definitely a Slurm bug, and we have a fix which can definitely go into 22.05, although we are considering other options for 22.05. But we'll be talking more about what we can do for 21.08. I also appreciate Weka support looking at how it uses cgroups. This will help us in the future.

Comment 87 Ahmed Essam ElMazaty 2021-11-16 12:25:59 MST

Hi Marshall,
I've added a 40-core node to our test cluster
(In reply to Marshall Garey from comment #86)

> > Thanks, that's really helpful. Can you run the following commands on a node
> > with the following configurations?
> > 
> > # cat /sys/fs/cgroup/cpuset/cpuset.cpus
> > # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus
> > 
> > (1) Run them on a node with Weka configured, with the core= option in Weka,
> > and with CpuSpecList
[mazatyae@slurm-04 ~]$ srun --pty -t 1  bash -l
srun: job 90 queued and waiting for resources
srun: job 90 has been allocated resources
srun: Force Terminated job 90

root@cn605-26-l: ~ # cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-39
root@cn605-26-l: ~ #  cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus
2-39

> > (2) Run them on a node without Weka configured, but still with CpuSpecList
[mazatyae@slurm-04 ~]$ srun --pty -t 1  bash -l
srun: job 91 queued and waiting for resources
srun: job 91 has been allocated resources
[mazatyae@cn605-26-l ~]$

root@cn605-26-l: ~ # cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-39
root@cn605-26-l: ~ # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus
0-39


> > (3) Run them on a node without Weka configured, and without CpuSpecList

[mazatyae@slurm-04 ~]$ srun --pty -t 1  bash -l
srun: job 92 queued and waiting for resources
srun: job 92 has been allocated resources
[mazatyae@cn605-26-l ~]$ 

root@cn605-26-l: ~ # cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-39
root@cn605-26-l: ~ # cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus
0-39


Thanks,
Ahmed

Comment 95 Ahmed Essam ElMazaty 2021-12-14 02:51:24 MST

Hi Marshall,
Any updates on this issue?
Regards,
Ahmed

Comment 96 Marshall Garey 2021-12-14 08:36:19 MST

I recently submitted patches to our review queue to fix this issue in Slurm 21.08 and master (though the patches for each are quite different). So right now the patches are pending review.

Comment 98 Ahmed Essam ElMazaty 2022-01-09 03:58:48 MST

Dear Marshall,

As an update, We've received this KB article from WekaIO engineers regarding using SLURM with Weka.

https://support.weka.io/s/article/Using-Slurm-or-another-job-scheduler-with-Weka

On SLURM config side nothing needed to be changed. Only the value of parameter "isolate_cpusets" in Weka configuration needed to be changed. 
However changing this also did not help as we're still getting the same error on 21.08

Best regards,
Ahmed

Comment 99 Marshall Garey 2022-01-10 13:10:50 MST

I can't see that article since I don't have an account with Weka. Regardless, we're still in the review process for my patches to Slurm. Review has slowed down for us in the last few weeks due to holidays.

Comment 101 Institut Pasteur HPC Admin 2022-01-11 13:49:03 MST

Hello,

regarding the weka article, I've pasted the interesting part in a comment in another bug:

https://bugs.schedmd.com/show_bug.cgi?id=13000#c5

Jean-Baptiste

Comment 102 Institut Pasteur HPC Admin 2022-01-11 14:03:31 MST

We've just hit the problem. We work around it using dedicated_mode=none option in the weka mount for the moment.

Comment 103 Marshall Garey 2022-01-26 14:53:23 MST

Quick update - we're still in the review process but we've made some progress. I just pinged my colleague who is doing the review to see if we can get this finished.

Comment 120 Marshall Garey 2022-02-03 14:29:59 MST

Greg,

We've gone through several revisions of the patch and now we're at a version which we hope to check into 21.08. Can you test attachment 23267 [details] (the file is named bug12393_2108_v9.patch)? (I'm about to make this patch public again, so you'll get another email notification.)

The functionality of this patch should be the same, but since we haven't found a way to reproduce it locally and we don't want to break something accidentally, we'd appreciate it if you can test this and verify that it fixes the problem. We're thinking about releasing 21.08.6 in mid-February, so the sooner you can test it, the more likely it can make it into 21.08.6 (though still not a guarantee). Otherwise, it may slip to 21.08.7.

Thanks,
- Marshall

Comment 121 Marshall Garey 2022-02-03 14:30:17 MST

Comment on attachment 23267 [details]
21.08 v9

Making this patch public

Comment 122 Greg Wickham 2022-02-05 23:02:45 MST

Hi Marshall!

Using:

$ srun -V
slurm 21.08.5

NodeName=DEFAULT Gres="" CpuSpecList=0-1 Feature=dragon,cpu_intel_gold_6148,skylake,intel,ibex2018,nogpu,nolmem,local_200G,local_400G,local_500G RealMemory=375618 Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 Weight=100
NodeName=cn605-26-l

(the node has 40 cores, but 2 cores are reserved for WekaIO)

$ srun -n 38 --time 00:10:00 --pty /bin/bash -i
srun: job 132 queued and waiting for resources
srun: job 132 has been allocated resources

[cn605-26-l ~]$ scontrol show node $(hostname)
NodeName=cn605-26-l Arch=x86_64 CoresPerSocket=20 
   CPUAlloc=38 CPUTot=40 CPULoad=0.47
   AvailableFeatures=dragon,cpu_intel_gold_6148,skylake,intel,ibex2018,nogpu,nolmem,local_200G,local_400G,local_500G
   ActiveFeatures=dragon,cpu_intel_gold_6148,skylake,intel,ibex2018,nogpu,nolmem,local_200G,local_400G,local_500G
   Gres=(null)
   NodeAddr=cn605-26-l NodeHostName=cn605-26-l Version=21.08.5
   OS=Linux 3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021 
   RealMemory=375618 AllocMem=77824 FreeMem=370880 Sockets=2 Boards=1
   CoreSpecCount=2 CPUSpecList=0-1 
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
   Partitions=batch 
   BootTime=2022-02-06T08:48:51 SlurmdStartTime=2022-02-06T08:56:06
   LastBusyTime=2022-02-06T08:57:25
   CfgTRES=cpu=40,mem=375618M,billing=40
   AllocTRES=cpu=38,mem=76G
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
$ exit

$ srun -n 39 --time 00:10:00 --pty /bin/bash -i
srun: error: Unable to allocate resources: Requested node configuration is not available

The patch appears to work.

Are there any other specific tests / commands you would like the output of?

   -greg

Comment 123 Greg Wickham 2022-02-05 23:08:56 MST

Interestingly:

$ srun --exclusive --time 00:10:00 --pty /bin/bash -i
srun: job 135 queued and waiting for resources
srun: job 135 has been allocated resources
[cn605-26-l ~]$ scontrol show job=135 | grep TRES
   TRES=cpu=40,mem=76G,node=1,billing=40
[cn605-26-l ~]$ squeue -j 135 --Format tres-alloc
TRES_ALLOC          
cpu=40,mem=76G,node=
[cn605-26-l ~]$ set | grep SLURM | grep 38
SLURM_CPUS_ON_NODE=38
SLURM_JOB_CPUS_PER_NODE=38

This seems to indicate that the accounting when using '--exclusive' isn't deducting the CpuSpecList cores.

Comment 125 Marshall Garey 2022-02-07 14:20:00 MST

Thanks for testing that, Greg!

For the accounting question - can you submit a new bug report about it? I agree that is strange behavior.

Comment 126 Marshall Garey 2022-02-07 14:47:19 MST

Greg,

Don't worry about creating a new bug for the accounting issue. I just barely created bug 13357 to track this. I made the bug public so that you can view it if you want.

Comment 130 Marshall Garey 2022-02-07 16:54:52 MST

Greg,

We pushed the patch to github ahead of 21.08.6:


commit 5b9f9d3fae97f291a7e5718a5e458a2568051806
Author: Marshall Garey <marshall@schedmd.com>
Date:   Thu Jan 27 22:28:06 2022 +0100

    NEWS for the previous three commits
    
    Bug 12393

commit d656f6e1df5364b0f088634464546c1e75a1aa37
Author: Marshall Garey <marshall@schedmd.com>
Date:   Fri Dec 3 15:32:15 2021 -0700

    Change variable name to reflect true behavior
    
    Continuation of the previous commit.
    
    Bug 12393

commit aedcbf80503d65087698f032063fc11518d36a65
Author: Marshall Garey <marshall@schedmd.com>
Date:   Thu Jan 27 22:27:35 2022 +0100

    Inherit correct limits for the UID cpuset cgroup.
    
    Fix regression in 21.08.0rc1 where job steps failed to setup the cpuset
    cgroup and thus job steps could not launch on systems that reserved a CPU
    in a cgroup outside of Slurm (for example, on systems with WekaIO).
    
    On such systems, the slurm cpuset cgroup does not have access to all the
    CPUs. When a job step tried to create the UID cpuset cgroup, it tried to
    inherit the CPUs from the root cpuset cgroup. The root cpuset cgroup has
    access to all the CPUs, but for this system the slurm cpuset cgroup does
    not have access to all the CPUs. This results in a permission denied
    error and causes job steps to fail.
    
    Before 21.08, the UID cpuset cgroup always inherited the limits from the
    slurm cpuset cgroup, not the root cpuset cgroup.
    
    Bug 12393

commit f67b919f7e7dd73e3d5c6f7383f78c21ed85c445
Author: Felip Moll <felip.moll@schedmd.com>
Date:   Thu Jan 27 17:04:42 2022 +0100

    Keep track of the slurm cgroup
    
    We are already tracking all the other cgroups representing every node of
    the hierarchy, namely root, uid, job, step and task cgroups. But the slurm
    cgroup was not kept and when we needed it we had to create and load it again
    each time. This will allow to get or set constrains from it directly.
    
    Bug 12393



Thanks for reporting this, for being really helpful with testing, and for being really patient as we took quite awhile to finally get this in.

I'm closing this as resolved/fixed.

Comment 131 Tim McMullan 2022-02-24 12:54:33 MST

*** Ticket 13338 has been marked as a duplicate of this ticket. ***