Ticket 10089

Summary: cpuset issue - No space left on device
Product: Slurm Reporter: Matt Ezell <ezellma>
Component: slurmdAssignee: Felip Moll <felip.moll>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bart, nate
Version: 20.11.x   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=9244
https://bugs.schedmd.com/show_bug.cgi?id=12157
Site: ORNL-OLCF Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.02.6 20.11.0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Matt Ezell 2020-10-28 18:44:43 MDT
I'm testing 20.11.0.0pre1 from ~Friday and ran into a problem setting up cgroups. The uid_# subdirectories don't get set with 'cpuset.mems' which disallows tasks from being put into the cpuset.

[root@lyra17 ~]# srun -n4 --ntasks-per-gpu=1 /bin/bash -c "env|grep ROC"
slurmstepd: error: Failed to invoke task plugins: task_p_pre_launch error
slurmstepd: error: Failed to invoke task plugins: task_p_pre_launch error
slurmstepd: error: Failed to invoke task plugins: task_p_pre_launch error
slurmstepd: error: Failed to invoke task plugins: task_p_pre_launch error
srun: error: lyra17: tasks 0-3: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=3.0

[2020-10-28T20:27:18.974] [3.extern] Considering each NUMA node as a socket
[2020-10-28T20:27:18.983] [3.extern] error: _file_write_uint32s: write pid 29589 to /sys/fs/cgroup/cpuset/slurm/uid_0/job_3/step_extern/cgroup.procs failed: No space left on device
[2020-10-28T20:27:18.983] [3.extern] error: task_cgroup_cpuset_create: unable to add slurmstepd to cpuset cg '/sys/fs/cgroup/cpuset/slurm/uid_0/job_3/step_extern'
...
[2020-10-28T20:27:41.744] [3.0] error: _file_write_uint32s: write pid 29661 to /sys/fs/cgroup/cpuset/slurm/uid_0/job_3/step_0/cgroup.procs failed: No space left on device
[2020-10-28T20:27:41.744] [3.0] error: task_cgroup_cpuset_create: unable to add slurmstepd to cpuset cg '/sys/fs/cgroup/cpuset/slurm/uid_0/job_3/step_0'
[2020-10-28T20:27:41.745] [3.0] task/cgroup: _memcg_initialize: /slurm/uid_0/job_3: alloc=0MB mem.limit=257740MB memsw.limit=unlimited
[2020-10-28T20:27:41.745] [3.0] task/cgroup: _memcg_initialize: /slurm/uid_0/job_3/step_0: alloc=0MB mem.limit=257740MB memsw.limit=unlimited
[2020-10-28T20:27:41.831] [3.0] error: Failed to invoke task plugins: task_p_pre_launch error
[2020-10-28T20:27:41.831] [3.0] error: Failed to invoke task plugins: task_p_pre_launch error
[2020-10-28T20:27:41.831] [3.0] error: Failed to invoke task plugins: task_p_pre_launch error
[2020-10-28T20:27:41.831] [3.0] error: Failed to invoke task plugins: task_p_pre_launch error
[2020-10-28T20:27:44.000] [3.0] done with job


[root@lyra17 slurm]# cat /sys/fs/cgroup/cpuset/slurm/cpuset.mems
0-7
[root@lyra17 slurm]# cat /sys/fs/cgroup/cpuset/slurm/uid_0/cpuset.mems

[root@lyra17 slurm]# echo $$ > /sys/fs/cgroup/cpuset/slurm/uid_0/tasks
bash: echo: write error: No space left on device
[root@lyra17 slurm]# echo $$ > /sys/fs/cgroup/cpuset/slurm/tasks
[root@lyra17 slurm]# echo '0-7' > /sys/fs/cgroup/cpuset/slurm/uid_0/cpuset.mems
[root@lyra17 slurm]# echo $$ > /sys/fs/cgroup/cpuset/slurm/uid_0/tasks

After I set cpuset.mems for uid_0, subsequent jobs for UID 0 work.

Is this expected behavior?



# cat /etc/slurm/cgroup.conf 
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
TaskAffinity=no
AllowedRAMSpace=95
# grep -i cgroup /etc/slurm/slurm.conf 
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup
Comment 1 Nate Rini 2020-10-28 19:39:53 MDT
Matt,

This looks like a dup of bug#9244. The current work around is to place value of cpuset.mems from the parent cgroup into the child recursively. It appears to be caused by a race condition during startup.

--Nate
Comment 2 Matt Ezell 2020-10-29 06:42:59 MDT
(In reply to Nate Rini from comment #1)
> Matt,
> 
> This looks like a dup of bug#9244. The current work around is to place value
> of cpuset.mems from the parent cgroup into the child recursively. It appears
> to be caused by a race condition during startup.
> 
> --Nate

I'm not authorized to see that bug.  I'm not sure it's a race condition, just how cgroups works in this kernel. There's a parameter called cgroup.clone_children can impact how cgroups are created. That parameter seems somewhat controversial, as there have been patches to the kernel to remove it. Anyway:

[root@lyra16 slurm]# pwd
/sys/fs/cgroup/cpuset/slurm
[root@lyra16 slurm]# cat cpuset.mems
0-1
[root@lyra16 slurm]# cat cgroup.clone_children 
0
[root@lyra16 slurm]# mkdir matt
[root@lyra16 slurm]# cat matt/cpuset.mems

[root@lyra16 slurm]# echo 1 > cgroup.clone_children 
[root@lyra16 slurm]# mkdir matt2
[root@lyra16 slurm]# cat matt2/cpuset.mems
0-1

So I think the fix is either have Slurm write 1 into $CPUSETDIR/slurm/cgroup.clone_children (if it exists) at startup, or to make sure to set cpuset.mems for every subdirectory it creates.
Comment 3 Felip Moll 2020-10-29 13:50:29 MDT
(In reply to Matt Ezell from comment #2)
> (In reply to Nate Rini from comment #1)
> > Matt,
> > 
> > This looks like a dup of bug#9244. The current work around is to place value
> > of cpuset.mems from the parent cgroup into the child recursively. It appears
> > to be caused by a race condition during startup.
> > 
> > --Nate
> 
> I'm not authorized to see that bug.  I'm not sure it's a race condition,
> just how cgroups works in this kernel. There's a parameter called
> cgroup.clone_children can impact how cgroups are created. That parameter
> seems somewhat controversial, as there have been patches to the kernel to
> remove it. Anyway:
> 
> [root@lyra16 slurm]# pwd
> /sys/fs/cgroup/cpuset/slurm
> [root@lyra16 slurm]# cat cpuset.mems
> 0-1
> [root@lyra16 slurm]# cat cgroup.clone_children 
> 0
> [root@lyra16 slurm]# mkdir matt
> [root@lyra16 slurm]# cat matt/cpuset.mems
> 
> [root@lyra16 slurm]# echo 1 > cgroup.clone_children 
> [root@lyra16 slurm]# mkdir matt2
> [root@lyra16 slurm]# cat matt2/cpuset.mems
> 0-1
> 
> So I think the fix is either have Slurm write 1 into
> $CPUSETDIR/slurm/cgroup.clone_children (if it exists) at startup, or to make
> sure to set cpuset.mems for every subdirectory it creates.

That's exactly what the patch I am working on does.
I will let you know when it is reviewed and done.

Can you point me to some reference? I am interested in this information:

> That parameter
> seems somewhat controversial, as there have been patches to the kernel to
> remove it.
Comment 4 Matt Ezell 2020-10-29 13:54:59 MDT
(In reply to Felip Moll from comment #3)
> Can you point me to some reference? I am interested in this information:
> 
> > That parameter
> > seems somewhat controversial, as there have been patches to the kernel to
> > remove it.

https://lists.linuxfoundation.org/pipermail/containers/2012-November/030813.html
https://lwn.net/Articles/547332/

In cgroupsV2 the file does not exist:

https://man7.org/linux/man-pages/man7/cgroups.7.html

> In addition, the cgroup.clone_children file that is employed by the cpuset controller has been removed.
Comment 5 Felip Moll 2020-10-30 07:37:00 MDT
(In reply to Matt Ezell from comment #4)
> (In reply to Felip Moll from comment #3)
> > Can you point me to some reference? I am interested in this information:
> > 
> > > That parameter
> > > seems somewhat controversial, as there have been patches to the kernel to
> > > remove it.
> 
> https://lists.linuxfoundation.org/pipermail/containers/2012-November/030813.
> html
> https://lwn.net/Articles/547332/
> 
> In cgroupsV2 the file does not exist:
> 
> https://man7.org/linux/man-pages/man7/cgroups.7.html
> 
> > In addition, the cgroup.clone_children file that is employed by the cpuset controller has been removed.

This is a very old discussion, back to 2012/13, and they finally decided to leave clone_children to cpuset.

cgroups v2 doesn't have this option, but it is an entirely new system which Slurm isn't supporting yet, so it doesn't matter here.

Thanks for your comments
Comment 6 Felip Moll 2020-10-30 12:26:56 MDT
Matt, I have a patch pending for review. 

Will let you know when it is ready.
Comment 7 Felip Moll 2020-11-03 06:14:31 MST
Matt,

A fix has been applied to:

- 20.02.6 commit 666d2eedebac
- 20.11.0pre1 (master) commit cd20c16b169a

Plese open a new bug or reopen this one if after these patches you still have issues.

Thanks