Ticket 10089 - cpuset issue - No space left on device
Summary: cpuset issue - No space left on device
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 20.11.x
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-10-28 18:44 MDT by Matt Ezell
Modified: 2021-08-19 01:47 MDT (History)
2 users (show)

See Also:
Site: ORNL-OLCF
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.02.6 20.11.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Matt Ezell 2020-10-28 18:44:43 MDT
I'm testing 20.11.0.0pre1 from ~Friday and ran into a problem setting up cgroups. The uid_# subdirectories don't get set with 'cpuset.mems' which disallows tasks from being put into the cpuset.

[root@lyra17 ~]# srun -n4 --ntasks-per-gpu=1 /bin/bash -c "env|grep ROC"
slurmstepd: error: Failed to invoke task plugins: task_p_pre_launch error
slurmstepd: error: Failed to invoke task plugins: task_p_pre_launch error
slurmstepd: error: Failed to invoke task plugins: task_p_pre_launch error
slurmstepd: error: Failed to invoke task plugins: task_p_pre_launch error
srun: error: lyra17: tasks 0-3: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=3.0

[2020-10-28T20:27:18.974] [3.extern] Considering each NUMA node as a socket
[2020-10-28T20:27:18.983] [3.extern] error: _file_write_uint32s: write pid 29589 to /sys/fs/cgroup/cpuset/slurm/uid_0/job_3/step_extern/cgroup.procs failed: No space left on device
[2020-10-28T20:27:18.983] [3.extern] error: task_cgroup_cpuset_create: unable to add slurmstepd to cpuset cg '/sys/fs/cgroup/cpuset/slurm/uid_0/job_3/step_extern'
...
[2020-10-28T20:27:41.744] [3.0] error: _file_write_uint32s: write pid 29661 to /sys/fs/cgroup/cpuset/slurm/uid_0/job_3/step_0/cgroup.procs failed: No space left on device
[2020-10-28T20:27:41.744] [3.0] error: task_cgroup_cpuset_create: unable to add slurmstepd to cpuset cg '/sys/fs/cgroup/cpuset/slurm/uid_0/job_3/step_0'
[2020-10-28T20:27:41.745] [3.0] task/cgroup: _memcg_initialize: /slurm/uid_0/job_3: alloc=0MB mem.limit=257740MB memsw.limit=unlimited
[2020-10-28T20:27:41.745] [3.0] task/cgroup: _memcg_initialize: /slurm/uid_0/job_3/step_0: alloc=0MB mem.limit=257740MB memsw.limit=unlimited
[2020-10-28T20:27:41.831] [3.0] error: Failed to invoke task plugins: task_p_pre_launch error
[2020-10-28T20:27:41.831] [3.0] error: Failed to invoke task plugins: task_p_pre_launch error
[2020-10-28T20:27:41.831] [3.0] error: Failed to invoke task plugins: task_p_pre_launch error
[2020-10-28T20:27:41.831] [3.0] error: Failed to invoke task plugins: task_p_pre_launch error
[2020-10-28T20:27:44.000] [3.0] done with job


[root@lyra17 slurm]# cat /sys/fs/cgroup/cpuset/slurm/cpuset.mems
0-7
[root@lyra17 slurm]# cat /sys/fs/cgroup/cpuset/slurm/uid_0/cpuset.mems

[root@lyra17 slurm]# echo $$ > /sys/fs/cgroup/cpuset/slurm/uid_0/tasks
bash: echo: write error: No space left on device
[root@lyra17 slurm]# echo $$ > /sys/fs/cgroup/cpuset/slurm/tasks
[root@lyra17 slurm]# echo '0-7' > /sys/fs/cgroup/cpuset/slurm/uid_0/cpuset.mems
[root@lyra17 slurm]# echo $$ > /sys/fs/cgroup/cpuset/slurm/uid_0/tasks

After I set cpuset.mems for uid_0, subsequent jobs for UID 0 work.

Is this expected behavior?



# cat /etc/slurm/cgroup.conf 
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
TaskAffinity=no
AllowedRAMSpace=95
# grep -i cgroup /etc/slurm/slurm.conf 
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup
Comment 1 Nate Rini 2020-10-28 19:39:53 MDT
Matt,

This looks like a dup of bug#9244. The current work around is to place value of cpuset.mems from the parent cgroup into the child recursively. It appears to be caused by a race condition during startup.

--Nate
Comment 2 Matt Ezell 2020-10-29 06:42:59 MDT
(In reply to Nate Rini from comment #1)
> Matt,
> 
> This looks like a dup of bug#9244. The current work around is to place value
> of cpuset.mems from the parent cgroup into the child recursively. It appears
> to be caused by a race condition during startup.
> 
> --Nate

I'm not authorized to see that bug.  I'm not sure it's a race condition, just how cgroups works in this kernel. There's a parameter called cgroup.clone_children can impact how cgroups are created. That parameter seems somewhat controversial, as there have been patches to the kernel to remove it. Anyway:

[root@lyra16 slurm]# pwd
/sys/fs/cgroup/cpuset/slurm
[root@lyra16 slurm]# cat cpuset.mems
0-1
[root@lyra16 slurm]# cat cgroup.clone_children 
0
[root@lyra16 slurm]# mkdir matt
[root@lyra16 slurm]# cat matt/cpuset.mems

[root@lyra16 slurm]# echo 1 > cgroup.clone_children 
[root@lyra16 slurm]# mkdir matt2
[root@lyra16 slurm]# cat matt2/cpuset.mems
0-1

So I think the fix is either have Slurm write 1 into $CPUSETDIR/slurm/cgroup.clone_children (if it exists) at startup, or to make sure to set cpuset.mems for every subdirectory it creates.
Comment 3 Felip Moll 2020-10-29 13:50:29 MDT
(In reply to Matt Ezell from comment #2)
> (In reply to Nate Rini from comment #1)
> > Matt,
> > 
> > This looks like a dup of bug#9244. The current work around is to place value
> > of cpuset.mems from the parent cgroup into the child recursively. It appears
> > to be caused by a race condition during startup.
> > 
> > --Nate
> 
> I'm not authorized to see that bug.  I'm not sure it's a race condition,
> just how cgroups works in this kernel. There's a parameter called
> cgroup.clone_children can impact how cgroups are created. That parameter
> seems somewhat controversial, as there have been patches to the kernel to
> remove it. Anyway:
> 
> [root@lyra16 slurm]# pwd
> /sys/fs/cgroup/cpuset/slurm
> [root@lyra16 slurm]# cat cpuset.mems
> 0-1
> [root@lyra16 slurm]# cat cgroup.clone_children 
> 0
> [root@lyra16 slurm]# mkdir matt
> [root@lyra16 slurm]# cat matt/cpuset.mems
> 
> [root@lyra16 slurm]# echo 1 > cgroup.clone_children 
> [root@lyra16 slurm]# mkdir matt2
> [root@lyra16 slurm]# cat matt2/cpuset.mems
> 0-1
> 
> So I think the fix is either have Slurm write 1 into
> $CPUSETDIR/slurm/cgroup.clone_children (if it exists) at startup, or to make
> sure to set cpuset.mems for every subdirectory it creates.

That's exactly what the patch I am working on does.
I will let you know when it is reviewed and done.

Can you point me to some reference? I am interested in this information:

> That parameter
> seems somewhat controversial, as there have been patches to the kernel to
> remove it.
Comment 4 Matt Ezell 2020-10-29 13:54:59 MDT
(In reply to Felip Moll from comment #3)
> Can you point me to some reference? I am interested in this information:
> 
> > That parameter
> > seems somewhat controversial, as there have been patches to the kernel to
> > remove it.

https://lists.linuxfoundation.org/pipermail/containers/2012-November/030813.html
https://lwn.net/Articles/547332/

In cgroupsV2 the file does not exist:

https://man7.org/linux/man-pages/man7/cgroups.7.html

> In addition, the cgroup.clone_children file that is employed by the cpuset controller has been removed.
Comment 5 Felip Moll 2020-10-30 07:37:00 MDT
(In reply to Matt Ezell from comment #4)
> (In reply to Felip Moll from comment #3)
> > Can you point me to some reference? I am interested in this information:
> > 
> > > That parameter
> > > seems somewhat controversial, as there have been patches to the kernel to
> > > remove it.
> 
> https://lists.linuxfoundation.org/pipermail/containers/2012-November/030813.
> html
> https://lwn.net/Articles/547332/
> 
> In cgroupsV2 the file does not exist:
> 
> https://man7.org/linux/man-pages/man7/cgroups.7.html
> 
> > In addition, the cgroup.clone_children file that is employed by the cpuset controller has been removed.

This is a very old discussion, back to 2012/13, and they finally decided to leave clone_children to cpuset.

cgroups v2 doesn't have this option, but it is an entirely new system which Slurm isn't supporting yet, so it doesn't matter here.

Thanks for your comments
Comment 6 Felip Moll 2020-10-30 12:26:56 MDT
Matt, I have a patch pending for review. 

Will let you know when it is ready.
Comment 7 Felip Moll 2020-11-03 06:14:31 MST
Matt,

A fix has been applied to:

- 20.02.6 commit 666d2eedebac
- 20.11.0pre1 (master) commit cd20c16b169a

Plese open a new bug or reopen this one if after these patches you still have issues.

Thanks