| Summary: | cpuset issue - No space left on device | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Matt Ezell <ezellma> |
| Component: | slurmd | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | bart, nate |
| Version: | 20.11.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=9244 https://bugs.schedmd.com/show_bug.cgi?id=12157 |
||
| Site: | ORNL-OLCF | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 20.02.6 20.11.0pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Matt Ezell
2020-10-28 18:44:43 MDT
Matt, This looks like a dup of bug#9244. The current work around is to place value of cpuset.mems from the parent cgroup into the child recursively. It appears to be caused by a race condition during startup. --Nate (In reply to Nate Rini from comment #1) > Matt, > > This looks like a dup of bug#9244. The current work around is to place value > of cpuset.mems from the parent cgroup into the child recursively. It appears > to be caused by a race condition during startup. > > --Nate I'm not authorized to see that bug. I'm not sure it's a race condition, just how cgroups works in this kernel. There's a parameter called cgroup.clone_children can impact how cgroups are created. That parameter seems somewhat controversial, as there have been patches to the kernel to remove it. Anyway: [root@lyra16 slurm]# pwd /sys/fs/cgroup/cpuset/slurm [root@lyra16 slurm]# cat cpuset.mems 0-1 [root@lyra16 slurm]# cat cgroup.clone_children 0 [root@lyra16 slurm]# mkdir matt [root@lyra16 slurm]# cat matt/cpuset.mems [root@lyra16 slurm]# echo 1 > cgroup.clone_children [root@lyra16 slurm]# mkdir matt2 [root@lyra16 slurm]# cat matt2/cpuset.mems 0-1 So I think the fix is either have Slurm write 1 into $CPUSETDIR/slurm/cgroup.clone_children (if it exists) at startup, or to make sure to set cpuset.mems for every subdirectory it creates. (In reply to Matt Ezell from comment #2) > (In reply to Nate Rini from comment #1) > > Matt, > > > > This looks like a dup of bug#9244. The current work around is to place value > > of cpuset.mems from the parent cgroup into the child recursively. It appears > > to be caused by a race condition during startup. > > > > --Nate > > I'm not authorized to see that bug. I'm not sure it's a race condition, > just how cgroups works in this kernel. There's a parameter called > cgroup.clone_children can impact how cgroups are created. That parameter > seems somewhat controversial, as there have been patches to the kernel to > remove it. Anyway: > > [root@lyra16 slurm]# pwd > /sys/fs/cgroup/cpuset/slurm > [root@lyra16 slurm]# cat cpuset.mems > 0-1 > [root@lyra16 slurm]# cat cgroup.clone_children > 0 > [root@lyra16 slurm]# mkdir matt > [root@lyra16 slurm]# cat matt/cpuset.mems > > [root@lyra16 slurm]# echo 1 > cgroup.clone_children > [root@lyra16 slurm]# mkdir matt2 > [root@lyra16 slurm]# cat matt2/cpuset.mems > 0-1 > > So I think the fix is either have Slurm write 1 into > $CPUSETDIR/slurm/cgroup.clone_children (if it exists) at startup, or to make > sure to set cpuset.mems for every subdirectory it creates. That's exactly what the patch I am working on does. I will let you know when it is reviewed and done. Can you point me to some reference? I am interested in this information: > That parameter > seems somewhat controversial, as there have been patches to the kernel to > remove it. (In reply to Felip Moll from comment #3) > Can you point me to some reference? I am interested in this information: > > > That parameter > > seems somewhat controversial, as there have been patches to the kernel to > > remove it. https://lists.linuxfoundation.org/pipermail/containers/2012-November/030813.html https://lwn.net/Articles/547332/ In cgroupsV2 the file does not exist: https://man7.org/linux/man-pages/man7/cgroups.7.html > In addition, the cgroup.clone_children file that is employed by the cpuset controller has been removed. (In reply to Matt Ezell from comment #4) > (In reply to Felip Moll from comment #3) > > Can you point me to some reference? I am interested in this information: > > > > > That parameter > > > seems somewhat controversial, as there have been patches to the kernel to > > > remove it. > > https://lists.linuxfoundation.org/pipermail/containers/2012-November/030813. > html > https://lwn.net/Articles/547332/ > > In cgroupsV2 the file does not exist: > > https://man7.org/linux/man-pages/man7/cgroups.7.html > > > In addition, the cgroup.clone_children file that is employed by the cpuset controller has been removed. This is a very old discussion, back to 2012/13, and they finally decided to leave clone_children to cpuset. cgroups v2 doesn't have this option, but it is an entirely new system which Slurm isn't supporting yet, so it doesn't matter here. Thanks for your comments Matt, I have a patch pending for review. Will let you know when it is ready. Matt, A fix has been applied to: - 20.02.6 commit 666d2eedebac - 20.11.0pre1 (master) commit cd20c16b169a Plese open a new bug or reopen this one if after these patches you still have issues. Thanks |