Ticket 16713

Summary: Add support for running in slurmd and slurmstepd in a cgroup namespace
Product: Slurm Reporter: Urban Borštnik <urban.borstnik>
Component: slurmdAssignee: Felip Moll <felip.moll>
Status: OPEN --- QA Contact:
Severity: C - Contributions    
Priority: --- CC: ben, felip.moll, nate
Version: 23.02.1   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Patchset for cgroup namespace support.
Patchset for cgroup namespace support.
bug16713_test.patch

Description Urban Borštnik 2023-05-12 00:56:56 MDT
Created attachment 30243 [details]
Patchset for cgroup namespace support.

When running the slurmd and slurmstepd daemons in containers for which the container runtime sets up cgroup namespaces (such as on Kubernetes), then there is a discrepancy between the path on which the cgroupfs is mounted within the container (i.e., `/sys/fs/cgroup/init.scope`) vs. what the `/proc/*/cgroup` paths show, which contain a relative path to the root of the cgroup namespace, e.g., `/../../../../cont1/init.scope` for PID 1.

This patchset detects that slurmd & slurmstepd run in a cgroup namespace with cgroup/v2, determines the path to the root namespace, and removes this relative path from the paths read from /sys/fs/cgroup. I have attempted to have it be minimally invasive with no impact when cgroup namespaces are not in use.

Four patches make up the patchset:

1. d09f40de53 Get the relative path to the cgroup namespace root.
Add a function that reads /proc/PID/mountinfo to get the relative path to the root of the cgroup namspace.

2. 166e0d10a1 Return self cgroup path relative to the namespace mount.
Removes the relative path to the root of the cgroup namespace for *self* and appends the remainder it to `slurm_cgroup_conf.cgroup_mountpoint`.

3. 7880f57102 Return init cgroup path relative to the namespace mount.
Removes the relative path to the root of the cgroup namespace for the *init* process (PID 1) and appends the remainder to `slurm_cgroup_conf.cgroup_mountpoint`

4. 93ed0520d8 Add copyright notice to changed file.

I have developed these against Slurm v 23.02.1 (tag slurm-23-02-1-1) but they still apply cleanly to slurm-23-02-2-1 and master. They build upon patch bdd6102d08 (Make cgroup/v2 to work with containerized cgroups), extending it to work in a cgroup namespace.

Thank you for considering this patchset for inclusion in Slurm!

With kind regards,
Urban
Comment 4 Felip Moll 2023-05-17 06:42:32 MDT
Created attachment 30332 [details]
Patchset for cgroup namespace support.

Attach a refactored version of your patch to better study it.

Please in the future use:

git format-patch --stdout > bugxxxx.patch


it is much easier then for us to apply and study it :)


-- 

We're discussing it internally and will come back to you asap.
Comment 8 Felip Moll 2023-05-17 09:58:21 MDT
(In reply to Urban Borštnik from comment #0)
> Created attachment 30243 [details]
> Patchset for cgroup namespace support.
> 
> When running the slurmd and slurmstepd daemons in containers for which the
> container runtime sets up cgroup namespaces (such as on Kubernetes), then
> there is a discrepancy between the path on which the cgroupfs is mounted
> within the container (i.e., `/sys/fs/cgroup/init.scope`) vs. what the
> `/proc/*/cgroup` paths show, which contain a relative path to the root of
> the cgroup namespace, e.g., `/../../../../cont1/init.scope` for PID 1.

Hello!,

I am failing to understand why it is a problem. Slurm (master after commit bdd6102d083163a20760) will read /proc/self/cgroup and attach this path to the CgroupMountpoint, so for example if you're in "cont1/init.scope", and /proc/self/cgroup shows 0::/cont1/init.scope, then slurm will use /sys/fs/cgroup/cont1/init.scope, thus it will work in the correct cgroup.

In my tests, I don't see "/../../../../" in /proc/self/cgroup (I only see it in mountinfo) even if I run from inside a cgroup namespace:


[root@llagosti ~]# unshare --pid --cgroup --user --fork /bin/bash -l
[nobody@llagosti ~]$ cat /proc/self/cgroup 
0::/
[nobody@llagosti ~]$ cat /proc/self/mountinfo |grep cgro
29 23 0:26 /../../../../../.. /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:4 - cgroup2 cgroup2 rw

Another different test is when we start from within a specific slice, where we're not in the root cgroup anymore. For example on my docker test installation I start the docker container into a docker.slice in the host system. From the container I see where I am. Slurm takes this path and works from /sys/fs/cgroup/docker.slice/docker-c071e45d.....scope/init.scope

[root@mgmtnode /]# cat /proc/self/cgroup 
0::/docker.slice/docker-c071e4d672dccd0804b2f5043b3bed9f80eb2bc915992179651aa049c572535c.scope/init.scope

[root@mgmtnode /]# cat /proc/self/mountinfo |grep cgroup
2714 2697 0:26 / /sys/fs/cgroup ro,relatime - cgroup2 cgroup2 rw
2718 2714 0:26 /docker.slice /sys/fs/cgroup/docker.slice rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw

Basically we're building an absolute path from what's in /proc/self/cgroup and we work from there.
Can you explain me a little bit more what's the real issue between the actual approach in master?

Thanks
Comment 10 Felip Moll 2023-05-17 11:39:26 MDT
Ah, I think I understand the real problem now.

We're reading /proc/1/cgroup path in _get_init_cg_path() to get the cgroup root directory, but in the case of a cgroup namespace it looks like this:

]# cat /proc/1/cgroup 
0::/../../../../../../init.scope

In the case of docker, started from an slice:

"0::/docker.slice/docker-<some UUID>.scope/init.scope"

In the case of a normal system:

]$ cat /proc/1/cgroup
0::/init.scope

We use this string to form the root like this:

/sys/fs/cgroup/</proc/1/cgroup 0:: line without init.scope>


Question 1: Why from inside a cgroup namespace we have so many ../../../../ in "cat /proc/1/cgroup"?
Question 2: Shouldn't it be as simple as ignoring the "../../../.." in this function?
Comment 11 Felip Moll 2023-05-17 11:45:33 MDT
> Question 1: Why from inside a cgroup namespace we have so many ../../../../
> in "cat /proc/1/cgroup"?

This is explained here.

https://man7.org/linux/man-pages/man7/cgroup_namespaces.7.html

When reading the cgroup memberships of a "target" process from
       /proc/[pid]/cgroup, the pathname shown in the third field of each
       record will be relative to the reading process's root directory
       for the corresponding cgroup hierarchy.  If the cgroup directory
       of the target process lies outside the root directory of the
       reading process's cgroup namespace, then the pathname will show
       ../ entries for each ancestor level in the cgroup hierarchy.


> Question 2: Shouldn't it be as simple as ignoring the "../../../.." in this
> function?

I think that will be the fix.


Let me study a bit more about this case.
Comment 12 Urban Borštnik 2023-05-17 14:09:19 MDT
Thank you for the detailed analysis!

My initial approach was also to drop any parent paths (that is, "../../../.."). However, part of the relative path may contain a name in the root namespace following the parent traversal, in the example case "cont1" (../../../../cont1). Including it would mean that the containers would then try to access its cgroup including "cont1" path, for example, as /sys/fs/cgroup/cont1/system.slice instead of /sys/fs/cgroup/cont1, which does not exist.

Reading the mountpoint from /proc/*/mountinfo ensures that we remove the whole prefix.

As a specific example without the patches applied, one container running slurmd/slurmstepd has /sys/fs/cgroup/cnode01 mounted from the root namespace onto /sys/fs/cgroup. The host has this view: 

/sys/fs/cgroup/cnode01
/sys/fs/cgroup/cnode01/init.scope
/sys/fs/cgroup/cnode01/system.slice
/sys/fs/cgroup/cnode01/system.slice/hpc-slurm.service
/sys/fs/cgroup/cnode01/system.slice/hpc-munge.service
/sys/fs/cgroup/cnode01/system.slice/dbus-broker.service
/sys/fs/cgroup/cnode01/system.slice/system-modprobe.slice
/sys/fs/cgroup/cnode01/system.slice/systemd-journald.service
/sys/fs/cgroup/cnode01/system.slice/systemd-hostnamed.service
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_22
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_22/step_batch
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_22/step_batch/slurm
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_22/step_batch/user
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_22/step_batch/user/task_0
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_22/step_extern
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_22/step_extern/slurm
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_22/step_extern/user
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_22/step_extern/user/task_0
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/system
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_23
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_23/step_batch
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_23/step_batch/slurm
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_23/step_batch/user
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_23/step_batch/user/task_0
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_23/step_extern
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_23/step_extern/slurm
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_23/step_extern/user
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/job_23/step_extern/user/task_0

and init's proc/1/cgroup shows
0::/../../../../cnode01/init.scope

while slurmd's /proc/self/cgroup shows
0::/../../../../cnode01/system.slice/hpc-slurm.service
before it aborts:
slurmd: error: unable to open '/sys/fs/cgroup/../../../../cnode01/system.slice/hpc-slurm.service/cgroup.contro>)
Comment 13 Urban Borštnik 2023-05-17 14:21:08 MDT
(In reply to Urban Borštnik from comment #12)
>[...]
> As a specific example without the patches applied, one container running

Sorry for the confusion: this listing is obviously from a working system with the patches applied.

>[..]
> before it aborts:
> slurmd: error: unable to open
> '/sys/fs/cgroup/../../../../cnode01/system.slice/hpc-slurm.service/cgroup.
> contro>)

This error message happens if the patches are not applied.
Comment 15 Felip Moll 2023-05-18 10:33:54 MDT
Created attachment 30359 [details]
bug16713_test.patch

I get the same behavior with this patch, but I am not convinced of any of our both approaches.

I understand the root cgroup must be remounted in the container, because from there we have no permission to write to the outside cgroups.

Also I am experimenting also with the combination of pid namespaces and cgroup. When we are in a pid namespace container and we try to write the cgroup, it will try to put a "virtualized" pid number into the cgroup, which doesn't really exist for the kernel. It is another topic but just wanted to comment.

Can you try this patch in your test environment?
Comment 17 Urban Borštnik 2023-05-19 04:38:55 MDT
> I get the same behavior with this patch, but I am not convinced of any of
> our both approaches.

This patch fails for me:

May 19 09:10:39 cnode01 slurmd[53]: error: unable to open '/sys/fs/cgroup/cnode01/system.slice/hpc-slurm.service/cgroup.controllers' for reading : No such file or directory
May 19 09:10:39 cnode01 slurmd[53]: error: cannot read /sys/fs/cgroup/cnode01/system.slice/hpc-slurm.service/cgroup.controllers: No such file or directory

While this patch does not have of the debugging outputs, I know why this fails: slurmd wants to read "its"
/sys/fs/cgroup/cnode01/system.slice/hpc-slurm.service/cgroup.controllers
file, which in the root namespace is then
/sys/fs/cgroup/cnode01/cnode01/system.slice/hpc-slurm.service/cgroup.controllers
                       ^^^^^^^
which does not exist (because of the double cnode01). Removing this cnode01 from the path, along with the ../../../.. is the reason why it's necessary to find the entire relative path to the root of the namespace, which can be read from /proc/self/mountinfo.

For completeness, here is the cnode01 path in the root cgroup namespace:

/sys/fs/cgroup/cnode01
/sys/fs/cgroup/cnode01/init.scope
/sys/fs/cgroup/cnode01/system.slice
/sys/fs/cgroup/cnode01/system.slice/dbus-broker.service
/sys/fs/cgroup/cnode01/system.slice/system-modprobe.slice
/sys/fs/cgroup/cnode01/system.slice/systemd-journald.service
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope
/sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/system
/sys/fs/cgroup/cnode01/system.slice/hpc-munge.service

and the PID 1 cgroup and mountinfo files:

[root@cnode01 /]# cat /proc/1/cgroup 
0::/../../../../cnode01/init.scope
[root@cnode01 /]# grep cgroup /proc/1/mountinfo 
3249 3220 0:28 /../../../../cnode01 /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw


> I understand the root cgroup must be remounted in the container, because
> from there we have no permission to write to the outside cgroups.

As a aside, I would also like to add that the reason I mount /sys/fs/cgroup/cnode01 onto /sys/fs/cgroup inside the container is to have the possibility to run multiple such containers on the same node (cnode01, cnode02, …) with each getting its own cgroup namespace. I think the proper solution that plays nice with the host's systemd would be to use the container's assigned cgroup path, for example,
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda086455f_f1b8_4468_8267_6c181e3e5138.slice/cri-containerd-dacb00f79ed2c7c90a8c21b02952b89a244cd221e9be416dba8df70811c615b3.scope
I just have not gotten that far yet.

> Also I am experimenting also with the combination of pid namespaces and
> cgroup. When we are in a pid namespace container and we try to write the
> cgroup, it will try to put a "virtualized" pid number into the cgroup, which
> doesn't really exist for the kernel. It is another topic but just wanted to
> comment.

The containers we run use PID namespaces I have not found that this to be an issue unless I'm missing something.

Also, in the cgroup namespace scenario, I believe that reading only /proc/self/cgroup is sufficient (i.e., no need for reading /proc/1/cgroup) because we assume that self's cgroup namespace is mounted directly on /sys/fs/cgroup and init.scope is at that level: there's no need to find out where PID 1's init.scope is. I just wanted to keep the patches minimally invasive so they are easier to follow and to ensure that they don't interfere with the common case.

Another proper thing to do would be to ignore slurm_cgroup_conf.cgroup_mountpoint and read this value from /proc/self/mountinfo.

I would be happy to prepare these features, too.

> 
> Can you try this patch in your test environment?
Comment 18 Felip Moll 2023-05-19 09:28:38 MDT
(In reply to Urban Borštnik from comment #17)
> > I get the same behavior with this patch, but I am not convinced of any of
> > our both approaches.
> 
> This patch fails for me:
> 
> May 19 09:10:39 cnode01 slurmd[53]: error: unable to open
> '/sys/fs/cgroup/cnode01/system.slice/hpc-slurm.service/cgroup.controllers'
> for reading : No such file or directory
> May 19 09:10:39 cnode01 slurmd[53]: error: cannot read
> /sys/fs/cgroup/cnode01/system.slice/hpc-slurm.service/cgroup.controllers: No
> such file or directory
> 
> While this patch does not have of the debugging outputs, I know why this
> fails: slurmd wants to read "its"
> /sys/fs/cgroup/cnode01/system.slice/hpc-slurm.service/cgroup.controllers
> file, which in the root namespace is then
> /sys/fs/cgroup/cnode01/cnode01/system.slice/hpc-slurm.service/cgroup.
> controllers
>                        ^^^^^^^

But the error message doesn't show a double 'cnode01', do you see it anywhere? If you don't I think the error is caused by another thing:

> /sys/fs/cgroup/cnode01/system.slice/hpc-slurm.service/cgroup.controllers: No such file or directory

This error can also happen when a process which is in a cgroup namespace tries to write to something outside its allowed namespace.

[root@llagosti tmp]# unshare --cgroup --mount --fork /bin/bash -l
[root@llagosti tmp]# cat /proc/self/cgroup 
0::/
[root@llagosti tmp]# cat /proc/self/mountinfo|grep cgroup
741 738 0:26 /../../../../../.. /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot
[root@llagosti tmp]# ls /sys/fs/cgroup/system.slice/cgroup.procs
/sys/fs/cgroup/system.slice/cgroup.procs
[root@llagosti tmp]# echo $$ > /sys/fs/cgroup/system.slice/cgroup.procs 
bash: echo: write error: No such file or directory

the operation returns ENOENT (as described in cgroup2 namespace docs) even if the file exists when we try to write to a cgroup outside our namespace.

Can you confirm which situation is happening for you? I don't have the 'cnode01' directory anyway. I will do more testing.

--

> > I understand the root cgroup must be remounted in the container, because
> > from there we have no permission to write to the outside cgroups.
> 
> As a aside, I would also like to add that the reason I mount
> /sys/fs/cgroup/cnode01 onto /sys/fs/cgroup inside the container is to have
> the possibility to run multiple such containers on the same node (cnode01,
> cnode02, …) with each getting its own cgroup namespace. I think the proper
> solution that plays nice with the host's systemd would be to use the
> container's assigned cgroup path, for example,
> /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-
> poda086455f_f1b8_4468_8267_6c181e3e5138.slice/cri-containerd-
> dacb00f79ed2c7c90a8c21b02952b89a244cd221e9be416dba8df70811c615b3.scope
> I just have not gotten that far yet.

I agree. How do you choose it to use 'cnode01' now? 
 

> The containers we run use PID namespaces I have not found that this to be an
> issue unless I'm missing something.

Dismiss my comment. I did more testing. I checked how from a container you can write the virtualized pid in cgroups, and from the host perspective the pid written is the real one, while from the container perspective it is the virtualized one. So all good here.

> Also, in the cgroup namespace scenario, I believe that reading only
> /proc/self/cgroup is sufficient (i.e., no need for reading /proc/1/cgroup)
> because we assume that self's cgroup namespace is mounted directly on
> /sys/fs/cgroup and init.scope is at that level: there's no need to find out
> where PID 1's init.scope is. I just wanted to keep the patches minimally
> invasive so they are easier to follow and to ensure that they don't
> interfere with the common case.

Yep, possibly. Reading /proc/1/cgroup was needed for a container which started a systemd under a slice.

> Another proper thing to do would be to ignore
> slurm_cgroup_conf.cgroup_mountpoint and read this value from
> /proc/self/mountinfo.

That could be an option yes. I guess you mean the fifth field:

741 738 0:26 /../../../../../.. /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot


> I would be happy to prepare these features, too.

Let me do more testing first, I want to have all the cases clear.
I will install a kubernetes too to make some tests.
Comment 19 Felip Moll 2023-05-19 10:44:03 MDT
> /sys/fs/cgroup/cnode01
> /sys/fs/cgroup/cnode01/init.scope
> /sys/fs/cgroup/cnode01/system.slice
> /sys/fs/cgroup/cnode01/system.slice/dbus-broker.service
> /sys/fs/cgroup/cnode01/system.slice/system-modprobe.slice
> /sys/fs/cgroup/cnode01/system.slice/systemd-journald.service
> /sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope
> /sys/fs/cgroup/cnode01/system.slice/slurmstepd.scope/system
> /sys/fs/cgroup/cnode01/system.slice/hpc-munge.service

> As a aside, I would also like to add that the reason I mount
> /sys/fs/cgroup/cnode01 onto /sys/fs/cgroup inside the container is to have
> the possibility to run multiple such containers on the same node (cnode01,
> cnode02, …) with each getting its own cgroup namespace. I think the proper
> solution that plays nice with the host's systemd would be to use the
> container's assigned cgroup path, for example,
> /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-
> poda086455f_f1b8_4468_8267_6c181e3e5138.slice/cri-containerd-
> dacb00f79ed2c7c90a8c21b02952b89a244cd221e9be416dba8df70811c615b3.scope
> I just have not gotten that far yet.

Should I assume you're starting systemd as the pid 1 in the container *after* mounting /sys/fs/cgroup/cnode01 to /sys/fs/cgroup inside the container?
Comment 20 Urban Borštnik 2023-05-22 07:39:01 MDT
(In reply to Felip Moll from comment #19)

> Should I assume you're starting systemd as the pid 1 in the container
> *after* mounting /sys/fs/cgroup/cnode01 to /sys/fs/cgroup inside the
> container?

Yes. I'm not certain how and when the cgroup namespace is created: before or after the mount. Based on the tests shown below it looks like the mount is created after the cgroup namespace for the container is created, which would explain where the "cnode01" comes into the relative path. If no mount is made, then the /sys/fs/cgroup hierarchy is read-only (though otherwise it looks like the Docker-type case in cgroup_v2.c). The only solution that I have found to work so far is to do the mount.

If I start the container without systemd, the view is somewhat different:

On host (root cgroup namespace), both cases:

root@k0s:~# cat /proc/1/cgroup
0::/init.scope
root@k0s:~# grep cgroup /proc/1/mountinfo 
33 24 0:28 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:9 - cgroup2 cgroup2 rw
root@k0s:~# ls -l /proc/1/ns/cgroup 
lrwxrwxrwx 1 root root 0 Apr 27 14:16 /proc/1/ns/cgroup -> 'cgroup:[4026531835]'

I. Bash case: Container in a pod in k8s, /bin/sh as PID 1, mount /sys/fs/cgroup/cnode01 onto /sys/fs/cgroup.

View from host (root NS) of PID 300706 (-> PID 1 in container)

root@k0s:~# cat /proc/300706/cgroup
0::/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5f4349aa_e5dc_4335_80f8_778bffad0d5c.slice/cri-containerd-441037d3a8b9d16b651c551b0e6743b151b82921bc6bb2cb697f17ed7c932696.scope
root@k0s:~# grep cgroup /proc/300706/mountinfo
3526 3512 0:28 /cnode01 /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw
root@k0s:~# ls -l /proc/300706/ns/cgroup
lrwxrwxrwx 1 root root 0 May 19 12:46 /proc/300706/ns/cgroup -> 'cgroup:[4026533793]'

root@k0s:~# stat --format=%i /sys/fs/cgroup/cnode01
6664
root@k0s:~# stat --format=%i /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod5f4349aa_e5dc_4335_80f8_778bffad0d5c.slice/cri-containerd-441037d3a8b9d16b651c551b0e6743b151b82921bc6bb2cb697f17ed7c932696.scope
45805


II. Systemd case: Container in a pod in k8s, systemd as PID 1, mount /sys/fs/cgroup/cnode01 onto /sys/fs/cgroup.

View from host (root NS) of PID 305483 (-> PID 1 in container)

root@k0s:~# cat /proc/305483/cgroup
0::/cnode01/init.scope
root@k0s:~# grep cgroup /proc/305483/mountinfo 
3526 3512 0:28 /cnode01 /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw
root@k0s:~# ls -l /proc/305483/ns/cgroup
lrwxrwxrwx 1 root root 0 May 22 09:03 /proc/305483/ns/cgroup -> 'cgroup:[4026533793]'
root@k0s:~# stat --format=%i /sys/fs/cgroup/cnode01
6664

View from container:

[root@cnode01 /]# cat /proc/1/cgroup
0::/../../../../cnode01/init.scope
[root@cnode01 /]# grep cgroup /proc/1/mountinfo
3526 3512 0:28 /../../../../cnode01 /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw
[root@cnode01 /]# ls -l /proc/1/ns/cgroup
lrwxrwxrwx 1 root root 0 May 22 06:58 /proc/1/ns/cgroup -> 'cgroup:[4026533793]'
[root@cnode01 /]# stat --format=%i /sys/fs/cgroup
6664


> (In reply to Urban Borštnik from comment #17)
> > > I get the same behavior with this patch, but I am not convinced of any of
> > > our both approaches.
> > 
> > This patch fails for me:
> > 
> > May 19 09:10:39 cnode01 slurmd[53]: error: unable to open
> > '/sys/fs/cgroup/cnode01/system.slice/hpc-slurm.service/cgroup.controllers'
> > for reading : No such file or directory
> > May 19 09:10:39 cnode01 slurmd[53]: error: cannot read
> > /sys/fs/cgroup/cnode01/system.slice/hpc-slurm.service/cgroup.controllers: No
> > such file or directory
> > 
> > While this patch does not have of the debugging outputs, I know why this
> > fails: slurmd wants to read "its"
> > /sys/fs/cgroup/cnode01/system.slice/hpc-slurm.service/cgroup.controllers
> > file, which in the root namespace is then
> > /sys/fs/cgroup/cnode01/cnode01/system.slice/hpc-slurm.service/cgroup.
> > controllers
> >                        ^^^^^^^
> 
> But the error message doesn't show a double 'cnode01', do you see it anywhere? If you don't I think the error is caused by another thing:
> 
> > /sys/fs/cgroup/cnode01/system.slice/hpc-slurm.service/cgroup.controllers: No such file or directory
> 
> This error can also happen when a process which is in a cgroup namespace tries to write to something outside its allowed namespace.
> 
> [root@llagosti tmp]# unshare --cgroup --mount --fork /bin/bash -l
> [root@llagosti tmp]# cat /proc/self/cgroup 
> 0::/
> [root@llagosti tmp]# cat /proc/self/mountinfo|grep cgroup
> 741 738 0:26 /../../../../../.. /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot
> [root@llagosti tmp]# ls /sys/fs/cgroup/system.slice/cgroup.procs
> /sys/fs/cgroup/system.slice/cgroup.procs
> [root@llagosti tmp]# echo $$ > /sys/fs/cgroup/system.slice/cgroup.procs 
> bash: echo: write error: No such file or directory
> 
> the operation returns ENOENT (as described in cgroup2 namespace docs) even if the file exists when we try to write to a cgroup outside our namespace.

> Can you confirm which situation is happening for you? I don't have the 'cnode01' directory anyway. I will do more testing.

It's the extra cnode01. Without the cnode01 there is no error.

> > > I understand the root cgroup must be remounted in the container, because
> > > from there we have no permission to write to the outside cgroups.
> > 
> > As a aside, I would also like to add that the reason I mount
> > /sys/fs/cgroup/cnode01 onto /sys/fs/cgroup inside the container is to have
> > the possibility to run multiple such containers on the same node (cnode01,
> > cnode02, …) with each getting its own cgroup namespace. I think the proper
> > solution that plays nice with the host's systemd would be to use the
> > container's assigned cgroup path, for example,
> > /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-
> > poda086455f_f1b8_4468_8267_6c181e3e5138.slice/cri-containerd-
> > dacb00f79ed2c7c90a8c21b02952b89a244cd221e9be416dba8df70811c615b3.scope
> > I just have not gotten that far yet.
> 
> I agree. How do you choose it to use 'cnode01' now?

I have a definition "container cnode" definition for a Kubernetes pod, which mounts the pod's name (cnode01 here, but can be anything) onto `/sys/fs/cgroup` in the container. This way, multiple pods can be run per node. The mounts use Kubernetes's standard volume mount definitions. The relevant part of the deployment here is:

spec.template.spec.containers.volumeMounts:
 - name: cgroup
   mountPath: "/sys/fs/cgroup"
spec.template.spec.volumes:
 - name: cgroup
   hostPath: /sys/fs/cgroup/{{ $nodeName }}

I test locally on k0s that uses containerd as the container runtime.
Comment 21 Felip Moll 2023-05-22 09:54:13 MDT
(In reply to Urban Borštnik from comment #20)
> (In reply to Felip Moll from comment #19)
> 
> > Should I assume you're starting systemd as the pid 1 in the container
> > *after* mounting /sys/fs/cgroup/cnode01 to /sys/fs/cgroup inside the
> > container?
> 
> Yes. I'm not certain how and when the cgroup namespace is created: before or
> after the mount. Based on the tests shown below it looks like the mount is
> created after the cgroup namespace for the container is created, which would
> explain where the "cnode01" comes into the relative path. If no mount is
> made, then the /sys/fs/cgroup hierarchy is read-only (though otherwise it
> looks like the Docker-type case in cgroup_v2.c). The only solution that I
> have found to work so far is to do the mount.
> 

Thanks, your last post makes sense. It is somewhat like the tests I was doing in docker.

I am also curious on knowing how and when the mounts are performed.

If you agree I will install a kubernetes setup with a configuration similar to yours and will do testing. I am in favour of making it work in all the cases, but as you will undersand I need to setup a test environment for that.

Will come back to you as soon as I have more feedback.

How urgent is this matter for you?
Comment 22 Urban Borštnik 2023-05-23 07:24:58 MDT
(In reply to Felip Moll from comment #21)
> (In reply to Urban Borštnik from comment #20)
> > (In reply to Felip Moll from comment #19)
> > 
> > > Should I assume you're starting systemd as the pid 1 in the container
> > > *after* mounting /sys/fs/cgroup/cnode01 to /sys/fs/cgroup inside the
> > > container?
> > 
> > Yes. I'm not certain how and when the cgroup namespace is created: before or
> > after the mount. Based on the tests shown below it looks like the mount is
> > created after the cgroup namespace for the container is created, which would
> > explain where the "cnode01" comes into the relative path. If no mount is
> > made, then the /sys/fs/cgroup hierarchy is read-only (though otherwise it
> > looks like the Docker-type case in cgroup_v2.c). The only solution that I
> > have found to work so far is to do the mount.
> > 
> 
> Thanks, your last post makes sense. It is somewhat like the tests I was
> doing in docker.
> 
> I am also curious on knowing how and when the mounts are performed.
> 
> If you agree I will install a kubernetes setup with a configuration similar
> to yours and will do testing. I am in favour of making it work in all the
> cases, but as you will undersand I need to setup a test environment for that.
> 
> Will come back to you as soon as I have more feedback.

Great to hear this! Let me know if I can help in any way. 

> How urgent is this matter for you?

This is not urgent to us as this time since my patches provide a working solution.