| Summary: | slurmd cgroup functionality doesn't work inside an LXC container | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Alun Jones <auj> |
| Component: | slurmd | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | 17.11.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Change xcgroup.c to use notify_on_release rather than release_agent as successful cgroup creation flag. | ||
I wonder whether bug https://bugs.schedmd.com/show_bug.cgi?id=5626 is related. The reporter doesn't say that it's running inside a container, but the symptoms are very similar. |
Created attachment 9618 [details] Change xcgroup.c to use notify_on_release rather than release_agent as successful cgroup creation flag. Hi, If I'm reading the sources correctly, src/slurmd/common/xcgroup.c uses the presence of the "release_agent" file in a cgroup to determine whether that cgroup has been created successfully. It looks like that's the only use that's made of that file. For various reasons, I would like to be able to run slurmd inside an LXC container and to use cgroup functionality to manage access to a pair of GPUs. However, LXC hides the release_agent file, which means that slurmd won't start in this case. If I point the test at the "notify_on_release" file (which *is* present inside LXC), everything works. If my guess is right (that release_agent is only used as a test of successful cgroup creation) then the change above would be sufficient to allow slurmd to run inside LXC. I'm attaching a "git diff" against the current version of Slurm. Hope it's appropriate and that you'll accept it.