| Summary: | slurmd cgroup functionality doesn't work inside an LXC container | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Alun Jones <auj> |
| Component: | slurmd | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | 17.11.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | Change xcgroup.c to use notify_on_release rather than release_agent as successful cgroup creation flag. | ||
I wonder whether bug https://bugs.schedmd.com/show_bug.cgi?id=5626 is related. The reporter doesn't say that it's running inside a container, but the symptoms are very similar. |
Created attachment 9618 [details] Change xcgroup.c to use notify_on_release rather than release_agent as successful cgroup creation flag. Hi, If I'm reading the sources correctly, src/slurmd/common/xcgroup.c uses the presence of the "release_agent" file in a cgroup to determine whether that cgroup has been created successfully. It looks like that's the only use that's made of that file. For various reasons, I would like to be able to run slurmd inside an LXC container and to use cgroup functionality to manage access to a pair of GPUs. However, LXC hides the release_agent file, which means that slurmd won't start in this case. If I point the test at the "notify_on_release" file (which *is* present inside LXC), everything works. If my guess is right (that release_agent is only used as a test of successful cgroup creation) then the change above would be sufficient to allow slurmd to run inside LXC. I'm attaching a "git diff" against the current version of Slurm. Hope it's appropriate and that you'll accept it.