Ticket 6722

Summary: slurmd cgroup functionality doesn't work inside an LXC container
Product: Slurm Reporter: Alun Jones <auj>
Component: slurmdAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 17.11.2   
Hardware: Linux   
OS: Linux   
Site: -Other- Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Change xcgroup.c to use notify_on_release rather than release_agent as successful cgroup creation flag.

Description Alun Jones 2019-03-19 09:07:35 MDT
Created attachment 9618 [details]
Change xcgroup.c to use notify_on_release rather than release_agent as successful cgroup creation flag.

Hi,

If I'm reading the sources correctly, src/slurmd/common/xcgroup.c uses
the presence of the "release_agent" file in a cgroup to determine whether that cgroup has been created successfully. It looks like that's the only use that's
made of that file.

For various reasons, I would like to be able to run slurmd inside an LXC container and to use cgroup functionality to manage access to a pair of GPUs.
However, LXC hides the release_agent file, which means that slurmd won't start in this case.

If I point the test at the "notify_on_release" file (which *is* present inside LXC), everything works.

If my guess is right (that release_agent is only used as a test of successful cgroup creation) then the change above would be sufficient to allow slurmd to run inside LXC.

I'm attaching a "git diff" against the current version of Slurm. Hope it's appropriate and that you'll accept it.
Comment 1 Alun Jones 2019-03-19 09:30:19 MDT
I wonder whether bug https://bugs.schedmd.com/show_bug.cgi?id=5626 is related. The reporter doesn't say that it's running inside a container, but the symptoms are very similar.