Ticket 6722 - slurmd cgroup functionality doesn't work inside an LXC container
Summary: slurmd cgroup functionality doesn't work inside an LXC container
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 17.11.2
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-03-19 09:07 MDT by Alun Jones
Modified: 2019-03-19 09:30 MDT (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Change xcgroup.c to use notify_on_release rather than release_agent as successful cgroup creation flag. (482 bytes, patch)
2019-03-19 09:07 MDT, Alun Jones
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Alun Jones 2019-03-19 09:07:35 MDT
Created attachment 9618 [details]
Change xcgroup.c to use notify_on_release rather than release_agent as successful cgroup creation flag.

Hi,

If I'm reading the sources correctly, src/slurmd/common/xcgroup.c uses
the presence of the "release_agent" file in a cgroup to determine whether that cgroup has been created successfully. It looks like that's the only use that's
made of that file.

For various reasons, I would like to be able to run slurmd inside an LXC container and to use cgroup functionality to manage access to a pair of GPUs.
However, LXC hides the release_agent file, which means that slurmd won't start in this case.

If I point the test at the "notify_on_release" file (which *is* present inside LXC), everything works.

If my guess is right (that release_agent is only used as a test of successful cgroup creation) then the change above would be sufficient to allow slurmd to run inside LXC.

I'm attaching a "git diff" against the current version of Slurm. Hope it's appropriate and that you'll accept it.
Comment 1 Alun Jones 2019-03-19 09:30:19 MDT
I wonder whether bug https://bugs.schedmd.com/show_bug.cgi?id=5626 is related. The reporter doesn't say that it's running inside a container, but the symptoms are very similar.