Summary: | slurmd can't find memory cgroup controller inspite of being enabled | ||
---|---|---|---|
Product: | Slurm | Reporter: | foufou33 |
Component: | Other | Assignee: | Jacob Jenson <jacob> |
Status: | RESOLVED INVALID | QA Contact: | |
Severity: | 6 - No support contract | ||
Priority: | --- | CC: | foufou33 |
Version: | 22.05.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | -Other- | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
patch to add debug output
more log patch |
Created attachment 27518 [details]
more log
the rest of the log generated by my patch
Created attachment 27566 [details]
patch
adding '\n' to strtock delimiters seems to fix the problem.
|
Created attachment 27517 [details] patch to add debug output I was playing around with slurm 22.05.5.1 and cgroupv2 when I noticed that slurmd complained about the memory cgroup not being enabled: as shown here: [2022-10-28T23:22:09.434] error: Controller memory is not enabled! [2022-10-28T23:22:09.434] Resource spec: Reserved abstract CPU IDs: 60-63 [2022-10-28T23:22:09.434] Resource spec: Reserved machine CPU IDs: 30-31,62-63 [2022-10-28T23:22:09.434] error: memory cgroup controller is not available. [2022-10-28T23:22:09.434] error: Resource spec: unable to initialize system memory cgroup [2022-10-28T23:22:09.434] error: Resource spec: system cgroup memory limit disabled [2022-10-28T23:22:09.442] cred/munge: init: Munge credential signature plugin loaded # cat /sys/fs/cgroup/cgroup.controllers cpuset cpu io memory # I added few debug statements in _get_controllers (src/plugins/cgroup/v2/cgroup_v2.c, attached patch) the --partial-- result: [2022-10-29T00:14:37.275] debug2: cgroup/v2: _get_controllers: _get_controllers: controller: (cpu) not found [2022-10-29T00:14:37.275] debug2: cgroup/v2: _get_controllers: _get_controllers: trying controller: (memory ) [2022-10-29T00:14:37.275] debug2: cgroup/v2: _get_controllers: _get_controllers: comparing with: (freezer) [2022-10-29T00:14:37.275] debug2: cgroup/v2: _get_controllers: _get_controllers: controller: (freezer) not found [2022-10-29T00:14:37.275] debug2: cgroup/v2: _get_controllers: _get_controllers: comparing with: (cpuset) it was trying to compare 'memory\n' (as read from /sys/fs/cgroup/cgroup.controllers) with 'memory' (as stored in ctl_names) the \n at the end of the content of /sys/fs/cgroup/cgroup.controllers should be removed/ignored