Ticket 15314

Summary: slurmd can't find memory cgroup controller inspite of being enabled
Product: Slurm Reporter: foufou33
Component: OtherAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: --- CC: foufou33
Version: 22.05.5   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: patch to add debug output
more log
patch

Description foufou33 2022-10-28 22:44:22 MDT
Created attachment 27517 [details]
patch to add debug output

I was playing around with slurm 22.05.5.1 and cgroupv2 when I noticed that slurmd complained about the memory cgroup not being enabled:

as shown here:

[2022-10-28T23:22:09.434] error: Controller memory is not enabled!
[2022-10-28T23:22:09.434] Resource spec: Reserved abstract CPU IDs: 60-63
[2022-10-28T23:22:09.434] Resource spec: Reserved machine CPU IDs: 30-31,62-63
[2022-10-28T23:22:09.434] error: memory cgroup controller is not available.
[2022-10-28T23:22:09.434] error: Resource spec: unable to initialize system memory cgroup
[2022-10-28T23:22:09.434] error: Resource spec: system cgroup memory limit disabled
[2022-10-28T23:22:09.442] cred/munge: init: Munge credential signature plugin loaded
 

# cat /sys/fs/cgroup/cgroup.controllers 
cpuset cpu io memory
#

I added few debug statements in  _get_controllers (src/plugins/cgroup/v2/cgroup_v2.c, attached patch) 


the --partial-- result:

[2022-10-29T00:14:37.275] debug2: cgroup/v2: _get_controllers: _get_controllers: controller: (cpu) not found
[2022-10-29T00:14:37.275] debug2: cgroup/v2: _get_controllers: _get_controllers: trying controller: (memory
)
[2022-10-29T00:14:37.275] debug2: cgroup/v2: _get_controllers: _get_controllers: comparing with: (freezer)
[2022-10-29T00:14:37.275] debug2: cgroup/v2: _get_controllers: _get_controllers: controller: (freezer) not found
[2022-10-29T00:14:37.275] debug2: cgroup/v2: _get_controllers: _get_controllers: comparing with: (cpuset)


it was trying to compare 'memory\n' (as read from /sys/fs/cgroup/cgroup.controllers)  with 'memory' (as stored in ctl_names)

the \n at the end  of the content of /sys/fs/cgroup/cgroup.controllers  should be removed/ignored
Comment 1 foufou33 2022-10-28 22:45:09 MDT
Created attachment 27518 [details]
more log

the rest of the log generated by my patch
Comment 2 foufou33 2022-11-03 00:45:18 MDT
Created attachment 27566 [details]
patch

adding '\n' to strtock delimiters seems to fix the problem.