Ticket 16680

Summary: Slurmd not starting due to "cannot find cgroup plugin for cgroup/v2"
Product: Slurm Reporter: Will Dennis <wdennis>
Component: slurmdAssignee: Benjamin Witham <benjamin.witham>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 2 - High Impact    
Priority: ---    
Version: 22.05.8   
Hardware: Linux   
OS: Linux   
Site: NEC Labs Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: Ubuntu
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: cgroup.conf file
cgroup_allowed_devices_file.conf file
config.log from build directory

Description Will Dennis 2023-05-09 09:31:54 MDT
I am trying to add three new nodes running Ubuntu 22.04 to my existing Slurm cluster (other nodes in this cluster currently running Ubuntu 18.04.) I configured and built Slurm in my usual way, but when I try to start slurmd on the new nodes, I get the following error:

slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm/cgroup_v2.so
slurmd: debug4: /usr/lib/x86_64-linux-gnu/slurm/cgroup_v2.so: Does not exist or not a regular file.
slurmd: error: Couldn't find the specified plugin name for cgroup/v2 looking at all files
slurmd: debug3: plugin_peek->_verify_syms: found Slurm plugin name:Cgroup v1 plugin type:cgroup/v1 version:0x160508
slurmd: error: cannot find cgroup plugin for cgroup/v2
slurmd: error: cannot create cgroup context for cgroup/v2
slurmd: error: Unable to initialize cgroup plugin
slurmd: error: slurmd initialization failed

I did verify, that there was no "cgroup_v2.so" built (although I do have a "cgroup_v1.so" there.)

I have both a "cgroup.conf" and "cgroup_allowed_devices_file.conf" file on the controller (we are running "configless"), I will attach these to the ticket.

Please help me to resolve this slurmd startup issue on these new nodes.
Comment 1 Will Dennis 2023-05-09 09:34:19 MDT
Created attachment 30177 [details]
cgroup.conf file
Comment 2 Will Dennis 2023-05-09 09:35:00 MDT
Created attachment 30178 [details]
cgroup_allowed_devices_file.conf file
Comment 3 Benjamin Witham 2023-05-09 13:14:11 MDT
Hello Will,

Try adding this line to your cgroup.conf file.

> CgroupPlugin=cgroup/v1

Let me know if this works for you.
Comment 5 Benjamin Witham 2023-05-09 14:19:38 MDT
Hello Will,

Where did you compile slurm for the nodes? Did you install from source or did you use RPMs?

Also, could we get your config.log?
Comment 6 Will Dennis 2023-05-09 14:31:11 MDT
It is compiled from source, not installed from generated RPMs...
I will attach the file you ask for, but - where would I look for this? In the source compilation directory?
Comment 7 Will Dennis 2023-05-09 15:03:02 MDT
Adding this line, and trying to start slurred did not work, I get the same error as before…


Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=16680#c3> on bug 16680<https://bugs.schedmd.com/show_bug.cgi?id=16680> from Benjamin Witham<mailto:benjamin.witham@schedmd.com>

Hello Will,

Try adding this line to your cgroup.conf file.

> CgroupPlugin=cgroup/v1

Let me know if this works for you.
Comment 8 Will Dennis 2023-05-09 15:05:28 MDT
Created attachment 30193 [details]
config.log from build directory
Comment 9 Will Dennis 2023-05-10 11:55:14 MDT
I'm afraid I'll have to ask for the priority to be increased on this ticket; it is blocking use of some new, expensive GPU servers, and the research groups are clamoring for access...
Comment 10 Benjamin Witham 2023-05-10 12:15:17 MDT
Hello Will, 

Looking through your logs I see that it says that no dbus-1 package is found. Could you run this command to check if you have dbus installed on your nodes?

> apt list --installed | grep dbus-1

Please send us the output.
Comment 11 Benjamin Witham 2023-05-10 12:17:54 MDT
Hello Will,

Please also send us the output of this command as well.

> mount | grep -i cgroup
Comment 12 Will Dennis 2023-05-10 12:19:12 MDT
root@cipr-gpu05:~# apt list --installed | grep dbus-1

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libdbus-1-3/jammy-updates,jammy-security,now 1.12.20-2ubuntu4.1 amd64 [installed,automatic]
root@cipr-gpu05:~#
Comment 13 Will Dennis 2023-05-10 12:19:46 MDT
root@cipr-gpu05:~# mount | grep -i cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
root@cipr-gpu05:~#
Comment 14 Will Dennis 2023-05-10 12:34:53 MDT
Taking a look (for the first time) at the config.log from the build that does not work, I see the log says:

configure:24611: checking for dbus-1
configure:24618: $PKG_CONFIG --exists --print-errors "dbus-1"
Package dbus-1 was not found in the pkg-config search path.
Perhaps you should add the directory containing `dbus-1.pc'
to the PKG_CONFIG_PATH environment variable
No package 'dbus-1' found
configure:24621: $? = 1
configure:24635: $PKG_CONFIG --exists --print-errors "dbus-1"
Package dbus-1 was not found in the pkg-config search path.
Perhaps you should add the directory containing `dbus-1.pc'
to the PKG_CONFIG_PATH environment variable
No package 'dbus-1' found
configure:24638: $? = 1
configure:24652: result: no
No package 'dbus-1' found
configure:24689: WARNING: unable to link against dbus-1 libraries required for cgroup/v2

So I went looking for "dbus-1" header package for U22.04, and found this:
https://ubuntu.pkgs.org/22.04/ubuntu-main-arm64/libdbus-1-dev_1.12.20-2ubuntu4_arm64.deb.html

So after installing the "libdbus-1-dev" package, and running a reconfigure & rebuild, voila, it works!

slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm/proctrack_cgroup.so
slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Process tracking via linux cgroup freezer subsystem type:proctrack/cgroup version:0x160508
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm/task_cgroup.so
slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Tasks containment cgroup plugin type:task/cgroup version:0x160508
slurmd: debug:  task/cgroup: init: Tasks containment cgroup plugin loaded
slurmd: debug3: Success.
[...]
slurmd: debug3: slurmd initialization successful
slurmd: slurmd version 22.05.8 started
slurmd: debug3: finished daemonize

So thanks for the point in the right direction!
Comment 15 Will Dennis 2023-05-10 12:40:02 MDT
If I could though -- In my current cgroup.conf file, I had to comment out the "TaskAffinity=no" line I had in there, but I still have the "AllowedDevicesFile=..." line, but I see I'm getting this error:

WARNING: AllowedDevicesFile option is obsolete, please remove it from your config.

Can you please advise what my cgroup.conf file should have in it for 22.05

Thanks
Comment 16 Benjamin Witham 2023-05-10 12:58:23 MDT
Hello Will,

I'm glad to hear you found your solution!

The AllowedDevicesFile is no longer needed, as the slurmstepd now creates a bpf pprogram dynamically using the devices outlined in your gres.conf.
Comment 17 Benjamin Witham 2023-05-10 12:59:28 MDT
It would be best to remove that line.

> https://slurm.schedmd.com/cgroup_v2.html#ebpf_controller
Comment 18 Will Dennis 2023-05-10 18:41:13 MDT
When I edit cgroup.conf, do I have to scontrol reconfigure thereafter?
Comment 19 Jason Booth 2023-05-10 19:09:03 MDT
You will need to restart your slurmds. This is used on computer nodes so there is no need to restart the controller.
Comment 20 Will Dennis 2023-05-15 20:56:51 MDT
Closing this one now.