I am trying to add three new nodes running Ubuntu 22.04 to my existing Slurm cluster (other nodes in this cluster currently running Ubuntu 18.04.) I configured and built Slurm in my usual way, but when I try to start slurmd on the new nodes, I get the following error: slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm/cgroup_v2.so slurmd: debug4: /usr/lib/x86_64-linux-gnu/slurm/cgroup_v2.so: Does not exist or not a regular file. slurmd: error: Couldn't find the specified plugin name for cgroup/v2 looking at all files slurmd: debug3: plugin_peek->_verify_syms: found Slurm plugin name:Cgroup v1 plugin type:cgroup/v1 version:0x160508 slurmd: error: cannot find cgroup plugin for cgroup/v2 slurmd: error: cannot create cgroup context for cgroup/v2 slurmd: error: Unable to initialize cgroup plugin slurmd: error: slurmd initialization failed I did verify, that there was no "cgroup_v2.so" built (although I do have a "cgroup_v1.so" there.) I have both a "cgroup.conf" and "cgroup_allowed_devices_file.conf" file on the controller (we are running "configless"), I will attach these to the ticket. Please help me to resolve this slurmd startup issue on these new nodes.
Created attachment 30177 [details] cgroup.conf file
Created attachment 30178 [details] cgroup_allowed_devices_file.conf file
Hello Will, Try adding this line to your cgroup.conf file. > CgroupPlugin=cgroup/v1 Let me know if this works for you.
Hello Will, Where did you compile slurm for the nodes? Did you install from source or did you use RPMs? Also, could we get your config.log?
It is compiled from source, not installed from generated RPMs... I will attach the file you ask for, but - where would I look for this? In the source compilation directory?
Adding this line, and trying to start slurred did not work, I get the same error as before… Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=16680#c3> on bug 16680<https://bugs.schedmd.com/show_bug.cgi?id=16680> from Benjamin Witham<mailto:benjamin.witham@schedmd.com> Hello Will, Try adding this line to your cgroup.conf file. > CgroupPlugin=cgroup/v1 Let me know if this works for you.
Created attachment 30193 [details] config.log from build directory
I'm afraid I'll have to ask for the priority to be increased on this ticket; it is blocking use of some new, expensive GPU servers, and the research groups are clamoring for access...
Hello Will, Looking through your logs I see that it says that no dbus-1 package is found. Could you run this command to check if you have dbus installed on your nodes? > apt list --installed | grep dbus-1 Please send us the output.
Hello Will, Please also send us the output of this command as well. > mount | grep -i cgroup
root@cipr-gpu05:~# apt list --installed | grep dbus-1 WARNING: apt does not have a stable CLI interface. Use with caution in scripts. libdbus-1-3/jammy-updates,jammy-security,now 1.12.20-2ubuntu4.1 amd64 [installed,automatic] root@cipr-gpu05:~#
root@cipr-gpu05:~# mount | grep -i cgroup cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot) root@cipr-gpu05:~#
Taking a look (for the first time) at the config.log from the build that does not work, I see the log says: configure:24611: checking for dbus-1 configure:24618: $PKG_CONFIG --exists --print-errors "dbus-1" Package dbus-1 was not found in the pkg-config search path. Perhaps you should add the directory containing `dbus-1.pc' to the PKG_CONFIG_PATH environment variable No package 'dbus-1' found configure:24621: $? = 1 configure:24635: $PKG_CONFIG --exists --print-errors "dbus-1" Package dbus-1 was not found in the pkg-config search path. Perhaps you should add the directory containing `dbus-1.pc' to the PKG_CONFIG_PATH environment variable No package 'dbus-1' found configure:24638: $? = 1 configure:24652: result: no No package 'dbus-1' found configure:24689: WARNING: unable to link against dbus-1 libraries required for cgroup/v2 So I went looking for "dbus-1" header package for U22.04, and found this: https://ubuntu.pkgs.org/22.04/ubuntu-main-arm64/libdbus-1-dev_1.12.20-2ubuntu4_arm64.deb.html So after installing the "libdbus-1-dev" package, and running a reconfigure & rebuild, voila, it works! slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm/proctrack_cgroup.so slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Process tracking via linux cgroup freezer subsystem type:proctrack/cgroup version:0x160508 slurmd: debug3: Success. slurmd: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm/task_cgroup.so slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Tasks containment cgroup plugin type:task/cgroup version:0x160508 slurmd: debug: task/cgroup: init: Tasks containment cgroup plugin loaded slurmd: debug3: Success. [...] slurmd: debug3: slurmd initialization successful slurmd: slurmd version 22.05.8 started slurmd: debug3: finished daemonize So thanks for the point in the right direction!
If I could though -- In my current cgroup.conf file, I had to comment out the "TaskAffinity=no" line I had in there, but I still have the "AllowedDevicesFile=..." line, but I see I'm getting this error: WARNING: AllowedDevicesFile option is obsolete, please remove it from your config. Can you please advise what my cgroup.conf file should have in it for 22.05 Thanks
Hello Will, I'm glad to hear you found your solution! The AllowedDevicesFile is no longer needed, as the slurmstepd now creates a bpf pprogram dynamically using the devices outlined in your gres.conf.
It would be best to remove that line. > https://slurm.schedmd.com/cgroup_v2.html#ebpf_controller
When I edit cgroup.conf, do I have to scontrol reconfigure thereafter?
You will need to restart your slurmds. This is used on computer nodes so there is no need to restart the controller.
Closing this one now.