Ticket 18623 - slurmstepd fails to create cgroup with kernel 5.15.0-91-generic
Summary: slurmstepd fails to create cgroup with kernel 5.15.0-91-generic
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 22.05.9
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-01-09 10:04 MST by Tim Schneider
Modified: 2024-01-23 10:51 MST (History)
3 users (show)

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: Ubuntu
Machine Name: 22.04
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Tim Schneider 2024-01-09 10:04:38 MST
Hi,

I am using SLURM 22.05.9 on a small compute cluster. Since I updated two of our nodes, I get the following error when launching a job:

slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

Also the cgroups do not seem to work properly anymore, as I am able to see all GPUs even if I do not request them, which is not the case on the other nodes.

One difference I found between the updated nodes and the original nodes (both are Ubuntu 22.04) is the kernel version, which is "5.15.0-89-generic #99-Ubuntu SMP" on the functioning nodes and "5.15.0-91-generic #101-Ubuntu SMP" on the updated nodes. I could not 
figure out how to install the exact first kernel version on the updated nodes, but I noticed that when I reinstall 5.15.0 with this tool: https://github.com/pimlie/ubuntu-mainline-kernel.sh, the error message disappears. So it seems to be something that is introduced by patch 91 specifically.

I am not sure how to debug this issue further, but I am happy to provide more information if needed.

Best,

Tim
Comment 1 Stefan 2024-01-23 07:36:25 MST
I think this is a regression in the 5.15 kernel. I opened a bug at  https://bugs.launchpad.net/bugs/2050098