Hi, I am using SLURM 22.05.9 on a small compute cluster. Since I updated two of our nodes, I get the following error when launching a job: slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK). Also the cgroups do not seem to work properly anymore, as I am able to see all GPUs even if I do not request them, which is not the case on the other nodes. One difference I found between the updated nodes and the original nodes (both are Ubuntu 22.04) is the kernel version, which is "5.15.0-89-generic #99-Ubuntu SMP" on the functioning nodes and "5.15.0-91-generic #101-Ubuntu SMP" on the updated nodes. I could not figure out how to install the exact first kernel version on the updated nodes, but I noticed that when I reinstall 5.15.0 with this tool: https://github.com/pimlie/ubuntu-mainline-kernel.sh, the error message disappears. So it seems to be something that is introduced by patch 91 specifically. I am not sure how to debug this issue further, but I am happy to provide more information if needed. Best, Tim
I think this is a regression in the 5.15 kernel. I opened a bug at https://bugs.launchpad.net/bugs/2050098