Created attachment 27634 [details] Slurm Configuration Hello SchedMD, We are seeing errors like the following from several jobs: slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_794240/job_64633746/step_batch/cgroup.procs: No such file or directory Any idea what could be causing these? Thanks, Steve
(In reply to Steve Ford from comment #0) > Created attachment 27634 [details] > Slurm Configuration > > Hello SchedMD, > > We are seeing errors like the following from several jobs: > > slurmstepd: error: _cgroup_procs_check: failed on path > /sys/fs/cgroup/freezer/slurm/uid_794240/job_64633746/step_batch/cgroup.procs: > No such file or directory > > Any idea what could be causing these? > > Thanks, > Steve Hi Steve, Can you please upload the slurmd logs with debug2 and the CGROUP debug flag activated? This might happen on very short jobs on a system where the kernel cgroup takes a bit of time to be created. I need to see when this error exactly happens. Is it reproducible? It could be similar to bug 14293.
Created attachment 27648 [details] Slurmd log for job 64137494
Arghh, the log shows the error for a step that was running previously to setting the new log level, so it did not log what I expected. [2022-11-08T12:20:42.797] debug2: container signal 15 to StepId=64137494.extern [2022-11-08T12:20:42.797] [64137494.batch] error: Detected 1 oom-kill event(s) in StepId=64137494.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. [2022-11-08T12:20:42.798] debug2: container signal 15 to StepId=64137494.batch [2022-11-08T12:20:42.798] [64137494.batch] error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1035122/job_64137494/step_batch/cgroup.procs: No such file or directory I see this error is close to a signal 15 and an OOM. Have you reproduced it again with new steps after increasing the debug level + flags?
Created attachment 27688 [details] slurmd logs for job 64884268
Do you have Delegate=yes in the systemd's slurmd unit file? Do you have weka or Bright in the system? Can I see a "cat /proc/mounts" on this affected node? Thanks
Felip, We are using Delegate=Yes in our slurmd unit file. We do have Weka available on our system as a software module but it does not appear to be in use by jobs throwing this error. Here is /proc/mounts on a node where this error occured: sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 devtmpfs /dev devtmpfs rw,nosuid,size=528096092k,nr_inodes=132024023,mode=755 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,cpu 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio,net_cls 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 configfs /sys/kernel/config configfs rw,relatime 0 0 /dev/mapper/vg_system-system_root / ext4 rw,relatime,data=ordered 0 0 debugfs /sys/kernel/debug debugfs rw,relatime 0 0 mqueue /dev/mqueue mqueue rw,relatime 0 0 systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=94700 0 0 hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0 binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 /dev/sda1 /boot ext4 rw,relatime,data=ordered 0 0 /dev/mapper/vg_system-system_puppet /opt/puppetlabs xfs rw,relatime,attr2,inode64,noquota 0 0 /dev/mapper/vg_system-system_tmp /tmp xfs rw,relatime,attr2,inode64,noquota 0 0 /dev/mapper/vg_system-system_var /var xfs rw,relatime,attr2,inode64,noquota 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 fs-07.i:/zsrv/el7optmodules /opt/modules nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.12.129,local_lock=none,addr=192.168.0.93 0 0 fs-07.i:/zsrv/el7optsoftware /opt/software nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.12.129,local_lock=none,addr=192.168.0.93 0 0 192.168.1.40:/mnt/nfs/crash /var/crash nfs rw,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.40,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=192.168.1.40 0 0 /etc/auto.cvmfs /cvmfs autofs rw,relatime,fd=5,pgrp=16610,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=185026 0 0 ufs18 /mnt/ufs18 gpfs rw,relatime 0 0 gs21 /mnt/gs21 gpfs rw,relatime 0 0
Hi, the error is harmless, but still something that probably needs to be fixed. It seems there's a race condition where the freezer cgroup is destroyed, and we are still serving signal RPCs from an external source (slurmd, srun, scancel..) afterwards. We need to lock the cgroup while destroying it and not let any other thread to try to read the cgroup afterwards/in the meantime. There's a related commit 26e96df68aa97 but in theory it is in since 21.08.7 and may be covering a bit different situation. I will let you know when I have more conclusions, but for the moment there's no need to worry too much about it.
Hi Steve, can you please confirm that *all* your slurmd's are running at versions superior or equal to 21.08.7 ?
Hi Steve, First of all sorry for not having replied in so much time. Besides having higher priority bugs (this one was under a specific condition and not harmful) I didn't found the issue until now and really I had a hard time to reproduce in newer versions. Finally I have found the cause of this issue and I am working already on a solution. This turned out to be a duplicate of bug 14293. If you don't mind I am closing this one since was being reported later, and we will continue in bug 14293. There I briefly explain the situation in https://bugs.schedmd.com/show_bug.cgi?id=14293#c35 I was already suspecting about a signal arriving when we were removing the cgroup, but didn't find exactly why it was causing issues. Now I've found it. Thanks for your comprehension and patience. *** This ticket has been marked as a duplicate of ticket 14293 ***