This could be related to a previous issue https://bugs.schedmd.com/show_bug.cgi?id=14191 This shows up only simple "srun" that produces output, e.g. $ srun uname -a Linux node002 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2627803/step_0/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2627803/step_0/cgroup.procs' slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2627803/step_0/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2627803/step_0/cgroup.procs' slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2627803/step_0/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2627803/step_0/cgroup.procs' or $ srun echo foobar foobar slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2627804/step_0/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2627804/step_0/cgroup.procs' slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2627804/step_0/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2627804/step_0/cgroup.procs' slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2627804/step_0/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2627804/step_0/cgroup.procs' Interactive jobs are OK: [myname@login ~]$ srun --pty /bin/bash [myname@node002 ~]$ echo $SLURM_JOB_ID 2627808 The mount seems to be there: [myname@node002 ~]$ sudo grep cgroup /proc/mounts | grep freezer cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 And the mount exists for that interactive shell job: drwxr-xr-x 4 root root 0 Jun 9 14:47 /sys/fs/cgroup/freezer/slurm/uid_1002/job_2627808/ Maybe it's a duration thing. It works fine (i.e. no cgroup error) if I do this: $ srun sleep 10 && uname -a Linux picotte001 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux FWIW, "srun uname -a" worked as expected in 20.02 (the previous version we were running). Cheers, Dave
Hi David, Yes, this seems a continuation of bug 14191. As Tim commented there he was not certain of how it was fixed. Can I ask you a new slurmd log with debug flags CGROUP enabled? I know you provided it in the other bug but I want start fresh. Also a complete "cat /proc/mount". I also suggest to turn off CgroupAutomount=yes, this can get in the middle and hide OS configuration issues. I also saw you had libcgroup-tools installed, can you also check "systemctl status cgconfig.service" and attach your /etc/cgconfig.conf file? Related to this last point, there's a daemon called uresourced that can do mess with the controllers. I don't think it is the case but it would be good if you can check for the existence of this daemon in the nodes. Thanks
Created attachment 25435 [details] /var/log/slurmd on compute node node002
Hi, Felip: I turned on the debug flag in slurm.conf DebugFlags=cgroup Then I first ran an interactive shell doing just "echo hello" once it started, and then exited. And then, I ran "srun uname -a". The /var/log/slurmd is attached. This node was just rebooted about 6 hrs ago (~ 10:00 local time). /proc/mounts: proc /proc proc rw,nosuid,relatime 0 0 sysfs /sys sysfs rw,relatime 0 0 devtmpfs /dev devtmpfs rw,relatime,size=98292044k,nr_inodes=24573011,mode=755 0 0 tmpfs /run tmpfs rw,relatime 0 0 /dev/sda1 / xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw 0 0 devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0 cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=37042 0 0 configfs /sys/kernel/config configfs rw,relatime 0 0 hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0 debugfs /sys/kernel/debug debugfs rw,relatime 0 0 mqueue /dev/mqueue mqueue rw,relatime 0 0 fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0 binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 /dev/sda6 /local xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0 /dev/sda2 /var xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0 /dev/sda3 /tmp xfs rw,nosuid,nodev,noatime,nodiratime,attr2,inode64,noquota 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 tracefs /sys/kernel/debug/tracing tracefs rw,relatime 0 0 beegfs_nodev /beegfs beegfs rw,relatime,cfgFile=/etc/beegfs/beegfs-client.conf 0 0 baran.cm.cluster:/ifs/baran/hpc-zone/sysadmin /ifs/sysadmin nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.36,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.36 0 0 master:/cm/shared /cm/shared nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.25.128.1,mountvers=3,mountport=4002,mountproto=udp,local_lock=none,addr=172.25.128.1 0 0 baran.cm.cluster:/ifs/baran/hpc-zone/opt_spack /ifs/opt_spack nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.33,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.33 0 0 baran.cm.cluster:/ifs/baran/hpc-zone/groups /ifs/groups nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.35,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.35 0 0 baran.cm.cluster:/ifs/baran/hpc-zone/opt /ifs/opt nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.37,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.37 0 0 baran.cm.cluster:/ifs/baran/hpc-zone/home /home nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.43,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.43 0 0 tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=19665704k,mode=700 0 0 On node002: $ sudo systemctl status cgconfig.service ● cgconfig.service - Control Group configuration service Loaded: loaded (/usr/lib/systemd/system/cgconfig.service; disabled; vendor preset: disabled) Active: inactive (dead) (In reply to Felip Moll from comment #1) > Hi David, > > Yes, this seems a continuation of bug 14191. As Tim commented there he was > not certain of how it was fixed. > > Can I ask you a new slurmd log with debug flags CGROUP enabled? I know you > provided it in the other bug but I want start fresh. > > Also a complete "cat /proc/mount". > > I also suggest to turn off CgroupAutomount=yes, this can get in the middle > and hide OS configuration issues. > > I also saw you had libcgroup-tools installed, can you also check "systemctl > status cgconfig.service" and attach your /etc/cgconfig.conf file? > > Related to this last point, there's a daemon called uresourced that can do > mess with the controllers. I don't think it is the case but it would be good > if you can check for the existence of this daemon in the nodes. > > Thanks
(In reply to David Chin from comment #3) > Hi, Felip: > > I turned on the debug flag in slurm.conf > > DebugFlags=cgroup > > Then I first ran an interactive shell doing just "echo hello" once it > started, and then exited. And then, I ran "srun uname -a". The > /var/log/slurmd is attached. This node was just rebooted about 6 hrs ago (~ > 10:00 local time). > > /proc/mounts: > > proc /proc proc rw,nosuid,relatime 0 0 > sysfs /sys sysfs rw,relatime 0 0 > devtmpfs /dev devtmpfs > rw,relatime,size=98292044k,nr_inodes=24573011,mode=755 0 0 > tmpfs /run tmpfs rw,relatime 0 0 > /dev/sda1 / xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0 > securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 > 0 > tmpfs /dev/shm tmpfs rw 0 0 > devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 > tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 > cgroup /sys/fs/cgroup/systemd cgroup > rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd- > cgroups-agent,name=systemd 0 0 > pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 > bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0 > cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0 > cgroup /sys/fs/cgroup/perf_event cgroup > rw,nosuid,nodev,noexec,relatime,perf_event 0 0 > cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 > 0 > cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 > cgroup /sys/fs/cgroup/net_cls,net_prio cgroup > rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0 > cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 > 0 > cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 > cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices > 0 0 > cgroup /sys/fs/cgroup/cpu,cpuacct cgroup > rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 > cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer > 0 0 > cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb > 0 0 > systemd-1 /proc/sys/fs/binfmt_misc autofs > rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct, > pipe_ino=37042 0 0 > configfs /sys/kernel/config configfs rw,relatime 0 0 > hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0 > debugfs /sys/kernel/debug debugfs rw,relatime 0 0 > mqueue /dev/mqueue mqueue rw,relatime 0 0 > fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0 > binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 > /dev/sda6 /local xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0 > /dev/sda2 /var xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0 > /dev/sda3 /tmp xfs rw,nosuid,nodev,noatime,nodiratime,attr2,inode64,noquota > 0 0 > sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 > tracefs /sys/kernel/debug/tracing tracefs rw,relatime 0 0 > beegfs_nodev /beegfs beegfs > rw,relatime,cfgFile=/etc/beegfs/beegfs-client.conf 0 0 > baran.cm.cluster:/ifs/baran/hpc-zone/sysadmin /ifs/sysadmin nfs > rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp, > timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.36,mountvers=3, > mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.36 0 0 > master:/cm/shared /cm/shared nfs > rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp, > timeo=600,retrans=2,sec=sys,mountaddr=172.25.128.1,mountvers=3, > mountport=4002,mountproto=udp,local_lock=none,addr=172.25.128.1 0 0 > baran.cm.cluster:/ifs/baran/hpc-zone/opt_spack /ifs/opt_spack nfs > rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp, > timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.33,mountvers=3, > mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.33 0 0 > baran.cm.cluster:/ifs/baran/hpc-zone/groups /ifs/groups nfs > rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp, > timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.35,mountvers=3, > mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.35 0 0 > baran.cm.cluster:/ifs/baran/hpc-zone/opt /ifs/opt nfs > rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp, > timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.37,mountvers=3, > mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.37 0 0 > baran.cm.cluster:/ifs/baran/hpc-zone/home /home nfs > rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp, > timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.43,mountvers=3, > mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.43 0 0 > tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=19665704k,mode=700 0 0 > > On node002: > > $ sudo systemctl status cgconfig.service > ● cgconfig.service - Control Group configuration service > Loaded: loaded (/usr/lib/systemd/system/cgconfig.service; disabled; > vendor preset: disabled) > Active: inactive (dead) > > (In reply to Felip Moll from comment #1) > > Hi David, > > > > Yes, this seems a continuation of bug 14191. As Tim commented there he was > > not certain of how it was fixed. > > > > Can I ask you a new slurmd log with debug flags CGROUP enabled? I know you > > provided it in the other bug but I want start fresh. > > > > Also a complete "cat /proc/mount". > > > > I also suggest to turn off CgroupAutomount=yes, this can get in the middle > > and hide OS configuration issues. > > > > I also saw you had libcgroup-tools installed, can you also check "systemctl > > status cgconfig.service" and attach your /etc/cgconfig.conf file? > > > > Related to this last point, there's a daemon called uresourced that can do > > mess with the controllers. I don't think it is the case but it would be good > > if you can check for the existence of this daemon in the nodes. > > > > Thanks Hi David, Your mountpoints seem correct and very close to mine. The only difference I see is that for /sys/fs/cgroup/systemd you have a release agent set, but I this should not be affecting the other mountpoints. I looked at your log file but it doesn't seem it catched the DebugFlags=CGROUP setting, I don't see relevant information more than "info" level. Can you please do it again, but now also adding a SlurmdDebug=debug2 ? You should see lines prefixed with "CGROUP" and "debug2" in the logs if the setting is activated. E.g.: [2022-03-16T19:14:05.367] [1.interactive] cgroup/v1: _set_uint32_param: CGROUP: parameter 'cgroup.procs' set to '217003' for '/sys/fs/cgroup/cpuacct/slurm_gamba1/uid_1000/job_1/step_interactive/task_0' So: 1. Enable DebugFlags=CGROUP and SlurmdDebug=debug2 in slurm.conf 2. Do an scontrol reconfig or restart the slurmd. 3. Note the time. 4. Run the jobs which fail. 5. Send back the new logs, and also the output of 'dmesg -T' in the node. 6. Send also the slurmctld log. I cannot do much with the current logs, I will be waiting for your input.
Hi, Felip: I added this to the slurm.conf and it seemed to work: DebugFlags=CGROUP SlurmdDebug=debug2 I think I used "cgroup" before, i.e. all lower case. Then ran 2 test jobs on node074: 1) srun -w node074 --pty /bin/bash -- then did "echo foobar", waited about 10 seconds, and exited the job shell. No cgroup error messages were printed. 2) srun -w node074 uname -a Linux node074 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs' slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs' slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs' Will attach log after this comment. Thanks, Dave
Created attachment 25456 [details] /var/log/slurmd on node074
Hi, Felip: Any news? Thanks, Dave
David, I had prepared a response some days ago but I was not entirely sure of the correctness. I was basically identifying this as a timing issue, as you already suggested. I reviewed all your logs sequence step by step and everything happens in some microseconds. The cgroup filesystem (v1 specially) has been demonstrated to be a bit slow during time, and we don't wait for the cgroup to be show up to continue, since we expect that the return call from mkdir already guarantees that. But after your tests and this log: [2022-06-10T09:58:31.597] [2628733.0] cgroup/v1: xcgroup_wait_pid_moved: CGROUP: Took 2072 checks before stepd pid 29852 was removed from the cpuset cgroup. ... [2022-06-10T09:58:31.624] [2628733.extern] cgroup/v1: xcgroup_wait_pid_moved: CGROUP: Took 3502 checks before stepd pid 29844 was removed from the cpuset cgroup. This means your cgroup is a bit slow, it takes 2000-3500 iterations to wait for a pid to be removed from a cgroup controller. I am thinking about if it is possible that mkdir returns, but the cgroup is not yet shown to the user even if the internal kernel structure is created. There was one opposite situation where the removal took time to complete, after a rmdir and we had to put an active wait for it to disappear. I discarded a plugin race, since the root cgroup is locked for an operation of creation and removal. We fixed this race possibility in: 21.08.7+ commits 26e96df68a..91bd26c481 - Bug 13423 In any case I could not reproduce this, I am not sure how to make my cgroup slower without modifying my kernel. Would you be able to try a test patch if I add an active poll or a wait after the directory creation? Also one other question... after this error, are the directories of the error still in the cgroup filesystem? [2022-06-10T09:58:31.584] [2628733.0] error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs: No such file or directory [2022-06-10T09:58:31.584] [2628733.0] error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs' Thanks and sorry for the delay on my response.
But.. my hypothesis is not 100% reliable, because for cpuset for example it just works fine. Another idea I am veryfing is that if this can happen before the freezer has been initialized: [2022-06-10T09:58:31.584] [2628733.0] debug: task/affinity: task_p_post_term: affinity StepId=2628733.0, task 0 [2022-06-10T09:58:31.584] [2628733.0] error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs: No such file or directory [2022-06-10T09:58:31.584] [2628733.0] error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs'
Dave, [2022-05-28T08:38:00.936] Considering each NUMA node as a socket [2022-05-28T08:38:00.936] Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw) [2022-05-28T08:38:00.937] Considering each NUMA node as a socket [2022-05-28T08:38:00.939] error: unable to mount freezer cgroup namespace: Device or resource busy [2022-05-28T08:38:00.939] error: unable to create freezer cgroup namespace [2022-05-28T08:38:00.939] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed [2022-05-28T08:38:00.939] error: cannot create proctrack context for proctrack/cgroup [2022-05-28T08:38:00.939] error: slurmd initialization failed [2022-05-28T08:38:30.970] Considering each NUMA node as a socket [2022-05-28T08:38:30.970] Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw) [2022-05-28T08:38:30.971] Considering each NUMA node as a socket [2022-05-28T08:38:30.974] error: unable to mount freezer cgroup namespace: Device or resource busy [2022-05-28T08:38:30.974] error: unable to create freezer cgroup namespace [2022-05-28T08:38:30.974] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed [2022-05-28T08:38:30.974] error: cannot create proctrack context for proctrack/cgroup [2022-05-28T08:38:30.974] error: slurmd initialization failed [2022-05-28T11:37:05.164] Considering each NUMA node as a socket Can you tell me about this errors in your log? Why was freezer cgroup not mounted these times? Do you know what happened? This could be related. Can you upload your slurm.conf and cgroup.conf? Just to discard some issues.. do you use Bright on your cluster?
Created attachment 25932 [details] slurm.conf
Created attachment 25933 [details] cgroup.conf
Yes, we do use Bright Cluster Manager. > Also one other question... after this error, are the directories of the error still in the cgroup filesystem? > > [2022-06-10T09:58:31.584] [2628733.0] error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs: No such file or directory > [2022-06-10T09:58:31.584] [2628733.0] error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs' No, the directories are not there. The freezer cgroup directory seems to be mounted: [root@node051]# ls -l /sys/fs/cgroup/freezer total 0 -rw-r--r-- 1 root root 0 Jul 20 11:35 cgroup.clone_children -rw-r--r-- 1 root root 0 Jul 18 16:16 cgroup.procs -r--r--r-- 1 root root 0 Jul 20 11:35 cgroup.sane_behavior -rw-r--r-- 1 root root 0 Jul 20 11:35 notify_on_release -rw-r--r-- 1 root root 0 Jul 20 11:35 release_agent drwxr-xr-x 6 root root 0 Jul 20 11:31 slurm/ -rw-r--r-- 1 root root 0 Jul 18 15:04 tasks
Hi, Felip: Feel free to lower the importance to "minor issue" since this hasn't impacted any users' actual jobs. Regards, Dave
It is interesting how Bright configures things. It sets swap parameteres with ConstrainSwapSpace=no. It enables *KMemSpace which is not very reliable in certain situations. I suggest this cleanup: ---cgroup.conf--- TaskAffinity=no ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=no ConstrainDevices=yes ---- In any case this is not the cause of the issue. I cannot reproduce it, I am not able to get such a small timing between creation/finish of a job, and indeed it seems related to the problem. From comment 12: Can you tell me about these errors in your log? Why was freezer cgroup not mounted these times? Do you know what happened?
(In reply to Felip Moll from comment #17) > It is interesting how Bright configures things. It sets swap parameteres > with ConstrainSwapSpace=no. It enables *KMemSpace which is not very reliable > in certain situations. > > I suggest this cleanup: > > ---cgroup.conf--- > TaskAffinity=no > ConstrainCores=yes > ConstrainRAMSpace=yes > ConstrainSwapSpace=no > ConstrainDevices=yes > ---- > > In any case this is not the cause of the issue. > > I cannot reproduce it, I am not able to get such a small timing between > creation/finish of a job, and indeed it seems related to the problem. > > From comment 12: Can you tell me about these errors in your log? Why was > freezer cgroup not mounted these times? Do you know what happened? Well, that cgroup.conf could have been me. I was trying figure out how to constrain some classes of jobs using cgroups, and I was tweaking those params.
(In reply to Felip Moll from comment #12) > Dave, > > [2022-05-28T08:38:00.936] Considering each NUMA node as a socket > [2022-05-28T08:38:00.936] Node reconfigured socket/core boundaries > SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw) > [2022-05-28T08:38:00.937] Considering each NUMA node as a socket > [2022-05-28T08:38:00.939] error: unable to mount freezer cgroup namespace: > Device or resource busy > [2022-05-28T08:38:00.939] error: unable to create freezer cgroup namespace > [2022-05-28T08:38:00.939] error: Couldn't load specified plugin name for > proctrack/cgroup: Plugin init() callback failed > [2022-05-28T08:38:00.939] error: cannot create proctrack context for > proctrack/cgroup > [2022-05-28T08:38:00.939] error: slurmd initialization failed > [2022-05-28T08:38:30.970] Considering each NUMA node as a socket > [2022-05-28T08:38:30.970] Node reconfigured socket/core boundaries > SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw) > [2022-05-28T08:38:30.971] Considering each NUMA node as a socket > [2022-05-28T08:38:30.974] error: unable to mount freezer cgroup namespace: > Device or resource busy > [2022-05-28T08:38:30.974] error: unable to create freezer cgroup namespace > [2022-05-28T08:38:30.974] error: Couldn't load specified plugin name for > proctrack/cgroup: Plugin init() callback failed > [2022-05-28T08:38:30.974] error: cannot create proctrack context for > proctrack/cgroup > [2022-05-28T08:38:30.974] error: slurmd initialization failed > [2022-05-28T11:37:05.164] Considering each NUMA node as a socket > > Can you tell me about this errors in your log? Why was freezer cgroup not > mounted these times? Do you know what happened? > > This could be related. > > Can you upload your slurm.conf and cgroup.conf? > > Just to discard some issues.. do you use Bright on your cluster? Not sure what's going on there. I just rebooted node002 for some unrelated reason. Its current /var/log/slurmd does not show any of those errors. It does have multiple jobs running on it, right now. Trying to force it to happen: $ srun -w node002 uname -a srun: job 3477807 queued and waiting for resources srun: job 3477807 has been allocated resources Linux node002 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' That produces these lines in /var/log/slurmd: [2022-07-25T11:18:43.756] launch task StepId=3477807.0 request from UID:1002 GID:1002 HOST:172.25.128.11 PORT:46924 [2022-07-25T11:18:43.757] task/affinity: lllp_distribution: JobId=3477807 implicit auto binding: cores,one_thread, dist 8192 [2022-07-25T11:18:43.757] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic [2022-07-25T11:18:43.757] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [3477807]: mask_cpu,one_thread, 0x000400000000 [2022-07-25T11:18:43.764] [3477807.0] Considering each NUMA node as a socket [2022-07-25T11:18:43.770] [3477807.0] task/cgroup: _memcg_initialize: job: alloc=3900MB mem.limit=3900MB memsw.limit=unlimited [2022-07-25T11:18:43.770] [3477807.0] task/cgroup: _memcg_initialize: step: alloc=3900MB mem.limit=3900MB memsw.limit=unlimited [2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No such file or directory [2022-07-25T11:18:43.839] [3477807.0] error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' [2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No such file or directory [2022-07-25T11:18:43.839] [3477807.0] error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' [2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No such file or directory [2022-07-25T11:18:43.839] [3477807.0] error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' [2022-07-25T11:18:43.860] [3477807.0] done with job [2022-07-25T11:18:43.880] [3477807.extern] done with job But the freezer directory does seem to exist: [root@node002]# ll -d /sys/fs/cgroup/freezer dr-xr-xr-x 3 root root 0 Jul 25 10:51 /sys/fs/cgroup/freezer/
(In reply to David Chin from comment #19) > (In reply to Felip Moll from comment #12) > > Dave, > > > > [2022-05-28T08:38:00.936] Considering each NUMA node as a socket > > [2022-05-28T08:38:00.936] Node reconfigured socket/core boundaries > > SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw) > > [2022-05-28T08:38:00.937] Considering each NUMA node as a socket > > [2022-05-28T08:38:00.939] error: unable to mount freezer cgroup namespace: > > Device or resource busy > > [2022-05-28T08:38:00.939] error: unable to create freezer cgroup namespace > > [2022-05-28T08:38:00.939] error: Couldn't load specified plugin name for > > proctrack/cgroup: Plugin init() callback failed > > [2022-05-28T08:38:00.939] error: cannot create proctrack context for > > proctrack/cgroup > > [2022-05-28T08:38:00.939] error: slurmd initialization failed > > [2022-05-28T08:38:30.970] Considering each NUMA node as a socket > > [2022-05-28T08:38:30.970] Node reconfigured socket/core boundaries > > SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw) > > [2022-05-28T08:38:30.971] Considering each NUMA node as a socket > > [2022-05-28T08:38:30.974] error: unable to mount freezer cgroup namespace: > > Device or resource busy > > [2022-05-28T08:38:30.974] error: unable to create freezer cgroup namespace > > [2022-05-28T08:38:30.974] error: Couldn't load specified plugin name for > > proctrack/cgroup: Plugin init() callback failed > > [2022-05-28T08:38:30.974] error: cannot create proctrack context for > > proctrack/cgroup > > [2022-05-28T08:38:30.974] error: slurmd initialization failed > > [2022-05-28T11:37:05.164] Considering each NUMA node as a socket > > > > Can you tell me about this errors in your log? Why was freezer cgroup not > > mounted these times? Do you know what happened? > > > > This could be related. > > > > Can you upload your slurm.conf and cgroup.conf? > > > > Just to discard some issues.. do you use Bright on your cluster? > > > Not sure what's going on there. I just rebooted node002 for some unrelated > reason. Its current /var/log/slurmd does not show any of those errors. It > does have multiple jobs running on it, right now. > > Trying to force it to happen: > > $ srun -w node002 uname -a > srun: job 3477807 queued and waiting for resources > srun: job 3477807 has been allocated resources > Linux node002 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 > x86_64 x86_64 x86_64 GNU/Linux > slurmstepd: error: _cgroup_procs_check: failed on path > /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No > such file or directory > slurmstepd: error: unable to read > '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' > slurmstepd: error: _cgroup_procs_check: failed on path > /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No > such file or directory > slurmstepd: error: unable to read > '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' > slurmstepd: error: _cgroup_procs_check: failed on path > /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No > such file or directory > slurmstepd: error: unable to read > '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' > > That produces these lines in /var/log/slurmd: > > [2022-07-25T11:18:43.756] launch task StepId=3477807.0 request from UID:1002 > GID:1002 HOST:172.25.128.11 PORT:46924 > [2022-07-25T11:18:43.757] task/affinity: lllp_distribution: JobId=3477807 > implicit auto binding: cores,one_thread, dist 8192 > [2022-07-25T11:18:43.757] task/affinity: _task_layout_lllp_cyclic: > _task_layout_lllp_cyclic > [2022-07-25T11:18:43.757] task/affinity: _lllp_generate_cpu_bind: > _lllp_generate_cpu_bind jobid [3477807]: mask_cpu,one_thread, 0x000400000000 > [2022-07-25T11:18:43.764] [3477807.0] Considering each NUMA node as a socket > [2022-07-25T11:18:43.770] [3477807.0] task/cgroup: _memcg_initialize: job: > alloc=3900MB mem.limit=3900MB memsw.limit=unlimited > [2022-07-25T11:18:43.770] [3477807.0] task/cgroup: _memcg_initialize: step: > alloc=3900MB mem.limit=3900MB memsw.limit=unlimited > [2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on > path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: > No such file or directory > [2022-07-25T11:18:43.839] [3477807.0] error: unable to read > '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' > [2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on > path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: > No such file or directory > [2022-07-25T11:18:43.839] [3477807.0] error: unable to read > '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' > [2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on > path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: > No such file or directory > [2022-07-25T11:18:43.839] [3477807.0] error: unable to read > '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' > [2022-07-25T11:18:43.860] [3477807.0] done with job > [2022-07-25T11:18:43.880] [3477807.extern] done with job > > But the freezer directory does seem to exist: > > [root@node002]# ll -d /sys/fs/cgroup/freezer > dr-xr-xr-x 3 root root 0 Jul 25 10:51 /sys/fs/cgroup/freezer/ If I provide a debug patch can you apply and test it on a node? I understand this is a production machine so it would be enough to isolate one node, apply the patch to that slurmd, and reproduce the error there. If this is possible I would prepare it.
(In reply to Felip Moll from comment #20) > If I provide a debug patch can you apply and test it on a node? I understand > this is a production machine so it would be enough to isolate one node, > apply the patch to that slurmd, and reproduce the error there. > > If this is possible I would prepare it. Sure, I can set something up. It may take me some number of days to put it in place.
(In reply to David Chin from comment #21) > (In reply to Felip Moll from comment #20) > > If I provide a debug patch can you apply and test it on a node? I understand > > this is a production machine so it would be enough to isolate one node, > > apply the patch to that slurmd, and reproduce the error there. > > > > If this is possible I would prepare it. > > Sure, I can set something up. It may take me some number of days to put it > in place. Hi David, I was able to get a similar issue than your by stressing Slurm, but still haven't not found the cause. I will keep you posted in the forthcoming weeks. In any case the error is informative and should not cause any harm. Sorry for the delay.
(In reply to David Chin from comment #19) > (In reply to Felip Moll from comment #12) > > Dave, > > > > [2022-05-28T08:38:00.936] Considering each NUMA node as a socket > > [2022-05-28T08:38:00.936] Node reconfigured socket/core boundaries > > SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw) > > [2022-05-28T08:38:00.937] Considering each NUMA node as a socket > > [2022-05-28T08:38:00.939] error: unable to mount freezer cgroup namespace: > > Device or resource busy > > [2022-05-28T08:38:00.939] error: unable to create freezer cgroup namespace > > [2022-05-28T08:38:00.939] error: Couldn't load specified plugin name for > > proctrack/cgroup: Plugin init() callback failed > > [2022-05-28T08:38:00.939] error: cannot create proctrack context for > > proctrack/cgroup > > [2022-05-28T08:38:00.939] error: slurmd initialization failed > > [2022-05-28T08:38:30.970] Considering each NUMA node as a socket > > [2022-05-28T08:38:30.970] Node reconfigured socket/core boundaries > > SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw) > > [2022-05-28T08:38:30.971] Considering each NUMA node as a socket > > [2022-05-28T08:38:30.974] error: unable to mount freezer cgroup namespace: > > Device or resource busy > > [2022-05-28T08:38:30.974] error: unable to create freezer cgroup namespace > > [2022-05-28T08:38:30.974] error: Couldn't load specified plugin name for > > proctrack/cgroup: Plugin init() callback failed > > [2022-05-28T08:38:30.974] error: cannot create proctrack context for > > proctrack/cgroup > > [2022-05-28T08:38:30.974] error: slurmd initialization failed > > [2022-05-28T11:37:05.164] Considering each NUMA node as a socket > > > > Can you tell me about this errors in your log? Why was freezer cgroup not > > mounted these times? Do you know what happened? > > > > This could be related. > > > > Can you upload your slurm.conf and cgroup.conf? > > > > Just to discard some issues.. do you use Bright on your cluster? > > > Not sure what's going on there. I just rebooted node002 for some unrelated > reason. Its current /var/log/slurmd does not show any of those errors. It > does have multiple jobs running on it, right now. > > Trying to force it to happen: > > $ srun -w node002 uname -a > srun: job 3477807 queued and waiting for resources > srun: job 3477807 has been allocated resources > Linux node002 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 > x86_64 x86_64 x86_64 GNU/Linux > slurmstepd: error: _cgroup_procs_check: failed on path > /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No > such file or directory > slurmstepd: error: unable to read > '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' > slurmstepd: error: _cgroup_procs_check: failed on path > /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No > such file or directory > slurmstepd: error: unable to read > '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' > slurmstepd: error: _cgroup_procs_check: failed on path > /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No > such file or directory > slurmstepd: error: unable to read > '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' > > That produces these lines in /var/log/slurmd: > > [2022-07-25T11:18:43.756] launch task StepId=3477807.0 request from UID:1002 > GID:1002 HOST:172.25.128.11 PORT:46924 > [2022-07-25T11:18:43.757] task/affinity: lllp_distribution: JobId=3477807 > implicit auto binding: cores,one_thread, dist 8192 > [2022-07-25T11:18:43.757] task/affinity: _task_layout_lllp_cyclic: > _task_layout_lllp_cyclic > [2022-07-25T11:18:43.757] task/affinity: _lllp_generate_cpu_bind: > _lllp_generate_cpu_bind jobid [3477807]: mask_cpu,one_thread, 0x000400000000 > [2022-07-25T11:18:43.764] [3477807.0] Considering each NUMA node as a socket > [2022-07-25T11:18:43.770] [3477807.0] task/cgroup: _memcg_initialize: job: > alloc=3900MB mem.limit=3900MB memsw.limit=unlimited > [2022-07-25T11:18:43.770] [3477807.0] task/cgroup: _memcg_initialize: step: > alloc=3900MB mem.limit=3900MB memsw.limit=unlimited > [2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on > path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: > No such file or directory > [2022-07-25T11:18:43.839] [3477807.0] error: unable to read > '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' > [2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on > path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: > No such file or directory > [2022-07-25T11:18:43.839] [3477807.0] error: unable to read > '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' > [2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on > path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: > No such file or directory > [2022-07-25T11:18:43.839] [3477807.0] error: unable to read > '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs' > [2022-07-25T11:18:43.860] [3477807.0] done with job > [2022-07-25T11:18:43.880] [3477807.extern] done with job > > But the freezer directory does seem to exist: > > [root@node002]# ll -d /sys/fs/cgroup/freezer > dr-xr-xr-x 3 root root 0 Jul 25 10:51 /sys/fs/cgroup/freezer/ David, first of all sorry for the late reply. I've had other priorities these past weeks. I also looked into this again but I don't have enough information to confirm my initial suspicions. Is it possible for you to reproduce again the issue but increasing the slurmd debug level to debug3, and add the debug flags: CGROUP,JobAccountGather,Profile,Protocol,Steps SlurmdDebug=debug3 Debugflags=Cgroup,JobAccountGather,Profile,Protocol,Steps Also when you run srun, can you run with "-vvv" ? srun -vvv /bin/true I will need the full slurmd log, slurmctld and srun output Thanks!
Hi David, Do you have any input for me regarding to my last comment 24? Thanks
(In reply to Felip Moll from comment #25) > Hi David, > > Do you have any input for me regarding to my last comment 24? > > Thanks Hi Felip: Aiming to get to this by end of this week. Regards, Dave
(In reply to Felip Moll from comment #25) > Hi David, > > Do you have any input for me regarding to my last comment 24? > > Thanks Hi Felip: Here's the output: $ srun -vvv /bin/true srun: defined options srun: -------------------- -------------------- srun: verbose : 3 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=18446744073709551615 srun: debug: propagating RLIMIT_CORE=0 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=1541534 srun: debug: propagating RLIMIT_NOFILE=131072 srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0002 srun: debug2: srun PMI messages to port=45223 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: port from net_stream_listen is 35795 srun: debug: Entering _msg_thr_internal srun: debug: auth/munge: init: Munge authentication plugin loaded srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index) srun: debug: Waited 0.100000 sec and still waiting: next sleep for 0.200000 sec srun: Nodes node001 are ready for job srun: jobid 3754978: nodes(1):`node001', cpu counts: 1(x1) srun: debug2: creating job with 1 tasks srun: debug: requesting job 3754978, user 1002, nodes 1 including ((null)) srun: debug: cpus 1, tasks 1, name true, relative 65534 srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type) srun: debug: Entering slurm_step_launch srun: debug: mpi type = (null) srun: debug: mpi/none: p_mpi_hook_client_prelaunch: Using mpi/none srun: debug: Entering _msg_thr_create() srun: debug: initialized stdio listening socket, port 42079 srun: debug: Started IO server thread (23456195376896) srun: debug: Entering _launch_tasks srun: launching StepId=3754978.0 on host node001, 1 tasks: 0 srun: route/default: init: route default plugin loaded srun: debug2: Tree head got back 0 looking for 1 srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Tree head got back 1 srun: debug: launch returned msg_rc=0 err=0 type=8001 srun: debug2: Activity on IO listening socket 21 srun: debug2: Entering io_init_msg_read_from_fd srun: debug2: Leaving io_init_msg_read_from_fd srun: debug2: Entering io_init_msg_validate srun: debug2: Leaving io_init_msg_validate srun: debug2: Validated IO connection from 172.25.128.101:38610, node rank 0, sd=22 srun: debug2: eio_message_socket_accept: got message connection from 172.25.128.101:42670 23 srun: debug2: received task launch srun: launch/slurm: _task_start: Node node001, 1 tasks started srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Entering _file_write slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3754978/step_0/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3754978/step_0/cgroup.procs' slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3754978/step_0/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3754978/step_0/cgroup.procs' slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3754978/step_0/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3754978/step_0/cgroup.procs' srun: debug2: Leaving _file_write srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: eio_message_socket_accept: got message connection from 172.25.128.101:42672 22 srun: debug2: received task exit srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=3754978.0 (status=0x0000). srun: launch/slurm: _task_finish: node001: task 0: Completed srun: debug: task 0 done srun: debug2: false, shutdown srun: debug2: false, shutdown srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: false, shutdown srun: debug: IO thread exiting srun: debug2: slurm_allocation_msg_thr_destroy: clearing up message thread srun: debug2: false, shutdown $ echo $? 0 Sleeping first: $ srun -vvv sleep 10 && /bin/true srun: defined options srun: -------------------- -------------------- srun: verbose : 3 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=18446744073709551615 srun: debug: propagating RLIMIT_CORE=0 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=1541534 srun: debug: propagating RLIMIT_NOFILE=131072 srun: debug: propagating RLIMIT_MEMLOCK=18446744073709551615 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0002 srun: debug2: srun PMI messages to port=38777 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: port from net_stream_listen is 35681 srun: debug: Entering _msg_thr_internal srun: debug: auth/munge: init: Munge authentication plugin loaded srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index) srun: debug: Waited 0.100000 sec and still waiting: next sleep for 0.200000 sec srun: Nodes node001 are ready for job srun: jobid 3754982: nodes(1):`node001', cpu counts: 1(x1) srun: debug2: creating job with 1 tasks srun: debug: requesting job 3754982, user 1002, nodes 1 including ((null)) srun: debug: cpus 1, tasks 1, name sleep, relative 65534 srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type) srun: debug: Entering slurm_step_launch srun: debug: mpi type = (null) srun: debug: mpi/none: p_mpi_hook_client_prelaunch: Using mpi/none srun: debug: Entering _msg_thr_create() srun: debug: initialized stdio listening socket, port 35551 srun: debug: Started IO server thread (23456195376896) srun: debug: Entering _launch_tasks srun: launching StepId=3754982.0 on host node001, 1 tasks: 0 srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: route/default: init: route default plugin loaded srun: debug2: Tree head got back 0 looking for 1 srun: debug2: Tree head got back 1 srun: debug: launch returned msg_rc=0 err=0 type=8001 srun: debug2: Activity on IO listening socket 21 srun: debug2: Entering io_init_msg_read_from_fd srun: debug2: Leaving io_init_msg_read_from_fd srun: debug2: Entering io_init_msg_validate srun: debug2: Leaving io_init_msg_validate srun: debug2: Validated IO connection from 172.25.128.101:39766, node rank 0, sd=22 srun: debug2: eio_message_socket_accept: got message connection from 172.25.128.101:37374 23 srun: debug2: received task launch srun: launch/slurm: _task_start: Node node001, 1 tasks started srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: eio_message_socket_accept: got message connection from 172.25.128.101:37380 22 srun: debug2: received task exit srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=3754982.0 (status=0x0000). srun: launch/slurm: _task_finish: node001: task 0: Completed srun: debug: task 0 done srun: debug2: false, shutdown srun: debug2: false, shutdown srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: false, shutdown srun: debug: IO thread exiting srun: debug2: slurm_allocation_msg_thr_destroy: clearing up message thread srun: debug2: false, shutdown srun: debug: Leaving _msg_thr_internal $ echo $? 0
> Is it possible for you to reproduce again the issue but increasing the > slurmd debug level to debug3, and add the debug flags: > CGROUP,JobAccountGather,Profile,Protocol,Steps > > SlurmdDebug=debug3 > Debugflags=Cgroup,JobAccountGather,Profile,Protocol,Steps > > Also when you run srun, can you run with "-vvv" ? > > srun -vvv /bin/true > > I will need the full slurmd log, slurmctld and srun output > > Thanks! Thanks David, but I still need the slurmd log and slurmctld having changed the debug settings in slurm.conf and restarted the slurmd.
Created attachment 27133 [details] Debugging output 2022-10-05 Here's the output from "srun", the appropriate section of /var/log/slurmd on the compute nodes that ran the 2 trial jobs, and the appropriate sections of /var/log/slurmctld on the head node. grep for "^localtime" to see the 2 separate tests: 1) srun -vvv /bin/true 2) srun -vvv sleep 10 && /bin/true
> grep for "^localtime" to see the 2 separate tests: > > 1) srun -vvv /bin/true > > 2) srun -vvv sleep 10 && /bin/true David, I've found this issue in other bugs but it is very difficult to reproduce, and we're not finding the race. Just wanted to update you. (Btw, test #2 is running two separate commands, first "srun -vvv sleep 10", and then "/bin/true", it is not running "sleep 10 && /bin/true" inside srun because && is interpreted first by bash and separates both commands)
David, can you please confirm that *all* your Slurmd's are running at versions superior or equal to 21.08.7 ?
(In reply to Felip Moll from comment #32) > David, can you please confirm that *all* your Slurmd's are running at > versions superior or equal to 21.08.7 ? Hi Felip: All nodes are running the Bright-provided package slurm21.08-client-21.08.8-100526_cm9.0_58aab6cfd1.x86_64 Doing "strings(1)" on the executable finds: SLURM_VERSION_STRING "21.08.8-2"
David, First of all sorry for not having replied in so much time. Besides having higher priority bugs (this one was under a specific condition and not harmful) I didn't found the issue until now and really I had a hard time to reproduce in newer versions. But today, and thanks to another bug, I was investigating a part of the code which gave me some hints for your issue. Despite my initial thoughts were wrong, I have found another path that lead me to see that we have a race condition with the RPC handler thread in a stepd. For several reasons we assumed only one thread in the slurmstepd ever accessed the cgroup plugin, but that's false. The main slurmstepd thread manages most of the work about cgroup, but on a certain situation, when a job ends very quickly we may hit for a microsecond a condition were we are removing the cgroup but still haven't set the variable which says we've done cleanup. In that moment if a RPC comes requesting the list of pids, it will get an error. I already have a solution for this issue, which I will implement in the following days and let you know. Thanks for your patience and comprehension.
*** Ticket 15370 has been marked as a duplicate of this ticket. ***
I am currently out of office on parental leave through February 22nd. I will respond to your emails as soon as possible upon my return.
I had problems to reproduce your issue in 22.05, so after deep investigation I realized that we fixed this in in bug 13768. We were calling proctrack_g_destroy which destroys the cgroup too early, so the following patch moved this call inside stepd_cleanup, which is the final function called when the step is ending. stepd_cleanup first terminates the io thread which is the responsible to get RPCs, and then destroys the cgroup. Before that patch, the order was incorrect. In summary, you should upgrade to 22.05 to get this error to disappear. commit 1ddef9a0dd896c6777daa3dbb68a405d076c45c6 Author: Dominik Bartkiewicz <bart@schedmd.com> Date: Thu Apr 14 17:22:42 2022 +0200 Make proctrack_p_wait to really wait on cgroup emptiness We were relying on the destruction of a cgroup to check for it being empty. This is not ok, because if there are other cgroup plugins loaded we do not want to try destroy the cgroup to not race with others. Also, multiple signals sent to a step can cause wait to be called multiple times, so the destroy must be avoided. Move it outside stepd_cleanup. This function is only called once per step at termination. Bug 13768 Signed-off-by: Felip Moll <felip.moll@schedmd.com> Signed-off-by: Oriol Vilarrubi <jvilarru@schedmd.com> Thanks for your patience and sorry again for the delay understanding why it was happening. At some point I was confused about versions and did really think this was a different issue. I am closing the bug now. *** This ticket has been marked as a duplicate of ticket 13768 ***