Created attachment 25258 [details] Current slurm.conf We just updated from Slurm 20.02.7 to 21.08.8-2 via Bright Cluster Manager. OS: RHEL 8.1 kernel 4.18.0-147.el8 libcgroup: libcgroup-0.41-19.el8.x86_64 libcgroup-tools-0.41-19.el8.x86_64 Since the upgrade, there have been multiple cgroup-related error messages. 1) This job completed and its results and outputs seemed to be unaffected: slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/memory/slurm/uid_1562/job_2596447/step_batch/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/memory/slurm/uid_1562/job_2596447/step_batch/cgroup.procs' 2) This is a job array - first 10 tasks completed and produced outputs, but the next 10 tasks did not start up: slurmstepd: error: error from open of cgroup '/sys/fs/cgroup/memory/slurm/uid_1447/job_2598870/step_batch' : No such file or directory slurmstepd: error: xcgroup_lock error: No such file or directory slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/ste p_batch/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/step_batch/cgroup.procs ' slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/ste p_batch/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/step_batch/cgroup.procs ' slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/ste p_batch/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/step_batch/cgroup.procs ' slurmstepd: error: problem with oom_pipe[0] slurmstepd: fatal: cgroup_v1.c:1352 _oom_event_monitor: pthread_mutex_lock(): Invalid argument Our slurm.conf is attached. Thanks, Dave
Here is one job which appears as "running": JobID JobName User Partition NodeList Elapsed State ExitCode ReqMem MaxRSS MaxVMSize AllocTRES -------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- ---------- -------------------------------- 2601625 rmsduni1.+ abcdefg def node006 05:19:15 RUNNING 0:0 160G billing=48,cpu=48,node=1 2601625.batch batch node006 05:19:15 RUNNING 0:0 cpu=48,mem=0,node=1 2601625.extern extern node006 05:19:15 RUNNING 0:0 billing=48,cpu=48,node=1 BUT there are no processes owned by that user running on that node.
(In reply to David Chin from comment #0) > We just updated from Slurm 20.02.7 to 21.08.8-2 via Bright Cluster Manager. There were significant improvements to the cgroup code to catch and handle more issues. Looks like they are at least catching issues. > 1) This job completed and its results and outputs seemed to be unaffected: > > slurmstepd: error: _cgroup_procs_check: failed on path > /sys/fs/cgroup/memory/slurm/uid_1562/job_2596447/step_batch/cgroup.procs: No > such file or directory > slurmstepd: error: unable to read > '/sys/fs/cgroup/memory/slurm/uid_1562/job_2596447/step_batch/cgroup.procs' > > 2) This is a job array - first 10 tasks completed and produced outputs, but > the next 10 tasks did not start up: > > slurmstepd: error: error from open of cgroup > '/sys/fs/cgroup/memory/slurm/uid_1447/job_2598870/step_batch' : > No such file or directory > slurmstepd: error: xcgroup_lock error: No such file or directory > slurmstepd: error: _cgroup_procs_check: failed on path > /sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/ste > p_batch/cgroup.procs: No such file or directory > slurmstepd: error: unable to read > '/sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/step_batch/cgroup.procs > ' > slurmstepd: error: _cgroup_procs_check: failed on path > /sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/ste > p_batch/cgroup.procs: No such file or directory > slurmstepd: error: unable to read > '/sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/step_batch/cgroup.procs > ' > slurmstepd: error: _cgroup_procs_check: failed on path > /sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/ste > p_batch/cgroup.procs: No such file or directory > slurmstepd: error: unable to read > '/sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/step_batch/cgroup.procs > ' > slurmstepd: error: problem with oom_pipe[0] > slurmstepd: fatal: cgroup_v1.c:1352 _oom_event_monitor: > pthread_mutex_lock(): Invalid argument Please attach the slurmd log from one of the nodes with the failed jobs.
Created attachment 25259 [details] /var/log/slurmd from job 2596447 /var/log/slurmd from the first item mentioned in this issue
Created attachment 25260 [details] /var/log/slurmd from job 2598870 /var/log/slurmd from case 2 job 2598870 (i.e. 2598869_1)
(In reply to David Chin from comment #2) > Here is one job which appears as "running": > > JobID JobName User Partition NodeList > Elapsed State ExitCode ReqMem MaxRSS MaxVMSize > AllocTRES > -------------------- ---------- --------- ---------- --------------- > ---------- ---------- -------- ---------- ---------- ---------- > -------------------------------- > 2601625 rmsduni1.+ abcdefg def node006 > 05:19:15 RUNNING 0:0 160G > billing=48,cpu=48,node=1 > 2601625.batch batch node006 > 05:19:15 RUNNING 0:0 > cpu=48,mem=0,node=1 > 2601625.extern extern node006 > 05:19:15 RUNNING 0:0 > billing=48,cpu=48,node=1 > > BUT there are no processes owned by that user running on that node. Please also attach slurmd log for this node. If possible, please activate the 'debugflags=cgroup' flag in slurm.conf on the nodes exhibiting this issue. A slurmd restart will be required to activate it.
Created attachment 25261 [details] /var/log/slurmd from job 2601625 /var/log/slurmd from job in comment 2
Created attachment 25262 [details] /var/log/slurmd from job 2402466 with cgroups debugging turned on
Job 2402466 array job: #!/bin/bash #SBATCH -p def #SBATCH -t 0:15:00 #SBATCH --mem-per-cpu=2G #SBATCH --nodes=1 #SBATCH --nodelist=node014 #SBATCH --cpus-per-task=1 #SBATCH --array=1-200 module load gcc/9.2.0 module load picotte-openmpi/gcc/4.1.0 sleep 30 env | grep SLURM | sort OUTDIR=/beegfs/scratch/dwc62/array_test if [[ ! -d $OUTDIR ]] then mkdir $OUTDIR fi echo TESTING 123 $SLURM_JOB_ID $SLURM_ARRAY_JOB_ID $SLURM_ARRAY_TASK_ID > ${OUTDIR}/foobar_${SLURM_JOB_ID}.txt First N jobs seemed to complete successfully, i.e. the output files ${OUTDIR}/foobar_${SLURM_JOB_ID}.txt were produced with the correct output. However, job tasks still appeared as "running", and these cgroup errors appeared in slurm-2602466_N.out: slurmstepd: error: error from open of cgroup '/sys/fs/cgroup/memory/slurm/uid_1002/job_2602467/step_batch' : No such file or directory slurmstepd: error: xcgroup_lock error: No such file or directory slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2602467/step_batch/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2602467/step_batch/cgroup.procs' slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2602467/step_batch/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2602467/step_batch/cgroup.procs' slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2602467/step_batch/cgroup.procs: No such file or directory slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2602467/step_batch/cgroup.procs' slurmstepd: error: problem with oom_pipe[0] slurmstepd: fatal: cgroup_v1.c:1352 _oom_event_monitor: pthread_mutex_lock(): Invalid argument On that node, no processes owned by the submitter of the job were running.
Waiting for info from user for me to reproduce what happened in job 2601625
Thanks for all this information, I'm looking into these now. Can you also attach the output of > cat /proc/mounts from node014? Thanks! --Tim
(In reply to David Chin from comment #11) > Waiting for info from user for me to reproduce what happened in job 2601625 What happened in job 2601625 is similar to what happened in the test array job 2402466, i.e. the single-task job completed successfully but appeared to remain running. In the array job case, that prevented new tasks from starting up. The /var/log/slurmd with debugging turned on for job 2402466 is already attached.
On node014 "cat /proc/mounts" proc /proc proc rw,nosuid,relatime 0 0 sysfs /sys sysfs rw,relatime 0 0 devtmpfs /dev devtmpfs rw,relatime,size=98291980k,nr_inodes=24572995,mode=755 0 0 tmpfs /run tmpfs rw,relatime 0 0 /dev/sda1 / xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw 0 0 devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0 cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,blkio,memory,freezer 0 0 cgroup /sys/fs/cgroup/net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio 0 0 cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/cpu cgroup rw,nosuid,nodev,noexec,relatime,cpu 0 0 systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=36,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=31133 0 0 mqueue /dev/mqueue mqueue rw,relatime 0 0 debugfs /sys/kernel/debug debugfs rw,relatime 0 0 hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0 fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0 configfs /sys/kernel/config configfs rw,relatime 0 0 binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 /dev/sda2 /var xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0 /dev/sda6 /local xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0 /dev/sda3 /tmp xfs rw,nosuid,nodev,noatime,nodiratime,attr2,inode64,noquota 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 tracefs /sys/kernel/debug/tracing tracefs rw,relatime 0 0 beegfs_nodev /beegfs beegfs rw,relatime,cfgFile=/etc/beegfs/beegfs-client.conf 0 0 baran.cm.cluster:/ifs/baran/hpc-zone/groups /ifs/groups nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.37,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.37 0 0 baran.cm.cluster:/ifs/baran/hpc-zone/opt /ifs/opt nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.42,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.42 0 0 master:/cm/shared /cm/shared nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.25.128.1,mountvers=3,mountport=4002,mountproto=udp,local_lock=none,addr=172.25.128.1 0 0 baran.cm.cluster:/ifs/baran/hpc-zone/opt_spack /ifs/opt_spack nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.40,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.40 0 0 baran.cm.cluster:/ifs/baran/hpc-zone/home /home nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.39,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.39 0 0 baran.cm.cluster:/ifs/baran/hpc-zone/sysadmin /ifs/sysadmin nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.32,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.32 0 0 tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=19665692k,mode=700 0 0
(In reply to David Chin from comment #14) > On node014 "cat /proc/mounts" Thank you! > cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,blkio,memory,freezer 0 0 This line from /proc/mounts suggests that you are using the "JoinControllers" option in systemd which isn't well supported by systemd, and causes problems with slurm. In 20.02 a lot of them were ignored, but in 21.08 it causes errors. I'm a little surprised its there since my understanding was that this was deprecated in newer version of bright. You will most likely fine a line like: > JoinControllers = blkio,cpuacct,memory,freezer in /etc/systemd/system.conf Please comment that line out if it is there. If not, please let me know and we can dig into where it might be. The node will have to be rebooted to apply the changes. I'd suggest trying it on a couple nodes to make sure everything is happy first. Thanks! --Tim
Hi, Tim: We are running Bright 9.0. Let me double check with Bright that commenting that out is OK since Bright controls a bunch of cgroup configs, too. And if not, if there's a workaround or config change to be made. Thanks, Dave (In reply to Tim McMullan from comment #15) > (In reply to David Chin from comment #14) > > On node014 "cat /proc/mounts" > > Thank you! > > > cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,blkio,memory,freezer 0 0 > > This line from /proc/mounts suggests that you are using the > "JoinControllers" option in systemd which isn't well supported by systemd, > and causes problems with slurm. In 20.02 a lot of them were ignored, but in > 21.08 it causes errors. I'm a little surprised its there since my > understanding was that this was deprecated in newer version of bright. > > You will most likely fine a line like: > > JoinControllers = blkio,cpuacct,memory,freezer > in /etc/systemd/system.conf > > Please comment that line out if it is there. If not, please let me know and > we can dig into where it might be. The node will have to be rebooted to > apply the changes. I'd suggest trying it on a couple nodes to make sure > everything is happy first. > > Thanks! > --Tim
13338 (In reply to David Chin from comment #16) > Hi, Tim: > > We are running Bright 9.0. > > Let me double check with Bright that commenting that out is OK since Bright > controls a bunch of cgroup configs, too. And if not, if there's a workaround > or config change to be made. > > Thanks, > Dave Sounds good! I found the bug where we first started talking about this: https://bugs.schedmd.com/show_bug.cgi?id=7536#c29 From https://bugs.schedmd.com/show_bug.cgi?id=7536#c25 it looks like bright changed this in Bright 9.1, so if you are 9.0 that would explain why its there. Thanks! --Tim
Hi, Tim: I commented out the JoinControllers line in /etc/systemd/system.conf but that caused slurmd top immediately fail on startup: ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/slurmd.service.d └─99-cmd.conf Active: failed (Result: exit-code) since Fri 2022-05-27 15:57:56 EDT; 21s ago Process: 11784 ExecStart=/cm/shared/apps/slurm/21.08.8/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE) Main PID: 11784 (code=exited, status=1/FAILURE) May 27 15:57:56 node074 systemd[1]: Started Slurm node daemon. May 27 15:57:56 node074 slurmd[11784]: slurmd: error: AccountingStorageTRES 1 specified more than once, latest value used May 27 15:57:56 node074 slurmd[11784]: error: AccountingStorageTRES 1 specified more than once, latest value used May 27 15:57:56 node074 slurmd[11784]: slurmd: Considering each NUMA node as a socket May 27 15:57:56 node074 slurmd[11784]: slurmd: Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw) May 27 15:57:56 node074 slurmd[11784]: slurmd: Considering each NUMA node as a socket May 27 15:57:56 node074 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE May 27 15:57:56 node074 systemd[1]: slurmd.service: Failed with result 'exit-code'. And /var/log/slurmd: [2022-05-27T15:58:26.488] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602633/slurm_script [2022-05-27T15:58:26.488] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602628/slurm_script [2022-05-27T15:58:26.488] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602634/slurm_script [2022-05-27T15:58:26.488] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602629/slurm_script [2022-05-27T15:58:26.488] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602630/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602650/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602640/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602645/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602647/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602648/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602643/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602644/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602652/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602649/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602642/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602646/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602638/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602651/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602655/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602653/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602654/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602656/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602657/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602658/slurm_script [2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602659/slurm_script [2022-05-27T15:58:26.517] Considering each NUMA node as a socket [2022-05-27T15:58:26.518] Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw) [2022-05-27T15:58:26.518] Considering each NUMA node as a socket [2022-05-27T15:58:26.521] error: cgroup namespace 'freezer' not mounted. aborting [2022-05-27T15:58:26.521] error: unable to create freezer cgroup namespace [2022-05-27T15:58:26.521] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed [2022-05-27T15:58:26.521] error: cannot create proctrack context for proctrack/cgroup [2022-05-27T15:58:26.521] error: slurmd initialization failed [2022-05-27T15:58:56.556] Considering each NUMA node as a socket [2022-05-27T15:58:56.556] Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw) [2022-05-27T15:58:56.557] Considering each NUMA node as a socket [2022-05-27T15:58:56.559] error: cgroup namespace 'freezer' not mounted. aborting [2022-05-27T15:58:56.559] error: unable to create freezer cgroup namespace [2022-05-27T15:58:56.559] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed [2022-05-27T15:58:56.559] error: cannot create proctrack context for proctrack/cgroup [2022-05-27T15:58:56.559] error: slurmd initialization failed [2022-05-27T15:59:26.598] Considering each NUMA node as a socket [2022-05-27T15:59:26.599] Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw) [2022-05-27T15:59:26.599] Considering each NUMA node as a socket [2022-05-27T15:59:26.602] error: cgroup namespace 'freezer' not mounted. aborting [2022-05-27T15:59:26.602] error: unable to create freezer cgroup namespace [2022-05-27T15:59:26.602] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed [2022-05-27T15:59:26.602] error: cannot create proctrack context for proctrack/cgroup [2022-05-27T15:59:26.602] error: slurmd initialization failed
Can you attach your cgroup.conf file?
(In reply to Tim McMullan from comment #19) > Can you attach your cgroup.conf file? # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE CgroupMountpoint="/sys/fs/cgroup" CgroupAutomount=no TaskAffinity=no ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=no ConstrainDevices=yes ConstrainKmemSpace=yes AllowedRamSpace=100.00 AllowedSwapSpace=20.00 MinKmemSpace=30 MaxKmemPercent=100.00 MemorySwappiness=100 MaxRAMPercent=100.00 MaxSwapPercent=100.00 MinRAMSpace=30 # END AUTOGENERATED SECTION -- DO NOT REMOVE
I'm not quite sure why the freezer cgroup doesn't exist for you on boot, but changing "CgroupAutomount=no" to "CgroupAutomount=yes" should let slurm mount it at startup. If that is something you can change, can you give it a try?
(In reply to Tim McMullan from comment #21) > I'm not quite sure why the freezer cgroup doesn't exist for you on boot, but > changing "CgroupAutomount=no" to "CgroupAutomount=yes" should let slurm > mount it at startup. > > If that is something you can change, can you give it a try? Changed it, and rebooted a few nodes: # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE CgroupMountpoint="/sys/fs/cgroup" CgroupAutomount=yes TaskAffinity=no ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=no ConstrainDevices=yes ConstrainKmemSpace=yes AllowedRamSpace=100.00 AllowedSwapSpace=20.00 MinKmemSpace=30 MaxKmemPercent=100.00 MemorySwappiness=100 MaxRAMPercent=100.00 MaxSwapPercent=100.00 MinRAMSpace=30 # END AUTOGENERATED SECTION -- DO NOT REMOVE but still no go: [2022-05-27T16:29:37.352] Considering each NUMA node as a socket [2022-05-27T16:29:37.353] Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw) [2022-05-27T16:29:37.353] Considering each NUMA node as a socket [2022-05-27T16:29:37.356] error: unable to mount freezer cgroup namespace: Device or resource busy [2022-05-27T16:29:37.356] error: unable to create freezer cgroup namespace [2022-05-27T16:29:37.356] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed [2022-05-27T16:29:37.356] error: cannot create proctrack context for proctrack/cgroup [2022-05-27T16:29:37.356] error: slurmd initialization failed
Thats very interesting. Can you attach cat /proc/mounts again, but this time from a system that has JoinControllers commented out and hasn't had the slurmd try to start? Thanks! --Tim
also, since you have libcgroup-tools installed, can you also check "systemctl status cgconfig.service" and attach your /etc/cgconfig.conf file?
(In reply to Tim McMullan from comment #23) > Thats very interesting. > > Can you attach cat /proc/mounts again, but this time from a system that has > JoinControllers commented out and hasn't had the slurmd try to start? > > Thanks! > --Tim I can't easily disable slurmd on a single node. Configs are done by categories of nodes, the assigning the "slurmclient" role to a category of nodes adds the slurmd service to those nodes. On a system with NodeControllers commented out, /proc/mounts is: proc /proc proc rw,nosuid,relatime 0 0 sysfs /sys sysfs rw,relatime 0 0 devtmpfs /dev devtmpfs rw,relatime,size=98291980k,nr_inodes=24572995,mode=755 0 0 tmpfs /run tmpfs rw,relatime 0 0 /dev/sda1 / xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw 0 0 devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0 cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0 systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=33,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=39192 0 0 debugfs /sys/kernel/debug debugfs rw,relatime 0 0 hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0 configfs /sys/kernel/config configfs rw,relatime 0 0 fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0 mqueue /dev/mqueue mqueue rw,relatime 0 0 binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 /dev/sda3 /tmp xfs rw,nosuid,nodev,noatime,nodiratime,attr2,inode64,noquota 0 0 /dev/sda2 /var xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0 /dev/sda6 /local xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 tracefs /sys/kernel/debug/tracing tracefs rw,relatime 0 0 beegfs_nodev /beegfs beegfs rw,relatime,cfgFile=/etc/beegfs/beegfs-client.conf 0 0 baran.cm.cluster:/ifs/baran/hpc-zone/opt_spack /ifs/opt_spack nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.39,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.39 0 0 baran.cm.cluster:/ifs/baran/hpc-zone/home /home nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.38,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.38 0 0 baran.cm.cluster:/ifs/baran/hpc-zone/sysadmin /ifs/sysadmin nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.32,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.32 0 0 baran.cm.cluster:/ifs/baran/hpc-zone/groups /ifs/groups nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.34,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.34 0 0 master:/cm/shared /cm/shared nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.25.128.1,mountvers=3,mountport=4002,mountproto=udp,local_lock=none,addr=172.25.128.1 0 0 baran.cm.cluster:/ifs/baran/hpc-zone/opt /ifs/opt nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.33,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.33 0 0 tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=19665692k,mode=700 0 0 - systemctl status cgconfig.service ● cgconfig.service - Control Group configuration service Loaded: loaded (/usr/lib/systemd/system/cgconfig.service; disabled; vendor preset: disabled) Active: inactive (dead) - /etc/cgconfig.conf [ ... all lines are commented out ...]
(In reply to David Chin from comment #25) > I can't easily disable slurmd on a single node. Configs are done by > categories of nodes, the assigning the "slurmclient" role to a category of > nodes adds the slurmd service to those nodes. Ok, noted! The mount output is missing a handful of mount points I'd expect to be there, something must be disabling those or preventing them from mounting. I'm not sure if Bright has anything else going on here that might be tweaking the behavior. Are there entries in /etc/fstab for the cgroup controllers? It look like the libcgroup-tools related service isn't a factor in this instance. Can you also provide the output of > grep CGROUP /boot/config-$(uname -r)
(In reply to Tim McMullan from comment #26) > (In reply to David Chin from comment #25) > > I can't easily disable slurmd on a single node. Configs are done by > > categories of nodes, the assigning the "slurmclient" role to a category of > > nodes adds the slurmd service to those nodes. > Ok, noted! > > The mount output is missing a handful of mount points I'd expect to be > there, something must be disabling those or preventing them from mounting. > I'm not sure if Bright has anything else going on here that might be > tweaking the behavior. > > Are there entries in /etc/fstab for the cgroup controllers? It look like > the libcgroup-tools related service isn't a factor in this instance. > > Can you also provide the output of > > grep CGROUP /boot/config-$(uname -r) Can you provide a list of mountpoints that are expected but not there in /proc/mounts? I'll contact Bright about them. As for the "grep CGROUP ...": CONFIG_CGROUPS=y CONFIG_BLK_CGROUP=y # CONFIG_DEBUG_BLK_CGROUP is not set CONFIG_CGROUP_WRITEBACK=y CONFIG_CGROUP_SCHED=y CONFIG_CGROUP_PIDS=y CONFIG_CGROUP_RDMA=y CONFIG_CGROUP_FREEZER=y CONFIG_CGROUP_HUGETLB=y CONFIG_CGROUP_DEVICE=y CONFIG_CGROUP_CPUACCT=y CONFIG_CGROUP_PERF=y CONFIG_CGROUP_BPF=y # CONFIG_CGROUP_DEBUG is not set CONFIG_SOCK_CGROUP_DATA=y # CONFIG_BLK_CGROUP_IOLATENCY is not set CONFIG_NETFILTER_XT_MATCH_CGROUP=m CONFIG_NET_CLS_CGROUP=y CONFIG_CGROUP_NET_PRIO=y CONFIG_CGROUP_NET_CLASSID=y
These are the missing mounts: cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 Note that they are the mounts that were in JoinControllers. On my fairly vanilla 8.6 install those plus the cgroup mounts in your output exist without further configuration on boot. Based on the grep, it appears they are all enabled in the kernel (which makes sense), so I'm thinking something must be preventing them from being mounted at boot time.
Thanks, Tim. I've forwarded that info to Bright. I'm off for the long weekend, and we'll pick up next week. Have a good holiday weekend. Dave
(In reply to David Chin from comment #29) > Thanks, Tim. I've forwarded that info to Bright. > > I'm off for the long weekend, and we'll pick up next week. > > Have a good holiday weekend. > > Dave Sounds good! Thank you, have a great weekend! --Tim
Couldn't stay away. Installed a VirtualBox VM with a fresh RHEL 8.1, and it has the mounts: tmpfs /sys/fs/cgroup tmpfs ro,seclabel,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,seclabel,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,seclabel,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,seclabel,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/rdma cgroup rw,seclabel,nosuid,nodev,noexec,relatime,rdma 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,seclabel,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,seclabel,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,seclabel,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,seclabel,nosuid,nodev,noexec,relatime,hugetlb 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,seclabel,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,seclabel,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,seclabel,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,seclabel,nosuid,nodev,noexec,relatime,devices 0 0 I also found a similar issue elsewhere: https://github.com/lxc/lxc/issues/4072 Before trying to build a new OS image, I tried adding kernel boot parameter "systemd.unified_cgroup_hierarchy=0" to the current one, but that did not change anything. Which let me to think that was the default. So, I then changed that to "systemd.unified_cgroup_hierarchy=1". That allowed slurmd to start and stay up: ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/slurmd.service.d └─99-cmd.conf Active: active (running) since Sat 2022-05-28 00:35:25 EDT; 12s ago Main PID: 8331 (slurmd) Tasks: 1 Memory: 12.9M CGroup: /system.slice/slurmd.service └─8331 /cm/shared/apps/slurm/21.08.8/sbin/slurmd -D -s May 28 00:35:25 node074 systemd[1]: Started Slurm node daemon. May 28 00:35:25 node074 slurmd[8331]: slurmd: error: AccountingStorageTRES 1 specified more than once, latest value used May 28 00:35:25 node074 slurmd[8331]: error: AccountingStorageTRES 1 specified more than once, latest value used May 28 00:35:26 node074 slurmd[8331]: slurmd: Considering each NUMA node as a socket May 28 00:35:26 node074 slurmd[8331]: slurmd: Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw) May 28 00:35:26 node074 slurmd[8331]: slurmd: Considering each NUMA node as a socket May 28 00:35:26 node074 slurmd[8331]: slurmd: slurmd version 21.08.8-2 started May 28 00:35:26 node074 slurmd[8331]: slurmd: slurmd started on Sat, 28 May 2022 00:35:26 -0400 $ cat /proc/mounts | grep cgroup cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 And the freezer mount has appeared. Submitting a the trivial job array (echo some env vars to write to a file) -- jobs don't even start running $ squeue --me 2602719_1 def tstarr_1node dwc62 urcfadmprj PD 0:00 15:00 1 1 2G (launch failed requeued held) 2602719_2 def tstarr_1node dwc62 urcfadmprj PD 0:00 15:00 1 1 2G (launch failed requeued held) 2602719_3 def tstarr_1node dwc62 urcfadmprj PD 0:00 15:00 1 1 2G (launch failed requeued held) 2602719_29 def tstarr_1node dwc62 urcfadmprj PD 0:00 15:00 1 1 2G (launch failed requeued held) Attaching /var/log/slurmd from this node (slurmd_node074.txt)
Created attachment 25267 [details] /var/log/slurmd from node where kernel param "systemd.unified_cgroup_hierarchy=1" was added /var/log/slurmd from node where kernel param "systemd.unified_cgroup_hierarchy=1" was added. slurmd started and remained running, but jobs failed to start running.
I reverted one node to the OS image from before the Slurm upgrade and these are the mounts. tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0 cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,blkio,memory,freezer 0 0 cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/cpu cgroup rw,nosuid,nodev,noexec,relatime,cpu 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 Doing "ls -l /sys/fs/cgroup" shows the missing mounts as soft links to a combined mount "blkio,cpuacct,memory,freezer": total 0 lrwxrwxrwx 1 root root 28 May 28 09:09 blkio -> blkio,cpuacct,memory,freezer/ dr-xr-xr-x 4 root root 0 May 28 09:09 blkio,cpuacct,memory,freezer/ dr-xr-xr-x 4 root root 0 May 28 09:09 cpu/ lrwxrwxrwx 1 root root 28 May 28 09:09 cpuacct -> blkio,cpuacct,memory,freezer/ dr-xr-xr-x 2 root root 0 May 28 09:09 cpuset/ dr-xr-xr-x 4 root root 0 May 28 09:09 devices/ lrwxrwxrwx 1 root root 28 May 28 09:09 freezer -> blkio,cpuacct,memory,freezer/ dr-xr-xr-x 2 root root 0 May 28 09:09 hugetlb/ lrwxrwxrwx 1 root root 28 May 28 09:09 memory -> blkio,cpuacct,memory,freezer/ dr-xr-xr-x 2 root root 0 May 28 09:09 net_cls/ dr-xr-xr-x 2 root root 0 May 28 09:09 net_prio/ dr-xr-xr-x 2 root root 0 May 28 09:09 perf_event/ dr-xr-xr-x 4 root root 0 May 28 09:09 pids/ dr-xr-xr-x 2 root root 0 May 28 09:09 rdma/ dr-xr-xr-x 5 root root 0 May 28 09:06 systemd/ And reverting to the OS image from when our cluster was first installed gives the same (i.e. missing freezer, memory, blkio, "cpu,cpuacct"). tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 cgroup /sys/fs/cgroup/cpu cgroup rw,nosuid,nodev,noexec,relatime,cpu 0 0 cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0 cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,blkio,memory,freezer 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio 0 0 cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 And "ls -l /sys/fs/cgroup" gives lrwxrwxrwx 1 root root 28 May 28 08:54 blkio -> blkio,cpuacct,memory,freezer dr-xr-xr-x 4 root root 0 May 28 08:54 blkio,cpuacct,memory,freezer dr-xr-xr-x 2 root root 0 May 28 08:54 cpu lrwxrwxrwx 1 root root 28 May 28 08:54 cpuacct -> blkio,cpuacct,memory,freezer dr-xr-xr-x 2 root root 0 May 28 08:54 cpuset dr-xr-xr-x 4 root root 0 May 28 08:54 devices lrwxrwxrwx 1 root root 28 May 28 08:54 freezer -> blkio,cpuacct,memory,freezer dr-xr-xr-x 2 root root 0 May 28 08:54 hugetlb lrwxrwxrwx 1 root root 28 May 28 08:54 memory -> blkio,cpuacct,memory,freezer dr-xr-xr-x 2 root root 0 May 28 08:54 net_cls dr-xr-xr-x 2 root root 0 May 28 08:54 net_prio dr-xr-xr-x 2 root root 0 May 28 08:54 perf_event dr-xr-xr-x 4 root root 0 May 28 08:54 pids dr-xr-xr-x 2 root root 0 May 28 08:54 rdma dr-xr-xr-x 5 root root 0 May 28 08:52 systemd These differ from a fresh RHEL 8.1 install which has individual mounts: total 0 dr-xr-xr-x. 2 root root 0 May 28 00:07 blkio lrwxrwxrwx. 1 root root 11 May 28 00:07 cpu -> cpu,cpuacct lrwxrwxrwx. 1 root root 11 May 28 00:07 cpuacct -> cpu,cpuacct dr-xr-xr-x. 2 root root 0 May 28 00:07 cpu,cpuacct dr-xr-xr-x. 2 root root 0 May 28 00:07 cpuset dr-xr-xr-x. 4 root root 0 May 28 00:07 devices dr-xr-xr-x. 2 root root 0 May 28 00:07 freezer dr-xr-xr-x. 2 root root 0 May 28 00:07 hugetlb dr-xr-xr-x. 5 root root 0 May 28 00:07 memory lrwxrwxrwx. 1 root root 16 May 28 00:07 net_cls -> net_cls,net_prio dr-xr-xr-x. 2 root root 0 May 28 00:07 net_cls,net_prio lrwxrwxrwx. 1 root root 16 May 28 00:07 net_prio -> net_cls,net_prio dr-xr-xr-x. 2 root root 0 May 28 00:07 perf_event dr-xr-xr-x. 5 root root 0 May 28 00:07 pids dr-xr-xr-x. 2 root root 0 May 28 00:07 rdma dr-xr-xr-x. 6 root root 0 May 28 00:07 systemd
Oh, I get it. The combined mount is probably due to the "JoinControllers" line.
(In reply to David Chin from comment #32) > Created attachment 25267 [details] > /var/log/slurmd from node where kernel param > "systemd.unified_cgroup_hierarchy=1" was added > > /var/log/slurmd from node where kernel param > "systemd.unified_cgroup_hierarchy=1" was added. slurmd started and remained > running, but jobs failed to start running. This option turns on cgroup/v2 which isn't supported by 21.08 (but is with 22.05!) so I would expect it to fail in this case. Its interesting that your fresh install on the cluster seems to differ from a new fresh install. Something we could try is adding "cgroup_enable=memory swapaccount=1" to the kernel command line. Its required on older Debian/Ubuntu installs, but might enable the memory cgroup in your environment. If it does, we could add additional ones for the other missing controllers. You would need to add them to the end of GRUB_CMDLINE_LINUX in /etc/sysconfig/grub, then update the grub config and reboot to see if it makes a difference.
Bright support is offline till Monday. My next move is probably creating a new OS image from a fresh RHEL 8.1, but that requires me to be onsite. So, I'm really done for the weekend, now. Thanks for the help. Cheers, Dave
Created attachment 25268 [details] /var/log/slurmd after adding "cgroup_enable=memory swapaccount=1" to kernel options /var/log/slurmd after adding "cgroup_enable=memory swapaccount=1" to kernel options. Shows a user job 2602819 that was cancelled so that I could run a test job array 2602820 (96 tasks).
It seems that things got less bad after the restart around here: > [2022-05-28T11:37:05.176] slurmd version 21.08.8-2 started > [2022-05-28T11:37:05.179] slurmd started on Sat, 28 May 2022 11:37:05 -0400 It would be good to see what "cat /proc/mounts | grep cgroup" looks like now.
(In reply to Tim McMullan from comment #38) > It seems that things got less bad after the restart around here: > > > [2022-05-28T11:37:05.176] slurmd version 21.08.8-2 started > > [2022-05-28T11:37:05.179] slurmd started on Sat, 28 May 2022 11:37:05 -0400 > > It would be good to see what "cat /proc/mounts | grep cgroup" looks like now. On a non-GPU node: cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 and a GPU node: cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
My test job array seemed to complete OK, without producing any of the cgroup messages we saw before. I've contacted several users who had issues with their jobs, and I'm waiting to hear back. I'm optimistic.
(In reply to David Chin from comment #40) > My test job array seemed to complete OK, without producing any of the cgroup > messages we saw before. > > I've contacted several users who had issues with their jobs, and I'm waiting > to hear back. I'm optimistic. Thanks for the update on this David! Those mounts are matching what I would expect them to look like, so hopefully we've got it fixed! Let me know what you hear from the other users! Thanks, --Tim
5 out of the 6 users who reported the cgroups-related errors/warnings have said they are no longer seeing the issues they saw. I'm fairly optimistic the remaining one will report succss, but it may be a few days before they can try.
(In reply to David Chin from comment #43) > 5 out of the 6 users who reported the cgroups-related errors/warnings have > said they are no longer seeing the issues they saw. I'm fairly optimistic > the remaining one will report succss, but it may be a few days before they > can try. OK that's great! What of the things we chatted about is currently being used? Is this just the kernel command line tweak, new image, etc? Thanks! --Tim
(In reply to Tim McMullan from comment #44) > ... > OK that's great! What of the things we chatted about is currently being > used? Is this just the kernel command line tweak, new image, etc? > > Thanks! > --Tim Only the kernel command line tweak. That was all it took. --Dave
(In reply to David Chin from comment #45) > > Only the kernel command line tweak. That was all it took. > > --Dave Thats very interesting. I was mostly expecting the memory cgroup to reappear, but the rest still be absent. I'm glad this seems to have resolved it for you, but I'm not totally certain why this fixed it. If you can, I would still consider rolling a new image to see if you can create one where this works without the kernel command line tweak since you were able to see its not necessary on a fresh build. Since it appears to be fixed though, and the last tester is a few days out from being able to test, are you OK with me resolving this for now and if the issue crops back up again you can re-open it or open a new one? I can also leave this open and wait for the last user if you are more comfortable with that, but in that case I'd like to reduce the severity since it now appears to be working. Thanks! --Tim
(In reply to Tim McMullan from comment #46) > (In reply to David Chin from comment #45) > > > > Only the kernel command line tweak. That was all it took. > > > > --Dave > > Thats very interesting. I was mostly expecting the memory cgroup to > reappear, but the rest still be absent. I'm glad this seems to have > resolved it for you, but I'm not totally certain why this fixed it. If you > can, I would still consider rolling a new image to see if you can create one > where this works without the kernel command line tweak since you were able > to see its not necessary on a fresh build. > > Since it appears to be fixed though, and the last tester is a few days out > from being able to test, are you OK with me resolving this for now and if > the issue crops back up again you can re-open it or open a new one? I can > also leave this open and wait for the last user if you are more comfortable > with that, but in that case I'd like to reduce the severity since it now > appears to be working. > > Thanks! > --Tim Yes, please go ahead and close this ticket out. If the last user has issues, we can re-open. I'll revisit building a new OS image for some future date. Thanks again for all your help. --Dave
Sounds good, Thanks Dave!