14191 – xcgroup_lock error: No such file or directory

Ticket 14191 - xcgroup_lock error: No such file or directory

Summary: xcgroup_lock error: No such file or directory

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmstepd (show other tickets)
Version:	21.08.8
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Tim McMullan
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-05-27 11:44 MDT by David Chin
Modified:	2023-04-26 04:03 MDT (History)
CC List:	1 user (show)

See Also:	14293 16390
Site:	Drexel
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	RHEL
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Current slurm.conf (8.00 KB, text/plain) 2022-05-27 11:44 MDT, David Chin	Details
/var/log/slurmd from job 2596447 (84.36 KB, text/plain) 2022-05-27 12:26 MDT, David Chin	Details
/var/log/slurmd from job 2598870 (3.85 MB, text/plain) 2022-05-27 12:27 MDT, David Chin	Details
/var/log/slurmd from job 2601625 (7.10 MB, text/plain) 2022-05-27 12:29 MDT, David Chin	Details
/var/log/slurmd from job 2402466 with cgroups debugging turned on (272.31 KB, text/plain) 2022-05-27 12:42 MDT, David Chin	Details
/var/log/slurmd from node where kernel param "systemd.unified_cgroup_hierarchy=1" was added (2.85 MB, text/plain) 2022-05-27 22:46 MDT, David Chin	Details
/var/log/slurmd after adding "cgroup_enable=memory swapaccount=1" to kernel options (3.58 MB, text/plain) 2022-05-28 09:48 MDT, David Chin	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description David Chin 2022-05-27 11:44:35 MDT

Created attachment 25258 [details]
Current slurm.conf

We just updated from Slurm 20.02.7 to 21.08.8-2 via Bright Cluster Manager.

OS: RHEL 8.1 kernel 4.18.0-147.el8
libcgroup: libcgroup-0.41-19.el8.x86_64 libcgroup-tools-0.41-19.el8.x86_64

Since the upgrade, there have been multiple cgroup-related error messages.

1) This job completed and its results and outputs seemed to be unaffected:

slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/memory/slurm/uid_1562/job_2596447/step_batch/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/memory/slurm/uid_1562/job_2596447/step_batch/cgroup.procs'

2) This is a job array - first 10 tasks completed and produced outputs, but the next 10 tasks did not start up:

slurmstepd: error: error from open of cgroup '/sys/fs/cgroup/memory/slurm/uid_1447/job_2598870/step_batch' :
 No such file or directory
slurmstepd: error: xcgroup_lock error: No such file or directory
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/ste
p_batch/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/step_batch/cgroup.procs
'
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/ste
p_batch/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/step_batch/cgroup.procs
'
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/ste
p_batch/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/step_batch/cgroup.procs
'
slurmstepd: error: problem with oom_pipe[0]
slurmstepd: fatal: cgroup_v1.c:1352 _oom_event_monitor: pthread_mutex_lock(): Invalid argument


Our slurm.conf is attached.

Thanks,
    Dave

Comment 2 David Chin 2022-05-27 12:07:19 MDT

Here is one job which appears as "running":

               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     ReqMem     MaxRSS  MaxVMSize                        AllocTRES
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- ---------- --------------------------------
             2601625 rmsduni1.+   abcdefg        def         node006   05:19:15    RUNNING      0:0       160G                               billing=48,cpu=48,node=1
       2601625.batch      batch                              node006   05:19:15    RUNNING      0:0                                               cpu=48,mem=0,node=1
      2601625.extern     extern                              node006   05:19:15    RUNNING      0:0                                          billing=48,cpu=48,node=1

BUT there are no processes owned by that user running on that node.

Comment 4 Nate Rini 2022-05-27 12:18:19 MDT

(In reply to David Chin from comment #0)
> We just updated from Slurm 20.02.7 to 21.08.8-2 via Bright Cluster Manager.
There were significant improvements to the cgroup code to catch and handle more issues. Looks like they are at least catching issues.
 
> 1) This job completed and its results and outputs seemed to be unaffected:
> 
> slurmstepd: error: _cgroup_procs_check: failed on path
> /sys/fs/cgroup/memory/slurm/uid_1562/job_2596447/step_batch/cgroup.procs: No
> such file or directory
> slurmstepd: error: unable to read
> '/sys/fs/cgroup/memory/slurm/uid_1562/job_2596447/step_batch/cgroup.procs'
> 
> 2) This is a job array - first 10 tasks completed and produced outputs, but
> the next 10 tasks did not start up:
> 
> slurmstepd: error: error from open of cgroup
> '/sys/fs/cgroup/memory/slurm/uid_1447/job_2598870/step_batch' :
>  No such file or directory
> slurmstepd: error: xcgroup_lock error: No such file or directory
> slurmstepd: error: _cgroup_procs_check: failed on path
> /sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/ste
> p_batch/cgroup.procs: No such file or directory
> slurmstepd: error: unable to read
> '/sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/step_batch/cgroup.procs
> '
> slurmstepd: error: _cgroup_procs_check: failed on path
> /sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/ste
> p_batch/cgroup.procs: No such file or directory
> slurmstepd: error: unable to read
> '/sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/step_batch/cgroup.procs
> '
> slurmstepd: error: _cgroup_procs_check: failed on path
> /sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/ste
> p_batch/cgroup.procs: No such file or directory
> slurmstepd: error: unable to read
> '/sys/fs/cgroup/freezer/slurm/uid_1447/job_2598870/step_batch/cgroup.procs
> '
> slurmstepd: error: problem with oom_pipe[0]
> slurmstepd: fatal: cgroup_v1.c:1352 _oom_event_monitor:
> pthread_mutex_lock(): Invalid argument

Please attach the slurmd log from one of the nodes with the failed jobs.

Comment 5 David Chin 2022-05-27 12:26:01 MDT

Created attachment 25259 [details]
/var/log/slurmd from job 2596447

/var/log/slurmd from the first item mentioned in this issue

Comment 6 David Chin 2022-05-27 12:27:43 MDT

Created attachment 25260 [details]
/var/log/slurmd from job 2598870

/var/log/slurmd from case 2 job 2598870 (i.e. 2598869_1)

Comment 7 Nate Rini 2022-05-27 12:28:42 MDT

(In reply to David Chin from comment #2)
> Here is one job which appears as "running":
> 
>                JobID    JobName      User  Partition        NodeList   
> Elapsed      State ExitCode     ReqMem     MaxRSS  MaxVMSize                
> AllocTRES
> -------------------- ---------- --------- ---------- ---------------
> ---------- ---------- -------- ---------- ---------- ----------
> --------------------------------
>              2601625 rmsduni1.+   abcdefg        def         node006  
> 05:19:15    RUNNING      0:0       160G                              
> billing=48,cpu=48,node=1
>        2601625.batch      batch                              node006  
> 05:19:15    RUNNING      0:0                                              
> cpu=48,mem=0,node=1
>       2601625.extern     extern                              node006  
> 05:19:15    RUNNING      0:0                                         
> billing=48,cpu=48,node=1
> 
> BUT there are no processes owned by that user running on that node.

Please also attach slurmd log for this node. If possible, please activate the 'debugflags=cgroup' flag in slurm.conf on the nodes exhibiting this issue. A slurmd restart will be required to activate it.

Comment 8 David Chin 2022-05-27 12:29:40 MDT

Created attachment 25261 [details]
/var/log/slurmd from job 2601625

/var/log/slurmd from job in comment 2

Comment 9 David Chin 2022-05-27 12:42:18 MDT

Created attachment 25262 [details]
/var/log/slurmd from job 2402466 with cgroups debugging turned on

Comment 10 David Chin 2022-05-27 12:45:08 MDT

Job 2402466 array job:

#!/bin/bash
#SBATCH -p def
#SBATCH -t 0:15:00
#SBATCH --mem-per-cpu=2G
#SBATCH --nodes=1
#SBATCH --nodelist=node014
#SBATCH --cpus-per-task=1
#SBATCH --array=1-200

module load gcc/9.2.0
module load picotte-openmpi/gcc/4.1.0

sleep 30

env | grep SLURM | sort

OUTDIR=/beegfs/scratch/dwc62/array_test
if [[ ! -d $OUTDIR ]]
then
    mkdir $OUTDIR
fi

echo TESTING 123 $SLURM_JOB_ID $SLURM_ARRAY_JOB_ID $SLURM_ARRAY_TASK_ID > ${OUTDIR}/foobar_${SLURM_JOB_ID}.txt



First N jobs seemed to complete successfully, i.e. the output files ${OUTDIR}/foobar_${SLURM_JOB_ID}.txt were produced with the correct output. However, job tasks still appeared as "running", and these cgroup errors appeared in slurm-2602466_N.out:

slurmstepd: error: error from open of cgroup '/sys/fs/cgroup/memory/slurm/uid_1002/job_2602467/step_batch' : No such file or directory
slurmstepd: error: xcgroup_lock error: No such file or directory
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2602467/step_batch/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2602467/step_batch/cgroup.procs'
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2602467/step_batch/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2602467/step_batch/cgroup.procs'
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2602467/step_batch/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2602467/step_batch/cgroup.procs'
slurmstepd: error: problem with oom_pipe[0]
slurmstepd: fatal: cgroup_v1.c:1352 _oom_event_monitor: pthread_mutex_lock(): Invalid argument

On that node, no processes owned by the submitter of the job were running.

Comment 11 David Chin 2022-05-27 12:46:44 MDT

Waiting for info from user for me to reproduce what happened in job 2601625

Comment 12 Tim McMullan 2022-05-27 12:53:34 MDT

Thanks for all this information, I'm looking into these now.

Can you also attach the output of

> cat /proc/mounts

from node014?

Thanks!
--Tim

Comment 13 David Chin 2022-05-27 13:03:12 MDT

(In reply to David Chin from comment #11)
> Waiting for info from user for me to reproduce what happened in job 2601625

What happened in job 2601625 is similar to what happened in the test array job 2402466, i.e. the single-task job completed successfully but appeared to remain running. In the array job case, that prevented new tasks from starting up.

The /var/log/slurmd with debugging turned on for job 2402466 is already attached.

Comment 14 David Chin 2022-05-27 13:03:34 MDT

On node014 "cat /proc/mounts"

proc /proc proc rw,nosuid,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
devtmpfs /dev devtmpfs rw,relatime,size=98291980k,nr_inodes=24572995,mode=755 0 0
tmpfs /run tmpfs rw,relatime 0 0
/dev/sda1 / xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0
cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,blkio,memory,freezer 0 0
cgroup /sys/fs/cgroup/net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,nosuid,nodev,noexec,relatime,cpu 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=36,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=31133 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
configfs /sys/kernel/config configfs rw,relatime 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
/dev/sda2 /var xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0
/dev/sda6 /local xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0
/dev/sda3 /tmp xfs rw,nosuid,nodev,noatime,nodiratime,attr2,inode64,noquota 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
tracefs /sys/kernel/debug/tracing tracefs rw,relatime 0 0
beegfs_nodev /beegfs beegfs rw,relatime,cfgFile=/etc/beegfs/beegfs-client.conf 0 0
baran.cm.cluster:/ifs/baran/hpc-zone/groups /ifs/groups nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.37,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.37 0 0
baran.cm.cluster:/ifs/baran/hpc-zone/opt /ifs/opt nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.42,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.42 0 0
master:/cm/shared /cm/shared nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.25.128.1,mountvers=3,mountport=4002,mountproto=udp,local_lock=none,addr=172.25.128.1 0 0
baran.cm.cluster:/ifs/baran/hpc-zone/opt_spack /ifs/opt_spack nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.40,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.40 0 0
baran.cm.cluster:/ifs/baran/hpc-zone/home /home nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.39,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.39 0 0
baran.cm.cluster:/ifs/baran/hpc-zone/sysadmin /ifs/sysadmin nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.32,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.32 0 0
tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=19665692k,mode=700 0 0

Comment 15 Tim McMullan 2022-05-27 13:16:25 MDT

(In reply to David Chin from comment #14)
> On node014 "cat /proc/mounts"

Thank you!

> cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,blkio,memory,freezer 0 0

This line from /proc/mounts suggests that you are using the "JoinControllers" option in systemd which isn't well supported by systemd, and causes problems with slurm.  In 20.02 a lot of them were ignored, but in 21.08 it causes errors.  I'm a little surprised its there since my understanding was that this was deprecated in newer version of bright.

You will most likely fine a line like:
> JoinControllers = blkio,cpuacct,memory,freezer
in /etc/systemd/system.conf

Please comment that line out if it is there.  If not, please let me know and we can dig into where it might be.  The node will have to be rebooted to apply the changes.  I'd suggest trying it on a couple nodes to make sure everything is happy first.

Thanks!
--Tim

Comment 16 David Chin 2022-05-27 13:21:48 MDT

Hi, Tim:

We are running Bright 9.0. 

Let me double check with Bright that commenting that out is OK since Bright controls a bunch of cgroup configs, too. And if not, if there's a workaround or config change to be made.

Thanks,
   Dave

(In reply to Tim McMullan from comment #15)
> (In reply to David Chin from comment #14)
> > On node014 "cat /proc/mounts"
> 
> Thank you!
> 
> > cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,blkio,memory,freezer 0 0
> 
> This line from /proc/mounts suggests that you are using the
> "JoinControllers" option in systemd which isn't well supported by systemd,
> and causes problems with slurm.  In 20.02 a lot of them were ignored, but in
> 21.08 it causes errors.  I'm a little surprised its there since my
> understanding was that this was deprecated in newer version of bright.
> 
> You will most likely fine a line like:
> > JoinControllers = blkio,cpuacct,memory,freezer
> in /etc/systemd/system.conf
> 
> Please comment that line out if it is there.  If not, please let me know and
> we can dig into where it might be.  The node will have to be rebooted to
> apply the changes.  I'd suggest trying it on a couple nodes to make sure
> everything is happy first.
> 
> Thanks!
> --Tim

Comment 17 Tim McMullan 2022-05-27 13:29:28 MDT

13338 (In reply to David Chin from comment #16)
> Hi, Tim:
> 
> We are running Bright 9.0. 
> 
> Let me double check with Bright that commenting that out is OK since Bright
> controls a bunch of cgroup configs, too. And if not, if there's a workaround
> or config change to be made.
> 
> Thanks,
>    Dave

Sounds good!  I found the bug where we first started talking about this:
https://bugs.schedmd.com/show_bug.cgi?id=7536#c29

From https://bugs.schedmd.com/show_bug.cgi?id=7536#c25 it looks like bright changed this in Bright 9.1, so if you are 9.0 that would explain why its there.

Thanks!
--Tim

Comment 18 David Chin 2022-05-27 14:00:09 MDT

Hi, Tim:

I commented out the JoinControllers line in /etc/systemd/system.conf but that caused slurmd top immediately fail on startup:

● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmd.service.d
           └─99-cmd.conf
   Active: failed (Result: exit-code) since Fri 2022-05-27 15:57:56 EDT; 21s ago
  Process: 11784 ExecStart=/cm/shared/apps/slurm/21.08.8/sbin/slurmd -D -s $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 11784 (code=exited, status=1/FAILURE)

May 27 15:57:56 node074 systemd[1]: Started Slurm node daemon.
May 27 15:57:56 node074 slurmd[11784]: slurmd: error: AccountingStorageTRES 1 specified more than once, latest value used
May 27 15:57:56 node074 slurmd[11784]: error: AccountingStorageTRES 1 specified more than once, latest value used
May 27 15:57:56 node074 slurmd[11784]: slurmd: Considering each NUMA node as a socket
May 27 15:57:56 node074 slurmd[11784]: slurmd: Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw)
May 27 15:57:56 node074 slurmd[11784]: slurmd: Considering each NUMA node as a socket
May 27 15:57:56 node074 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
May 27 15:57:56 node074 systemd[1]: slurmd.service: Failed with result 'exit-code'.

And /var/log/slurmd:

[2022-05-27T15:58:26.488] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602633/slurm_script
[2022-05-27T15:58:26.488] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602628/slurm_script
[2022-05-27T15:58:26.488] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602634/slurm_script
[2022-05-27T15:58:26.488] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602629/slurm_script
[2022-05-27T15:58:26.488] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602630/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602650/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602640/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602645/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602647/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602648/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602643/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602644/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602652/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602649/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602642/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602646/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602638/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602651/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602655/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602653/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602654/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602656/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602657/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602658/slurm_script
[2022-05-27T15:58:26.489] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job2602659/slurm_script
[2022-05-27T15:58:26.517] Considering each NUMA node as a socket
[2022-05-27T15:58:26.518] Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw)
[2022-05-27T15:58:26.518] Considering each NUMA node as a socket
[2022-05-27T15:58:26.521] error: cgroup namespace 'freezer' not mounted. aborting
[2022-05-27T15:58:26.521] error: unable to create freezer cgroup namespace
[2022-05-27T15:58:26.521] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed
[2022-05-27T15:58:26.521] error: cannot create proctrack context for proctrack/cgroup
[2022-05-27T15:58:26.521] error: slurmd initialization failed
[2022-05-27T15:58:56.556] Considering each NUMA node as a socket
[2022-05-27T15:58:56.556] Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw)
[2022-05-27T15:58:56.557] Considering each NUMA node as a socket
[2022-05-27T15:58:56.559] error: cgroup namespace 'freezer' not mounted. aborting
[2022-05-27T15:58:56.559] error: unable to create freezer cgroup namespace
[2022-05-27T15:58:56.559] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed
[2022-05-27T15:58:56.559] error: cannot create proctrack context for proctrack/cgroup
[2022-05-27T15:58:56.559] error: slurmd initialization failed
[2022-05-27T15:59:26.598] Considering each NUMA node as a socket
[2022-05-27T15:59:26.599] Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw)
[2022-05-27T15:59:26.599] Considering each NUMA node as a socket
[2022-05-27T15:59:26.602] error: cgroup namespace 'freezer' not mounted. aborting
[2022-05-27T15:59:26.602] error: unable to create freezer cgroup namespace
[2022-05-27T15:59:26.602] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed
[2022-05-27T15:59:26.602] error: cannot create proctrack context for proctrack/cgroup
[2022-05-27T15:59:26.602] error: slurmd initialization failed

Comment 19 Tim McMullan 2022-05-27 14:04:29 MDT

Can you attach your cgroup.conf file?

Comment 20 David Chin 2022-05-27 14:06:22 MDT

(In reply to Tim McMullan from comment #19)
> Can you attach your cgroup.conf file?

# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=no
TaskAffinity=no
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=yes
ConstrainKmemSpace=yes
AllowedRamSpace=100.00
AllowedSwapSpace=20.00
MinKmemSpace=30
MaxKmemPercent=100.00
MemorySwappiness=100
MaxRAMPercent=100.00
MaxSwapPercent=100.00
MinRAMSpace=30
# END AUTOGENERATED SECTION   -- DO NOT REMOVE

Comment 21 Tim McMullan 2022-05-27 14:14:40 MDT

I'm not quite sure why the freezer cgroup doesn't exist for you on boot, but changing "CgroupAutomount=no" to "CgroupAutomount=yes" should let slurm mount it at startup.

If that is something you can change, can you give it a try?

Comment 22 David Chin 2022-05-27 14:33:06 MDT

(In reply to Tim McMullan from comment #21)
> I'm not quite sure why the freezer cgroup doesn't exist for you on boot, but
> changing "CgroupAutomount=no" to "CgroupAutomount=yes" should let slurm
> mount it at startup.
> 
> If that is something you can change, can you give it a try?

Changed it, and rebooted a few nodes:

# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
TaskAffinity=no
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=yes
ConstrainKmemSpace=yes
AllowedRamSpace=100.00
AllowedSwapSpace=20.00
MinKmemSpace=30
MaxKmemPercent=100.00
MemorySwappiness=100
MaxRAMPercent=100.00
MaxSwapPercent=100.00
MinRAMSpace=30
# END AUTOGENERATED SECTION   -- DO NOT REMOVE


but still no go:

[2022-05-27T16:29:37.352] Considering each NUMA node as a socket
[2022-05-27T16:29:37.353] Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw)
[2022-05-27T16:29:37.353] Considering each NUMA node as a socket
[2022-05-27T16:29:37.356] error: unable to mount freezer cgroup namespace: Device or resource busy
[2022-05-27T16:29:37.356] error: unable to create freezer cgroup namespace
[2022-05-27T16:29:37.356] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed
[2022-05-27T16:29:37.356] error: cannot create proctrack context for proctrack/cgroup
[2022-05-27T16:29:37.356] error: slurmd initialization failed

Comment 23 Tim McMullan 2022-05-27 14:35:16 MDT

Thats very interesting.

Can you attach cat /proc/mounts again, but this time from a system that has JoinControllers commented out and hasn't had the slurmd try to start?

Thanks!
--Tim

Comment 24 Tim McMullan 2022-05-27 14:46:33 MDT

also, since you have libcgroup-tools installed, can you also check "systemctl status cgconfig.service" and attach your /etc/cgconfig.conf file?

Comment 25 David Chin 2022-05-27 14:49:15 MDT

(In reply to Tim McMullan from comment #23)
> Thats very interesting.
> 
> Can you attach cat /proc/mounts again, but this time from a system that has
> JoinControllers commented out and hasn't had the slurmd try to start?
> 
> Thanks!
> --Tim


I can't easily disable slurmd on a single node. Configs are done by categories of nodes, the assigning the "slurmclient" role to a category of nodes adds the slurmd service to those nodes.

On a system with NodeControllers commented out, /proc/mounts is:

proc /proc proc rw,nosuid,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
devtmpfs /dev devtmpfs rw,relatime,size=98291980k,nr_inodes=24572995,mode=755 0 0
tmpfs /run tmpfs rw,relatime 0 0
/dev/sda1 / xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=33,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=39192 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0
configfs /sys/kernel/config configfs rw,relatime 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
/dev/sda3 /tmp xfs rw,nosuid,nodev,noatime,nodiratime,attr2,inode64,noquota 0 0
/dev/sda2 /var xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0
/dev/sda6 /local xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
tracefs /sys/kernel/debug/tracing tracefs rw,relatime 0 0
beegfs_nodev /beegfs beegfs rw,relatime,cfgFile=/etc/beegfs/beegfs-client.conf 0 0
baran.cm.cluster:/ifs/baran/hpc-zone/opt_spack /ifs/opt_spack nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.39,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.39 0 0
baran.cm.cluster:/ifs/baran/hpc-zone/home /home nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.38,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.38 0 0
baran.cm.cluster:/ifs/baran/hpc-zone/sysadmin /ifs/sysadmin nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.32,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.32 0 0
baran.cm.cluster:/ifs/baran/hpc-zone/groups /ifs/groups nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.34,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.34 0 0
master:/cm/shared /cm/shared nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.25.128.1,mountvers=3,mountport=4002,mountproto=udp,local_lock=none,addr=172.25.128.1 0 0
baran.cm.cluster:/ifs/baran/hpc-zone/opt /ifs/opt nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.33,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.33 0 0
tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=19665692k,mode=700 0 0


- systemctl status cgconfig.service
● cgconfig.service - Control Group configuration service
   Loaded: loaded (/usr/lib/systemd/system/cgconfig.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

- /etc/cgconfig.conf
[ ... all lines are commented out ...]

Comment 26 Tim McMullan 2022-05-27 15:09:56 MDT

(In reply to David Chin from comment #25)
> I can't easily disable slurmd on a single node. Configs are done by
> categories of nodes, the assigning the "slurmclient" role to a category of
> nodes adds the slurmd service to those nodes.
 Ok, noted!

The mount output is missing a handful of mount points I'd expect to be there, something must be disabling those or preventing them from mounting.  I'm not sure if Bright has anything else going on here that might be tweaking the behavior.

Are there entries in /etc/fstab for the cgroup controllers?  It look like the libcgroup-tools related service isn't a factor in this instance.

Can you also provide the output of 
> grep CGROUP /boot/config-$(uname -r)

Comment 27 David Chin 2022-05-27 15:48:24 MDT

(In reply to Tim McMullan from comment #26)
> (In reply to David Chin from comment #25)
> > I can't easily disable slurmd on a single node. Configs are done by
> > categories of nodes, the assigning the "slurmclient" role to a category of
> > nodes adds the slurmd service to those nodes.
>  Ok, noted!
> 
> The mount output is missing a handful of mount points I'd expect to be
> there, something must be disabling those or preventing them from mounting. 
> I'm not sure if Bright has anything else going on here that might be
> tweaking the behavior.
> 
> Are there entries in /etc/fstab for the cgroup controllers?  It look like
> the libcgroup-tools related service isn't a factor in this instance.
> 
> Can you also provide the output of 
> > grep CGROUP /boot/config-$(uname -r)

Can you provide a list of mountpoints that are expected but not there in /proc/mounts? I'll contact Bright about them. 

As for the "grep CGROUP ...":

CONFIG_CGROUPS=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_BPF=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_SOCK_CGROUP_DATA=y
# CONFIG_BLK_CGROUP_IOLATENCY is not set
CONFIG_NETFILTER_XT_MATCH_CGROUP=m
CONFIG_NET_CLS_CGROUP=y
CONFIG_CGROUP_NET_PRIO=y
CONFIG_CGROUP_NET_CLASSID=y

Comment 28 Tim McMullan 2022-05-27 16:00:48 MDT

These are the missing mounts:

cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0

Note that they are the mounts that were in JoinControllers.

On my fairly vanilla 8.6 install those plus the cgroup mounts in your output exist without further configuration on boot.

Based on the grep, it appears they are all enabled in the kernel (which makes sense), so I'm thinking something must be preventing them from being mounted at boot time.

Comment 29 David Chin 2022-05-27 16:07:45 MDT

Thanks, Tim. I've forwarded that info to Bright.

I'm off for the long weekend, and we'll pick up next week.

Have a good holiday weekend.

Dave

Comment 30 Tim McMullan 2022-05-27 16:13:55 MDT

(In reply to David Chin from comment #29)
> Thanks, Tim. I've forwarded that info to Bright.
> 
> I'm off for the long weekend, and we'll pick up next week.
> 
> Have a good holiday weekend.
> 
> Dave

Sounds good!

Thank you, have a great weekend!
--Tim

Comment 31 David Chin 2022-05-27 22:42:35 MDT

Couldn't stay away.

Installed a VirtualBox VM with a fresh RHEL 8.1, and it has the mounts:

tmpfs /sys/fs/cgroup tmpfs ro,seclabel,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,seclabel,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,seclabel,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,seclabel,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/rdma cgroup rw,seclabel,nosuid,nodev,noexec,relatime,rdma 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,seclabel,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,seclabel,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,seclabel,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,seclabel,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,seclabel,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,seclabel,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,seclabel,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,seclabel,nosuid,nodev,noexec,relatime,devices 0 0


I also found a similar issue elsewhere: https://github.com/lxc/lxc/issues/4072 

Before trying to build a new OS image, I tried adding kernel boot parameter "systemd.unified_cgroup_hierarchy=0" to the current one, but that did not change anything. Which let me to think that was the default.

So, I then changed that to "systemd.unified_cgroup_hierarchy=1". That allowed slurmd to start and stay up:

● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmd.service.d
           └─99-cmd.conf
   Active: active (running) since Sat 2022-05-28 00:35:25 EDT; 12s ago
 Main PID: 8331 (slurmd)
    Tasks: 1
   Memory: 12.9M
   CGroup: /system.slice/slurmd.service
           └─8331 /cm/shared/apps/slurm/21.08.8/sbin/slurmd -D -s

May 28 00:35:25 node074 systemd[1]: Started Slurm node daemon.
May 28 00:35:25 node074 slurmd[8331]: slurmd: error: AccountingStorageTRES 1 specified more than once, latest value used
May 28 00:35:25 node074 slurmd[8331]: error: AccountingStorageTRES 1 specified more than once, latest value used
May 28 00:35:26 node074 slurmd[8331]: slurmd: Considering each NUMA node as a socket
May 28 00:35:26 node074 slurmd[8331]: slurmd: Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw)
May 28 00:35:26 node074 slurmd[8331]: slurmd: Considering each NUMA node as a socket
May 28 00:35:26 node074 slurmd[8331]: slurmd: slurmd version 21.08.8-2 started
May 28 00:35:26 node074 slurmd[8331]: slurmd: slurmd started on Sat, 28 May 2022 00:35:26 -0400

$ cat /proc/mounts | grep cgroup

cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0

And the freezer mount has appeared.

Submitting a the trivial job array (echo some env vars to write to a file) -- jobs don't even start running

$ squeue --me
       2602719_1  def tstarr_1node    dwc62   urcfadmprj PD       0:00       15:00   1    1       2G (launch failed requeued held)
       2602719_2  def tstarr_1node    dwc62   urcfadmprj PD       0:00       15:00   1    1       2G (launch failed requeued held)
       2602719_3  def tstarr_1node    dwc62   urcfadmprj PD       0:00       15:00   1    1       2G (launch failed requeued held)
      2602719_29  def tstarr_1node    dwc62   urcfadmprj PD       0:00       15:00   1    1       2G (launch failed requeued held)

Attaching /var/log/slurmd from this node (slurmd_node074.txt)

Comment 32 David Chin 2022-05-27 22:46:09 MDT

Created attachment 25267 [details]
/var/log/slurmd from node where kernel param "systemd.unified_cgroup_hierarchy=1" was added

/var/log/slurmd from node where kernel param "systemd.unified_cgroup_hierarchy=1" was added. slurmd started and remained running, but jobs failed to start running.

Comment 33 David Chin 2022-05-28 07:20:46 MDT

I reverted one node to the OS image from before the Slurm upgrade and these are the mounts. 

tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0
cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,blkio,memory,freezer 0 0
cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,nosuid,nodev,noexec,relatime,cpu 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0


Doing "ls -l /sys/fs/cgroup" shows the missing mounts as soft links to a combined mount "blkio,cpuacct,memory,freezer":

total 0
lrwxrwxrwx 1 root root 28 May 28 09:09 blkio -> blkio,cpuacct,memory,freezer/
dr-xr-xr-x 4 root root  0 May 28 09:09 blkio,cpuacct,memory,freezer/
dr-xr-xr-x 4 root root  0 May 28 09:09 cpu/
lrwxrwxrwx 1 root root 28 May 28 09:09 cpuacct -> blkio,cpuacct,memory,freezer/
dr-xr-xr-x 2 root root  0 May 28 09:09 cpuset/
dr-xr-xr-x 4 root root  0 May 28 09:09 devices/
lrwxrwxrwx 1 root root 28 May 28 09:09 freezer -> blkio,cpuacct,memory,freezer/
dr-xr-xr-x 2 root root  0 May 28 09:09 hugetlb/
lrwxrwxrwx 1 root root 28 May 28 09:09 memory -> blkio,cpuacct,memory,freezer/
dr-xr-xr-x 2 root root  0 May 28 09:09 net_cls/
dr-xr-xr-x 2 root root  0 May 28 09:09 net_prio/
dr-xr-xr-x 2 root root  0 May 28 09:09 perf_event/
dr-xr-xr-x 4 root root  0 May 28 09:09 pids/
dr-xr-xr-x 2 root root  0 May 28 09:09 rdma/
dr-xr-xr-x 5 root root  0 May 28 09:06 systemd/



And reverting to the OS image from when our cluster was first installed gives the same (i.e. missing freezer, memory, blkio, "cpu,cpuacct"). 

tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,nosuid,nodev,noexec,relatime,cpu 0 0
cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0
cgroup /sys/fs/cgroup/blkio,cpuacct,memory,freezer cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,blkio,memory,freezer 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio 0 0
cgroup /sys/fs/cgroup/net_cls cgroup rw,nosuid,nodev,noexec,relatime,net_cls 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0

And "ls -l /sys/fs/cgroup" gives

lrwxrwxrwx 1 root root 28 May 28 08:54 blkio -> blkio,cpuacct,memory,freezer
dr-xr-xr-x 4 root root  0 May 28 08:54 blkio,cpuacct,memory,freezer
dr-xr-xr-x 2 root root  0 May 28 08:54 cpu
lrwxrwxrwx 1 root root 28 May 28 08:54 cpuacct -> blkio,cpuacct,memory,freezer
dr-xr-xr-x 2 root root  0 May 28 08:54 cpuset
dr-xr-xr-x 4 root root  0 May 28 08:54 devices
lrwxrwxrwx 1 root root 28 May 28 08:54 freezer -> blkio,cpuacct,memory,freezer
dr-xr-xr-x 2 root root  0 May 28 08:54 hugetlb
lrwxrwxrwx 1 root root 28 May 28 08:54 memory -> blkio,cpuacct,memory,freezer
dr-xr-xr-x 2 root root  0 May 28 08:54 net_cls
dr-xr-xr-x 2 root root  0 May 28 08:54 net_prio
dr-xr-xr-x 2 root root  0 May 28 08:54 perf_event
dr-xr-xr-x 4 root root  0 May 28 08:54 pids
dr-xr-xr-x 2 root root  0 May 28 08:54 rdma
dr-xr-xr-x 5 root root  0 May 28 08:52 systemd

These differ from a fresh RHEL 8.1 install which has individual mounts:
total 0
dr-xr-xr-x. 2 root root  0 May 28 00:07 blkio
lrwxrwxrwx. 1 root root 11 May 28 00:07 cpu -> cpu,cpuacct
lrwxrwxrwx. 1 root root 11 May 28 00:07 cpuacct -> cpu,cpuacct
dr-xr-xr-x. 2 root root  0 May 28 00:07 cpu,cpuacct
dr-xr-xr-x. 2 root root  0 May 28 00:07 cpuset
dr-xr-xr-x. 4 root root  0 May 28 00:07 devices
dr-xr-xr-x. 2 root root  0 May 28 00:07 freezer
dr-xr-xr-x. 2 root root  0 May 28 00:07 hugetlb
dr-xr-xr-x. 5 root root  0 May 28 00:07 memory
lrwxrwxrwx. 1 root root 16 May 28 00:07 net_cls -> net_cls,net_prio
dr-xr-xr-x. 2 root root  0 May 28 00:07 net_cls,net_prio
lrwxrwxrwx. 1 root root 16 May 28 00:07 net_prio -> net_cls,net_prio
dr-xr-xr-x. 2 root root  0 May 28 00:07 perf_event
dr-xr-xr-x. 5 root root  0 May 28 00:07 pids
dr-xr-xr-x. 2 root root  0 May 28 00:07 rdma
dr-xr-xr-x. 6 root root  0 May 28 00:07 systemd

Comment 34 David Chin 2022-05-28 07:26:43 MDT

Oh, I get it. The combined mount is probably due to the "JoinControllers" line.

Comment 35 Tim McMullan 2022-05-28 07:39:33 MDT

(In reply to David Chin from comment #32)
> Created attachment 25267 [details]
> /var/log/slurmd from node where kernel param
> "systemd.unified_cgroup_hierarchy=1" was added
> 
> /var/log/slurmd from node where kernel param
> "systemd.unified_cgroup_hierarchy=1" was added. slurmd started and remained
> running, but jobs failed to start running.

This option turns on cgroup/v2 which isn't supported by 21.08 (but is with 22.05!) so I would expect it to fail in this case.

Its interesting that your fresh install on the cluster seems to differ from a new fresh install.  Something we could try is adding "cgroup_enable=memory swapaccount=1" to the kernel command line.  Its required on older Debian/Ubuntu installs, but might enable the memory cgroup in your environment.  If it does, we could add additional ones for the other missing controllers.

You would need to add them to the end of GRUB_CMDLINE_LINUX in /etc/sysconfig/grub, then update the grub config and reboot to see if it makes a difference.

Comment 36 David Chin 2022-05-28 08:42:48 MDT

Bright support is offline till Monday. 

My next move is probably creating a new OS image from a fresh RHEL 8.1, but that requires me to be onsite. So, I'm really done for the weekend, now.

Thanks for the help.

Cheers,
    Dave

Comment 37 David Chin 2022-05-28 09:48:22 MDT

Created attachment 25268 [details]
/var/log/slurmd after adding "cgroup_enable=memory swapaccount=1" to kernel options

/var/log/slurmd after adding "cgroup_enable=memory swapaccount=1" to kernel options.

Shows a user job 2602819 that was cancelled so that I could run a test job array 2602820 (96 tasks).

Comment 38 Tim McMullan 2022-05-31 05:55:08 MDT

It seems that things got less bad after the restart around here:

> [2022-05-28T11:37:05.176] slurmd version 21.08.8-2 started
> [2022-05-28T11:37:05.179] slurmd started on Sat, 28 May 2022 11:37:05 -0400

It would be good to see what "cat /proc/mounts | grep cgroup" looks like now.

Comment 39 David Chin 2022-05-31 09:18:19 MDT

(In reply to Tim McMullan from comment #38)
> It seems that things got less bad after the restart around here:
> 
> > [2022-05-28T11:37:05.176] slurmd version 21.08.8-2 started
> > [2022-05-28T11:37:05.179] slurmd started on Sat, 28 May 2022 11:37:05 -0400
> 
> It would be good to see what "cat /proc/mounts | grep cgroup" looks like now.

On a non-GPU node:

cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0


and a GPU node:

cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0

Comment 40 David Chin 2022-05-31 09:32:54 MDT

My test job array seemed to complete OK, without producing any of the cgroup messages we saw before.

I've contacted several users who had issues with their jobs, and I'm waiting to hear back. I'm optimistic.

Comment 41 Tim McMullan 2022-05-31 09:41:05 MDT

(In reply to David Chin from comment #40)
> My test job array seemed to complete OK, without producing any of the cgroup
> messages we saw before.
> 
> I've contacted several users who had issues with their jobs, and I'm waiting
> to hear back. I'm optimistic.

Thanks for the update on this David!

Those mounts are matching what I would expect them to look like, so hopefully we've got it fixed!

Let me know what you hear from the other users!

Thanks,
--Tim

Comment 43 David Chin 2022-05-31 15:43:00 MDT

5 out of the 6 users who reported the cgroups-related errors/warnings have said they are no longer seeing the issues they saw. I'm fairly optimistic the remaining one will report succss, but it may be a few days before they can try.

Comment 44 Tim McMullan 2022-06-01 05:43:30 MDT

(In reply to David Chin from comment #43)
> 5 out of the 6 users who reported the cgroups-related errors/warnings have
> said they are no longer seeing the issues they saw. I'm fairly optimistic
> the remaining one will report succss, but it may be a few days before they
> can try.

OK that's great!  What of the things we chatted about is currently being used?  Is this just the kernel command line tweak, new image, etc?

Thanks!
--Tim

Comment 45 David Chin 2022-06-01 09:24:56 MDT

(In reply to Tim McMullan from comment #44)
> ... 
> OK that's great!  What of the things we chatted about is currently being
> used?  Is this just the kernel command line tweak, new image, etc?
> 
> Thanks!
> --Tim

Only the kernel command line tweak. That was all it took.

--Dave

Comment 46 Tim McMullan 2022-06-01 11:34:58 MDT

(In reply to David Chin from comment #45)
> 
> Only the kernel command line tweak. That was all it took.
> 
> --Dave

Thats very interesting.  I was mostly expecting the memory cgroup to reappear, but the rest still be absent.  I'm glad this seems to have resolved it for you, but I'm not totally certain why this fixed it.  If you can, I would still consider rolling a new image to see if you can create one where this works without the kernel command line tweak since you were able to see its not necessary on a fresh build.

Since it appears to be fixed though, and the last tester is a few days out from being able to test, are you OK with me resolving this for now and if the issue crops back up again you can re-open it or open a new one?  I can also leave this open and wait for the last user if you are more comfortable with that, but in that case I'd like to reduce the severity since it now appears to be working.

Thanks!
--Tim

Comment 47 David Chin 2022-06-01 12:53:07 MDT

(In reply to Tim McMullan from comment #46)
> (In reply to David Chin from comment #45)
> > 
> > Only the kernel command line tweak. That was all it took.
> > 
> > --Dave
> 
> Thats very interesting.  I was mostly expecting the memory cgroup to
> reappear, but the rest still be absent.  I'm glad this seems to have
> resolved it for you, but I'm not totally certain why this fixed it.  If you
> can, I would still consider rolling a new image to see if you can create one
> where this works without the kernel command line tweak since you were able
> to see its not necessary on a fresh build.
> 
> Since it appears to be fixed though, and the last tester is a few days out
> from being able to test, are you OK with me resolving this for now and if
> the issue crops back up again you can re-open it or open a new one?  I can
> also leave this open and wait for the last user if you are more comfortable
> with that, but in that case I'd like to reduce the severity since it now
> appears to be working.
> 
> Thanks!
> --Tim

Yes, please go ahead and close this ticket out. If the last user has issues, we can re-open.

I'll revisit building a new OS image for some future date.

Thanks again for all your help.

--Dave

Comment 48 Tim McMullan 2022-06-01 13:22:30 MDT

Sounds good, Thanks Dave!