14293 – srun succeeds but prints out "slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/..."

Ticket 14293 - srun succeeds but prints out "slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/..."

Summary: srun succeeds but prints out "slurmstepd: error: _cgroup_procs_check: failed ...

Status:	RESOLVED DUPLICATE of ticket 13768

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmstepd (show other tickets)
Version:	21.08.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Felip Moll
QA Contact:

URL:

Duplicates (1):	15370 (view as ticket list)
Depends on:
Blocks:

Reported:	2022-06-09 12:51 MDT by David Chin
Modified:	2023-02-03 05:47 MST (History)
CC List:	1 user (show)

See Also:	14191 14579 15370
Site:	Drexel
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
/var/log/slurmd on compute node node002 (1.31 MB, text/plain) 2022-06-09 14:35 MDT, David Chin	Details
/var/log/slurmd on node074 (3.74 MB, text/plain) 2022-06-10 08:02 MDT, David Chin	Details
slurm.conf (8.27 KB, text/plain) 2022-07-20 09:33 MDT, David Chin	Details
cgroup.conf (519 bytes, text/plain) 2022-07-20 09:33 MDT, David Chin	Details
Debugging output 2022-10-05 (131.87 KB, text/plain) 2022-10-05 14:53 MDT, David Chin	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description David Chin 2022-06-09 12:51:58 MDT

This could be related to a previous issue https://bugs.schedmd.com/show_bug.cgi?id=14191 

This shows up only simple "srun" that produces output, e.g.

$ srun uname -a
Linux node002 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2627803/step_0/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2627803/step_0/cgroup.procs'
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2627803/step_0/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2627803/step_0/cgroup.procs'
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2627803/step_0/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2627803/step_0/cgroup.procs'

or

$ srun echo foobar
foobar
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2627804/step_0/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2627804/step_0/cgroup.procs'
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2627804/step_0/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2627804/step_0/cgroup.procs'
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2627804/step_0/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2627804/step_0/cgroup.procs'

Interactive jobs are OK:

[myname@login ~]$ srun --pty /bin/bash
[myname@node002 ~]$ echo $SLURM_JOB_ID
2627808

The mount seems to be there:

[myname@node002 ~]$ sudo grep cgroup /proc/mounts | grep freezer
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0


And the mount exists for that interactive shell job:

drwxr-xr-x 4 root root 0 Jun  9 14:47 /sys/fs/cgroup/freezer/slurm/uid_1002/job_2627808/

Maybe it's a duration thing. It works fine (i.e. no cgroup error) if I do this:

$ srun sleep 10 && uname -a
Linux picotte001 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux


FWIW, "srun uname -a" worked as expected in 20.02 (the previous version we were running).

Cheers,
    Dave

Comment 1 Felip Moll 2022-06-09 14:05:37 MDT

Hi David,

Yes, this seems a continuation of bug 14191. As Tim commented there he was not certain of how it was fixed.

Can I ask you a new slurmd log with debug flags CGROUP enabled? I know you provided it in the other bug but I want start fresh.

Also a complete "cat /proc/mount".

I also suggest to turn off CgroupAutomount=yes, this can get in the middle and hide OS configuration issues.

I also saw you had libcgroup-tools installed, can you also check "systemctl status cgconfig.service" and attach your /etc/cgconfig.conf file?

Related to this last point, there's a daemon called uresourced that can do mess with the controllers. I don't think it is the case but it would be good if you can check for the existence of this daemon in the nodes.

Thanks

Comment 2 David Chin 2022-06-09 14:35:08 MDT

Created attachment 25435 [details]
/var/log/slurmd on compute node node002

Comment 3 David Chin 2022-06-09 14:37:51 MDT

Hi, Felip:

I turned on the debug flag in slurm.conf

    DebugFlags=cgroup

Then I first ran an interactive shell doing just "echo hello" once it started, and then exited. And then, I ran "srun uname -a". The /var/log/slurmd is attached. This node was just rebooted about 6 hrs ago (~ 10:00 local time).

/proc/mounts:

proc /proc proc rw,nosuid,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
devtmpfs /dev devtmpfs rw,relatime,size=98292044k,nr_inodes=24573011,mode=755 0 0
tmpfs /run tmpfs rw,relatime 0 0
/dev/sda1 / xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0
cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=37042 0 0
configfs /sys/kernel/config configfs rw,relatime 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
/dev/sda6 /local xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0
/dev/sda2 /var xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0
/dev/sda3 /tmp xfs rw,nosuid,nodev,noatime,nodiratime,attr2,inode64,noquota 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
tracefs /sys/kernel/debug/tracing tracefs rw,relatime 0 0
beegfs_nodev /beegfs beegfs rw,relatime,cfgFile=/etc/beegfs/beegfs-client.conf 0 0
baran.cm.cluster:/ifs/baran/hpc-zone/sysadmin /ifs/sysadmin nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.36,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.36 0 0
master:/cm/shared /cm/shared nfs rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.25.128.1,mountvers=3,mountport=4002,mountproto=udp,local_lock=none,addr=172.25.128.1 0 0
baran.cm.cluster:/ifs/baran/hpc-zone/opt_spack /ifs/opt_spack nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.33,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.33 0 0
baran.cm.cluster:/ifs/baran/hpc-zone/groups /ifs/groups nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.35,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.35 0 0
baran.cm.cluster:/ifs/baran/hpc-zone/opt /ifs/opt nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.37,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.37 0 0
baran.cm.cluster:/ifs/baran/hpc-zone/home /home nfs rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.43,mountvers=3,mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.43 0 0
tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=19665704k,mode=700 0 0

On node002:

$ sudo systemctl status cgconfig.service
● cgconfig.service - Control Group configuration service
   Loaded: loaded (/usr/lib/systemd/system/cgconfig.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

(In reply to Felip Moll from comment #1)
> Hi David,
> 
> Yes, this seems a continuation of bug 14191. As Tim commented there he was
> not certain of how it was fixed.
> 
> Can I ask you a new slurmd log with debug flags CGROUP enabled? I know you
> provided it in the other bug but I want start fresh.
> 
> Also a complete "cat /proc/mount".
> 
> I also suggest to turn off CgroupAutomount=yes, this can get in the middle
> and hide OS configuration issues.
> 
> I also saw you had libcgroup-tools installed, can you also check "systemctl
> status cgconfig.service" and attach your /etc/cgconfig.conf file?
> 
> Related to this last point, there's a daemon called uresourced that can do
> mess with the controllers. I don't think it is the case but it would be good
> if you can check for the existence of this daemon in the nodes.
> 
> Thanks

Comment 4 Felip Moll 2022-06-10 03:48:16 MDT

(In reply to David Chin from comment #3)
> Hi, Felip:
> 
> I turned on the debug flag in slurm.conf
> 
>     DebugFlags=cgroup
> 
> Then I first ran an interactive shell doing just "echo hello" once it
> started, and then exited. And then, I ran "srun uname -a". The
> /var/log/slurmd is attached. This node was just rebooted about 6 hrs ago (~
> 10:00 local time).
> 
> /proc/mounts:
> 
> proc /proc proc rw,nosuid,relatime 0 0
> sysfs /sys sysfs rw,relatime 0 0
> devtmpfs /dev devtmpfs
> rw,relatime,size=98292044k,nr_inodes=24573011,mode=755 0 0
> tmpfs /run tmpfs rw,relatime 0 0
> /dev/sda1 / xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0
> securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0
> 0
> tmpfs /dev/shm tmpfs rw 0 0
> devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
> tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
> cgroup /sys/fs/cgroup/systemd cgroup
> rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-
> cgroups-agent,name=systemd 0 0
> pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
> bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0
> cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0
> cgroup /sys/fs/cgroup/perf_event cgroup
> rw,nosuid,nodev,noexec,relatime,perf_event 0 0
> cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0
> 0
> cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
> cgroup /sys/fs/cgroup/net_cls,net_prio cgroup
> rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
> cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0
> 0
> cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
> cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices
> 0 0
> cgroup /sys/fs/cgroup/cpu,cpuacct cgroup
> rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
> cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer
> 0 0
> cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb
> 0 0
> systemd-1 /proc/sys/fs/binfmt_misc autofs
> rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,
> pipe_ino=37042 0 0
> configfs /sys/kernel/config configfs rw,relatime 0 0
> hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0
> debugfs /sys/kernel/debug debugfs rw,relatime 0 0
> mqueue /dev/mqueue mqueue rw,relatime 0 0
> fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
> binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
> /dev/sda6 /local xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0
> /dev/sda2 /var xfs rw,noatime,nodiratime,attr2,inode64,noquota 0 0
> /dev/sda3 /tmp xfs rw,nosuid,nodev,noatime,nodiratime,attr2,inode64,noquota
> 0 0
> sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
> tracefs /sys/kernel/debug/tracing tracefs rw,relatime 0 0
> beegfs_nodev /beegfs beegfs
> rw,relatime,cfgFile=/etc/beegfs/beegfs-client.conf 0 0
> baran.cm.cluster:/ifs/baran/hpc-zone/sysadmin /ifs/sysadmin nfs
> rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,
> timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.36,mountvers=3,
> mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.36 0 0
> master:/cm/shared /cm/shared nfs
> rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,
> timeo=600,retrans=2,sec=sys,mountaddr=172.25.128.1,mountvers=3,
> mountport=4002,mountproto=udp,local_lock=none,addr=172.25.128.1 0 0
> baran.cm.cluster:/ifs/baran/hpc-zone/opt_spack /ifs/opt_spack nfs
> rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,
> timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.33,mountvers=3,
> mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.33 0 0
> baran.cm.cluster:/ifs/baran/hpc-zone/groups /ifs/groups nfs
> rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,
> timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.35,mountvers=3,
> mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.35 0 0
> baran.cm.cluster:/ifs/baran/hpc-zone/opt /ifs/opt nfs
> rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,
> timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.37,mountvers=3,
> mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.37 0 0
> baran.cm.cluster:/ifs/baran/hpc-zone/home /home nfs
> rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,
> timeo=600,retrans=5,sec=sys,mountaddr=172.25.128.43,mountvers=3,
> mountport=300,mountproto=tcp,local_lock=none,addr=172.25.128.43 0 0
> tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=19665704k,mode=700 0 0
> 
> On node002:
> 
> $ sudo systemctl status cgconfig.service
> ● cgconfig.service - Control Group configuration service
>    Loaded: loaded (/usr/lib/systemd/system/cgconfig.service; disabled;
> vendor preset: disabled)
>    Active: inactive (dead)
> 
> (In reply to Felip Moll from comment #1)
> > Hi David,
> > 
> > Yes, this seems a continuation of bug 14191. As Tim commented there he was
> > not certain of how it was fixed.
> > 
> > Can I ask you a new slurmd log with debug flags CGROUP enabled? I know you
> > provided it in the other bug but I want start fresh.
> > 
> > Also a complete "cat /proc/mount".
> > 
> > I also suggest to turn off CgroupAutomount=yes, this can get in the middle
> > and hide OS configuration issues.
> > 
> > I also saw you had libcgroup-tools installed, can you also check "systemctl
> > status cgconfig.service" and attach your /etc/cgconfig.conf file?
> > 
> > Related to this last point, there's a daemon called uresourced that can do
> > mess with the controllers. I don't think it is the case but it would be good
> > if you can check for the existence of this daemon in the nodes.
> > 
> > Thanks

Hi David,

Your mountpoints seem correct and very close to mine. The only difference I see is that for /sys/fs/cgroup/systemd you have a release agent set, but I this should not be affecting the other mountpoints.

I looked at your log file but it doesn't seem it catched the DebugFlags=CGROUP setting, I don't see relevant information more than "info" level. Can you please do it again, but now also adding a SlurmdDebug=debug2 ?

You should see lines prefixed with "CGROUP" and "debug2" in the logs if the setting is activated.

E.g.:
[2022-03-16T19:14:05.367] [1.interactive] cgroup/v1: _set_uint32_param: CGROUP: parameter 'cgroup.procs' set to '217003' for '/sys/fs/cgroup/cpuacct/slurm_gamba1/uid_1000/job_1/step_interactive/task_0'

So:

1. Enable DebugFlags=CGROUP and SlurmdDebug=debug2 in slurm.conf
2. Do an scontrol reconfig or restart the slurmd.
3. Note the time.
4. Run the jobs which fail.
5. Send back the new logs, and also the output of 'dmesg -T' in the node.
6. Send also the slurmctld log.

I cannot do much with the current logs, I will be waiting for your input.

Comment 5 David Chin 2022-06-10 07:59:54 MDT

Hi, Felip:

I added this to the slurm.conf and it seemed to work:

    DebugFlags=CGROUP
    SlurmdDebug=debug2

I think I used "cgroup" before, i.e. all lower case.

Then ran 2 test jobs on node074:

1) srun -w node074 --pty /bin/bash -- then did "echo foobar", waited about 10 seconds, and exited the job shell. No cgroup error messages were printed.

2) srun -w node074 uname -a
Linux node074 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs'
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs'
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs'

Will attach log after this comment.

Thanks,
    Dave

Comment 6 David Chin 2022-06-10 08:02:38 MDT

Created attachment 25456 [details]
/var/log/slurmd on node074

Comment 8 David Chin 2022-07-20 04:58:51 MDT

Hi, Felip:

Any news?

Thanks,
    Dave

Comment 9 Felip Moll 2022-07-20 07:12:59 MDT

David,

I had prepared a response some days ago but I was not entirely sure of the correctness.

I was basically identifying this as a timing issue, as you already suggested. I reviewed all your logs sequence step by step and everything happens in some microseconds.
The cgroup filesystem (v1 specially) has been demonstrated to be a bit slow during time, and we don't wait for the cgroup to be show up to continue, since we expect that the return call from mkdir already guarantees that.

But after your tests and this log:

[2022-06-10T09:58:31.597] [2628733.0] cgroup/v1: xcgroup_wait_pid_moved: CGROUP: Took 2072 checks before stepd pid 29852 was removed from the cpuset cgroup.
...
[2022-06-10T09:58:31.624] [2628733.extern] cgroup/v1: xcgroup_wait_pid_moved: CGROUP: Took 3502 checks before stepd pid 29844 was removed from the cpuset cgroup.

This means your cgroup is a bit slow, it takes 2000-3500 iterations to wait for a pid to be removed from a cgroup controller.

I am thinking about if it is possible that mkdir returns, but the cgroup is not yet shown to the user even if the internal kernel structure is created. There was one opposite situation where the removal took time to complete, after a rmdir and we had to put an active wait for it to disappear.

I discarded a plugin race, since the root cgroup is locked for an operation of creation and removal. We fixed this race possibility in:

21.08.7+ commits 26e96df68a..91bd26c481 - Bug 13423

In any case I could not reproduce this, I am not sure how to make my cgroup slower without modifying my kernel.
Would you be able to try a test patch if I add an active poll or a wait after the directory creation?

Also one other question... after this error, are the directories of the error still in the cgroup filesystem?

[2022-06-10T09:58:31.584] [2628733.0] error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs: No such file or directory
[2022-06-10T09:58:31.584] [2628733.0] error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs'

Thanks and sorry for the delay on my response.

Comment 10 Felip Moll 2022-07-20 07:38:05 MDT

But.. my hypothesis is not 100% reliable, because for cpuset for example it just works fine.

Another idea I am veryfing is that if this can happen before the freezer has been initialized:

[2022-06-10T09:58:31.584] [2628733.0] debug:  task/affinity: task_p_post_term: affinity StepId=2628733.0, task 0
[2022-06-10T09:58:31.584] [2628733.0] error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs: No such file or directory
[2022-06-10T09:58:31.584] [2628733.0] error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs'

Comment 12 Felip Moll 2022-07-20 09:11:54 MDT

Dave,

[2022-05-28T08:38:00.936] Considering each NUMA node as a socket
[2022-05-28T08:38:00.936] Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw)
[2022-05-28T08:38:00.937] Considering each NUMA node as a socket
[2022-05-28T08:38:00.939] error: unable to mount freezer cgroup namespace: Device or resource busy
[2022-05-28T08:38:00.939] error: unable to create freezer cgroup namespace
[2022-05-28T08:38:00.939] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed
[2022-05-28T08:38:00.939] error: cannot create proctrack context for proctrack/cgroup
[2022-05-28T08:38:00.939] error: slurmd initialization failed
[2022-05-28T08:38:30.970] Considering each NUMA node as a socket
[2022-05-28T08:38:30.970] Node reconfigured socket/core boundaries SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw)
[2022-05-28T08:38:30.971] Considering each NUMA node as a socket
[2022-05-28T08:38:30.974] error: unable to mount freezer cgroup namespace: Device or resource busy
[2022-05-28T08:38:30.974] error: unable to create freezer cgroup namespace
[2022-05-28T08:38:30.974] error: Couldn't load specified plugin name for proctrack/cgroup: Plugin init() callback failed
[2022-05-28T08:38:30.974] error: cannot create proctrack context for proctrack/cgroup
[2022-05-28T08:38:30.974] error: slurmd initialization failed
[2022-05-28T11:37:05.164] Considering each NUMA node as a socket

Can you tell me about this errors in your log? Why was freezer cgroup not mounted these times? Do you know what happened?

This could be related.

Can you upload your slurm.conf and cgroup.conf?

Just to discard some issues.. do you use Bright on your cluster?

Comment 13 David Chin 2022-07-20 09:33:38 MDT

Created attachment 25932 [details]
slurm.conf

Comment 14 David Chin 2022-07-20 09:33:58 MDT

Created attachment 25933 [details]
cgroup.conf

Comment 15 David Chin 2022-07-20 09:36:08 MDT

Yes, we do use Bright Cluster Manager.

> Also one other question... after this error, are the directories of the error still in the cgroup filesystem?
>
> [2022-06-10T09:58:31.584] [2628733.0] error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs: No such file or directory
> [2022-06-10T09:58:31.584] [2628733.0] error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_2628733/step_0/cgroup.procs'

No, the directories are not there.

The freezer cgroup directory seems to be mounted:

[root@node051]# ls -l /sys/fs/cgroup/freezer
total 0
-rw-r--r-- 1 root root 0 Jul 20 11:35 cgroup.clone_children
-rw-r--r-- 1 root root 0 Jul 18 16:16 cgroup.procs
-r--r--r-- 1 root root 0 Jul 20 11:35 cgroup.sane_behavior
-rw-r--r-- 1 root root 0 Jul 20 11:35 notify_on_release
-rw-r--r-- 1 root root 0 Jul 20 11:35 release_agent
drwxr-xr-x 6 root root 0 Jul 20 11:31 slurm/
-rw-r--r-- 1 root root 0 Jul 18 15:04 tasks

Comment 16 David Chin 2022-07-22 07:49:56 MDT

Hi, Felip:

Feel free to lower the importance to "minor issue" since this hasn't impacted any users' actual jobs.

Regards,
    Dave

Comment 17 Felip Moll 2022-07-25 05:26:53 MDT

It is interesting how Bright configures things. It sets swap parameteres with ConstrainSwapSpace=no. It enables *KMemSpace which is not very reliable in certain situations.

I suggest this cleanup:

---cgroup.conf---
TaskAffinity=no
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=yes
----

In any case this is not the cause of the issue.

I cannot reproduce it, I am not able to get such a small timing between creation/finish of a job, and indeed it seems related to the problem.

From comment 12: Can you tell me about these errors in your log? Why was freezer cgroup not mounted these times? Do you know what happened?

Comment 18 David Chin 2022-07-25 09:12:45 MDT

(In reply to Felip Moll from comment #17)
> It is interesting how Bright configures things. It sets swap parameteres
> with ConstrainSwapSpace=no. It enables *KMemSpace which is not very reliable
> in certain situations.
> 
> I suggest this cleanup:
> 
> ---cgroup.conf---
> TaskAffinity=no
> ConstrainCores=yes
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=no
> ConstrainDevices=yes
> ----
> 
> In any case this is not the cause of the issue.
> 
> I cannot reproduce it, I am not able to get such a small timing between
> creation/finish of a job, and indeed it seems related to the problem.
> 
> From comment 12: Can you tell me about these errors in your log? Why was
> freezer cgroup not mounted these times? Do you know what happened?

Well, that cgroup.conf could have been me. I was trying figure out how to constrain some classes of jobs using cgroups, and I was tweaking those params.

Comment 19 David Chin 2022-07-25 09:20:50 MDT

(In reply to Felip Moll from comment #12)
> Dave,
> 
> [2022-05-28T08:38:00.936] Considering each NUMA node as a socket
> [2022-05-28T08:38:00.936] Node reconfigured socket/core boundaries
> SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw)
> [2022-05-28T08:38:00.937] Considering each NUMA node as a socket
> [2022-05-28T08:38:00.939] error: unable to mount freezer cgroup namespace:
> Device or resource busy
> [2022-05-28T08:38:00.939] error: unable to create freezer cgroup namespace
> [2022-05-28T08:38:00.939] error: Couldn't load specified plugin name for
> proctrack/cgroup: Plugin init() callback failed
> [2022-05-28T08:38:00.939] error: cannot create proctrack context for
> proctrack/cgroup
> [2022-05-28T08:38:00.939] error: slurmd initialization failed
> [2022-05-28T08:38:30.970] Considering each NUMA node as a socket
> [2022-05-28T08:38:30.970] Node reconfigured socket/core boundaries
> SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw)
> [2022-05-28T08:38:30.971] Considering each NUMA node as a socket
> [2022-05-28T08:38:30.974] error: unable to mount freezer cgroup namespace:
> Device or resource busy
> [2022-05-28T08:38:30.974] error: unable to create freezer cgroup namespace
> [2022-05-28T08:38:30.974] error: Couldn't load specified plugin name for
> proctrack/cgroup: Plugin init() callback failed
> [2022-05-28T08:38:30.974] error: cannot create proctrack context for
> proctrack/cgroup
> [2022-05-28T08:38:30.974] error: slurmd initialization failed
> [2022-05-28T11:37:05.164] Considering each NUMA node as a socket
> 
> Can you tell me about this errors in your log? Why was freezer cgroup not
> mounted these times? Do you know what happened?
> 
> This could be related.
> 
> Can you upload your slurm.conf and cgroup.conf?
> 
> Just to discard some issues.. do you use Bright on your cluster?


Not sure what's going on there. I just rebooted node002 for some unrelated reason. Its current /var/log/slurmd does not show any of those errors. It does have multiple jobs running on it, right now.

Trying to force it to happen:

$ srun -w node002 uname -a
srun: job 3477807 queued and waiting for resources
srun: job 3477807 has been allocated resources
Linux node002 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'

That produces these lines in /var/log/slurmd:

[2022-07-25T11:18:43.756] launch task StepId=3477807.0 request from UID:1002 GID:1002 HOST:172.25.128.11 PORT:46924
[2022-07-25T11:18:43.757] task/affinity: lllp_distribution: JobId=3477807 implicit auto binding: cores,one_thread, dist 8192
[2022-07-25T11:18:43.757] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
[2022-07-25T11:18:43.757] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [3477807]: mask_cpu,one_thread, 0x000400000000
[2022-07-25T11:18:43.764] [3477807.0] Considering each NUMA node as a socket
[2022-07-25T11:18:43.770] [3477807.0] task/cgroup: _memcg_initialize: job: alloc=3900MB mem.limit=3900MB memsw.limit=unlimited
[2022-07-25T11:18:43.770] [3477807.0] task/cgroup: _memcg_initialize: step: alloc=3900MB mem.limit=3900MB memsw.limit=unlimited
[2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No such file or directory
[2022-07-25T11:18:43.839] [3477807.0] error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
[2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No such file or directory
[2022-07-25T11:18:43.839] [3477807.0] error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
[2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No such file or directory
[2022-07-25T11:18:43.839] [3477807.0] error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
[2022-07-25T11:18:43.860] [3477807.0] done with job
[2022-07-25T11:18:43.880] [3477807.extern] done with job

But the freezer directory does seem to exist:

[root@node002]# ll -d /sys/fs/cgroup/freezer
dr-xr-xr-x 3 root root 0 Jul 25 10:51 /sys/fs/cgroup/freezer/

Comment 20 Felip Moll 2022-07-25 09:23:07 MDT

(In reply to David Chin from comment #19)
> (In reply to Felip Moll from comment #12)
> > Dave,
> > 
> > [2022-05-28T08:38:00.936] Considering each NUMA node as a socket
> > [2022-05-28T08:38:00.936] Node reconfigured socket/core boundaries
> > SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw)
> > [2022-05-28T08:38:00.937] Considering each NUMA node as a socket
> > [2022-05-28T08:38:00.939] error: unable to mount freezer cgroup namespace:
> > Device or resource busy
> > [2022-05-28T08:38:00.939] error: unable to create freezer cgroup namespace
> > [2022-05-28T08:38:00.939] error: Couldn't load specified plugin name for
> > proctrack/cgroup: Plugin init() callback failed
> > [2022-05-28T08:38:00.939] error: cannot create proctrack context for
> > proctrack/cgroup
> > [2022-05-28T08:38:00.939] error: slurmd initialization failed
> > [2022-05-28T08:38:30.970] Considering each NUMA node as a socket
> > [2022-05-28T08:38:30.970] Node reconfigured socket/core boundaries
> > SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw)
> > [2022-05-28T08:38:30.971] Considering each NUMA node as a socket
> > [2022-05-28T08:38:30.974] error: unable to mount freezer cgroup namespace:
> > Device or resource busy
> > [2022-05-28T08:38:30.974] error: unable to create freezer cgroup namespace
> > [2022-05-28T08:38:30.974] error: Couldn't load specified plugin name for
> > proctrack/cgroup: Plugin init() callback failed
> > [2022-05-28T08:38:30.974] error: cannot create proctrack context for
> > proctrack/cgroup
> > [2022-05-28T08:38:30.974] error: slurmd initialization failed
> > [2022-05-28T11:37:05.164] Considering each NUMA node as a socket
> > 
> > Can you tell me about this errors in your log? Why was freezer cgroup not
> > mounted these times? Do you know what happened?
> > 
> > This could be related.
> > 
> > Can you upload your slurm.conf and cgroup.conf?
> > 
> > Just to discard some issues.. do you use Bright on your cluster?
> 
> 
> Not sure what's going on there. I just rebooted node002 for some unrelated
> reason. Its current /var/log/slurmd does not show any of those errors. It
> does have multiple jobs running on it, right now.
> 
> Trying to force it to happen:
> 
> $ srun -w node002 uname -a
> srun: job 3477807 queued and waiting for resources
> srun: job 3477807 has been allocated resources
> Linux node002 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019
> x86_64 x86_64 x86_64 GNU/Linux
> slurmstepd: error: _cgroup_procs_check: failed on path
> /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No
> such file or directory
> slurmstepd: error: unable to read
> '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
> slurmstepd: error: _cgroup_procs_check: failed on path
> /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No
> such file or directory
> slurmstepd: error: unable to read
> '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
> slurmstepd: error: _cgroup_procs_check: failed on path
> /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No
> such file or directory
> slurmstepd: error: unable to read
> '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
> 
> That produces these lines in /var/log/slurmd:
> 
> [2022-07-25T11:18:43.756] launch task StepId=3477807.0 request from UID:1002
> GID:1002 HOST:172.25.128.11 PORT:46924
> [2022-07-25T11:18:43.757] task/affinity: lllp_distribution: JobId=3477807
> implicit auto binding: cores,one_thread, dist 8192
> [2022-07-25T11:18:43.757] task/affinity: _task_layout_lllp_cyclic:
> _task_layout_lllp_cyclic
> [2022-07-25T11:18:43.757] task/affinity: _lllp_generate_cpu_bind:
> _lllp_generate_cpu_bind jobid [3477807]: mask_cpu,one_thread, 0x000400000000
> [2022-07-25T11:18:43.764] [3477807.0] Considering each NUMA node as a socket
> [2022-07-25T11:18:43.770] [3477807.0] task/cgroup: _memcg_initialize: job:
> alloc=3900MB mem.limit=3900MB memsw.limit=unlimited
> [2022-07-25T11:18:43.770] [3477807.0] task/cgroup: _memcg_initialize: step:
> alloc=3900MB mem.limit=3900MB memsw.limit=unlimited
> [2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on
> path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs:
> No such file or directory
> [2022-07-25T11:18:43.839] [3477807.0] error: unable to read
> '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
> [2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on
> path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs:
> No such file or directory
> [2022-07-25T11:18:43.839] [3477807.0] error: unable to read
> '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
> [2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on
> path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs:
> No such file or directory
> [2022-07-25T11:18:43.839] [3477807.0] error: unable to read
> '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
> [2022-07-25T11:18:43.860] [3477807.0] done with job
> [2022-07-25T11:18:43.880] [3477807.extern] done with job
> 
> But the freezer directory does seem to exist:
> 
> [root@node002]# ll -d /sys/fs/cgroup/freezer
> dr-xr-xr-x 3 root root 0 Jul 25 10:51 /sys/fs/cgroup/freezer/

If I provide a debug patch can you apply and test it on a node? I understand this is a production machine so it would be enough to isolate one node, apply the patch to that slurmd, and reproduce the error there.

If this is possible I would prepare it.

Comment 21 David Chin 2022-07-25 09:27:35 MDT

(In reply to Felip Moll from comment #20)
> If I provide a debug patch can you apply and test it on a node? I understand
> this is a production machine so it would be enough to isolate one node,
> apply the patch to that slurmd, and reproduce the error there.
> 
> If this is possible I would prepare it.

Sure, I can set something up. It may take me some number of days to put it in place.

Comment 23 Felip Moll 2022-08-22 09:07:44 MDT

(In reply to David Chin from comment #21)
> (In reply to Felip Moll from comment #20)
> > If I provide a debug patch can you apply and test it on a node? I understand
> > this is a production machine so it would be enough to isolate one node,
> > apply the patch to that slurmd, and reproduce the error there.
> > 
> > If this is possible I would prepare it.
> 
> Sure, I can set something up. It may take me some number of days to put it
> in place.

Hi David,

I was able to get a similar issue than your by stressing Slurm, but still haven't not found the cause.

I will keep you posted in the forthcoming weeks. In any case the error is informative and should not cause any harm.

Sorry for the delay.

Comment 24 Felip Moll 2022-09-13 07:07:04 MDT

(In reply to David Chin from comment #19)
> (In reply to Felip Moll from comment #12)
> > Dave,
> > 
> > [2022-05-28T08:38:00.936] Considering each NUMA node as a socket
> > [2022-05-28T08:38:00.936] Node reconfigured socket/core boundaries
> > SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw)
> > [2022-05-28T08:38:00.937] Considering each NUMA node as a socket
> > [2022-05-28T08:38:00.939] error: unable to mount freezer cgroup namespace:
> > Device or resource busy
> > [2022-05-28T08:38:00.939] error: unable to create freezer cgroup namespace
> > [2022-05-28T08:38:00.939] error: Couldn't load specified plugin name for
> > proctrack/cgroup: Plugin init() callback failed
> > [2022-05-28T08:38:00.939] error: cannot create proctrack context for
> > proctrack/cgroup
> > [2022-05-28T08:38:00.939] error: slurmd initialization failed
> > [2022-05-28T08:38:30.970] Considering each NUMA node as a socket
> > [2022-05-28T08:38:30.970] Node reconfigured socket/core boundaries
> > SocketsPerBoard=2:4(hw) CoresPerSocket=24:12(hw)
> > [2022-05-28T08:38:30.971] Considering each NUMA node as a socket
> > [2022-05-28T08:38:30.974] error: unable to mount freezer cgroup namespace:
> > Device or resource busy
> > [2022-05-28T08:38:30.974] error: unable to create freezer cgroup namespace
> > [2022-05-28T08:38:30.974] error: Couldn't load specified plugin name for
> > proctrack/cgroup: Plugin init() callback failed
> > [2022-05-28T08:38:30.974] error: cannot create proctrack context for
> > proctrack/cgroup
> > [2022-05-28T08:38:30.974] error: slurmd initialization failed
> > [2022-05-28T11:37:05.164] Considering each NUMA node as a socket
> > 
> > Can you tell me about this errors in your log? Why was freezer cgroup not
> > mounted these times? Do you know what happened?
> > 
> > This could be related.
> > 
> > Can you upload your slurm.conf and cgroup.conf?
> > 
> > Just to discard some issues.. do you use Bright on your cluster?
> 
> 
> Not sure what's going on there. I just rebooted node002 for some unrelated
> reason. Its current /var/log/slurmd does not show any of those errors. It
> does have multiple jobs running on it, right now.
> 
> Trying to force it to happen:
> 
> $ srun -w node002 uname -a
> srun: job 3477807 queued and waiting for resources
> srun: job 3477807 has been allocated resources
> Linux node002 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019
> x86_64 x86_64 x86_64 GNU/Linux
> slurmstepd: error: _cgroup_procs_check: failed on path
> /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No
> such file or directory
> slurmstepd: error: unable to read
> '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
> slurmstepd: error: _cgroup_procs_check: failed on path
> /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No
> such file or directory
> slurmstepd: error: unable to read
> '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
> slurmstepd: error: _cgroup_procs_check: failed on path
> /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs: No
> such file or directory
> slurmstepd: error: unable to read
> '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
> 
> That produces these lines in /var/log/slurmd:
> 
> [2022-07-25T11:18:43.756] launch task StepId=3477807.0 request from UID:1002
> GID:1002 HOST:172.25.128.11 PORT:46924
> [2022-07-25T11:18:43.757] task/affinity: lllp_distribution: JobId=3477807
> implicit auto binding: cores,one_thread, dist 8192
> [2022-07-25T11:18:43.757] task/affinity: _task_layout_lllp_cyclic:
> _task_layout_lllp_cyclic
> [2022-07-25T11:18:43.757] task/affinity: _lllp_generate_cpu_bind:
> _lllp_generate_cpu_bind jobid [3477807]: mask_cpu,one_thread, 0x000400000000
> [2022-07-25T11:18:43.764] [3477807.0] Considering each NUMA node as a socket
> [2022-07-25T11:18:43.770] [3477807.0] task/cgroup: _memcg_initialize: job:
> alloc=3900MB mem.limit=3900MB memsw.limit=unlimited
> [2022-07-25T11:18:43.770] [3477807.0] task/cgroup: _memcg_initialize: step:
> alloc=3900MB mem.limit=3900MB memsw.limit=unlimited
> [2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on
> path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs:
> No such file or directory
> [2022-07-25T11:18:43.839] [3477807.0] error: unable to read
> '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
> [2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on
> path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs:
> No such file or directory
> [2022-07-25T11:18:43.839] [3477807.0] error: unable to read
> '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
> [2022-07-25T11:18:43.839] [3477807.0] error: _cgroup_procs_check: failed on
> path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs:
> No such file or directory
> [2022-07-25T11:18:43.839] [3477807.0] error: unable to read
> '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3477807/step_0/cgroup.procs'
> [2022-07-25T11:18:43.860] [3477807.0] done with job
> [2022-07-25T11:18:43.880] [3477807.extern] done with job
> 
> But the freezer directory does seem to exist:
> 
> [root@node002]# ll -d /sys/fs/cgroup/freezer
> dr-xr-xr-x 3 root root 0 Jul 25 10:51 /sys/fs/cgroup/freezer/

David, first of all sorry for the late reply.
I've had other priorities these past weeks. I also looked into this again but I don't have enough information to confirm my initial suspicions.

Is it possible for you to reproduce again the issue but increasing the slurmd debug level to debug3, and add the debug flags: CGROUP,JobAccountGather,Profile,Protocol,Steps

SlurmdDebug=debug3
Debugflags=Cgroup,JobAccountGather,Profile,Protocol,Steps

Also when you run srun, can you run with "-vvv" ?

srun -vvv /bin/true

I will need the full slurmd log, slurmctld and srun output

Thanks!

Comment 25 Felip Moll 2022-09-26 06:05:14 MDT

Hi David,

Do you have any input for me regarding to my last comment 24?

Thanks

Comment 26 David Chin 2022-09-26 11:16:01 MDT

(In reply to Felip Moll from comment #25)
> Hi David,
> 
> Do you have any input for me regarding to my last comment 24?
> 
> Thanks


Hi Felip:

Aiming to get to this by end of this week.

Regards,
    Dave

Comment 27 David Chin 2022-10-04 14:54:40 MDT

(In reply to Felip Moll from comment #25)
> Hi David,
> 
> Do you have any input for me regarding to my last comment 24?
> 
> Thanks

Hi Felip:

Here's the output:

$ srun -vvv /bin/true
srun: defined options
srun: -------------------- --------------------
srun: verbose             : 3
srun: -------------------- --------------------
srun: end of defined options
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=18446744073709551615
srun: debug:  propagating RLIMIT_CORE=0
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=1541534
srun: debug:  propagating RLIMIT_NOFILE=131072
srun: debug:  propagating RLIMIT_MEMLOCK=18446744073709551615
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0002
srun: debug2: srun PMI messages to port=45223
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  port from net_stream_listen is 35795
srun: debug:  Entering _msg_thr_internal
srun: debug:  auth/munge: init: Munge authentication plugin loaded
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
srun: debug:  Waited 0.100000 sec and still waiting: next sleep for 0.200000 sec
srun: Nodes node001 are ready for job
srun: jobid 3754978: nodes(1):`node001', cpu counts: 1(x1)
srun: debug2: creating job with 1 tasks
srun: debug:  requesting job 3754978, user 1002, nodes 1 including ((null))
srun: debug:  cpus 1, tasks 1, name true, relative 65534
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: debug:  Entering slurm_step_launch
srun: debug:  mpi type = (null)
srun: debug:  mpi/none: p_mpi_hook_client_prelaunch: Using mpi/none
srun: debug:  Entering _msg_thr_create()
srun: debug:  initialized stdio listening socket, port 42079
srun: debug:  Started IO server thread (23456195376896)
srun: debug:  Entering _launch_tasks
srun: launching StepId=3754978.0 on host node001, 1 tasks: 0
srun: route/default: init: route default plugin loaded
srun: debug2: Tree head got back 0 looking for 1
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Tree head got back 1
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: debug2: Activity on IO listening socket 21
srun: debug2: Entering io_init_msg_read_from_fd
srun: debug2: Leaving io_init_msg_read_from_fd
srun: debug2: Entering io_init_msg_validate
srun: debug2: Leaving io_init_msg_validate
srun: debug2: Validated IO connection from 172.25.128.101:38610, node rank 0, sd=22
srun: debug2: eio_message_socket_accept: got message connection from 172.25.128.101:42670 23
srun: debug2: received task launch
srun: launch/slurm: _task_start: Node node001, 1 tasks started
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Entering _file_write
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3754978/step_0/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3754978/step_0/cgroup.procs'
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3754978/step_0/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3754978/step_0/cgroup.procs'
slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1002/job_3754978/step_0/cgroup.procs: No such file or directory
slurmstepd: error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_1002/job_3754978/step_0/cgroup.procs'
srun: debug2: Leaving  _file_write
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: eio_message_socket_accept: got message connection from 172.25.128.101:42672 22
srun: debug2: received task exit
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=3754978.0 (status=0x0000).
srun: launch/slurm: _task_finish: node001: task 0: Completed
srun: debug:  task 0 done
srun: debug2:   false, shutdown
srun: debug2:   false, shutdown
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2:   false, shutdown
srun: debug:  IO thread exiting
srun: debug2: slurm_allocation_msg_thr_destroy: clearing up message thread
srun: debug2:   false, shutdown
$ echo $?
0

Sleeping first:

$ srun -vvv sleep 10 && /bin/true
srun: defined options
srun: -------------------- --------------------
srun: verbose             : 3
srun: -------------------- --------------------
srun: end of defined options
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=18446744073709551615
srun: debug:  propagating RLIMIT_CORE=0
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=1541534
srun: debug:  propagating RLIMIT_NOFILE=131072
srun: debug:  propagating RLIMIT_MEMLOCK=18446744073709551615
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0002
srun: debug2: srun PMI messages to port=38777
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  port from net_stream_listen is 35681
srun: debug:  Entering _msg_thr_internal
srun: debug:  auth/munge: init: Munge authentication plugin loaded
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
srun: debug:  Waited 0.100000 sec and still waiting: next sleep for 0.200000 sec
srun: Nodes node001 are ready for job
srun: jobid 3754982: nodes(1):`node001', cpu counts: 1(x1)
srun: debug2: creating job with 1 tasks
srun: debug:  requesting job 3754982, user 1002, nodes 1 including ((null))
srun: debug:  cpus 1, tasks 1, name sleep, relative 65534
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: debug:  Entering slurm_step_launch
srun: debug:  mpi type = (null)
srun: debug:  mpi/none: p_mpi_hook_client_prelaunch: Using mpi/none
srun: debug:  Entering _msg_thr_create()
srun: debug:  initialized stdio listening socket, port 35551
srun: debug:  Started IO server thread (23456195376896)
srun: debug:  Entering _launch_tasks
srun: launching StepId=3754982.0 on host node001, 1 tasks: 0
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: route/default: init: route default plugin loaded
srun: debug2: Tree head got back 0 looking for 1
srun: debug2: Tree head got back 1
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: debug2: Activity on IO listening socket 21
srun: debug2: Entering io_init_msg_read_from_fd
srun: debug2: Leaving io_init_msg_read_from_fd
srun: debug2: Entering io_init_msg_validate
srun: debug2: Leaving io_init_msg_validate
srun: debug2: Validated IO connection from 172.25.128.101:39766, node rank 0, sd=22
srun: debug2: eio_message_socket_accept: got message connection from 172.25.128.101:37374 23
srun: debug2: received task launch
srun: launch/slurm: _task_start: Node node001, 1 tasks started
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2: eio_message_socket_accept: got message connection from 172.25.128.101:37380 22
srun: debug2: received task exit
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=3754982.0 (status=0x0000).
srun: launch/slurm: _task_finish: node001: task 0: Completed
srun: debug:  task 0 done
srun: debug2:   false, shutdown
srun: debug2:   false, shutdown
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug2:   false, shutdown
srun: debug:  IO thread exiting
srun: debug2: slurm_allocation_msg_thr_destroy: clearing up message thread
srun: debug2:   false, shutdown
srun: debug:  Leaving _msg_thr_internal
$ echo $?
0

Comment 28 Felip Moll 2022-10-05 04:41:16 MDT

> Is it possible for you to reproduce again the issue but increasing the
> slurmd debug level to debug3, and add the debug flags:
> CGROUP,JobAccountGather,Profile,Protocol,Steps
> 
> SlurmdDebug=debug3
> Debugflags=Cgroup,JobAccountGather,Profile,Protocol,Steps
> 
> Also when you run srun, can you run with "-vvv" ?
> 
> srun -vvv /bin/true
> 
> I will need the full slurmd log, slurmctld and srun output
> 
> Thanks!

Thanks David,

but I still need the slurmd log and slurmctld having changed the debug settings in slurm.conf and restarted the slurmd.

Comment 29 David Chin 2022-10-05 14:53:33 MDT

Created attachment 27133 [details]
Debugging output 2022-10-05

Here's the output from "srun", the appropriate section of /var/log/slurmd on the compute nodes that ran the 2 trial jobs, and the appropriate sections of /var/log/slurmctld on the head node.

grep for "^localtime" to see the 2 separate tests:

1) srun -vvv /bin/true

2) srun -vvv sleep 10 && /bin/true

Comment 30 Felip Moll 2022-11-24 09:56:55 MST

> grep for "^localtime" to see the 2 separate tests:
> 
> 1) srun -vvv /bin/true
> 
> 2) srun -vvv sleep 10 && /bin/true

David, I've found this issue in other bugs but it is very difficult to reproduce, and we're not finding the race.

Just wanted to update you.



(Btw, test #2 is running two separate commands, first "srun -vvv sleep 10", and then "/bin/true", it is not running "sleep 10 && /bin/true" inside srun because && is interpreted first by bash and separates both commands)

Comment 32 Felip Moll 2022-11-24 11:42:29 MST

David, can you please confirm that *all* your Slurmd's are running at versions superior or equal to 21.08.7 ?

Comment 33 David Chin 2022-11-28 06:28:49 MST

(In reply to Felip Moll from comment #32)
> David, can you please confirm that *all* your Slurmd's are running at
> versions superior or equal to 21.08.7 ?

Hi Felip:

All nodes are running the Bright-provided package slurm21.08-client-21.08.8-100526_cm9.0_58aab6cfd1.x86_64

Doing "strings(1)" on the executable finds:

   SLURM_VERSION_STRING "21.08.8-2"

Comment 35 Felip Moll 2023-02-01 14:52:02 MST

David,

First of all sorry for not having replied in so much time. Besides having higher priority bugs (this one was under a specific condition and not harmful) I didn't found the issue until now and really I had a hard time to reproduce in newer versions.

But today, and thanks to another bug, I was investigating a part of the code which gave me some hints for your issue. Despite my initial thoughts were wrong, I have found another path that lead me to see that we have a race condition with the RPC handler thread in a stepd.

For several reasons we assumed only one thread in the slurmstepd ever accessed the cgroup plugin, but that's false. The main slurmstepd thread manages most of the work about cgroup, but on a certain situation, when a job ends very quickly we may hit for a microsecond a condition were we are removing the cgroup but still haven't set the variable which says we've done cleanup. In that moment if a RPC comes requesting the list of pids, it will get an error.

I already have a solution for this issue, which I will implement in the following days and let you know.

Thanks for your patience and comprehension.

Comment 37 Felip Moll 2023-02-01 15:11:50 MST

*** Ticket 15370 has been marked as a duplicate of this ticket. ***

Comment 38 Steve Ford 2023-02-01 15:12:00 MST

I am currently out of office on parental leave through February 22nd. I will respond to your emails as soon as possible upon my return.

Comment 40 Felip Moll 2023-02-03 05:47:22 MST

I had problems to reproduce your issue in 22.05, so after deep investigation I realized that we fixed this in in bug 13768.

We were calling proctrack_g_destroy which destroys the cgroup too early, so the following patch moved this call inside stepd_cleanup, which is the final function called when the step is ending. stepd_cleanup first terminates the io thread which is the responsible to get RPCs, and then destroys the cgroup. Before that patch, the order was incorrect.

In summary, you should upgrade to 22.05 to get this error to disappear.

commit 1ddef9a0dd896c6777daa3dbb68a405d076c45c6
Author: Dominik Bartkiewicz <bart@schedmd.com>
Date:   Thu Apr 14 17:22:42 2022 +0200

    Make proctrack_p_wait to really wait on cgroup emptiness
    
    We were relying on the destruction of a cgroup to check for it being empty.
    This is not ok, because if there are other cgroup plugins loaded we do not
    want to try destroy the cgroup to not race with others. Also, multiple
    signals sent to a step can cause wait to be called multiple times, so the
    destroy must be avoided. Move it outside stepd_cleanup. This function is
    only called once per step at termination.
    
    Bug 13768
    
    Signed-off-by: Felip Moll <felip.moll@schedmd.com>
    Signed-off-by: Oriol Vilarrubi <jvilarru@schedmd.com>


Thanks for your patience and sorry again for the delay understanding why it was happening. At some point I was confused about versions and did really think this was a different issue.

I am closing the bug now.

*** This ticket has been marked as a duplicate of ticket 13768 ***