15370 – Cgroup errors from slurmstepd

Ticket 15370 - Cgroup errors from slurmstepd

Summary: Cgroup errors from slurmstepd

Status:	RESOLVED DUPLICATE of ticket 14293

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmstepd (show other tickets)
Version:	21.08.8
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-11-07 13:37 MST by Steve Ford
Modified:	2023-02-01 15:11 MST (History)
CC List:	1 user (show)

See Also:	14293 15387
Site:	MSU
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurm Configuration (12.08 KB, application/x-compressed) 2022-11-07 13:37 MST, Steve Ford	Details
Slurmd log for job 64137494 (167.81 KB, application/x-gzip) 2022-11-08 10:29 MST, Steve Ford	Details
slurmd logs for job 64884268 (52.69 MB, application/x-gzip) 2022-11-10 09:12 MST, Steve Ford	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Steve Ford 2022-11-07 13:37:22 MST

Created attachment 27634 [details]
Slurm Configuration

Hello SchedMD,

We are seeing errors like the following from several jobs:

slurmstepd: error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_794240/job_64633746/step_batch/cgroup.procs: No such file or directory

Any idea what could be causing these?

Thanks,
Steve

Comment 1 Felip Moll 2022-11-08 01:23:36 MST

(In reply to Steve Ford from comment #0)
> Created attachment 27634 [details]
> Slurm Configuration
> 
> Hello SchedMD,
> 
> We are seeing errors like the following from several jobs:
> 
> slurmstepd: error: _cgroup_procs_check: failed on path
> /sys/fs/cgroup/freezer/slurm/uid_794240/job_64633746/step_batch/cgroup.procs:
> No such file or directory
> 
> Any idea what could be causing these?
> 
> Thanks,
> Steve

Hi Steve,

Can you please upload the slurmd logs with debug2 and the CGROUP debug flag activated?

This might happen on very short jobs on a system where the kernel cgroup takes a bit of time to be created. I need to see when this error exactly happens.

Is it reproducible? It could be similar to bug 14293.

Comment 2 Steve Ford 2022-11-08 10:29:07 MST

Created attachment 27648 [details]
Slurmd log for job 64137494

Comment 3 Felip Moll 2022-11-10 08:59:46 MST

Arghh, the log shows the error for a step that was running previously to setting the new log level, so it did not log what I expected.

[2022-11-08T12:20:42.797] debug2: container signal 15 to StepId=64137494.extern
[2022-11-08T12:20:42.797] [64137494.batch] error: Detected 1 oom-kill event(s) in StepId=64137494.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2022-11-08T12:20:42.798] debug2: container signal 15 to StepId=64137494.batch
[2022-11-08T12:20:42.798] [64137494.batch] error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_1035122/job_64137494/step_batch/cgroup.procs: No such file or directory

I see this error is close to a signal 15 and an OOM.

Have you reproduced it again with new steps after increasing the debug level + flags?

Comment 4 Steve Ford 2022-11-10 09:12:10 MST

Created attachment 27688 [details]
slurmd logs for job 64884268

Comment 5 Felip Moll 2022-11-11 02:59:24 MST

Do you have Delegate=yes in the systemd's slurmd unit file?

Do you have weka or Bright in the system?

Can I see a "cat /proc/mounts" on this affected node?

Thanks

Comment 6 Steve Ford 2022-11-14 07:48:58 MST

Felip,

We are using Delegate=Yes in our slurmd unit file.

We do have Weka available on our system as a software module but it does not appear to be in use by jobs throwing this error.

Here is /proc/mounts on a node where this error occured:

sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
devtmpfs /dev devtmpfs rw,nosuid,size=528096092k,nr_inodes=132024023,mode=755 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,cpu 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio,net_cls 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
configfs /sys/kernel/config configfs rw,relatime 0 0
/dev/mapper/vg_system-system_root / ext4 rw,relatime,data=ordered 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
mqueue /dev/mqueue mqueue rw,relatime 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=94700 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
/dev/sda1 /boot ext4 rw,relatime,data=ordered 0 0
/dev/mapper/vg_system-system_puppet /opt/puppetlabs xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/vg_system-system_tmp /tmp xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/mapper/vg_system-system_var /var xfs rw,relatime,attr2,inode64,noquota 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
fs-07.i:/zsrv/el7optmodules /opt/modules nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.12.129,local_lock=none,addr=192.168.0.93 0 0
fs-07.i:/zsrv/el7optsoftware /opt/software nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.12.129,local_lock=none,addr=192.168.0.93 0 0
192.168.1.40:/mnt/nfs/crash /var/crash nfs rw,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.40,mountvers=3,mountport=20048,mountproto=udp,local_lock=none,addr=192.168.1.40 0 0
/etc/auto.cvmfs /cvmfs autofs rw,relatime,fd=5,pgrp=16610,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=185026 0 0
ufs18 /mnt/ufs18 gpfs rw,relatime 0 0
gs21 /mnt/gs21 gpfs rw,relatime 0 0

Comment 9 Felip Moll 2022-11-17 11:06:01 MST

Hi, the error is harmless, but still something that probably needs to be fixed.

It seems there's a race condition where the freezer cgroup is destroyed, and we are still serving signal RPCs from an external source (slurmd, srun, scancel..) afterwards. We need to lock the cgroup while destroying it and not let any other thread to try to read the cgroup afterwards/in the meantime.

There's a related commit 26e96df68aa97 but in theory it is in since 21.08.7 and may be covering a bit different situation.

I will let you know when I have more conclusions, but for the moment there's no need to worry too much about it.

Comment 12 Felip Moll 2022-11-24 11:43:21 MST

Hi Steve, can you please confirm that *all* your slurmd's are running at versions superior or equal to 21.08.7 ?

Comment 13 Felip Moll 2023-02-01 15:11:50 MST

Hi Steve,

First of all sorry for not having replied in so much time. Besides having higher priority bugs (this one was under a specific condition and not harmful) I didn't found the issue until now and really I had a hard time to reproduce in newer versions.

Finally I have found the cause of this issue and I am working already on a solution. This turned out to be a duplicate of bug 14293. If you don't mind I am closing this one since was being reported later, and we will continue in bug 14293.

There I briefly explain the situation in https://bugs.schedmd.com/show_bug.cgi?id=14293#c35

I was already suspecting about a signal arriving when we were removing the cgroup, but didn't find exactly why it was causing issues. Now I've found it.

Thanks for your comprehension and patience.

*** This ticket has been marked as a duplicate of ticket 14293 ***