Ticket 7841

Summary: jobacct_gather plugin fails to remove cgroups (Device or resource busy)
Product: Slurm Reporter: Mazen Al-Hagri <mazen.alhagri>
Component: slurmstepdAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: --- CC: robert.r.vernon
Version: 19.05.2   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Mazen Al-Hagri 2019-10-01 03:16:03 MDT
Created attachment 11755 [details]
slurm.conf

Hi,

Upon MPI job successful completion, slurmstepd fails to clean-up the job cgroups, and consequently (some) nodes get drained.

Logs from slurmctld:

> slurmstepd: error: *** JOB 21 STEPD TERMINATED ON node001 AT 2019-09-20T15:34:50 DUE TO JOB NOT ENDING WITH SIGNALS ***

Debug output of slurmd:

> # Logs from node001:/var/log/slurmd
> [2019-09-20T15:34:29.355] [21.batch] debug3: xcgroup_set_uint32_param: parameter 'tasks' set to '6034' for '/sys/fs/cgroup/cpuacct'
> [2019-09-20T15:34:29.356] [21.batch] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_21/step_batch/task_0): Device or resource busy
> [2019-09-20T15:34:29.356] [21.batch] debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_1001/job_21/step_batch/task_0 Device or resource busy
> [2019-09-20T15:34:29.356] [21.batch] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_21/step_batch): Device or resource busy
> [2019-09-20T15:34:29.356] [21.batch] debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct Device or resource busy
> [2019-09-20T15:34:29.356] [21.batch] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_21): Device or resource busy
> [2019-09-20T15:34:29.356] [21.batch] debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_1001/job_21 Device or resource busy
> ...

The issue boils down to not being able to remove job's step cgroup upon completion:

> [root@node001 ~]# rmdir /sys/fs/cgroup/freezer/slurm/uid_1001/job_21/step_batch
> rmdir: failed to remove ‘/sys/fs/cgroup/freezer/slurm/uid_1001/job_21/step_batch’: Device or resource busy

However, removing the `task_0` subdir first, then removing the job slice, works:

> [root@node001 ~]# rmdir /sys/fs/cgroup/freezer/slurm/uid_1001/job_21/step_batch/task_0/
> [root@node001 ~]# rmdir /sys/fs/cgroup/freezer/slurm/uid_1001/job_21/step_batch
> [root@node001 ~]# 

I couldn't figure out what's the root cause though. We've seen this issue with MPI jobs only (openmpi-3).

A copy of slurm.conf is attached.

(The issue was reproducible in slurm 18.08.4 and 19.05.2).
Comment 1 Jacob Jenson 2019-10-01 09:16:22 MDT
Mazen,

Is this request for a system that has Slurm support with SchedMD? Or is this more of a question from internal testing? Typically SchedMD only provides support to sites/systems with support contracts. 

Thanks,
Jacob