Ticket 7841

Summary:	jobacct_gather plugin fails to remove cgroups (Device or resource busy)
Product:	Slurm	Reporter:	Mazen Al-Hagri <mazen.alhagri>
Component:	slurmstepd	Assignee:	Jacob Jenson <jacob>
Status:	RESOLVED INVALID	QA Contact:
Severity:	6 - No support contract
Priority:	---	CC:	robert.r.vernon
Version:	19.05.2
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf

Description Mazen Al-Hagri 2019-10-01 03:16:03 MDT

Created attachment 11755 [details]
slurm.conf

Hi,

Upon MPI job successful completion, slurmstepd fails to clean-up the job cgroups, and consequently (some) nodes get drained.

Logs from slurmctld:

> slurmstepd: error: *** JOB 21 STEPD TERMINATED ON node001 AT 2019-09-20T15:34:50 DUE TO JOB NOT ENDING WITH SIGNALS ***

Debug output of slurmd:

> # Logs from node001:/var/log/slurmd
> [2019-09-20T15:34:29.355] [21.batch] debug3: xcgroup_set_uint32_param: parameter 'tasks' set to '6034' for '/sys/fs/cgroup/cpuacct'
> [2019-09-20T15:34:29.356] [21.batch] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_21/step_batch/task_0): Device or resource busy
> [2019-09-20T15:34:29.356] [21.batch] debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_1001/job_21/step_batch/task_0 Device or resource busy
> [2019-09-20T15:34:29.356] [21.batch] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_21/step_batch): Device or resource busy
> [2019-09-20T15:34:29.356] [21.batch] debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct Device or resource busy
> [2019-09-20T15:34:29.356] [21.batch] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_21): Device or resource busy
> [2019-09-20T15:34:29.356] [21.batch] debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_1001/job_21 Device or resource busy
> ...

The issue boils down to not being able to remove job's step cgroup upon completion:

> [root@node001 ~]# rmdir /sys/fs/cgroup/freezer/slurm/uid_1001/job_21/step_batch
> rmdir: failed to remove ‘/sys/fs/cgroup/freezer/slurm/uid_1001/job_21/step_batch’: Device or resource busy

However, removing the `task_0` subdir first, then removing the job slice, works:

> [root@node001 ~]# rmdir /sys/fs/cgroup/freezer/slurm/uid_1001/job_21/step_batch/task_0/
> [root@node001 ~]# rmdir /sys/fs/cgroup/freezer/slurm/uid_1001/job_21/step_batch
> [root@node001 ~]# 

I couldn't figure out what's the root cause though. We've seen this issue with MPI jobs only (openmpi-3).

A copy of slurm.conf is attached.

(The issue was reproducible in slurm 18.08.4 and 19.05.2).

Comment 1 Jacob Jenson 2019-10-01 09:16:22 MDT

Mazen,

Is this request for a system that has Slurm support with SchedMD? Or is this more of a question from internal testing? Typically SchedMD only provides support to sites/systems with support contracts. 

Thanks,
Jacob