Created attachment 11755 [details] slurm.conf Hi, Upon MPI job successful completion, slurmstepd fails to clean-up the job cgroups, and consequently (some) nodes get drained. Logs from slurmctld: > slurmstepd: error: *** JOB 21 STEPD TERMINATED ON node001 AT 2019-09-20T15:34:50 DUE TO JOB NOT ENDING WITH SIGNALS *** Debug output of slurmd: > # Logs from node001:/var/log/slurmd > [2019-09-20T15:34:29.355] [21.batch] debug3: xcgroup_set_uint32_param: parameter 'tasks' set to '6034' for '/sys/fs/cgroup/cpuacct' > [2019-09-20T15:34:29.356] [21.batch] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_21/step_batch/task_0): Device or resource busy > [2019-09-20T15:34:29.356] [21.batch] debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_1001/job_21/step_batch/task_0 Device or resource busy > [2019-09-20T15:34:29.356] [21.batch] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_21/step_batch): Device or resource busy > [2019-09-20T15:34:29.356] [21.batch] debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct Device or resource busy > [2019-09-20T15:34:29.356] [21.batch] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_1001/job_21): Device or resource busy > [2019-09-20T15:34:29.356] [21.batch] debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_1001/job_21 Device or resource busy > ... The issue boils down to not being able to remove job's step cgroup upon completion: > [root@node001 ~]# rmdir /sys/fs/cgroup/freezer/slurm/uid_1001/job_21/step_batch > rmdir: failed to remove ‘/sys/fs/cgroup/freezer/slurm/uid_1001/job_21/step_batch’: Device or resource busy However, removing the `task_0` subdir first, then removing the job slice, works: > [root@node001 ~]# rmdir /sys/fs/cgroup/freezer/slurm/uid_1001/job_21/step_batch/task_0/ > [root@node001 ~]# rmdir /sys/fs/cgroup/freezer/slurm/uid_1001/job_21/step_batch > [root@node001 ~]# I couldn't figure out what's the root cause though. We've seen this issue with MPI jobs only (openmpi-3). A copy of slurm.conf is attached. (The issue was reproducible in slurm 18.08.4 and 19.05.2).
Mazen, Is this request for a system that has Slurm support with SchedMD? Or is this more of a question from internal testing? Typically SchedMD only provides support to sites/systems with support contracts. Thanks, Jacob