Ticket 13849

Summary: Job termination trouble
Product: Slurm Reporter: Josko Plazonic <plazonic>
Component: slurmstepdAssignee: Carlos Tripiana Montes <tripiana>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart, csc-slurm-tickets, kilian, marshall
Version: 21.08.6   
Hardware: Linux   
OS: Linux   
Site: Princeton (PICSciE) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Josko Plazonic 2022-04-14 10:22:38 MDT
Quite often we've been getting nodes stuck in drain state with "Kill task failed" on this one small cluster (which is why it is a problem - when you only have 2 compute nodes and one or both get offlined they tend to notice and complain).

In slurmd log I see:

[2022-04-13T17:53:24.010] [30032.batch] error: *** JOB 30032 ON verde-m18n2 CANCELLED AT 2022-04-13T17:53:24 DUE TO TIME LIMIT ***
[2022-04-13T17:53:24.011] [30032.extern] error: _cgroup_procs_check: failed on path /sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/step_extern/cgroup.procs: No such file or directory
[2022-04-13T17:53:24.011] [30032.extern] error: unable to read '/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/step_extern/cgroup.procs'
[2022-04-13T17:54:25.000] [30032.batch] error: *** JOB 30032 STEPD TERMINATED ON verde-m18n2 AT 2022-04-13T17:54:24 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2022-04-13T17:54:27.030] [30032.batch] error: Unable to destroy container 2091180 in cgroup plugin, giving up after 63 sec

and 

on slurmctld side

[2022-04-13T17:54:25.002] error: slurm_msg_sendto: address:port=10.33.18.33:43482 msg_type=8001: No error
[2022-04-13T17:54:25.002] error: slurmd error running JobId=30032 on node(s)=verde-m18n2: Kill task failed

There is nothing particularly special about this job (slightly simplified):

=============================================================
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=32G
#SBATCH --time=168:00:00

module purge
module load anaconda3/2021.11
conda activate xxxxxxx

python xxxxx.py abcd
=============================================================

It didn't use too much memory (according to slurm stats roughly 11.82GB).

The cgroup in question is still there, with the missing step_extern.

/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/
/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/cgroup.procs
/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/step_batch
/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/step_batch/cgroup.procs
/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/step_batch/freezer.self_freezing
/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/step_batch/tasks
/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/step_batch/freezer.parent_freezing
/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/step_batch/freezer.state
/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/step_batch/notify_on_release
/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/step_batch/cgroup.clone_children
/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/freezer.self_freezing
/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/tasks
/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/freezer.parent_freezing
/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/freezer.state
/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/notify_on_release
/sys/fs/cgroup/freezer/slurm/uid_267114/job_30032/cgroup.clone_children


The user in question did also login to that node while this job and a few others (of this) were running so step_extern probably got used (we have "pam_slurm_adopt.so action_no_jobs=ignore join_container=true").

Anything that can explain this and any way to prevent it?

Thanks!
Comment 1 Jason Booth 2022-04-15 11:49:14 MDT
Would you please attach your slurm.conf, cgroup.conf and the flavor of Linux your site uses for compute nodes? 

"Kill task failed" normally happens when a task can not be killed with SIG 9. These tasks may be hanging on I/O. Slurm provides a few options to gather metrics after a timeout period. These are UnkillableStepProgram and UnkillableStepTimeout. The program is something your site defines and is used to just gather output from say dmesg, mounts or see if any process is hung in the defunct state. Nodes should be considered unclean and rebooted if they see these errors.

https://slurm.schedmd.com/slurm.conf.html#SECTION_UNKILLABLE-STEP-PROGRAM-SCRIPT

https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram
https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepTimeout
Comment 2 Josko Plazonic 2022-04-18 08:26:02 MDT
I am not sure that what you are saying makes much sense in this context. It is failing to read cgroup.procs file - it does not exist at that point. No extra timeout will help here. If anything this looks like some kind of a timing/synchronization problem and it looks to me like it should treat this case as a non fatal error. If the file is missing then that cgroup was probably destroyed in the meantime.
Comment 5 Jason Booth 2022-04-18 12:58:00 MDT
> I am not sure that what you are saying makes much sense in this context. It is 
> failing to read cgroup.procs file - it does not exist at that point. No extra 
> timeout will help here. If anything this looks like some kind of a 
> timing/synchronization problem and it looks to me like it should treat this case 
> as a non fatal error. If the file is missing then that cgroup was probably 
> destroyed in the meantime.

There are a few recent fixes that may address the "timing/synchronization problem" 
 and I will have Carlos look into this for you.
Comment 6 Carlos Tripiana Montes 2022-04-21 06:33:02 MDT
Josko,

I have a couple of potential reasons that could have caused this:

https://github.com/SchedMD/slurm/commit/1ddef9a0dd8
https://github.com/schedMD/slurm/commit/91bd26c4817

I'd encourage you to cherry pick those and see if applied the issue is still reproducible.

If you still have the same issue please let me know and we'll require you additional information so we can better understand this flaw (if it's truly new).

Cheers,
Carlos.
Comment 7 Carlos Tripiana Montes 2022-04-27 07:08:11 MDT
Josko,

Had you some time to have a look at?

Thanks,
Carlos.
Comment 8 Carlos Tripiana Montes 2022-05-12 01:19:07 MDT
Josko,

We need to close the issue as info given by now, assuming the provided information in Comment 6 was enough.

Please, let us know if you need further assistance.

Regards,
Carlos.
Comment 9 Carlos Tripiana Montes 2022-05-13 01:04:44 MDT
Closing now. Please reopen if needed.