Summary: | Job steps left running if proctrack_p_wait is unable to remove cgroup | ||
---|---|---|---|
Product: | Slurm | Reporter: | David Gloe <david.gloe> |
Component: | slurmd | Assignee: | Felip Moll <felip.moll> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | csamuel, felip.moll, james.nordby, lyeager |
Version: | 20.02.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | CRAY | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | Cray Internal |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 21.08.0 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | bug11103 20.11 patch |
Description
David Gloe
2021-03-16 12:53:58 MDT
(In reply to David Gloe from comment #0) > During a recent 48 hour endurance run on a Cray Internal system, we found > many compute nodes with processes still running from jobs that had already > completed. Investigation showed that the job step's cgroup still existed, > and these error messages were seen in the slurmd log: > > proctrack_p_wait: Unable to destroy container 66624 in cgroup plugin, giving > up after 128 sec > eio_handle_mainloop: Abandoning IO 60 secs after job shutdown initiated > > I believe that this sequence of events caused this: > > 1. Lustre connectivity goes down > 2. Application process attempts to write to Lustre and hangs in the kernel > 3. Job gets canceled, proctrack/cgroup attempts to kill processes > 4. proctrack/cgroup gives up killing processes and removing cgroup > 5. Lustre connectivity is restored, processes continue running > > I think we could improve this situation by making the proctrack_p_wait > timeout configurable, and/or draining the node if proctrack_p_wait is unable > to kill the processes. proctrack_p_wait sleeps several times waiting for the processes to end. It starts sleeping 1 second in the first try, then 2, then 4, 8, 16, 32 and 64. That makes a total wait time of 126 seconds. If you have the UnkillableStepTimeout set to a value less than 126, you will get the desired effect of draining the node. If it is bigger, then you will see this situation. That's because the step_terminate_monitor will wait for this configured time before draining the node, and if it turns that proctrack ends earlier then the node won't be drained. See _monitor() in step_terminate_monitor.c. Moreover you can set the UnkillableStepProgram to do further actions if this situation happens. But I see your point, you're right that proctrack_p_wait timeout in cgroup code is hardcoded, which makes that if UnkillableStepTimeout is greater than 126 then it won't behave as one would expect. What is your UnkillableStepTimeout? > proctrack_p_wait sleeps several times waiting for the processes to end. It
> starts sleeping 1 second in the first try, then 2, then 4, 8, 16, 32 and 64.
> That makes a total wait time of 126 seconds.
Sorry, 127 seconds is what I meant in my last comment.
We have UnkillableStepTimeout set to 3600. We can try reducing the UnkillableStepTimeout to below 127 and see if that helps. (In reply to Felip Moll from comment #2) > If you have the UnkillableStepTimeout set to a value less than 126, you will > get the desired effect of draining the node. If it is bigger, then you will > see this situation. Yikes! I'm fixing this on our clusters ASAP. The solution recommended by David is fine, but please also update slurm.conf(5) and/or add a warning in the slurmctld startup log about this. (In reply to Luke Yeager from comment #5) > (In reply to Felip Moll from comment #2) > > If you have the UnkillableStepTimeout set to a value less than 126, you will > > get the desired effect of draining the node. If it is bigger, then you will > > see this situation. > Yikes! I'm fixing this on our clusters ASAP. The solution recommended by > David is fine, but please also update slurm.conf(5) and/or add a warning in > the slurmctld startup log about this. Yep, that's something I will need to fix/improve. David, let me know if changing that value fixed things for you. Actually, I can't reproduce this. For me, the unkillable step results in the node being drained with "Kill task failed" as expected.
Relevant config:
> $ scontrol show config
> ProctrackType = proctrack/cgroup
> SelectType = select/cons_tres
> SLURM_VERSION = 20.11.5
> UnkillableStepTimeout = 180 sec
Luke, could you check the state of the node between the 128 seconds of proctrack_p_wait and the UnkillableStepTimeout? The behavior we're seeing is that the node goes back to idle state so that more jobs run on that node. (In reply to David Gloe from comment #8) > Luke, could you check the state of the node between the 128 seconds of > proctrack_p_wait and the UnkillableStepTimeout? The behavior we're seeing is > that the node goes back to idle state so that more jobs run on that node. I increased the timeout from 180sec to 600sec and now I can see the issue that David is reporting. I'm using a slightly modified version of this trick to simulate a hung process: https://unix.stackexchange.com/a/305640. > Mar 24 09:12:37 compute-node-735 slurm-prolog[3908899]: START user=lyeager job=104002 > Mar 24 09:12:44 compute-node-735 slurm-prolog[3909126]: END user=lyeager job=104002 after 7 seconds > Mar 24 09:12:44 compute-node-735 slurmd[3908816]: prolog for job 104002 ran for 7 seconds > Mar 24 09:12:46 compute-node-735 slurmd[3908816]: StepId=104002.interactive job_mem:0MB step_mem:0MB > Mar 24 09:12:57 compute-node-735 kernel: <1>Inserting memory module > Mar 24 09:13:06 compute-node-735 kernel: <1>going to sleep > Mar 24 09:13:19 compute-node-735 slurmstepd[3909135]: error: *** STEP 104002.interactive ON compute-node-735 CANCELLED AT 2021-03-24T09:13:19 *** > Mar 24 09:13:19 compute-node-735 slurmstepd[3909128]: done with job > Mar 24 09:15:45 compute-node-735 kernel: INFO: task cat:3909163 blocked for more than 120 seconds. > Mar 24 09:17:46 compute-node-735 kernel: INFO: task cat:3909163 blocked for more than 241 seconds. > Mar 24 09:18:04 compute-node-735 slurmstepd[3909135]: error: proctrack_p_wait: Unable to destroy container 3909135 in cgroup plugin, giving up after 128 sec > Mar 24 09:19:04 compute-node-735 slurmstepd[3909135]: error: eio_handle_mainloop: Abandoning IO 60 secs after job shutdown initiated > Mar 24 09:19:06 compute-node-735 slurmstepd[3909135]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused > Mar 24 09:19:06 compute-node-735 slurmstepd[3909135]: done with job > Mar 24 09:19:09 compute-node-735 slurm-epilog[3909392]: START user=lyeager job=104002 > Mar 24 09:19:12 compute-node-735 slurm-epilog[3909820]: END user=lyeager job=104002 after 3 seconds > Mar 24 09:19:47 compute-node-735 kernel: INFO: task cat:3909163 blocked for more than 362 seconds. And indeed, the node is not drained anymore with this sequence of events. That's not good. Fortunately, we already have an extra paranoid check for stray processes in our epilog (left over from some previous fire-fighting) so we are able to catch this. And we also haven't seen it in the wild yet. But thanks to David for reporting this! Good to know about! Note that I'm seeing nearly 5minutes between the cancellation of the job and the "Unable to destroy container" message. So it seems the threshold is actually higher than 127 seconds for some reason. > Note that I'm seeing nearly 5minutes between the cancellation of the job and
> the "Unable to destroy container" message. So it seems the threshold is
> actually higher than 127 seconds for some reason.
I missed that the sleep is done at the start of the loop, so it is not 127 but 256 seconds (the last wait is useless).
I am preparing a fix for this.
Just curious - how's that fix going? Created attachment 20443 [details] bug11103 20.11 patch Hi, Sorry for not having updated you before. There is a fix pending for review for some time ago but due to the 21.08 approaching and other critical issues the QA team still haven't pushed this one in. It will probably be part of 21.08 but not fixed in 20.11. Just in case you're really interested in the fix I hand over this standalone patch to you. This applies to 20.11 cleanly, if you need instructions on how to do so let me know. I will inform when the fix is officially in our production release. Hi David, First of all sorry for the delay on the resolution of this bug. Finally a fix has been included in the latest 21.08.0 release, so it will be in from now on. The change is applied to cgroup and pgid plugins. There's a bit change in logic (which should not cause any negative impact); now the delay between retries will be of maximum 32 seconds, it made no sense to wait up to 127 seconds to retry. Also the last sigkill, which its status was not controlled won't run if the timeout is reached. commit dbb9fa27b6d91b0bbbc5b28267bd1dc135f9b353 Author: Felip Moll <felip.moll@schedmd.com> AuthorDate: Tue May 18 14:36:55 2021 +0200 Commit: Tim Wickberg <tim@schedmd.com> CommitDate: Wed Aug 25 09:22:20 2021 -0600 Wait UnkillableStepTimeout in proctrack_g_wait() There was a hardcoded value of 256 seconds in proctrack_g_wait() in the pgid and cgroup plugins that did not match the wait of the step_terminate_monitor. That caused values of UnkillableStepTimeout larger than 256 to not drain the node when an unkillable step existed, because proctrack finished before this 256 second timeout and the monitor was terminated. With this fix, proctrack_g_wait() will wait UnkillableStepTimeout instead. Also change the delay between tests to a maximum of 32 seconds, instead of exponentially increasing it every time. Bug 11103 Thanks for your waiting. I am closing this as resolved. |