Ticket 20468

Summary: RFE: do not immediately SIGKILL all processes in a task when the main process exits
Product: Slurm Reporter: Felix Abecassis <fabecassis>
Component: slurmstepdAssignee: Ben Glines <ben.glines>
Status: RESOLVED FIXED QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: ben.glines, bnabong, jbernauer, lyeager, tim
Version: 24.11.x   
Hardware: Linux   
OS: Linux   
Site: NVIDIA (PSLA) Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 25.05.0rc1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Felix Abecassis 2024-07-22 19:12:15 MDT
From what I have noticed experimentally and also by looking at the code in src/slurmctld/job_mgr.c, Slurm will immediately send a SIGKILL to all remaining processes when all the tasks on a node are considered finished. And a task is considered finished when the main PID, the one that was directly spawned by slurmstepd, exits.

$ time srun bash -c 'bash -c "sleep 21 & kill -9 \$$" ; sleep 3s ; pgrep -a sleep'
/usr/bin/bash: line 1: 710924 Killed                  bash -c "sleep 21 & kill -9 \$$"
710907 sleep 100000000
710925 sleep 21

real    0m4.149s
user    0m0.003s
sys     0m0.003s

And from the slurmstepd log:
[1342.0] debug2: proctrack/cgroup: proctrack_p_signal: sending process 710925 (inherited_task) signal 9


We have a few use cases where some of the subprocesses of the task serve an important purpose (for example in a "sidecar container" type of pattern) and we would like the descendants of the task to not be terminated so forcefully, to give them a chance to cleanup and also to notify other nodes in the job that the application terminated.

There are probably a few ways to enable to this feature, for example I could think of the following:
- There could be a configuration option similar in behavior to GraceTime, but for subprocesses in a task: they would be sent SIGTERM and then SIGKILL after N seconds.

- There could be a new "srun" flag that waits until *all* processes in a task exit before marking the task as complete. I looked at this and it seems more challenging to implement, you need PR_SET_CHILD_SUBREAPER to be able to reap grandchildren processes (and avoid having them reparented to init), but when you reap a reparented process you need to be able to lookup which local task it belonged to before it was reparented to slurmstepd.

We are open to other suggestions, of course.
Comment 1 Tim Wickberg 2024-07-23 17:15:50 MDT
Re-tagging as a potential enhancement request.

Is this something you're imagining only needing on explicitly launched steps with srun, or might also want as part of the batch step as well?

Using PR_SET_CHILD_SUBREAPER is certainly trickier, but also may be worth exploring on its own anyways.
Comment 2 Felix Abecassis 2024-07-23 18:08:52 MDT
Right now our use case is only for steps launched through srun. But we believe supporting processes in the sbatch test could possibly be helpful too, and the behavior would be consistent this way between the sbatch step and srun steps.
Comment 3 Ben Glines 2025-05-02 16:41:38 MDT
The '--wait-for-children' options has been added to srun and will be available in 25.05. The option leverages certain cgroup features that make it possible to wait until all processes are done before tasks (and subsequently the step) finished.

Please let me know if you have any questions.

See commits here: https://github.com/SchedMD/slurm/compare/f3a3d87f3feb...0a25b5ffd297.
Comment 4 Ben Glines 2025-05-09 09:56:06 MDT
Please reopen this ticket or submit a new ticket if you run into any issues. Closing now.
Comment 5 Felix Abecassis 2025-06-26 17:13:18 MDT
Re-opening, as I would like to discuss this aspect of the feature: https://slurm.schedmd.com/srun.html#OPT_wait-for-children

> Note that if the parent process exits with a non-zero exit code, the task will end regardless of whether there are still children processes running.

After discussing with our internal users that requested this feature, we would like to discuss whether this constraint can be relaxed.

The use case is having children processes monitor the parent process and give those children the time to notify other nodes that one task crashed, so they should not be killed immediately.

The suggestion from a user, and I think it makes sense, is to have a special case when both --no-kill and --wait-for-children are set. In this case we would wait wait all for children processes even if the exit code was non-zero for the parent.

If --no-kill is not set, then we would keep the current behavior.

What do you think?
Comment 6 Ben Glines 2025-07-03 15:44:33 MDT
Hi Felix,

We're currently discussing this internally. I'll get back to you ASAP on how we want to handle this.
Comment 7 Julie Bernauer 2025-07-03 15:45:07 MDT
Traveling. Email replies will be delayed.
Comment 9 Tim Wickberg 2025-07-07 15:29:33 MDT
Hey Felix -

I'm not sure I want to overload --no-kill in this way, that flag has a lot of other impacts that aren't necessarily directly tied here.

It looks like --kill-on-bad-exit=0 would be suitable. Although on my quick testing it seems like that's only respected client-side right now, and we'd need to adjust some code paths to suit. I'll ask Ben to take a look at whether we could patch that in, or if this might also require some RPC changes to support.

- Tim
Comment 11 Felix Abecassis 2025-08-01 10:11:44 MDT
Hello Tim,

Do you have any update on this? Regarding modifying the behavior when --kill-on-bad-exit=0 is used.

Thanks
Comment 12 Ben Glines 2025-08-04 16:01:47 MDT
(In reply to Felix Abecassis from comment #11)
> Hello Tim,
> 
> Do you have any update on this? Regarding modifying the behavior when
> --kill-on-bad-exit=0 is used.
> 
> Thanks

We have a patch set to use --kill-bad-on-exit to change the behavior of --wait-for-all-children in regards to the main process exit, i.e. if --kill-on-bad-exit=0, ignore non-zero exit code from main process, and if --kill-on-bad-exit=1, task is ended when main process ends with a non-zero exit code. The patch set is currently under review and I'll let you know about any status updates on it.

Please let me know if you have any questions or concerns
Comment 13 Felix Abecassis 2025-08-05 08:41:19 MDT
Thanks Ben!

Are you planning to add the patchset to 25.05 or just to master/25.11?
Comment 14 Ben Glines 2025-08-05 09:11:01 MDT
(In reply to Felix Abecassis from comment #13)
> Thanks Ben!
> 
> Are you planning to add the patchset to 25.05 or just to master/25.11?

We are considering adding it to 25.05 as well, but I can't guarantee that yet. I'll let you know what is decided during the review process.
Comment 15 Ben Glines 2025-08-05 14:33:10 MDT
We've added this change to --wait-for-children to 25.11, as well as 25.05. The earliest 25.05 version to see this change should be 25.05.2. See the following commits for each respective version:

25.11:
https://github.com/SchedMD/slurm/compare/d1ebc21f15b...5f191368bb7

25.05:
https://github.com/SchedMD/slurm/compare/d0bb9b329ef...66289192195

Let me know if you have any questions.
Comment 16 Ben Glines 2025-09-02 09:30:02 MDT
Closing.