From what I have noticed experimentally and also by looking at the code in src/slurmctld/job_mgr.c, Slurm will immediately send a SIGKILL to all remaining processes when all the tasks on a node are considered finished. And a task is considered finished when the main PID, the one that was directly spawned by slurmstepd, exits. $ time srun bash -c 'bash -c "sleep 21 & kill -9 \$$" ; sleep 3s ; pgrep -a sleep' /usr/bin/bash: line 1: 710924 Killed bash -c "sleep 21 & kill -9 \$$" 710907 sleep 100000000 710925 sleep 21 real 0m4.149s user 0m0.003s sys 0m0.003s And from the slurmstepd log: [1342.0] debug2: proctrack/cgroup: proctrack_p_signal: sending process 710925 (inherited_task) signal 9 We have a few use cases where some of the subprocesses of the task serve an important purpose (for example in a "sidecar container" type of pattern) and we would like the descendants of the task to not be terminated so forcefully, to give them a chance to cleanup and also to notify other nodes in the job that the application terminated. There are probably a few ways to enable to this feature, for example I could think of the following: - There could be a configuration option similar in behavior to GraceTime, but for subprocesses in a task: they would be sent SIGTERM and then SIGKILL after N seconds. - There could be a new "srun" flag that waits until *all* processes in a task exit before marking the task as complete. I looked at this and it seems more challenging to implement, you need PR_SET_CHILD_SUBREAPER to be able to reap grandchildren processes (and avoid having them reparented to init), but when you reap a reparented process you need to be able to lookup which local task it belonged to before it was reparented to slurmstepd. We are open to other suggestions, of course.
Re-tagging as a potential enhancement request. Is this something you're imagining only needing on explicitly launched steps with srun, or might also want as part of the batch step as well? Using PR_SET_CHILD_SUBREAPER is certainly trickier, but also may be worth exploring on its own anyways.
Right now our use case is only for steps launched through srun. But we believe supporting processes in the sbatch test could possibly be helpful too, and the behavior would be consistent this way between the sbatch step and srun steps.
The '--wait-for-children' options has been added to srun and will be available in 25.05. The option leverages certain cgroup features that make it possible to wait until all processes are done before tasks (and subsequently the step) finished. Please let me know if you have any questions. See commits here: https://github.com/SchedMD/slurm/compare/f3a3d87f3feb...0a25b5ffd297.
Please reopen this ticket or submit a new ticket if you run into any issues. Closing now.
Re-opening, as I would like to discuss this aspect of the feature: https://slurm.schedmd.com/srun.html#OPT_wait-for-children > Note that if the parent process exits with a non-zero exit code, the task will end regardless of whether there are still children processes running. After discussing with our internal users that requested this feature, we would like to discuss whether this constraint can be relaxed. The use case is having children processes monitor the parent process and give those children the time to notify other nodes that one task crashed, so they should not be killed immediately. The suggestion from a user, and I think it makes sense, is to have a special case when both --no-kill and --wait-for-children are set. In this case we would wait wait all for children processes even if the exit code was non-zero for the parent. If --no-kill is not set, then we would keep the current behavior. What do you think?
Hi Felix, We're currently discussing this internally. I'll get back to you ASAP on how we want to handle this.
Traveling. Email replies will be delayed.
Hey Felix - I'm not sure I want to overload --no-kill in this way, that flag has a lot of other impacts that aren't necessarily directly tied here. It looks like --kill-on-bad-exit=0 would be suitable. Although on my quick testing it seems like that's only respected client-side right now, and we'd need to adjust some code paths to suit. I'll ask Ben to take a look at whether we could patch that in, or if this might also require some RPC changes to support. - Tim
Hello Tim, Do you have any update on this? Regarding modifying the behavior when --kill-on-bad-exit=0 is used. Thanks
(In reply to Felix Abecassis from comment #11) > Hello Tim, > > Do you have any update on this? Regarding modifying the behavior when > --kill-on-bad-exit=0 is used. > > Thanks We have a patch set to use --kill-bad-on-exit to change the behavior of --wait-for-all-children in regards to the main process exit, i.e. if --kill-on-bad-exit=0, ignore non-zero exit code from main process, and if --kill-on-bad-exit=1, task is ended when main process ends with a non-zero exit code. The patch set is currently under review and I'll let you know about any status updates on it. Please let me know if you have any questions or concerns
Thanks Ben! Are you planning to add the patchset to 25.05 or just to master/25.11?
(In reply to Felix Abecassis from comment #13) > Thanks Ben! > > Are you planning to add the patchset to 25.05 or just to master/25.11? We are considering adding it to 25.05 as well, but I can't guarantee that yet. I'll let you know what is decided during the review process.
We've added this change to --wait-for-children to 25.11, as well as 25.05. The earliest 25.05 version to see this change should be 25.05.2. See the following commits for each respective version: 25.11: https://github.com/SchedMD/slurm/compare/d1ebc21f15b...5f191368bb7 25.05: https://github.com/SchedMD/slurm/compare/d0bb9b329ef...66289192195 Let me know if you have any questions.
Closing.