| Summary: | RFE: pam_finish should be called after spank_fini | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Felix Abecassis <fabecassis> |
| Component: | slurmstepd | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | bnabong, jbernauer, lyeager, tim |
| Version: | 24.05.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NVIDIA (PSLA) | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 25.05.0rc1 | |
| Target Release: | 25.05 | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | v1 | ||
|
Description
Felix Abecassis
2024-03-19 20:52:15 MDT
Hi Felix, I'm investigating this now and will get back to you soon. Making the bug public again. As we discussed in a meeting, I ended up modifying pyxis to do the container cleanup in slurm_spank_task_exit to sidestep the issue: https://github.com/NVIDIA/pyxis/commit/f23bafb20fa971d36beaba37bd5c77953d49c073 It looks like a good solution for this problem, so closing this bug. I'm reopening this bug, I found out that the change I introduced in pyxis is causing errors in some cases where there is a fairly complex processes hierarchy. It's actually related to the behavior I describe in my other RFE https://support.schedmd.com/show_bug.cgi?id=20468: I now realize that I cannot remove the container filesystem in the last call to slurm_spank_task_exit(), because it just means that the main PID has exited for each local task in the job step, but there could still be children that are currently executing at this point. The SIGKILL to the remaining children is sent *after* all calls to slurm_spank_task_exit() completed, so I really need to have the container cleanup in slurm_spank_exit() instead, like it used to be. So please reconsider my original request in this bug. Adding another SPANK callback between task_exit() and exit() would also work, if this callback guarantees that all PIDs were removed *and* if it's called before pam_finish(). Also, while RFE https://support.schedmd.com/show_bug.cgi?id=20468 could actually help fix this problem by waiting on all children, it won't be the default behavior, and I would like to have a solution that works in all cases. Re-targeting as a potential dev project. I can't safely move the existing hooks around - there may well be plugins reliant on the existing sequence. Instead what I'd discussed with Felix is potentially adding an additional SPANK hook in the desired location. That would be fine for us. Here are the requirements for a new SPANK callback: * Has to be after all processes exited (hence after the SIGKILL of all leftover tasks), that's not the case for task_exit as it's only the exit of the parent PID of the task. * Has to be synchronous with regard to the step ending. slurmstepd exit() is executed asynchronously so there is a potential race condition with the execution of the next step on the same node. * Has to be before pam_finish is called in order to be able to access the filesystem on behalf of the user. Created attachment 40496 [details]
v1
Felix -
Can you check if the attached change would solve your problems? It's the same location a new API call would land if we continue down that path.
I've been staring at this one last time, and I'm willing to concede that the number of sites mixing - and cross-tying - SPANK plus PAM stacks together is close to zero, and this is probably as low a risk change as we can make to SPANK.
I'd still want to propose this adjustment as NRE - our main risk is in needing to further adapt if other users discover issues on upgrading to 25.05, and I'd still need to have someone audit our cache of public SPANK plugins for potential problems before landing this.
- Tim
I couldn't fully test the patch it in the same environment than the one where we noticed the initial problem (because it's a complex setup). But yes I think this patch would work.
> It's the same location a new API call would land if we continue down that path.
I wouldn't be against introducing the new API call anyway as it would make the container cleanup / container export cleaner.
What I currently have:
- task_exit: container cleanup + container save (--container-save)
With this patch I will go back to doing the following:
- task_exit: container save
- exit: container cleanup
However the problem of having children processes still executing after the last call to task_exit is also a problem for container save (but less critical than for container cleanup). And container save can't be in exit() because the exit() is async so this would be a race condition: the second srun would run concurrently to ubuntu.sqsh being created if I moved the save to exit():
$ srun --container-image=ubuntu --container-save=ubuntu.sqsh true
$ srun --container-image=ubuntu.sqsh ...
Hence a new API call would allow to perform safe cleanup / container export when all PIDs of the task exited.
(In reply to Felix Abecassis from comment #9) > That would be fine for us. > > Here are the requirements for a new SPANK callback: > * Has to be after all processes exited (hence after the SIGKILL of all > leftover tasks), that's not the case for task_exit as it's only the exit of > the parent PID of the task. > * Has to be synchronous with regard to the step ending. slurmstepd exit() is > executed asynchronously so there is a potential race condition with the > execution of the next step on the same node. > * Has to be before pam_finish is called in order to be able to access the > filesystem on behalf of the user. I'd missed the implication of your request that the new hook be added "after all processes exited" but "before pam_finish()". The patch I'd proposed doesn't necessarily accomplish that - that's the same location I'd have added the new callback. But process cleanup isn't necessarily complete until after proctrack_g_destroy(). The implications of moving both spank_fini() and pam_finish() that far back are a lot harder to sort through, and would require a lot more research before we could commit to anything. If switching the order of pam_finish() vs. spank_fini() is still helpful, that's something we can do for 25.05. If a new SPANK callback is introduced between task_exit() and exit(), the calls to pam_finish and spank_fini don't need to be moved at all. It's in the pyxis plugin that I would move the logic around.
> If switching the order of pam_finish() vs. spank_fini() is still helpful
Yes the patch is still useful for 25.05, as it would allow us to fix the problem initially reported here. My point in the previous comment is that I think there is still room for a new callback executed when all tasks are guaranteed to be finished.
I've pushed the re-ordering of pam_finish() vs spank_fini() to master. It'll be included in 25.05 when released. (https://github.com/SchedMD/slurm/commit/8a9e4cec90) As I'd mentioned, the requirement that an additional SPANK hook exist before pam_fini() but after all tasks have completed would require deferring pam_finish() much later in the process cleanup which has unclear ramifications, and is not something I'd like to consider at this point in time. - Tim |