Ticket 19362

Summary:	RFE: pam_finish should be called after spank_fini
Product:	Slurm	Reporter:	Felix Abecassis <fabecassis>
Component:	slurmstepd	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED FIXED	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	bnabong, jbernauer, lyeager, tim
Version:	24.05.3
Hardware:	Linux
OS:	Linux
Site:	NVIDIA (PSLA)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	25.05.0rc1
Target Release:	25.05	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	v1

Description Felix Abecassis 2024-03-19 20:52:15 MDT

Filing on behalf of my teammate Aurelien Degremont.

In job_manager(), slurmstepd is calling pam_finish() before spank_fini() triggering a situation where PAM could be revoking security credentials that could be needed by spank plugins.

Likewise what commit 7b55fb8 (https://github.com/SchedMD/slurm/commit/7b55fb8) did, the pam_finish() calls should be done after any use cases for security contexts, like doing I/O on secured filesystems, have been handled. That commit moved the call to pam_finish() later in job_manager however, it did not move late enough to encompass the spank_fini() call. In the current situation, SPANK plugins could not do I/O calls requiring user credentials.

Comment 1 Ben Glines 2024-03-20 12:07:57 MDT

Hi Felix,

I'm investigating this now and will get back to you soon.

Comment 3 Felix Abecassis 2024-03-21 10:45:22 MDT

Making the bug public again.

Comment 6 Felix Abecassis 2024-04-26 10:22:35 MDT

As we discussed in a meeting, I ended up modifying pyxis to do the container cleanup in slurm_spank_task_exit to sidestep the issue:
https://github.com/NVIDIA/pyxis/commit/f23bafb20fa971d36beaba37bd5c77953d49c073

It looks like a good solution for this problem, so closing this bug.

Comment 7 Felix Abecassis 2025-01-08 20:11:01 MST

I'm reopening this bug, I found out that the change I introduced in pyxis is causing errors in some cases where there is a fairly complex processes hierarchy.

It's actually related to the behavior I describe in my other RFE https://support.schedmd.com/show_bug.cgi?id=20468: I now realize that I cannot remove the container filesystem in the last call to slurm_spank_task_exit(), because it just means that the main PID has exited for each local task in the job step, but there could still be children that are currently executing at this point. The SIGKILL to the remaining children is sent *after* all calls to slurm_spank_task_exit() completed, so I really need to have the container cleanup in slurm_spank_exit() instead, like it used to be.

So please reconsider my original request in this bug. Adding another SPANK callback between task_exit() and exit() would also work, if this callback guarantees that all PIDs were removed *and* if it's called before pam_finish().

Also, while RFE https://support.schedmd.com/show_bug.cgi?id=20468 could actually help fix this problem by waiting on all children, it won't be the default behavior, and I would like to have a solution that works in all cases.

Comment 8 Tim Wickberg 2025-01-22 12:29:22 MST

Re-targeting as a potential dev project.

I can't safely move the existing hooks around - there may well be plugins reliant on the existing sequence.

Instead what I'd discussed with Felix is potentially adding an additional SPANK hook in the desired location.

Comment 9 Felix Abecassis 2025-01-22 12:36:36 MST

That would be fine for us.

Here are the requirements for a new SPANK callback:
* Has to be after all processes exited (hence after the SIGKILL of all leftover tasks), that's not the case for task_exit as it's only the exit of the parent PID of the task.
* Has to be synchronous with regard to the step ending. slurmstepd exit() is executed asynchronously so there is a potential race condition with the execution of the next step on the same node.
* Has to be before pam_finish is called in order to be able to access the filesystem on behalf of the user.

Comment 10 Tim Wickberg 2025-01-24 16:35:51 MST

Created attachment 40496 [details]
v1

Felix -

Can you check if the attached change would solve your problems? It's the same location a new API call would land if we continue down that path.

I've been staring at this one last time, and I'm willing to concede that the number of sites mixing - and cross-tying - SPANK plus PAM stacks together is close to zero, and this is probably as low a risk change as we can make to SPANK.

I'd still want to propose this adjustment as NRE - our main risk is in needing to further adapt if other users discover issues on upgrading to 25.05, and I'd still need to have someone audit our cache of public SPANK plugins for potential problems before landing this.

- Tim

Comment 11 Felix Abecassis 2025-01-28 04:33:07 MST

I couldn't fully test the patch it in the same environment than the one where we noticed the initial problem (because it's a complex setup). But yes I think this patch would work.

> It's the same location a new API call would land if we continue down that path.

I wouldn't be against introducing the new API call anyway as it would make the container cleanup / container export cleaner. 

What I currently have:
- task_exit: container cleanup + container save (--container-save)

With this patch I will go back to doing the following:
- task_exit: container save
- exit: container cleanup

However the problem of having children processes still executing after the last call to task_exit is also a problem for container save (but less critical than for container cleanup). And container save can't be in exit() because the exit() is async so this would be a race condition: the second srun would run concurrently to ubuntu.sqsh being created if I moved the save to exit():
$ srun --container-image=ubuntu --container-save=ubuntu.sqsh true
$ srun --container-image=ubuntu.sqsh ...

Hence a new API call would allow to perform safe cleanup / container export when all PIDs of the task exited.

Comment 12 Tim Wickberg 2025-01-31 16:36:19 MST

(In reply to Felix Abecassis from comment #9)
> That would be fine for us.
> 
> Here are the requirements for a new SPANK callback:
> * Has to be after all processes exited (hence after the SIGKILL of all
> leftover tasks), that's not the case for task_exit as it's only the exit of
> the parent PID of the task.
> * Has to be synchronous with regard to the step ending. slurmstepd exit() is
> executed asynchronously so there is a potential race condition with the
> execution of the next step on the same node.
> * Has to be before pam_finish is called in order to be able to access the
> filesystem on behalf of the user.

I'd missed the implication of your request that the new hook be added "after all processes exited" but "before pam_finish()".

The patch I'd proposed doesn't necessarily accomplish that - that's the same location I'd have added the new callback.

But process cleanup isn't necessarily complete until after proctrack_g_destroy().

The implications of moving both spank_fini() and pam_finish() that far back are a lot harder to sort through, and would require a lot more research before we could commit to anything.

If switching the order of pam_finish() vs. spank_fini() is still helpful, that's something we can do for 25.05.

Comment 13 Felix Abecassis 2025-02-01 01:14:37 MST

If a new SPANK callback is introduced between task_exit() and exit(), the calls to pam_finish and spank_fini don't need to be moved at all. It's in the pyxis plugin that I would move the logic around.

> If switching the order of pam_finish() vs. spank_fini() is still helpful

Yes the patch is still useful for 25.05, as it would allow us to fix the problem initially reported here. My point in the previous comment is that I think there is still room for a new callback executed when all tasks are guaranteed to be finished.

Comment 14 Tim Wickberg 2025-02-07 10:20:08 MST

I've pushed the re-ordering of pam_finish() vs spank_fini() to master. It'll be included in 25.05 when released.

(https://github.com/SchedMD/slurm/commit/8a9e4cec90)

As I'd mentioned, the requirement that an additional SPANK hook exist before pam_fini() but after all tasks have completed would require deferring pam_finish() much later in the process cleanup which has unclear ramifications, and is not something I'd like to consider at this point in time.

- Tim