Ticket 4298

Summary: slurm_spank_task_init_privileged context broken
Product: Slurm Reporter: John Thiltges <jthiltges2>
Component: slurmstepdAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: josh
Version: 17.02.8   
Hardware: Linux   
OS: Linux   
Site: University of Nebraska–Lincoln Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 17.02.9 17.11.0-rc2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description John Thiltges 2017-10-24 10:55:35 MDT
Commit 7e5d3d15e7 for bug 3531 appears to have broken the slurm_spank_task_init_privileged context.

Previously, _pre_task_privileged() was called inside the child process:
https://github.com/SchedMD/slurm/blob/slurm-17-02-6-1/src/slurmd/slurmstepd/mgr.c#L1736

Now it's being called later, and runs as the slurmstepd process:
https://github.com/SchedMD/slurm/blob/slurm-17-02-8-1/src/slurmd/slurmstepd/mgr.c#L1800

This breaks our per-job TMPDIR, as the bind mount is applied to slurmstepd, rather than the child processes.
Comment 5 Danny Auble 2017-10-26 16:26:59 MDT
Sorry for the delay, we are looking into this.  I know (as you probably do as well) this was to avoid a race condition with other plugin threads.  I'll see what can be done.
Comment 6 Danny Auble 2017-10-30 10:55:30 MDT
Thanks for bringing up the use case John.  We think you are right and out previous "safe" attitude was incorrect.

I have reverted this commit and changed the name of _pre_task_privileged to be _pre_task_child_privileged to avoid future confusions.  This will be in 17.02.9 commit 5447fbd8b5bd.
Comment 7 John Thiltges 2017-11-08 09:57:19 MST
Dear Danny,

We've updated to 17.02.9 and slurm_spank_task_init_privileged appears back to normal.

My understanding of the flow was based on the handy diagram in slurm/spank.h. I think this new commit makes the logic match the diagram.

Many thanks for your help.

Best regards,
John