Ticket 12403 - Document interactions between job_container/tmpfs and SPANK plugins, prologs
Summary: Document interactions between job_container/tmpfs and SPANK plugins, prologs
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Documentation (show other tickets)
Version: 21.08.0
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Tim McMullan
QA Contact: Ben Roberts
URL:
Depends on:
Blocks:
 
Reported: 2021-08-31 12:31 MDT by Luke Yeager
Modified: 2022-05-16 08:00 MDT (History)
2 users (show)

See Also:
Site: NVIDIA (PSLA)
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08.9 22.05rc2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Luke Yeager 2021-08-31 12:31:23 MDT
I'm trying to populate a file in /tmp/ for a job, using either any/all of a SPANK plugin, a prolog, or a taskprolog. From my testing, here is the effect of doing a 'mkdir /tmp/foo' from all of the options available to me:

Location                                 Effect
--------                                 ------
Prolog                                   Job /tmp/
TaskProlog                               Job /tmp/
slurm_spank_init(remote)                 OS /tmp/
slurm_spank_init_post_opt(remote)        OS /tmp/
slurm_spank_task_post_fork(remote)       OS /tmp/
slurm_spank_user_init(remote)            OS /tmp/
slurm_spank_task_init_privileged(remote) Job /tmp/
slurm_spank_task_init(remote)            Job /tmp/

I might prefer to see user_init() contained, too, but other than that I don't see anything obviously _wrong_ about that table. However, it was certainly not obvious to me what the behavior would be before I checked. I don't see any indication about how this works in slurm/spank.h, nor in job_container.conf.html. Would you please document somewhere how this expected to work, or point me to the documentation if I missed it?
Comment 1 Luke Yeager 2021-08-31 13:26:01 MDT
To clarify that table, the "Prolog" and "TaskProlog" rows aren't related to spank. I didn't test slurm_spank_job_prolog().
Comment 2 Tim McMullan 2021-08-31 13:34:16 MDT
Thanks for the info Luke!  I'll make sure this all gets documented!
Comment 3 Luke Yeager 2021-08-31 13:36:26 MDT
Sure! I'd try verifying that you get the same results, too, before making it official. I'm currently getting different behavior on my local machine (with --enable-multiple-slurmd) - the Prolog is running w/ the OS's /tmp/ instead.
Comment 4 Luke Yeager 2021-10-12 10:09:49 MDT
Here's what I'm seeing with 21.08.2 (so, my original table was wrong about the prolog):

Location                      Which /tmp  Want changed?
--------                      ----------  -------------
spank_job_prolog()            OS
Prolog                        OS
spank_init()                  OS
spank_init_post_opt()         OS
spank_user_init()             OS          YES
spank_task_post_fork()        OS          YES
spank_task_init_privileged()  Job
spank_task_init()             Job
TaskProlog                    Job
spank_task_exit()             OS          YES
spank_exit()                  OS
spank_job_epilog()            OS
Epilog                        OS


1) What's the timeline on documenting this somewhere?

2) I'd like to see more of the SPANK entrypoints contained, as specified in the table above. In particular, user_init is sometimes preferable to task_init because it only runs once per node instead of once per task, but with 'job_container/tmpfs' you're stuck with task_init. Also, it's hard for plugins to clean up after themselves in task_exit from work they did in task_init when the /tmp mount has changed.
Comment 5 Tim McMullan 2021-10-15 05:21:05 MDT
Hey Luke, sorry about the delay!

(In reply to Luke Yeager from comment #4)
> 1) What's the timeline on documenting this somewhere?

I've been looking into some inconsistencies with job_container/tmpfs and what is contained/not so I've been holding off on documenting it until I get that sorted out.

> 2) I'd like to see more of the SPANK entrypoints contained, as specified in
> the table above. In particular, user_init is sometimes preferable to
> task_init because it only runs once per node instead of once per task, but
> with 'job_container/tmpfs' you're stuck with task_init. Also, it's hard for
> plugins to clean up after themselves in task_exit from work they did in
> task_init when the /tmp mount has changed.

The changes you describe for spank_user_init(), spank_task_post_fork(), and spank_task_exit() I expect would be an enhancement.  We should break those desired changes out into an enhancement ticket and chat with Tim (Wickberg) et al. about it.

Thanks, and sorry again about the delay!
--Tim
Comment 6 Luke Yeager 2021-10-15 08:17:04 MDT
(In reply to Tim McMullan from comment #5)
> The changes you describe for spank_user_init(), spank_task_post_fork(), and
> spank_task_exit() I expect would be an enhancement.  We should break those
> desired changes out into an enhancement ticket and chat with Tim (Wickberg)
> et al. about it.
Roger that. Bug#12672.
Comment 12 Tim McMullan 2022-05-16 08:00:15 MDT
Hey Luke,

Sorry about the delay on this, but the documentation for this landed in https://github.com/SchedMD/slurm/commit/b4893df64

I'll resolve this now since the docs landed.

Thanks!
--Tim