Ticket 12567 - job_containers fails with auto mounted directories in use
Summary: job_containers fails with auto mounted directories in use
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 21.08.2
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Tim McMullan
QA Contact:
URL:
: 14344 14803 14954 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2021-09-28 14:16 MDT by mike coyne
Modified: 2022-12-21 11:48 MST (History)
15 users (show)

See Also:
Site: LANL
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: RHEL
Machine Name: kit
CLE Version:
Version Fixed: 23.02pre1
Target Release: 23.02
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
tar file with patch and sample pre and post namespace clone scripts. (20.00 KB, application/x-tar)
2021-09-28 14:16 MDT, mike coyne
Details
updated patch to add pre and post clonens scripts (5.31 KB, patch)
2021-09-29 10:51 MDT, mike coyne
Details | Diff
sync version of patch to exec a script to start and later shutdown a automounter (8.28 KB, application/x-troff-man)
2021-10-05 09:15 MDT, mike coyne
Details
updated sync version for 21.08.2+ (7.78 KB, patch)
2021-10-22 08:44 MDT, mike coyne
Details | Diff
Updated patch to work with slurm 22.05.2 (8.90 KB, patch)
2022-07-20 11:58 MDT, mike coyne
Details | Diff
Updated slurm 22.05.2 using the same internal script execution for nsepilog and nsprolog (9.11 KB, application/x-troff-man)
2022-07-26 09:08 MDT, mike coyne
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description mike coyne 2021-09-28 14:16:25 MDT
Created attachment 21498 [details]
tar file with patch and sample pre and post namespace clone scripts.

In working with the tmp_fs job_container plugin in , we discovered that in a envionment with auto-mounted home and/or project directories for the users . these would not be accessible for the job .

working with ticket 12361 , i was able to come up with a patch to run a script after the job name space had been created as well as one prior to removing the /dev/shm and /tmp directies durring tear down of the container.

I am attempting to run a per job automounter instance with the fifo , map directories compiled to run in /dev/shm

%define fifo_dir /dev/shm
%configure --disable-mount-locking --enable-ignore-busy --with-libtirpc --without-hesiod %{?systemd_configure_arg:} --with-confdir=%{_sysconfdir} --with-mapdir=%{fifo_dir} --with-fifodir=%{fifo_dir} --with-flagdir=%{fifo_dir}  --enable-force-shutdown

I  have it working with one remaining issue, the run_command function waits for the automount'er to complete before it returns . And i need to leave the automounter running in the users job namespace until the completion of job, i need to background it.

is there i similar command that can be used to lauch a script to setup the users mounts . 
I am unable to make use of the job_container functionality as we like many folks make use of automounted nfs project directories for our users
Comment 1 mike coyne 2021-09-28 14:26:57 MDT
Reference original https://bugs.schedmd.com/show_bug.cgi?id=12361
ticket
Comment 3 mike coyne 2021-09-29 10:51:33 MDT
Created attachment 21515 [details]
updated patch to add pre and post clonens scripts

I tried changing the timeout value for the run_command to -1 for the clonensscript , this seems to have got it working much better.. still a work in progress
Comment 4 Jason Booth 2021-09-29 11:00:22 MDT
Thanks, Mike. I will have one of our engineers look over this and give you some feedback.
Comment 5 mike coyne 2021-10-05 09:15:39 MDT
Created attachment 21603 [details]
sync version of patch to exec a script to start and later shutdown a automounter

Sync version of patch which executes the nsclone script and waits for it to exit with out killing the process group underneath .. modified and "shamelessly" taken from the slurmctld deamon code. This would be to make sure the automounter comes up fully prior to continuing with the job launch. 

of note here even with a this set , i still see a warning about executing your job in /tmp instead of the home directory . but it in fact does execute the correctly in the users auto mounted home directory etc.. the check must be happening prior to or not calling the container entry code ....
Comment 9 mike coyne 2021-10-22 08:44:33 MDT
Created attachment 21893 [details]
updated sync version for 21.08.2+

i noticed that in 21.08.2 that additional calls were now being made through to the _delete_ns functions , so i moved my additional calls down into the _create_ns and _delete_ns functions . i probably should have left them there in the first place.
Comment 10 Tim McMullan 2021-12-22 15:04:10 MST
Hey Mike,

I've been looking this over and I'm a little confused why for clonensscript you have run_command commented out and use your own fork/waitpid setup, but for clonensepilog run_command is fine.  Was there some specific issue you were avoiding there that I'm missing?

Thanks!
--Tim
Comment 11 mike coyne 2021-12-22 15:12:48 MST
(In reply to Tim McMullan from comment #10)
> Hey Mike,
> 
> I've been looking this over and I'm a little confused why for clonensscript
> you have run_command commented out and use your own fork/waitpid setup, but
> for clonensepilog run_command is fine.  Was there some specific issue you
> were avoiding there that I'm missing?
> 
> Thanks!
> --Tim

yes , i just took the run_script from i think it was one of the slurmd main's 
what the run_command was doing is fireing up the script async and i was concerned about waiting for the automounter to get up and running before the mount was actually done. i was wanting it to wait until the file systems were in a good state, bind mounts were in place etc.. then proceed with the job launch
Mike
Comment 12 mike coyne 2021-12-22 15:29:47 MST
there may be a way to get the run_command to act sync but i just did not know how to do that.. on the cleanup i was not as worried about running in a async mode so i just used the regular run_command but the sync version should have been fine as well
Mike
Comment 13 Tim McMullan 2022-01-05 12:47:01 MST
(In reply to mike coyne from comment #12)
> there may be a way to get the run_command to act sync but i just did not
> know how to do that.. on the cleanup i was not as worried about running in a
> async mode so i just used the regular run_command but the sync version
> should have been fine as well
> Mike

Ok, thanks for the clarification there!  run_command is only async if you call it with max_wait=-1, otherwise it should be able to replace for the fork()+waitpid() method (there are some small improvements still happening with run_command).
Comment 19 Tim McMullan 2022-06-28 05:39:34 MDT
Hey Mike,

I was wondering if we could make this ticket public?  We have a second ticket discussing this issue as well and I think it would be ideal to have the discussion in one spot.  I can mark any attachments you would like as private to just SchedMD, if there are any attachments you are concerned about.

Let me know either way!

Thanks,
--Tim
Comment 20 mike coyne 2022-06-28 06:47:42 MDT
I personally do not have a problem with opening this but i do need to get it Officially Reviewed for release to the public. So please bear with me ..
Comment 21 Tim McMullan 2022-06-28 10:33:35 MDT
(In reply to mike coyne from comment #20)
> I personally do not have a problem with opening this but i do need to get it
> Officially Reviewed for release to the public. So please bear with me ..

Thank you Mike, just let me know how it goes!
Comment 25 mike coyne 2022-06-30 12:47:43 MDT
the data was reviewed and is ok for to be made public . i "unchecked" all the  only users in the selected boxes , if that helps
Comment 26 Tim McMullan 2022-06-30 13:40:09 MDT
(In reply to mike coyne from comment #25)
> the data was reviewed and is ok for to be made public . i "unchecked" all
> the  only users in the selected boxes , if that helps

Thank you!
Comment 27 Marshall Garey 2022-06-30 13:43:33 MDT
*** Ticket 14344 has been marked as a duplicate of this ticket. ***
Comment 28 Ole.H.Nielsen@fysik.dtu.dk 2022-07-01 00:12:03 MDT
FYI: As an alternative to the job_container/tmpfs plugin, I've now built and tested the SPANK plugin https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir.
Please see https://bugs.schedmd.com/show_bug.cgi?id=14344#c16

I hope that the excellent work in that plugin can help solving the autofs issue in the job_container/tmpfs plugin.
Comment 29 staeglis 2022-07-11 06:48:42 MDT
Hi,

is there any timeline for a patch release?

Best,
Stefan
Comment 30 mike coyne 2022-07-20 11:58:38 MDT
Created attachment 25937 [details]
Updated patch to work with slurm 22.05.2

i updated my patch to add a pre and post namespace script for 22.05.2
i did leave the previous _run_script_in_ns i had copied. as i was not sure how to launch a script with time out but leave a running "service" such as automount with the updated run_script functions in the 22.05 release
Comment 31 Ole.H.Nielsen@fysik.dtu.dk 2022-07-21 01:00:46 MDT
Hi Mike,

(In reply to mike coyne from comment #30)
> i updated my patch to add a pre and post namespace script for 22.05.2
> i did leave the previous _run_script_in_ns i had copied. as i was not sure
> how to launch a script with time out but leave a running "service" such as
> automount with the updated run_script functions in the 22.05 release

Could you kindly explain what are the ramifications of the updated patch?  Will the autofs filesystems work with this patch?  Will the patch be included in 22.05.3?

Thanks,
Ole
Comment 32 mike coyne 2022-07-21 08:28:13 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #31)
> Hi Mike,
> 
> (In reply to mike coyne from comment #30)
> > i updated my patch to add a pre and post namespace script for 22.05.2
> > i did leave the previous _run_script_in_ns i had copied. as i was not sure
> > how to launch a script with time out but leave a running "service" such as
> > automount with the updated run_script functions in the 22.05 release
> 
> Could you kindly explain what are the ramifications of the updated patch? 
> Will the autofs filesystems work with this patch?  Will the patch be
> included in 22.05.3?
> 
> Thanks,
> Ole

Ole , 
   i have been working with tim on a possible solution , my intent has been to demonstrate a way to get autofs working in along with the tmpfs mnt namespace. My intent is just to demonstrate the issue and show a possible solution . 

as far as the patch i just added .. i have been trying to get the version i had working for 21.08 to now work on 22.05 but schemd has re-written the run_command family of function.  i believe i have the initiation script working the the but the clonensepilog  script in the _delete_ns call may not be quite right it seems to fire but is immediately terminated , i had set its type as initscript but it seems to need to be something else. hoping tim can help me with that. 

So that said. what this patch will do is fire a up  root executed shell just after the name space is created , and another one just prior the name space getting destroyed . What this allows me to do is start up a purpose built automount with in the name space , to make automount work , what i had to do was  redirect the directory it uses for its fifos and configs into the namespace ie /dev/shm  or /tmp .. if you like .. this allows the automounter to run and not step on other automounters in other namespaces ..  and the epilog script is intended to shutdown the automounter with a sig 15 to allow it to shutdown the automounter cleanly .. on exit . What is did find was that i was able to really tune users environment , scratch access etc much easier as we control what file systems they can access based on a predefined  access "color"  if you will, per job .

regretfully what i have seen is that there are things that can happen that force the job to not exit thought the epilog  _destroy_ns  .. so it may be a good to have your node health check  kill any automount process not associated with a running job and unmount  any leftover mounts if any . 

any feedback , thoughts or suggestions would be very helpful 
Mike
Comment 33 Ole.H.Nielsen@fysik.dtu.dk 2022-07-22 03:26:42 MDT
Hi Mike,

(In reply to mike coyne from comment #32)
>    i have been working with tim on a possible solution , my intent has been
> to demonstrate a way to get autofs working in along with the tmpfs mnt
> namespace. My intent is just to demonstrate the issue and show a possible
> solution . 

I understand that this patch is Work in Progress and seems to be somewhat involved.  I don't have experience with this type of software, so I can't offer any help.

On 21.08.8 we have the auto_tmpdir[1] SPANK plugin working very nicely indeed together with our autofs NFS automount home directories.  FWIW, the implementation in auto_tmpdir[1] may perhaps be a Proof of Concept to inspire the tmp_fs job_container plugin?  Maybe you can test auto_tmpdir[1] in your environment?

Since you're having issues on 22.05, I wonder if auto_tmpdir[1] will face such issues as well?

Thanks,
Ole

[1] https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir
Comment 34 mike coyne 2022-07-26 09:08:28 MDT
Created attachment 26032 [details]
Updated  slurm 22.05.2 using the same internal script execution for nsepilog and nsprolog

Tim, Ole .. 
I corrected my patch for 22.05.2 , to make it run-able i replaced the run_command with my _run_script_in_ns . when trying to use run_command it would not execute as it complained that it was in the process of shutting down and refused to execute the script in the _delete_ns . i assume i am not calling it correctly .

Tim, another question is the combination of using the tmp_fs and the --container options for srun ..  Is the execution of the "runc" withing the name space of the  tmp_fs or is it in parallel with that namespace . I can seem to run one or the other but not both.. it may be the adding the automounter in the tmp_fs is breaking some timing ? 

Ole, thanks,   i will take a look at your spank plugin .
Mike
Comment 35 Ole.H.Nielsen@fysik.dtu.dk 2022-07-26 09:29:40 MDT
I'm out of the office, back on August 15.
Jeg er ikke på kontoret, tilbage den 15/8.

Best regards / Venlig hilsen,
Ole Holm Nielsen
Comment 37 Tim McMullan 2022-07-28 09:59:34 MDT
(In reply to mike coyne from comment #34)
> Created attachment 26032 [details]
> Updated  slurm 22.05.2 using the same internal script execution for nsepilog
> and nsprolog
> 
> Tim, Ole .. 
> I corrected my patch for 22.05.2 , to make it run-able i replaced the
> run_command with my _run_script_in_ns . when trying to use run_command it
> would not execute as it complained that it was in the process of shutting
> down and refused to execute the script in the _delete_ns . i assume i am not
> calling it correctly .

That is quite interesting, but at the moment I'm really trying to explore mount propagation between / and the job namespaces since this should be a much cleaner approach.  Its also the approach some of the spank plugins use.  I do have a proof of concept that seems to work, but it needs a lot more testing and refinement.

> Tim, another question is the combination of using the tmp_fs and the
> --container options for srun ..  Is the execution of the "runc" withing the
> name space of the  tmp_fs or is it in parallel with that namespace . I can
> seem to run one or the other but not both.. it may be the adding the
> automounter in the tmp_fs is breaking some timing ? 

The --container flag and the job_container/tmpfs plugin do a lot of things very close to eachother in the code.  I've not experimented with combining them yet so I'm not sure what is going on there.  Have you tried it without your additional patches?

> Ole, thanks,   i will take a look at your spank plugin .
> Mike
Comment 38 mike coyne 2022-07-28 10:46:31 MDT
(In reply to Tim McMullan from comment #37)
> (In reply to mike coyne from comment #34)
> > Created attachment 26032 [details]
> > Updated  slurm 22.05.2 using the same internal script execution for nsepilog
> > and nsprolog
> > 
> > Tim, Ole .. 
> > I corrected my patch for 22.05.2 , to make it run-able i replaced the
> > run_command with my _run_script_in_ns . when trying to use run_command it
> > would not execute as it complained that it was in the process of shutting
> > down and refused to execute the script in the _delete_ns . i assume i am not
> > calling it correctly .
> 
> That is quite interesting, but at the moment I'm really trying to explore
> mount propagation between / and the job namespaces since this should be a
> much cleaner approach.  Its also the approach some of the spank plugins use.
> I do have a proof of concept that seems to work, but it needs a lot more
> testing and refinement.
> 
> > Tim, another question is the combination of using the tmp_fs and the
> > --container options for srun ..  Is the execution of the "runc" withing the
> > name space of the  tmp_fs or is it in parallel with that namespace . I can
> > seem to run one or the other but not both.. it may be the adding the
> > automounter in the tmp_fs is breaking some timing ? 
> 
> The --container flag and the job_container/tmpfs plugin do a lot of things
> very close to each other in the code.  I've not experimented with combining
> them yet so I'm not sure what is going on there.  Have you tried it without
> your additional patches?
> 
> > Ole, thanks,   i will take a look at your spank plugin .
> > Mike
Tim, this what i have seen combining them so far.
i do now have my patched tmp_fs code in place and working .. when i was testing i allowed the root namespace  autofs to be disconnected in that say /users on the node as root no longer mounted users .. and with  the job running with the tmp_fs enables  the automounter correctly runs and allow users to access their home directories , project file etc..
 What i saw when trying to run a --container job ;  i found that it seems to  not launch in my case runc as the user withing the tmp_fs but instead launches the container (srun with in a salloc ) in a "parallel" mount name space. The path to say the container  had to be a viable path in both the root namespace and the tmp_fs namespace .  otherwise it would fail to find the container path... i did try to wrap the runc c command in a nsexec call but as its run  as the user so that did on work out.
i was able to run full parallel containers  with the tmp_fs enabled if i put the container file system in a location that was availibe on the root fs and  and had been imported into the tmp_fs

Mike 

running either by on their works fine..
Comment 39 Tim McMullan 2022-08-25 06:22:49 MDT
*** Ticket 14803 has been marked as a duplicate of this ticket. ***
Comment 40 Tim McMullan 2022-09-13 09:23:13 MDT
*** Ticket 14954 has been marked as a duplicate of this ticket. ***
Comment 47 Tim McMullan 2022-12-20 13:04:33 MST
Hi everyone,

We've landed these commits that should update the job_container/tmpfs plugin to function with autofs managed mounts.  It is following a similar pattern to that of many of the spank plugins in that its sharing mounts from the root namespace into the job containers as they are mounted.

https://github.com/SchedMD/slurm/commit/516c10ce06
https://github.com/SchedMD/slurm/commit/5f67e6c801
https://github.com/SchedMD/slurm/commit/2169071395

Please let us know if you encounter any issues with this!

Thanks!
--Tim
Comment 48 Tim McMullan 2022-12-21 11:25:54 MST
Since this has landed now, I'm going to mark this as resolved.

Let us know if you have any issues!

Thanks,
--Tim
Comment 49 mike coyne 2022-12-21 11:48:07 MST
Thanks Tim , i was working on trying to get the patches in place ...
i am a little uncertain about going from a configuration with MS_PRIVATE 
-> MS_SHARED & MS_SLAVE  and how that will relate to any root generated 
mounts in the job ns and if they could  show up in other namespaces ? i 
created  new prejobprivns job_container in my  22.05.6, still with 
ms_private copy to compare with the shared/slave version . I may need to 
continue with using the per job autofs and namespaces as being able to 
customize the mount name space per job has proved very useful for 
particular reasons when it comes to sharing a clusters computes between 
multiple mutually isolated  programs.
Mike

On 12/21/22 11:25, bugs@schedmd.com wrote:
> https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=12567__;!!Bt8fGhp8LhKGRg!FZPULSmML-W0m4VQliPXi9pbq12QX6tFaD5slXLFMt6nzrch0pqyYgiMF1kvWD5HC5WQEMesLXsz$
>
> Tim McMullan <mcmullan@schedmd.com> changed:
>
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>        Version Fixed|                            |23.02pre1
>               Status|OPEN                        |RESOLVED
>           Resolution|---                         |FIXED
>
> --- Comment #48 from Tim McMullan <mcmullan@schedmd.com> ---
> Since this has landed now, I'm going to mark this as resolved.
>
> Let us know if you have any issues!
>
> Thanks,
> --Tim
>