Ticket 11135

Summary: Allow job_container/tmpfs to work without private /tmp
Product: Slurm Reporter: Trey Dockendorf <tdockendorf>
Component: OtherAssignee: Tim McMullan <mcmullan>
Status: RESOLVED FIXED QA Contact:
Severity: C - Contributions    
Priority: --- CC: agaur, bas.vandervlies, fabecassis, lyeager, mcmullan, pedmon, plazonic, rundall, ward.poelmans
Version: 20.11.5   
Hardware: Linux   
OS: Linux   
Site: Ohio State OSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.02pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Adds an option to specify multiple dirs to handle for private tmp
Adds an option to specify multiple dirs to handle for private tmp

Description Trey Dockendorf 2021-03-18 14:13:21 MDT
Per discussion started in bug #11123 , it would be very useful if the job_container/tmpfs plugin could be used to make a private /dev/shm like it currently does but not a private /tmp.  OSC currently only has the need for private /dev/shm and we mount our compute node local disks directly to /tmp.  I would imagine maybe enabling job_container/tmpfs plugin in slurm.conf but not defining a BasePath in job_container.conf would be one way to configure such a thing if it were made possible.
Comment 1 Nate Rini 2021-03-18 15:58:06 MDT
*** Ticket 11109 has been marked as a duplicate of this ticket. ***
Comment 2 Josko Plazonic 2021-03-19 07:50:16 MDT
We, in Princeton, need private directories besides /tmp. I.e. we need /tmp to be configurable, say 

Dirs=/tmp,/var/tmp,/var/locks

so that each of these directories would then be private under BasePath.
Comment 3 Ward Poelmans 2021-03-19 10:04:55 MDT
I would also like to have a configurable list of directories. At a minimum /tmp, /dev/shm and /var/tmp would be needed.
Comment 4 Felix Abecassis 2021-03-22 11:02:41 MDT
We would also like to have a list of directories, ideally with the possibility to change mount type and mount options.

* Mount type: on cluster A, /tmp might need to be a tmpfs. On cluster B, /tmp might need to be a bind-mount from a local filesystem. 
* Mount options: in addition to to the memory cgroup, we want to limit the size of the tmpfs, or back the tmpfs with huge pages, it requires additional mount options. 

On one of our smaller cluster, we have our own homemade SPANK plugin to handle this, like others are doing right now. This plugin can be configured by a file in the fstab format, to satisfy the constraints above:
$ cat /etc/slurm/fstab
tmpfs /dev/shm tmpfs rw,nodev,nosuid,size=16G,huge=always 0 0
/raid/scratch /tmp none defaults,bind 0 0

But just having a configuration option for bind-mounts and a configuration option for tmpfs would be a good start.
Comment 5 Tim Wickberg 2021-03-24 15:32:08 MDT
Thanks for all the suggestions, and we'll certainly be looking into some aspects of this seeing how much interest this has perked up so quickly.

But - at the moment I cannot commit to any specific extensions. Additional directory configuration, and options to modify the mount options, both do strike me as useful, but will need further development.

If a site is interested in sponsoring some of this, and/or wishes to propose a patch, I'll certainly be willing to consider that.

- Tim
Comment 6 Paul Edmon 2021-03-25 11:17:46 MDT
Just to throw in our two cents from Harvard two additional features we would like to see are:

1. Use a different directory than /tmp
2. Multiple directories able to be specified

This tmpfs plugin is really handy, thanks for putting it together.
Comment 7 Josko Plazonic 2021-03-31 15:48:32 MDT
Created attachment 18779 [details]
Adds an option to specify multiple dirs to handle for private tmp

Adds Dirs=/tmp,/var/tmp kind of option so one can have multiple job container tmpfs directories. They all use the same BasePath and if one does not specify Dirs it defaults to /tmp.

This patch also removes namespace unmount in fini + adds a file to /run to indicate that the bind mount base_path was done (in container_p_restore). With these changes restart of slurmd is reliable and does not break running jobs. Not sure if /run/ is a good path to use for this on all systems.

A few things could be simplified - e.g. temp dirs end up looking like /scratch/slurmtemp/3755/.3755/_var_tmp and their perm is 1777 (easier than changing ownership of each of /scratch/slurmtemp/3755/.3755/* to the user). A few more snprintf length checks could be added (but if your private dirs are close to PATH_MAX length you have other problems).

Anyway, it works in my tests.
Comment 8 Aditi Gaur 2021-03-31 16:05:33 MDT
This requirement has come up on our end as well- at the minimum to add `/var/tmp` and ideally have configurable number of directories. I ll give the above patch a try when I have the time.

>This patch also removes namespace unmount in fini + adds a file to /run to >indicate that the bind mount base_path was done (in container_p_restore). With >these changes restart of slurmd is reliable

Thats interesting. We have been using it in production for a while as well and have not seen restarts of slurmd being unreliable or disruptive to running jobs. A description of that behavior would certainly help!
Comment 9 Felix Abecassis 2021-03-31 16:06:42 MDT
Aditi: I described the slurmd restart issue here: https://bugs.schedmd.com/show_bug.cgi?id=11093
Comment 10 Aditi Gaur 2021-03-31 16:40:33 MDT
Thanks Felix, Interesting that you are seeing this behavior. On my end I did just try to reproduce by setting basepath to /var/run and then submitting a job and then killing a job. And then restarting slurm. But in my case slurmd recovered and the running job terminated fine..Obviously there could be subtle differences here. Just for reference this is the kernel I am on: 4.15.0-140-generic. setns and mount calls can have some subtle differences if using an older kernel.

this is what i used in namespace.conf:

```
NodeName=linux_vb BasePath=/var/run/storage AutoBasePath=true InitScript=/usr/local/etc/test.py
```

And I killed slurmd using pkill. Maybe you killed more aggressively?

Another thing that is helpful for debugging is that in my case slurmstepd was running when i killed slurmd:

```
root@linux_vb:/usr/local/etc# ps aux | grep slurm
root      3109  0.0  0.6 279968  6504 ?        Sl   14:57   0:00 slurmstepd: [4.extern]
slurm     6742  0.0  0.8 690184  8860 ?        Sl   15:17   0:00 /usr/local/sbin/slurmctld -i
root      8786  0.0  0.6 213404  6524 ?        Sl   15:28   0:00 slurmstepd: [4.extern]
root      8811  0.0  0.6 346528  6596 ?        Sl   15:28   0:00 slurmstepd: [4.interactive]
root      8963  0.0  0.1  14428  1048 pts/0    S+   15:29   0:00 grep --color=auto slurm
```

In tmpfs its the slurmstepd that actually keeps the namespace active even if upper directory gets unmounted- in this `/var/run/storage`. As long as slurmstepd is safe during a job- namespace should remain active even if upper directory is unmounted. But again if your kernel is older- then problems could be due to underlying factors and its newer then we all are going to hit it soon anyway :)

I am personally unsure whats the right approach here but seems like the solution patch above works for you all.
Comment 11 Felix Abecassis 2021-03-31 17:47:09 MDT
Aditi, I suggest we take this discussion to the other bug, I'll answer there.
Comment 12 Josko Plazonic 2021-05-03 09:56:53 MDT
Created attachment 19261 [details]
Adds an option to specify multiple dirs to handle for private tmp

Updated for 20.11.6
Comment 24 Bas van der Vlies 2022-07-25 06:34:56 MDT
We use the the this plugin https://github.com/hpc2n/spank-private-tmp. This can handle multiple dirctorties. So it would be nice if this feature can also be enabled for job_container/tmpfs plugin. Will this patch be applied?
Comment 25 Ward Poelmans 2022-07-25 06:35:07 MDT
Hi,

I’ll be unavailable until August 20.

For HPC related matters, contact the helpdesk at hpc@vub.be

For other urgent matters, contact hpcadmin@vub.be

Kind regards,

Ward
Comment 26 Felix Abecassis 2022-07-25 10:11:35 MDT
I saw some activity on the "master" branch related to this bug, e.g.:
https://github.com/SchedMD/slurm/commit/3489ff75cbb88f2b2932c9982d27b25c666bd213
https://github.com/SchedMD/slurm/commit/ebe74549393c16e84ab5af8ebdaf1239f6b94d1f
Comment 27 Bas van der Vlies 2022-07-25 14:15:55 MDT
(In reply to Felix Abecassis from comment #26)
> I saw some activity on the "master" branch related to this bug, e.g.:
> https://github.com/SchedMD/slurm/commit/
> 3489ff75cbb88f2b2932c9982d27b25c666bd213
> https://github.com/SchedMD/slurm/commit/
> ebe74549393c16e84ab5af8ebdaf1239f6b94d1f


Looks good ;-) Thanks Felix for the info
Comment 28 Tim McMullan 2022-07-26 15:22:01 MDT
Thank you for all the suggestions, we've landed patches that will allow it to work with an arbitrary set of Dirs based on some of the contributed code along with a few additional small changes.  These changes would at the earliest appear in 23.02.

The other suggestions like having expanded mount options/types sound useful, but if there is interest in them I think we should talk about them in a new ticket.

Thank you again for the contributions and discussion on this plugin!
--Tim