Summary: | Allow job_container/tmpfs to work without private /tmp | ||
---|---|---|---|
Product: | Slurm | Reporter: | Trey Dockendorf <tdockendorf> |
Component: | Other | Assignee: | Tim McMullan <mcmullan> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | C - Contributions | ||
Priority: | --- | CC: | agaur, bas.vandervlies, fabecassis, lyeager, mcmullan, pedmon, plazonic, rundall, ward.poelmans |
Version: | 20.11.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Ohio State OSC | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 23.02pre1 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
Adds an option to specify multiple dirs to handle for private tmp
Adds an option to specify multiple dirs to handle for private tmp |
Description
Trey Dockendorf
2021-03-18 14:13:21 MDT
*** Ticket 11109 has been marked as a duplicate of this ticket. *** We, in Princeton, need private directories besides /tmp. I.e. we need /tmp to be configurable, say Dirs=/tmp,/var/tmp,/var/locks so that each of these directories would then be private under BasePath. I would also like to have a configurable list of directories. At a minimum /tmp, /dev/shm and /var/tmp would be needed. We would also like to have a list of directories, ideally with the possibility to change mount type and mount options. * Mount type: on cluster A, /tmp might need to be a tmpfs. On cluster B, /tmp might need to be a bind-mount from a local filesystem. * Mount options: in addition to to the memory cgroup, we want to limit the size of the tmpfs, or back the tmpfs with huge pages, it requires additional mount options. On one of our smaller cluster, we have our own homemade SPANK plugin to handle this, like others are doing right now. This plugin can be configured by a file in the fstab format, to satisfy the constraints above: $ cat /etc/slurm/fstab tmpfs /dev/shm tmpfs rw,nodev,nosuid,size=16G,huge=always 0 0 /raid/scratch /tmp none defaults,bind 0 0 But just having a configuration option for bind-mounts and a configuration option for tmpfs would be a good start. Thanks for all the suggestions, and we'll certainly be looking into some aspects of this seeing how much interest this has perked up so quickly. But - at the moment I cannot commit to any specific extensions. Additional directory configuration, and options to modify the mount options, both do strike me as useful, but will need further development. If a site is interested in sponsoring some of this, and/or wishes to propose a patch, I'll certainly be willing to consider that. - Tim Just to throw in our two cents from Harvard two additional features we would like to see are: 1. Use a different directory than /tmp 2. Multiple directories able to be specified This tmpfs plugin is really handy, thanks for putting it together. Created attachment 18779 [details]
Adds an option to specify multiple dirs to handle for private tmp
Adds Dirs=/tmp,/var/tmp kind of option so one can have multiple job container tmpfs directories. They all use the same BasePath and if one does not specify Dirs it defaults to /tmp.
This patch also removes namespace unmount in fini + adds a file to /run to indicate that the bind mount base_path was done (in container_p_restore). With these changes restart of slurmd is reliable and does not break running jobs. Not sure if /run/ is a good path to use for this on all systems.
A few things could be simplified - e.g. temp dirs end up looking like /scratch/slurmtemp/3755/.3755/_var_tmp and their perm is 1777 (easier than changing ownership of each of /scratch/slurmtemp/3755/.3755/* to the user). A few more snprintf length checks could be added (but if your private dirs are close to PATH_MAX length you have other problems).
Anyway, it works in my tests.
This requirement has come up on our end as well- at the minimum to add `/var/tmp` and ideally have configurable number of directories. I ll give the above patch a try when I have the time.
>This patch also removes namespace unmount in fini + adds a file to /run to >indicate that the bind mount base_path was done (in container_p_restore). With >these changes restart of slurmd is reliable
Thats interesting. We have been using it in production for a while as well and have not seen restarts of slurmd being unreliable or disruptive to running jobs. A description of that behavior would certainly help!
Aditi: I described the slurmd restart issue here: https://bugs.schedmd.com/show_bug.cgi?id=11093 Thanks Felix, Interesting that you are seeing this behavior. On my end I did just try to reproduce by setting basepath to /var/run and then submitting a job and then killing a job. And then restarting slurm. But in my case slurmd recovered and the running job terminated fine..Obviously there could be subtle differences here. Just for reference this is the kernel I am on: 4.15.0-140-generic. setns and mount calls can have some subtle differences if using an older kernel. this is what i used in namespace.conf: ``` NodeName=linux_vb BasePath=/var/run/storage AutoBasePath=true InitScript=/usr/local/etc/test.py ``` And I killed slurmd using pkill. Maybe you killed more aggressively? Another thing that is helpful for debugging is that in my case slurmstepd was running when i killed slurmd: ``` root@linux_vb:/usr/local/etc# ps aux | grep slurm root 3109 0.0 0.6 279968 6504 ? Sl 14:57 0:00 slurmstepd: [4.extern] slurm 6742 0.0 0.8 690184 8860 ? Sl 15:17 0:00 /usr/local/sbin/slurmctld -i root 8786 0.0 0.6 213404 6524 ? Sl 15:28 0:00 slurmstepd: [4.extern] root 8811 0.0 0.6 346528 6596 ? Sl 15:28 0:00 slurmstepd: [4.interactive] root 8963 0.0 0.1 14428 1048 pts/0 S+ 15:29 0:00 grep --color=auto slurm ``` In tmpfs its the slurmstepd that actually keeps the namespace active even if upper directory gets unmounted- in this `/var/run/storage`. As long as slurmstepd is safe during a job- namespace should remain active even if upper directory is unmounted. But again if your kernel is older- then problems could be due to underlying factors and its newer then we all are going to hit it soon anyway :) I am personally unsure whats the right approach here but seems like the solution patch above works for you all. Aditi, I suggest we take this discussion to the other bug, I'll answer there. Created attachment 19261 [details]
Adds an option to specify multiple dirs to handle for private tmp
Updated for 20.11.6
We use the the this plugin https://github.com/hpc2n/spank-private-tmp. This can handle multiple dirctorties. So it would be nice if this feature can also be enabled for job_container/tmpfs plugin. Will this patch be applied? Hi, I’ll be unavailable until August 20. For HPC related matters, contact the helpdesk at hpc@vub.be For other urgent matters, contact hpcadmin@vub.be Kind regards, Ward I saw some activity on the "master" branch related to this bug, e.g.: https://github.com/SchedMD/slurm/commit/3489ff75cbb88f2b2932c9982d27b25c666bd213 https://github.com/SchedMD/slurm/commit/ebe74549393c16e84ab5af8ebdaf1239f6b94d1f (In reply to Felix Abecassis from comment #26) > I saw some activity on the "master" branch related to this bug, e.g.: > https://github.com/SchedMD/slurm/commit/ > 3489ff75cbb88f2b2932c9982d27b25c666bd213 > https://github.com/SchedMD/slurm/commit/ > ebe74549393c16e84ab5af8ebdaf1239f6b94d1f Looks good ;-) Thanks Felix for the info Thank you for all the suggestions, we've landed patches that will allow it to work with an arbitrary set of Dirs based on some of the contributed code along with a few additional small changes. These changes would at the earliest appear in 23.02. The other suggestions like having expanded mount options/types sound useful, but if there is interest in them I think we should talk about them in a new ticket. Thank you again for the contributions and discussion on this plugin! --Tim |