Created attachment 18512 [details] slurm.conf I am testing job_container/tmpfs and it simply does not work as advertised. My job_container.conf is this: $ cat /etc/slurm/job_container.conf AutoBasePath=false BasePath=/dev/shm I want /dev/shm to be private inside a job so if a user uses /dev/shm it is private to their job and cleaned up when the job ends. Errors: $ salloc -w slurmd01 -A PZS0708 srun --interactive --pty /bin/bash salloc: Pending job allocation 2001790 salloc: job 2001790 queued and waiting for resources salloc: job 2001790 has been allocated resources salloc: Granted job allocation 2001790 salloc: Waiting for resource configuration salloc: Nodes slurmd01 are ready for job slurmstepd: error: container_p_join: open failed /dev/shm/2001790/.active: No such file or directory slurmstepd: error: container_g_join failed: 2001790 slurmstepd: error: write to unblock task 0 failed: Broken pipe slurmstepd: error: container_p_join: open failed /dev/shm/2001790/.active: No such file or directory slurmstepd: error: container_g_join(2001790): No such file or directory srun: error: slurmd01: task 0: Exited with exit code 1 salloc: Relinquishing job allocation 2001790 $ sbatch -w slurmd01 -A PZS0708 --wrap 'scontrol show job=$SLURM_JOB_ID' Submitted batch job 2001791 $ cat slurm-2001791.out slurmstepd: error: container_p_join: open failed /dev/shm/2001791/.active: No such file or directory slurmstepd: error: container_g_join failed: 2001791 slurmstepd: error: write to unblock task 0 failed: Broken pipe slurmstepd: error: container_p_join: open failed /dev/shm/2001791/.active: No such file or directory slurmstepd: error: container_g_join(2001791): No such file or directory
To clarify, the node I am attempting to run on is configless and have verified it has latest configs and for good measure I have restarted slurmd before the errors are produced.
Also this issue is somewhat time sensitive. We have center wide downtime on March 31 and that would be when we deploy 20.11.5 to make use of this new feature to replace the SPANK plugin we currently use for private /dev/shm. Doing this change live seems rather risky since we would be needing to change how private /dev/shm is handled, so be much easier during our March 31 downtime.
Hi Trey, Thanks for this bug report. I reproduced what you were seeing, but I found that the job_container/tmpfs plugin is working. But we need to clarify a couple things in the documentation. We already have an internal bug open for this (bug 11107, though you can't see it since it's private). No worries about the confusion though. There was some confusion among some of us as well. In bug 11109, you said: "If we moved to job_container/tmpfs it looks like we'd be limited to /dev/shm only and not both /dev/shm and /tmp. Is it possible with job_container/tmpfs to setup multiple private locations like /dev/shm and /tmp ? My read of the config docs and my initial read of code is that only one BasePath per either config or node group is allowed." Actually, both /dev/shm and /tmp are created as private directories for the job. You are correct that only one BasePath per either config or node group is allowed, but BasePath isn't doing what you think it is doing. For each job, the job_container/tmpfs plugin creates a <job_id> directory and then creates private /tmp and /dev/shm directories inside that <job_id> directory. The user can then use /tmp and /dev/shm however they want in the job, and it will use these private directories. These directories will be torn down at the end of the job. BasePath is where these directories are actually mounted. It needs to be a location that can mount/unmount directories freely. BasePath cannot be /tmp or /dev/shm. The errors you are seeing is because there are issues with mounting/unmounting directly in /tmp and in /dev/shm. You need to change BasePath to something that is not /tmp or /dev/shm. For example, if I set BasePath=/mnt then everything is fine. Hopefully I cleared this up. Can you let me know if this makes sense, or if I can clarify something? - Marshall
Thanks, I think I misunderstood what this plugin does. I was hoping it would allow me to make selective locations private so a private /dev/shm would be mounted to like /dev/shm/slurm.$SLURM_JOB_ID but seen as /dev/shm inside the job. Having both /tmp and /dev/shm mounted to same place like a scratch directory is not what we would need at this time unfortunately. At the very least I think some documentation updates are needed if this plugin will continue to behave as it currently does. I think this case can be closed. We will continue to use https://github.com/treydock/spank-private-tmp/tree/osc for making /dev/shm private within a job.
Okay, so I looked at this again and I have good news and bad news. The bad news is that I apparently didn't know what I was talking about and we really need to update the documentation. The good news is I was wrong and hopefully the plugin actually does what you want (or something close to it). Also we are working on improving the documentation. Okay, so let me start over and explain things how I now understand them. The job_container/tmpfs plugin's job is to create a private /tmp and a private /dev/shm for the job. * The private /tmp is mounted inside BasePath in a <job_id> subdirectory: $BasePath/<job_id> static int _mount_private_tmp(char *path) { if (!path) { error("%s: cannot mount /tmp", __func__); return -1; } #if !defined(__APPLE__) && !defined(__FreeBSD__) if (mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL)) { error("%s: making root private: failed: %s", __func__, strerror(errno)); return -1; } if (mount(path, "/tmp", NULL, MS_BIND|MS_REC, NULL)) { error("%s: /tmp mount failed, %s", __func__, strerror(errno)); return -1; } #endif return 0; } * The private /dev/shm: /dev/shm is unmounted, then a private tmpfs is mounted at /dev/shm. static int _mount_private_shm(void) { int rc = 0; rc = umount("/dev/shm"); if (rc && errno != EINVAL) { error("%s: umount /dev/shm failed: %s\n", __func__, strerror(errno)); return rc; } #if !defined(__APPLE__) && !defined(__FreeBSD__) rc = mount("tmpfs", "/dev/shm", "tmpfs", 0, NULL); if (rc) { error("%s: mounting private /dev/shm failed: %s\n", __func__, strerror(errno)); return -1; } #endif return rc; } It was surprising to me (and others) how BasePath actually works. So, you want BasePath to specify wherever the private /tmp will be mounted, which can't be /tmp or /dev/shm, which makes sense once I understand what is actually happening. Does this make sense?
So the private /dev/shm is good, is that cleaned up somehow or just unmounted and goes away when job ends? For /tmp, we mount our compute node's local disk to /tmp and we need to keep users using things like $TMPDIR (/tmp/slurm.$SLURM_JOB_ID) on that local disk. Is there a way to get the benefits of private /dev/shm but not make /tmp private? Would I be able to maybe enable job_container/tmpfs in slurm.conf but not define a BasePath and still get benefits of private /dev/shm but not get private /tmp? Thanks, - Trey
(In reply to Trey Dockendorf from comment #10) > So the private /dev/shm is good, is that cleaned up somehow or just > unmounted and goes away when job ends? Since it's a private tmpfs, my understanding is that the mount is purged when the last process dies, and Slurm kills all job processes when the job ends. There's nothing explicit in the job_container/tmpfs plugin that unmounts or cleans up /dev/shm, though. For private /tmp, the job_containter/tmpfs plugin unmounts it from the topmost directory (BasePath), then the directories are traversed and files and directories are removed. See the functions container_p_delete() and _rm_data() to see how it actually works. > For /tmp, we mount our compute node's local disk to /tmp and we need to keep > users using things like $TMPDIR (/tmp/slurm.$SLURM_JOB_ID) on that local > disk. Is there a way to get the benefits of private /dev/shm but not make > /tmp private? Would I be able to maybe enable job_container/tmpfs in > slurm.conf but not define a BasePath and still get benefits of private > /dev/shm but not get private /tmp? > > Thanks, > - Trey Not right now. It's certainly possible to add.
Should this become a RFE then to get the ability to use job_container/tmpfs for private /dev/shm without also using it for private /tmp? Or should I open something new? It won't be until our 2022 cluster most likely where we could redo our partition schema and support local disk somewhere other than /tmp that gets mounted privately to /tmp.
Can you open a new bug for the RFE? That way it can be kept track of much easier, and somebody doesn't have to scroll through all the discussion on this bug. Thanks for your patience with me on this one as I've figured out how this new plugin works.
I opened 11135 RFE. This case can be closed I believe since I think I now know how this plugin works and what's needed for us to actually be able to use it, which is what the RFE should cover.
Thanks! Closing as infogiven.