Forking https://bugs.schedmd.com/show_bug.cgi?id=11673 since I realized there are 2 separate bugs. Tested on the current master branch, commit a10619bf1d482e189fc3f0dceed5ef459b410667 Running Slurm in a single node test config, on Ubuntu 20.04 with kernel 5.4.0-73-generic. $ cat /etc/slurm/job_container.conf BasePath=/var/run/slurm AutoBasePath=true When starting a new job, it has access to a per-job tmpfs mounted in /dev/shm: $ srun --pty bash $ echo $(date) > /dev/shm/date ; cat /dev/shm/date Thu 20 May 2021 04:11:15 PM PDT And if you launch another parallel job step, it will have access to the same instance of the /dev/shm tmpfs: $ srun --jobid=14 --overlap cat /dev/shm/date Thu 20 May 2021 04:11:15 PM PDT This mount namespace is saved in /run/slurm/${SLURM_JOBID}/.ns: $ findmnt -R /run/slurm TARGET SOURCE FSTYPE OPTIONS /run/slurm tmpfs[/slurm] tmpfs rw,nosuid,nodev,noexec,relatime,size=32594660k,mode=755 └─/run/slurm/14/.ns nsfs[mnt:[4026532561]] nsfs rw But if you restart slurmd (normally) while the job is still running, a new mount namespace will be created (4026532561 vs 4026532562): $ findmnt -R /run/slurm TARGET SOURCE FSTYPE OPTIONS /run/slurm tmpfs[/slurm] tmpfs rw,nosuid,nodev,noexec,relatime,size=32594660k,mode=755 └─/run/slurm/14/.ns nsfs[mnt:[4026532562]] nsfs rw New job steps will now join a different mount namespace, that is empty, so the job steps are not in sync anymore: $ srun --jobid=14 --overlap cat /dev/shm/date srun: error: ioctl: task 0: Exited with exit code 1 /bin/cat: /dev/shm/date: No such file or directory Whereas the original interactive job step (srun --pty) can still access /dev/shm/date just fine: $ cat /dev/shm/date Thu 20 May 2021 04:11:15 PM PDT
I've been looking into this and reproducing it is easy. I'm working on a patch and seeing what the implications are of making sure /dev/shm persist! Thanks, --Tim
Thanks Tim! Btw, for context, there has been some discussions about that in https://bugs.schedmd.com/show_bug.cgi?id=11093 The two options are likely: 1) Do not unmount the mount namespace bind mounts of active jobs when slurmd stops (https://bugs.schedmd.com/show_bug.cgi?id=11093#c11) 2) Continue unmounting the bind mounts, but restore them somehow in restore (https://bugs.schedmd.com/show_bug.cgi?id=11093#c12).
Thank you for the extra context!
Hi Felix, Sorry about the delay here, but we were able to get this working. Unfortunately the changes required were a little too much to make it into 20.11, but they have landed ahead of the 21.08 release. Note that the job_container/tmpfs plugin now also requires "PrologFlags=contain" in the slurm.conf, since we've delegated all of the mount handling to the extern step. I'm going to mark this as resolved for now, but please let us know if you notice any other issues! Thanks, --Tim