Ticket 11674

Summary: /dev/shm from job_container/tmpfs not restored after slurmd restart
Product: Slurm Reporter: Felix Abecassis <fabecassis>
Component: slurmdAssignee: Tim McMullan <mcmullan>
Status: RESOLVED FIXED QA Contact: Unassigned Reviewer <reviewers>
Severity: 4 - Minor Issue    
Priority: --- CC: jbernauer, lyeager, mcmullan, rundall, tripiana
Version: 21.08.x   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=11907
Site: NVIDIA (PSLA) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 21.08.0 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Felix Abecassis 2021-05-20 17:17:56 MDT
Forking https://bugs.schedmd.com/show_bug.cgi?id=11673 since I realized there are 2 separate bugs.

Tested on the current master branch, commit a10619bf1d482e189fc3f0dceed5ef459b410667
Running Slurm in a single node test config, on Ubuntu 20.04 with kernel 5.4.0-73-generic.

$ cat /etc/slurm/job_container.conf
BasePath=/var/run/slurm
AutoBasePath=true

When starting a new job, it has access to a per-job tmpfs mounted in /dev/shm:
$ srun --pty bash
$ echo $(date) > /dev/shm/date ; cat /dev/shm/date
Thu 20 May 2021 04:11:15 PM PDT

And if you launch another parallel job step, it will have access to the same instance of the /dev/shm tmpfs:
$ srun --jobid=14 --overlap cat /dev/shm/date
Thu 20 May 2021 04:11:15 PM PDT

This mount namespace is saved in /run/slurm/${SLURM_JOBID}/.ns:
$ findmnt -R /run/slurm
TARGET              SOURCE                 FSTYPE OPTIONS
/run/slurm          tmpfs[/slurm]          tmpfs  rw,nosuid,nodev,noexec,relatime,size=32594660k,mode=755
└─/run/slurm/14/.ns nsfs[mnt:[4026532561]] nsfs   rw

But if you restart slurmd (normally) while the job is still running, a new mount namespace will be created (4026532561 vs 4026532562):
$ findmnt -R /run/slurm
TARGET              SOURCE                 FSTYPE OPTIONS
/run/slurm          tmpfs[/slurm]          tmpfs  rw,nosuid,nodev,noexec,relatime,size=32594660k,mode=755
└─/run/slurm/14/.ns nsfs[mnt:[4026532562]] nsfs   rw

New job steps will now join a different mount namespace, that is empty, so the job steps are not in sync anymore:
$ srun --jobid=14 --overlap cat /dev/shm/date
srun: error: ioctl: task 0: Exited with exit code 1
/bin/cat: /dev/shm/date: No such file or directory

Whereas the original interactive job step (srun --pty) can still access /dev/shm/date just fine:
$ cat /dev/shm/date
Thu 20 May 2021 04:11:15 PM PDT
Comment 4 Tim McMullan 2021-06-02 10:03:34 MDT
I've been looking into this and reproducing it is easy.  I'm working on a patch and seeing what the implications are of making sure /dev/shm persist!

Thanks,
--Tim
Comment 5 Felix Abecassis 2021-06-02 10:14:30 MDT
Thanks Tim!

Btw, for context, there has been some discussions about that in https://bugs.schedmd.com/show_bug.cgi?id=11093

The two options are likely:
1) Do not unmount the mount namespace bind mounts of active jobs when slurmd stops (https://bugs.schedmd.com/show_bug.cgi?id=11093#c11)
2) Continue unmounting the bind mounts, but restore them somehow in restore (https://bugs.schedmd.com/show_bug.cgi?id=11093#c12).
Comment 6 Tim McMullan 2021-06-02 12:29:55 MDT
Thank you for the extra context!
Comment 14 Tim McMullan 2021-08-25 08:28:33 MDT
Hi Felix,

Sorry about the delay here, but we were able to get this working.  Unfortunately the changes required were a little too much to make it into 20.11, but they have landed ahead of the 21.08 release.

Note that the job_container/tmpfs plugin now also requires "PrologFlags=contain" in the slurm.conf, since we've delegated all of the mount handling to the extern step.

I'm going to mark this as resolved for now, but please let us know if you notice any other issues!

Thanks,
--Tim