Summary: | /dev/shm from job_container/tmpfs not restored after slurmd restart | ||
---|---|---|---|
Product: | Slurm | Reporter: | Felix Abecassis <fabecassis> |
Component: | slurmd | Assignee: | Tim McMullan <mcmullan> |
Status: | RESOLVED FIXED | QA Contact: | Unassigned Reviewer <reviewers> |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | jbernauer, lyeager, mcmullan, rundall, tripiana |
Version: | 21.08.x | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=11907 | ||
Site: | NVIDIA (PSLA) | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 21.08.0 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Felix Abecassis
2021-05-20 17:17:56 MDT
I've been looking into this and reproducing it is easy. I'm working on a patch and seeing what the implications are of making sure /dev/shm persist! Thanks, --Tim Thanks Tim! Btw, for context, there has been some discussions about that in https://bugs.schedmd.com/show_bug.cgi?id=11093 The two options are likely: 1) Do not unmount the mount namespace bind mounts of active jobs when slurmd stops (https://bugs.schedmd.com/show_bug.cgi?id=11093#c11) 2) Continue unmounting the bind mounts, but restore them somehow in restore (https://bugs.schedmd.com/show_bug.cgi?id=11093#c12). Thank you for the extra context! Hi Felix, Sorry about the delay here, but we were able to get this working. Unfortunately the changes required were a little too much to make it into 20.11, but they have landed ahead of the 21.08 release. Note that the job_container/tmpfs plugin now also requires "PrologFlags=contain" in the slurm.conf, since we've delegated all of the mount handling to the extern step. I'm going to mark this as resolved for now, but please let us know if you notice any other issues! Thanks, --Tim |