Summary: | job_container/tmpfs & automounter causes first attempt to run a job on a node to fail | ||
---|---|---|---|
Product: | Slurm | Reporter: | Michael Pelletier <michael.v.pelletier> |
Component: | Accounting | Assignee: | Tim McMullan <mcmullan> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | ||
Version: | 21.08.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Raytheon Missile, Space and Airborne | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Michael Pelletier
2022-08-22 09:28:27 MDT
I tried the following prolog script to no avail, so maybe my guess is wrong: #!/bin/bash for mountpoint in $(mount -l -t autofs | awk '{print $3}') ; do /bin/mount --make-rshared $mountpoint done I also tried adding an "ls -l $mountpoint/* >/dev/null 2>&1" with the idea of prodding the automounter, but that wound up with a "launch failed requeued held" problem. This appears to actually be a known issue (bug12567) that I've been working on getting fixed for an upcoming Slurm release. (In reply to Michael Pelletier from comment #1) > I tried the following prolog script to no avail, so maybe my guess is wrong: > > #!/bin/bash > for mountpoint in $(mount -l -t autofs | awk '{print $3}') ; do > /bin/mount --make-rshared $mountpoint > done This would seem to help, but part of the current implementation actually forces everything to private after the prolog script runs. > I also tried adding an "ls -l $mountpoint/* >/dev/null 2>&1" with the idea > of prodding the automounter, but that wound up with a "launch failed > requeued held" problem. In an empty autofs mount point the ls -l for me just errors out which is probably why it fails. It would have to know all the mount points first and touch them all. I'm not currently aware of an easy workaround for this inside slurm, but there is some example code and some suggestions in the other bug. I do have a proof of concept fix that still requires more testing but it will likely be in a future major release. Let me know if you have other questions on this, but I will likely mark this as a duplicate of 12567. Thanks! --Tim As mentioned in the previous comment, this is a dup of 12567. Marking it as a duplicate now. Thanks! --Tim *** This ticket has been marked as a duplicate of ticket 12567 *** |