Ticket 14803

Summary:	job_container/tmpfs & automounter causes first attempt to run a job on a node to fail
Product:	Slurm	Reporter:	Michael Pelletier <michael.v.pelletier>
Component:	Accounting	Assignee:	Tim McMullan <mcmullan>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	21.08.7
Hardware:	Linux
OS:	Linux
Site:	Raytheon Missile, Space and Airborne	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Michael Pelletier 2022-08-22 09:28:27 MDT

I've implemented the job_container/tmpfs functionality to redirect node /tmp space to a large NFS fileserver to avoid limitations on local disk space.

It appears that this functionality, in conjunction with autofs, causes the first attempt to run a job to fail on a machine where a required NFS directory has not yet been mounted. 

The symptom is the following in the application's log file:

/var/spool/slurmd/job32283/slurm_script: line 3: /apps/cst/cluster/CST2022/cst_settings: Too many levels of symbolic links
/var/spool/slurmd/job32283/slurm_script: line 4: /apps/cst/cluster/CST2022/cst_common: Too many levels of symbolic links

This came from an exec node where the /apps/cst automount filesystem was not yet mounted when the job started.

The tmpfs namespace is added successfully, but it runs into problems when a directory that did not exist when the namespace was created is accessed.

Subsequent runs on the same node work fine, because by the time the second run starts the automounter has finished mounting the filesystem.

The root cause appears to be that the automount daemon is not aware of namespaces, any new automounts go into the parent namespace by default, making them inaccessible to the job's namespace.

It looks like it's necessary to walk through the list of automounts on the system and mark each of them shared-subtree, by setting the MS_SHARED flag. The shell command required is mount --make-shared /apps, for example. This allows changes in the parent namespace's automount points to be seen by the child.

My first workaround attempt is limited to a specific app, but I'll see if I can redesign that prolog script to be universally applicable, and post it here if it works.

Comment 1 Michael Pelletier 2022-08-22 10:18:06 MDT

I tried the following prolog script to no avail, so maybe my guess is wrong:

#!/bin/bash
for mountpoint in $(mount -l -t autofs | awk '{print $3}') ; do
    /bin/mount --make-rshared $mountpoint
done

I also tried adding an "ls -l $mountpoint/* >/dev/null 2>&1" with the idea of prodding the automounter, but that  wound up with a "launch failed requeued held" problem.

Comment 2 Tim McMullan 2022-08-22 12:08:12 MDT

This appears to actually be a known issue (bug12567) that I've been working on getting fixed for an upcoming Slurm release.

(In reply to Michael Pelletier from comment #1)
> I tried the following prolog script to no avail, so maybe my guess is wrong:
> 
> #!/bin/bash
> for mountpoint in $(mount -l -t autofs | awk '{print $3}') ; do
>     /bin/mount --make-rshared $mountpoint
> done

This would seem to help, but part of the current implementation actually forces everything to private after the prolog script runs.

> I also tried adding an "ls -l $mountpoint/* >/dev/null 2>&1" with the idea
> of prodding the automounter, but that  wound up with a "launch failed
> requeued held" problem.

In an empty autofs mount point the ls -l for me just errors out which is probably why it fails.  It would have to know all the mount points first and touch them all.

I'm not currently aware of an easy workaround for this inside slurm, but there is some example code and some suggestions in the other bug.

I do have a proof of concept fix that still requires more testing but it will likely be in a future major release.

Let me know if you have other questions on this, but I will likely mark this as a duplicate of 12567.

Thanks!
--Tim

Comment 3 Tim McMullan 2022-08-25 06:22:49 MDT

As mentioned in the previous comment, this is a dup of 12567.  Marking it as a duplicate now.

Thanks!
--Tim

*** This ticket has been marked as a duplicate of ticket 12567 ***

Comment 4 Michael Pelletier 2022-08-25 09:41:42 MDT

Thanks very much for your guidance, Tim! I'll take a closer look at bug 12567 and decide whether I have to revert from the container to the TmpFS= approach.