14344 – job_container/tmpfs jobs fail because of race condition with NFS autofs mounting

Ticket 14344 - job_container/tmpfs jobs fail because of race condition with NFS autofs mounting

Summary: job_container/tmpfs jobs fail because of race condition with NFS autofs mounting

Status:	RESOLVED DUPLICATE of ticket 12567

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	21.08.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-06-20 06:08 MDT by Ole.H.Nielsen@fysik.dtu.dk
Modified:	2022-06-30 13:43 MDT (History)
CC List:	1 user (show)

See Also:	13546 12567
Site:	DTU Physics
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Job script and output file (308 bytes, application/x-shellscript) 2022-06-20 06:08 MDT, Ole.H.Nielsen@fysik.dtu.dk	Details
Job output file (4.16 KB, text/plain) 2022-06-20 06:09 MDT, Ole.H.Nielsen@fysik.dtu.dk	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Ole.H.Nielsen@fysik.dtu.dk 2022-06-20 06:08:39 MDT

Created attachment 25576 [details]
Job script and output file

I'm testing the job_container/tmpfs plugin (https://slurm.schedmd.com/job_container.conf.html) on our test cluster, where I added these lines to slurm.conf:

PrologFlags=contain
JobContainerType=job_container/tmpfs

I created /etc/slurm/job_container.conf with just the contents:

BasePath=/scratch

since the compute nodes have a scratch file system:

$ df -Ph /scratch/
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-lv_scratch  384G  2.8G  382G   1% /scratch

The *.conf files are propagated to all nodes and the daemons are restarted.

I now submit jobs and they run, albeit with errors.  The slurmd.log shows an error:

[2022-06-20T13:41:58.957] [47.batch] error: couldn't chdir to `/home/niflheim/ohni': Too many levels of symbolic links: going to /tmp instead

and the job output file shows the same error (see attachments).

It turns out that the user's home directory /home/niflheim/ohni, which is NFS auto-mounted using autofs, seems to be unavailable at the instant when slurmd starts the job.  If the home directory was already mounted during a previous job, no error occurs.  If I manually unmount the home directory, the error comes back.

IMHO, there seems to be a race condition between slurmd's start of the job and the NFS autofs mounting of home directories.  It would be great if slurmd could postpone job starts for some milliseconds until the job's working directory had been mounted.

Before configuring the job_container/tmpfs plugin we didn't have any issues with the NFS home directories.

Comment 1 Ole.H.Nielsen@fysik.dtu.dk 2022-06-20 06:09:17 MDT

Created attachment 25577 [details]
Job output file

Comment 3 Marshall Garey 2022-06-24 14:58:40 MDT

Basically this is the same issue in bug 12567 (which is private).

automount and job_container/tmpfs do not play well together:
If a directory does not exist *when the tmpfs is created*, then that directory cannot be accessed by the job.

I thought that you could wait for the directory to get mounted from inside the prolog. However, the prolog runs inside of the job_container, so by the time prolog runs if the directory does not exist then it is too late.

Ideally we would do this in InitScript. However, InitScript does not have the required information. (In 21.08 it doesn't have any SLURM_* environment variables set. In 22.05 we set SLURM_JOB_ID, SLURM_JOB_MOUNTPOINT_SRC, SLURM_CONF, and SLURMD_NODENAME but that's not enough for what you need.) We will solve that in bug 13546 by passing more environment variables to InitScript. However, this will go into 23.02 at the earliest.

If you need autofs, then you might be able to make it work in PrologSlurmctld: from PrologSlurmctld, do something that tells autofs on the compute nodes to mount the user's home directory. (You have access to SLURM_JOB_NODELIST, SLURM_JOB_USER, SLURM_JOB_WORK_DIR (set only if --chdir was specified), and other environment variables in PrologSlurmctld, which should give the needed information. This is an ugly workaround, but it's all I can think of at the moment.

Once we have a patch you could backport it and maintain it in a local patch until 23.02.

Comment 4 Marshall Garey 2022-06-24 15:09:05 MDT

If you can go without the job_container plugin for now, that would also be a valid workaround.

Comment 5 Ole.H.Nielsen@fysik.dtu.dk 2022-06-27 02:31:07 MDT

Hi Marshall,

Thanks for the disappointing news about NFS automount:

(In reply to Marshall Garey from comment #3)
> automount and job_container/tmpfs do not play well together:
> If a directory does not exist *when the tmpfs is created*, then that
> directory cannot be accessed by the job.

Do you know if some of the community SPANK plugins might possibly work despite this?

* https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir
* https://github.com/hpc2n/spank-private-tmp

Thanks,
Ole

Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2022-06-27 04:34:51 MDT

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #5)
> Hi Marshall,
> 
> Thanks for the disappointing news about NFS automount:
> 
> (In reply to Marshall Garey from comment #3)
> > automount and job_container/tmpfs do not play well together:
> > If a directory does not exist *when the tmpfs is created*, then that
> > directory cannot be accessed by the job.
> 
> Do you know if some of the community SPANK plugins might possibly work
> despite this?
> 
> * https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir
> * https://github.com/hpc2n/spank-private-tmp

What I mean is whether SPANK plugins could be alternatives to job_container/tmpfs?

Comment 7 staeglis 2022-06-27 07:29:05 MDT

(In reply to Marshall Garey from comment #3)
> Basically this is the same issue in bug 12567 (which is private).
> 
> automount and job_container/tmpfs do not play well together:
> If a directory does not exist *when the tmpfs is created*, then that
> directory cannot be accessed by the job.

Can this be fixed? We use autofs not for the home directories only and I would prefer not to mount everything before a job starts.

Comment 8 Marshall Garey 2022-06-27 14:08:40 MDT

So the main problem is that autofs isn't namespace aware. When I googled "autofs namespace" the first result was a mailing thread about this.

https://patchwork.kernel.org/project/linux-fsdevel/patch/1460076663.3135.37.camel@themaw.net/


(1) RE the community SPANK plugins:
I don't know. However, looking at the README at both of those plugins, they both use mount namespaces, so it's possible that they're affected by this issue. I don't know whether they have successfully worked around the problem. You'll have to ask them.

(2) Can this be fixed?
We are looking into how to fix it or at least worked around it. Since this is discussed more in bug 12567, we're seeing if the site that opened that bug will allow it to be made public so you can post on it and follow the discussion.

Comment 9 Ole.H.Nielsen@fysik.dtu.dk 2022-06-28 00:32:59 MDT

(In reply to Marshall Garey from comment #8)
> So the main problem is that autofs isn't namespace aware. When I googled
> "autofs namespace" the first result was a mailing thread about this.
> 
> https://patchwork.kernel.org/project/linux-fsdevel/patch/1460076663.3135.37.
> camel@themaw.net/

Thanks.  There are other discussions as well:

* autofs is now more reliable when handling namespaces in https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.4_release_notes/bug_fixes_file_systems

* Using autofs in Docker containers and the "Too many levels of symbolic links" message  in bugzilla https://access.redhat.com/articles/3104671

* Processes in mount namespaces hang or fail when accessing automount directories in https://bugzilla.redhat.com/show_bug.cgi?id=1569146

> (1) RE the community SPANK plugins:
> I don't know. However, looking at the README at both of those plugins, they
> both use mount namespaces, so it's possible that they're affected by this
> issue. I don't know whether they have successfully worked around the
> problem. You'll have to ask them.

I have opened a question in https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir/issues/6

> (2) Can this be fixed?
> We are looking into how to fix it or at least worked around it. Since this
> is discussed more in bug 12567, we're seeing if the site that opened that
> bug will allow it to be made public so you can post on it and follow the
> discussion.

It would be really good if we could find fixes or workarounds, since I guess that many Slurm sites may be using NFS autofs for user home directories etc.

Comment 10 staeglis 2022-06-28 02:17:56 MDT

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #9)
> * Using autofs in Docker containers and the "Too many levels of symbolic
> links" message  in bugzilla https://access.redhat.com/articles/3104671

Do you have access to this KB entry?

Comment 11 staeglis 2022-06-28 08:59:31 MDT

> https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir/issues6#issuecomment-1168789618

Maybe a hint for a fix?

Comment 12 Marshall Garey 2022-06-28 12:26:46 MDT

(In reply to staeglis from comment #11)
> > https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir/issues6#issuecomment-1168789618
> 
> Maybe a hint for a fix?

Correct URL:

https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir/issues/6#issuecomment-1168789618

> [job_container/tmpfs] clones a new namespace then remounts root recursive+private mode in it.

Thanks for that. My colleague who is working on this issue in bug 12567 realized that is our main problem. It is nice to see how University of Delaware handles that. I will make my colleague aware of this.

Ideally this would be part of the discussion in the other bug, but we are still waiting to see if the other site can open it up publicly.

Comment 13 Marshall Garey 2022-06-28 12:55:09 MDT

(In reply to staeglis from comment #10)
> (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #9)
> > * Using autofs in Docker containers and the "Too many levels of symbolic
> > links" message  in bugzilla https://access.redhat.com/articles/3104671
> 
> Do you have access to this KB entry?

I do not. Even my colleague with a developer account cannot access it. It looks like you need a subscription.

Comment 14 Ole.H.Nielsen@fysik.dtu.dk 2022-06-29 01:10:30 MDT

(In reply to Marshall Garey from comment #13)
> (In reply to staeglis from comment #10)
> > (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #9)
> > > * Using autofs in Docker containers and the "Too many levels of symbolic
> > > links" message  in bugzilla https://access.redhat.com/articles/3104671
> > 
> > Do you have access to this KB entry?
> 
> I do not. Even my colleague with a developer account cannot access it. It
> looks like you need a subscription.

I can send the KB text privately to the SchedMD developer if desired (I already communicated with staeglis@informatik.uni-freiburg.de).

/Ole

Comment 15 Marshall Garey 2022-06-29 10:01:58 MDT

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #14)
> (In reply to Marshall Garey from comment #13)
> > (In reply to staeglis from comment #10)
> > > (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #9)
> > > > * Using autofs in Docker containers and the "Too many levels of symbolic
> > > > links" message  in bugzilla https://access.redhat.com/articles/3104671
> > > 
> > > Do you have access to this KB entry?
> > 
> > I do not. Even my colleague with a developer account cannot access it. It
> > looks like you need a subscription.
> 
> I can send the KB text privately to the SchedMD developer if desired (I
> already communicated with staeglis@informatik.uni-freiburg.de).
> 
> /Ole

One of my colleagues was in the end able to access those pages and took some screenshots for us. Thank you for the offer, though.

Comment 16 Ole.H.Nielsen@fysik.dtu.dk 2022-06-30 05:50:14 MDT

As an alternative to the job_container/tmpfs plugin, I've now built and tested the SPANK plugin https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir.

This plugin both provides bind mounts of configurable directories, and it also works correctly with NFS automounted user home directories! I've documented my tests in:
https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir/issues/6

and collected the entire configuration in:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#temporary-job-directories

Hopefully the job_container/tmpfs plugin could be inspired by auto_tmpdir to implement a solution that works with NFS autofs home directories?

Comment 17 staeglis 2022-06-30 05:56:27 MDT

I would prefer to use the job_container/tmpfs plugin as it is an official part of SLURM and doesn't need an epilog script for cleaning up the nodes.

So it would be very nice indeed if this issue could be fixed soon.

Comment 18 Ole.H.Nielsen@fysik.dtu.dk 2022-06-30 05:59:30 MDT

(In reply to staeglis from comment #17)
> I would prefer to use the job_container/tmpfs plugin as it is an official
> part of SLURM and doesn't need an epilog script for cleaning up the nodes.

I agree with your desire for an official Slurm plugin.

However, the auto_tmpdir SPANK plugin doesn't need an epilog script for cleaning up the nodes, AFAICT.  Just create plugstack.conf and restart the slurmd's, this worked for me.

Comment 19 staeglis 2022-06-30 06:35:53 MDT

Oh my fault. Seems that I've mixed it with this:
https://github.com/hpc2n/spank-private-tmp

Comment 20 Marshall Garey 2022-06-30 13:43:33 MDT

Thanks for that information, Ole. I passed it along to my colleague who is working on bug 12567.

I'm marking this bug as a duplicate of bug 12567 since it is now public. Feel free to take all the discussion to that bug.

*** This ticket has been marked as a duplicate of ticket 12567 ***