Created attachment 25576 [details] Job script and output file I'm testing the job_container/tmpfs plugin (https://slurm.schedmd.com/job_container.conf.html) on our test cluster, where I added these lines to slurm.conf: PrologFlags=contain JobContainerType=job_container/tmpfs I created /etc/slurm/job_container.conf with just the contents: BasePath=/scratch since the compute nodes have a scratch file system: $ df -Ph /scratch/ Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-lv_scratch 384G 2.8G 382G 1% /scratch The *.conf files are propagated to all nodes and the daemons are restarted. I now submit jobs and they run, albeit with errors. The slurmd.log shows an error: [2022-06-20T13:41:58.957] [47.batch] error: couldn't chdir to `/home/niflheim/ohni': Too many levels of symbolic links: going to /tmp instead and the job output file shows the same error (see attachments). It turns out that the user's home directory /home/niflheim/ohni, which is NFS auto-mounted using autofs, seems to be unavailable at the instant when slurmd starts the job. If the home directory was already mounted during a previous job, no error occurs. If I manually unmount the home directory, the error comes back. IMHO, there seems to be a race condition between slurmd's start of the job and the NFS autofs mounting of home directories. It would be great if slurmd could postpone job starts for some milliseconds until the job's working directory had been mounted. Before configuring the job_container/tmpfs plugin we didn't have any issues with the NFS home directories.
Created attachment 25577 [details] Job output file
Basically this is the same issue in bug 12567 (which is private). automount and job_container/tmpfs do not play well together: If a directory does not exist *when the tmpfs is created*, then that directory cannot be accessed by the job. I thought that you could wait for the directory to get mounted from inside the prolog. However, the prolog runs inside of the job_container, so by the time prolog runs if the directory does not exist then it is too late. Ideally we would do this in InitScript. However, InitScript does not have the required information. (In 21.08 it doesn't have any SLURM_* environment variables set. In 22.05 we set SLURM_JOB_ID, SLURM_JOB_MOUNTPOINT_SRC, SLURM_CONF, and SLURMD_NODENAME but that's not enough for what you need.) We will solve that in bug 13546 by passing more environment variables to InitScript. However, this will go into 23.02 at the earliest. If you need autofs, then you might be able to make it work in PrologSlurmctld: from PrologSlurmctld, do something that tells autofs on the compute nodes to mount the user's home directory. (You have access to SLURM_JOB_NODELIST, SLURM_JOB_USER, SLURM_JOB_WORK_DIR (set only if --chdir was specified), and other environment variables in PrologSlurmctld, which should give the needed information. This is an ugly workaround, but it's all I can think of at the moment. Once we have a patch you could backport it and maintain it in a local patch until 23.02.
If you can go without the job_container plugin for now, that would also be a valid workaround.
Hi Marshall, Thanks for the disappointing news about NFS automount: (In reply to Marshall Garey from comment #3) > automount and job_container/tmpfs do not play well together: > If a directory does not exist *when the tmpfs is created*, then that > directory cannot be accessed by the job. Do you know if some of the community SPANK plugins might possibly work despite this? * https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir * https://github.com/hpc2n/spank-private-tmp Thanks, Ole
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #5) > Hi Marshall, > > Thanks for the disappointing news about NFS automount: > > (In reply to Marshall Garey from comment #3) > > automount and job_container/tmpfs do not play well together: > > If a directory does not exist *when the tmpfs is created*, then that > > directory cannot be accessed by the job. > > Do you know if some of the community SPANK plugins might possibly work > despite this? > > * https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir > * https://github.com/hpc2n/spank-private-tmp What I mean is whether SPANK plugins could be alternatives to job_container/tmpfs?
(In reply to Marshall Garey from comment #3) > Basically this is the same issue in bug 12567 (which is private). > > automount and job_container/tmpfs do not play well together: > If a directory does not exist *when the tmpfs is created*, then that > directory cannot be accessed by the job. Can this be fixed? We use autofs not for the home directories only and I would prefer not to mount everything before a job starts.
So the main problem is that autofs isn't namespace aware. When I googled "autofs namespace" the first result was a mailing thread about this. https://patchwork.kernel.org/project/linux-fsdevel/patch/1460076663.3135.37.camel@themaw.net/ (1) RE the community SPANK plugins: I don't know. However, looking at the README at both of those plugins, they both use mount namespaces, so it's possible that they're affected by this issue. I don't know whether they have successfully worked around the problem. You'll have to ask them. (2) Can this be fixed? We are looking into how to fix it or at least worked around it. Since this is discussed more in bug 12567, we're seeing if the site that opened that bug will allow it to be made public so you can post on it and follow the discussion.
(In reply to Marshall Garey from comment #8) > So the main problem is that autofs isn't namespace aware. When I googled > "autofs namespace" the first result was a mailing thread about this. > > https://patchwork.kernel.org/project/linux-fsdevel/patch/1460076663.3135.37. > camel@themaw.net/ Thanks. There are other discussions as well: * autofs is now more reliable when handling namespaces in https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.4_release_notes/bug_fixes_file_systems * Using autofs in Docker containers and the "Too many levels of symbolic links" message in bugzilla https://access.redhat.com/articles/3104671 * Processes in mount namespaces hang or fail when accessing automount directories in https://bugzilla.redhat.com/show_bug.cgi?id=1569146 > (1) RE the community SPANK plugins: > I don't know. However, looking at the README at both of those plugins, they > both use mount namespaces, so it's possible that they're affected by this > issue. I don't know whether they have successfully worked around the > problem. You'll have to ask them. I have opened a question in https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir/issues/6 > (2) Can this be fixed? > We are looking into how to fix it or at least worked around it. Since this > is discussed more in bug 12567, we're seeing if the site that opened that > bug will allow it to be made public so you can post on it and follow the > discussion. It would be really good if we could find fixes or workarounds, since I guess that many Slurm sites may be using NFS autofs for user home directories etc.
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #9) > * Using autofs in Docker containers and the "Too many levels of symbolic > links" message in bugzilla https://access.redhat.com/articles/3104671 Do you have access to this KB entry?
> https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir/issues6#issuecomment-1168789618 Maybe a hint for a fix?
(In reply to staeglis from comment #11) > > https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir/issues6#issuecomment-1168789618 > > Maybe a hint for a fix? Correct URL: https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir/issues/6#issuecomment-1168789618 > [job_container/tmpfs] clones a new namespace then remounts root recursive+private mode in it. Thanks for that. My colleague who is working on this issue in bug 12567 realized that is our main problem. It is nice to see how University of Delaware handles that. I will make my colleague aware of this. Ideally this would be part of the discussion in the other bug, but we are still waiting to see if the other site can open it up publicly.
(In reply to staeglis from comment #10) > (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #9) > > * Using autofs in Docker containers and the "Too many levels of symbolic > > links" message in bugzilla https://access.redhat.com/articles/3104671 > > Do you have access to this KB entry? I do not. Even my colleague with a developer account cannot access it. It looks like you need a subscription.
(In reply to Marshall Garey from comment #13) > (In reply to staeglis from comment #10) > > (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #9) > > > * Using autofs in Docker containers and the "Too many levels of symbolic > > > links" message in bugzilla https://access.redhat.com/articles/3104671 > > > > Do you have access to this KB entry? > > I do not. Even my colleague with a developer account cannot access it. It > looks like you need a subscription. I can send the KB text privately to the SchedMD developer if desired (I already communicated with staeglis@informatik.uni-freiburg.de). /Ole
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #14) > (In reply to Marshall Garey from comment #13) > > (In reply to staeglis from comment #10) > > > (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #9) > > > > * Using autofs in Docker containers and the "Too many levels of symbolic > > > > links" message in bugzilla https://access.redhat.com/articles/3104671 > > > > > > Do you have access to this KB entry? > > > > I do not. Even my colleague with a developer account cannot access it. It > > looks like you need a subscription. > > I can send the KB text privately to the SchedMD developer if desired (I > already communicated with staeglis@informatik.uni-freiburg.de). > > /Ole One of my colleagues was in the end able to access those pages and took some screenshots for us. Thank you for the offer, though.
As an alternative to the job_container/tmpfs plugin, I've now built and tested the SPANK plugin https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir. This plugin both provides bind mounts of configurable directories, and it also works correctly with NFS automounted user home directories! I've documented my tests in: https://github.com/University-of-Delaware-IT-RCI/auto_tmpdir/issues/6 and collected the entire configuration in: https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#temporary-job-directories Hopefully the job_container/tmpfs plugin could be inspired by auto_tmpdir to implement a solution that works with NFS autofs home directories?
I would prefer to use the job_container/tmpfs plugin as it is an official part of SLURM and doesn't need an epilog script for cleaning up the nodes. So it would be very nice indeed if this issue could be fixed soon.
(In reply to staeglis from comment #17) > I would prefer to use the job_container/tmpfs plugin as it is an official > part of SLURM and doesn't need an epilog script for cleaning up the nodes. I agree with your desire for an official Slurm plugin. However, the auto_tmpdir SPANK plugin doesn't need an epilog script for cleaning up the nodes, AFAICT. Just create plugstack.conf and restart the slurmd's, this worked for me.
Oh my fault. Seems that I've mixed it with this: https://github.com/hpc2n/spank-private-tmp
Thanks for that information, Ole. I passed it along to my colleague who is working on bug 12567. I'm marking this bug as a duplicate of bug 12567 since it is now public. Feel free to take all the discussion to that bug. *** This ticket has been marked as a duplicate of ticket 12567 ***