Tested on the current master branch, commit a10619bf1d482e189fc3f0dceed5ef459b410667 Running Slurm in a single node test config, on Ubuntu 20.04 with kernel 5.4.0-73-generic. $ cat /etc/slurm/job_container.conf BasePath=/var/run/slurm AutoBasePath=true When starting a new job, it has access to a per-job filesystem mounted in /tmp: $ srun --pty bash $ findmnt /tmp TARGET SOURCE FSTYPE OPTIONS /tmp tmpfs[/slurm/8/.8] tmpfs rw,nosuid,nodev,noexec,relatime,size=32594660k,mode=755 And from this job you can write to /tmp: $ echo $(hostname) > /tmp/test ; cat /tmp/test ioctl From another terminal running in the root mount namespace, you can indeed see that there is a mount for this filesystem: $ findmnt -R /run/slurm TARGET SOURCE FSTYPE OPTIONS /run/slurm tmpfs[/slurm] tmpfs rw,nosuid,nodev,noexec,relatime,size=32594660k,mode=755 └─/run/slurm/8/.ns nsfs[mnt:[4026532561]] nsfs rw Now, if you stop slurmd normally while the job is still running, /run/slurm will be unmounted: $ findmnt -R /run/slurm ; echo $? 1 And then, restarting slurmd will create new mounts, but with a different mount namespace (4026532561 vs 4026532562): $ findmnt -R /run/slurm TARGET SOURCE FSTYPE OPTIONS /run/slurm tmpfs[/slurm] tmpfs rw,nosuid,nodev,noexec,relatime,size=32594660k,mode=755 └─/run/slurm/8/.ns nsfs[mnt:[4026532562]] nsfs rw As a result, /tmp is not usable from the existing job anymore: $ ls /tmp ls: cannot open directory '/tmp': Permission denied
Actually, I don't think it's related to the mount namespace (but it's likely another separate bug). I think the issue is this line hardcoding UID 0: https://github.com/SchedMD/slurm/blob/a10619bf1d482e189fc3f0dceed5ef459b410667/src/plugins/job_container/tmpfs/job_container_tmpfs.c#L172 This will cause the permissions of the job's /tmp folder to change underneath it, here: https://github.com/SchedMD/slurm/blob/a10619bf1d482e189fc3f0dceed5ef459b410667/src/plugins/job_container/tmpfs/job_container_tmpfs.c#L550 From the job, before the slurmd restart: $ ls -ld /tmp/ drwx------ 2 fabecassis root 60 May 20 16:01 /tmp/ After the slurmd restart $ ls -ld /tmp/ drwx------ 2 root root 60 May 20 16:01 /tmp/
This seems related to this other bug I reported, where the job's /tmp remains owned by root in some cases (basically until srun is run): https://bugs.schedmd.com/show_bug.cgi?id=11609
(In reply to Felix Abecassis from comment #1) > Actually, I don't think it's related to the mount namespace (but it's likely > another separate bug). You are correct, it appears that the /tmp ending up owned by root after the restart and the issue with /dev/shm appear to be different problems. Its related too.... (In reply to Jake Rundall from comment #3) > This seems related to this other bug I reported, where the job's /tmp > remains owned by root in some cases (basically until srun is run): > https://bugs.schedmd.com/show_bug.cgi?id=11609 the eventual fix for 11609. I'll give more details on that particular bug there in a moment. Short version is that I've reproduced this issue on master and written a patch for it, its just waiting review now. Thanks! --Tim
This issue has been resolved on master (https://github.com/SchedMD/slurm/commit/77eb6cbd2397c3bcb7b3007080942db291c6d467). Thanks for catching this! --Tim