Summary: | /tmp from job_container/tmpfs not usable after slurmd restart | ||
---|---|---|---|
Product: | Slurm | Reporter: | Felix Abecassis <fabecassis> |
Component: | slurmd | Assignee: | Tim McMullan <mcmullan> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | jbernauer, lyeager, rundall |
Version: | 21.08.x | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | NVIDIA (PSLA) | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 21.08pre1 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Felix Abecassis
2021-05-20 16:46:11 MDT
Actually, I don't think it's related to the mount namespace (but it's likely another separate bug). I think the issue is this line hardcoding UID 0: https://github.com/SchedMD/slurm/blob/a10619bf1d482e189fc3f0dceed5ef459b410667/src/plugins/job_container/tmpfs/job_container_tmpfs.c#L172 This will cause the permissions of the job's /tmp folder to change underneath it, here: https://github.com/SchedMD/slurm/blob/a10619bf1d482e189fc3f0dceed5ef459b410667/src/plugins/job_container/tmpfs/job_container_tmpfs.c#L550 From the job, before the slurmd restart: $ ls -ld /tmp/ drwx------ 2 fabecassis root 60 May 20 16:01 /tmp/ After the slurmd restart $ ls -ld /tmp/ drwx------ 2 root root 60 May 20 16:01 /tmp/ This seems related to this other bug I reported, where the job's /tmp remains owned by root in some cases (basically until srun is run): https://bugs.schedmd.com/show_bug.cgi?id=11609 (In reply to Felix Abecassis from comment #1) > Actually, I don't think it's related to the mount namespace (but it's likely > another separate bug). You are correct, it appears that the /tmp ending up owned by root after the restart and the issue with /dev/shm appear to be different problems. Its related too.... (In reply to Jake Rundall from comment #3) > This seems related to this other bug I reported, where the job's /tmp > remains owned by root in some cases (basically until srun is run): > https://bugs.schedmd.com/show_bug.cgi?id=11609 the eventual fix for 11609. I'll give more details on that particular bug there in a moment. Short version is that I've reproduced this issue on master and written a patch for it, its just waiting review now. Thanks! --Tim This issue has been resolved on master (https://github.com/SchedMD/slurm/commit/77eb6cbd2397c3bcb7b3007080942db291c6d467). Thanks for catching this! --Tim |