Description
Miguel Esteva
2023-02-12 17:15:40 MST
Can you share slurmd logs from the node where you files not being cleaned up? cheers, Marcin In addition to the comment from Marcin please follow these steps. 1. Set DebugFlags=jobcontainer + debug2 2. ls -lahR of the path where the file is not being removed 3. Attach the job_container.conf 4. scontrol show job of the affected job. 5. Attach the sbatch script if any and job submission arguments. There is a small delay but the files should be cleaned up even if there is a slight delay. Hi Marcin and Jason, Saw nothing in the logs and indeed bumped slurmd logging to debug3 (then lowered to debug2). Will add JobContainer to the debug flags and share what I find. Thank you. Created attachment 28880 [details]
slurmd node logs
Hi Marcin and Jason,
Some logs from nodes that didn't clean up are attached.
Our job_container.conf is simply:
AutoBasePath=true
BasePath=/vast/scratch/tmp
I can see these entries:
[2023-02-16T09:25:16.576] [1362352.extern] debug2: _rm_data: could not remove path: /vast/scratch/tmp/1362352: Device or resource busy
However, even when that entry is show, I have not been able to reproduce with my own jobs, even when they were cancelled halfway. Directories get cleaned up ok. What I see from the jobs run by the users is that the majority got cancelled.
I should mention that the mount used for BasePath is NFSv3 that we use for scratch:
vast:/scratch /vast/scratch nfs vers=3,relatime,nodiratime,acregmax=3,acdirmin=3,acdirmax=3,mountproto=tcp
/etc/slurm/plugins/conf.d/tmpdir.conf is present but the contents are all commented out.
Kind regards,
Miguel
Created attachment 28883 [details]
Contents of the basepath directory
Changed the configuration of job_container.conf so each node has a unique directory: The template used to generate the file: AutoBasePath=true BasePath=/vast/scratch/tmp/dev/$(hostame -s) /etc/slurm/job_container.conf il-n01: AutoBasePath=true il-n01: BasePath=/vast/scratch/tmp/dev/il-n01 em-n01: AutoBasePath=true em-n01: BasePath=/vast/scratch/tmp/dev/em-n01 milton-sml-02: AutoBasePath=true milton-sml-02: BasePath=/vast/scratch/tmp/dev/milton-sml-02 cl-n01: AutoBasePath=true cl-n01: BasePath=/vast/scratch/tmp/dev/cl-n01 milton-sml-01: AutoBasePath=true milton-sml-01: BasePath=/vast/scratch/tmp/dev/milton-sml-01 > Changed the configuration of job_container.conf so each node has a unique directory:
As you discovered, each node needs to have its own unique directory for BasePath. In Slurm 23.02, we added wildcard expansion for hostname (%h) and nodename (%n) in BasePath so that you can have the same job_container.conf file on every node, which should make this easier if your BasePath is on a shared filesystem.
Can you let us know if you see directories not cleaned up since you made the change?
Created attachment 28905 [details]
Post job_container.conf change logs
Unfortunately, we still noticed a couple jobs not cleaning up their directories. The logs from node are attached (jobs 1362429, 1362431).
em-n01/1362429:
total 0
drwx------ 2 root root 4096 Feb 16 16:54 .
drwxr-xr-x 2 root root 4096 Feb 17 09:10 ..
em-n01/1362431:
total 0
drwx------ 2 root root 4096 Feb 16 16:36 .
drwxr-xr-x 2 root root 4096 Feb 17 09:10 ..
Cheers.
(In reply to Miguel Esteva from comment #12) > Created attachment 28905 [details] > Post job_container.conf change logs > > Unfortunately, we still noticed a couple jobs not cleaning up their > directories. The logs from node are attached (jobs 1362429, 1362431). > > em-n01/1362429: > total 0 > drwx------ 2 root root 4096 Feb 16 16:54 . > drwxr-xr-x 2 root root 4096 Feb 17 09:10 .. > > em-n01/1362431: > total 0 > drwx------ 2 root root 4096 Feb 16 16:36 . > drwxr-xr-x 2 root root 4096 Feb 17 09:10 .. > > Cheers. Thanks, I'm looking into it. I may have found a possible cause for this. Thank you! We have since updated our test cluster to 23.02.0-0rc1. Can see some directories were not cleaned up posterior to the update. Hi, We have pushed commit 99a6d87322 which fixes a situation where job directories would not be removed if the namespace mount was already gone. From your uploaded slurmd log, I know that this happened on your node. To test this, you should be able to either locally apply this commit to your test system, or upgrade your test system to the latest commit on the slurm-23.02 branch. Will you be able to test this on your test system so we can confirm whether there are other outstanding issues that cause the job container directories to not be cleaned up? One additional issue that we observed is unmount2() failing with errno == ESTALE, which can happen on shared filesystems. If you observe this happening and job container directories are not cleaned up right away, once the filesystem issue is resolved, restarting the slurmd should cause the job container directories to be cleaned up. Thank you. We have updated our test cluster to 23.02. Will report back once we run more jobs. Hi Miguel, Have you had a chance to run jobs in your Slurm 23.02 test environment to verify if job_container/tmpfs is cleaning the job directories? Hi Marshall, Ran a simple array job that uses mktemp to create a file inside a directory. Then the file is kept open with tail until the job times out. So far I have not seen any tmp directories left behind. Are there any other tests you would recommend? Cheers, Miguel Just any test that you noticed job directories leftover is fine. I'm marking this as fixed ahead of 23.02.1. Let us know if you encounter more issues. Created attachment 29596 [details]
Node logs 23.02.1
Hi,
Updated to v23.02.1. Still saw some directories left behind after a user ran some jobs in our test cluster. Attaching logs.
Cheers,
Miguel
Hi Miguel, First, just to confirm: are your mountpoints private or shared? I know in comment 7 you made it so each mountpoint was unique to each node, and just want to confirm this is still the case. I also want to remind you that in Slurm 23.02, we added wildcard expansion for hostname (%h) and nodename (%n) in BasePath so that you can have the same job_container.conf file on every node, which should make this easier for you to maintain if your BasePath is on a shared filesystem. I just uploaded attachment 29647 [details] (debug patch v2). This includes two commits that will help us with debugging: * Add an error message for a failed unmount. Previously a failed unmount in this spot did not log a message. This will hopefully let us see why unmounts are failing and diagnose the problem. * Prevent unmounts that will always fail in private mode; only do them in shared mode. This will silence most of the debug2 log messages that we saw in your log file: cl-n01.log:[2023-03-30T10:33:19.281] [1363882.extern] debug2: job_container/tmpfs: _clean_job_basepath: failed to unmount /vast/scratch/tmp/dev/cl-n01/1363885 for job 1363882 Can you apply this patch (to slurmd) and upload a slurmd log file with failed unmounts with this patch applied? Hi Marshall, We setup ansible to create a unique directory and job_container.conf for each node. I can give the %n wildcard a go, it will save us a couple tasks. Thanks for the patches, will apply them and contact the user to run again once they have been applied. Cheers and thanks, Miguel Created attachment 29659 [details]
Apr04 logs
Applied the patch and rebuilt. The user ran the same workload. Now we see most (or all) directories got left behind. Quickly skimmed some logs and couldn't see much. Attaching slurmd logs in the nodes that had directories left in their BasePath.
Thank you!
Created attachment 29679 [details]
debug patch v4
Hi,
Thanks for that extra data. I don't see any error logs, so the umount didn't fail. We added one more debug patch to replace the one you have. It has the same two commits that your current debug patch has, plus one more commit that adds error logs for failed rmdir calls. I suspect this is what is failing since your directories are left.
Thanks!
Created attachment 29691 [details]
patch v4 logs
Hi Marshall,
Implemented the patch and captured some logs from affected nodes. Additionally, added some audit logs from the NFS side.
Might have to do with the namespace being created with unshare passing the same mount flags used in the local /tmp (our NFS mounts have some additional options, included in a previous entry).
Didn't have much time to check the logs but I do see some .nfs files failing to clear (processes in the job step still using those files maybe?), resulting in "Device or resource busy" errors. Consequently, rmdir fails. For some Slurm jobs that create jobs in rapid succession, we instruct users to add a small delay to allow NFS file entries to propagate.
Cheers.
Can you run lsof on il-n01, or another node that still has lingering job container directories? Created attachment 29722 [details]
apr06 lsof log
Attaching logs from a node that had files open under slurmstep after the job ended.
This shows that something started by slurmstepd set up a conda environment, and that a process is still holding files in the job's tmpfs directory. What is setting up this conda environment? A task prolog or the job script? How exactly is it being started? Created attachment 29902 [details] Jobs description (In reply to Marshall Garey from comment #52) > This shows that something started by slurmstepd set up a conda environment, > and that a process is still holding files in the job's tmpfs directory. What > is setting up this conda environment? A task prolog or the job script? How > exactly is it being started? An apology for the late reply. Those nodes do have prolog/epilog scripts but they do not run anything (they just exit 0). Only our GPU nodes currently use prologs and epilogs to setup DCGM. Those jobs are submitted with the Nextflow Workflow manager. A document detailing what each job ran is attached. Most of the nodes leaving files behind are Java hsperfdata files (attempted to run with -XX:-UsePerfData but could not tell if there were differences) and those conda files setup during the batch step. Cheers. If the processes started by the batch step are not killed when the batch step exits, these processes can still hold files open in the job's local /tmp directory. What is your ProctrackType in slurm.conf? Created attachment 29975 [details]
cgroup.conf
We use proctrack/cgroup. The configuration is attached.
For your information, we pushed the patches that we gave to you upstream. The fix for the erroneous debug2 log is in ahead of 23.02.2: commit 7bb373e3c9 And the added logs are in master: commit 44007fb569 commit 213c415993 Sorry for a long delay in responding to you. For further debugging this, can you get the cgroups (with cat /proc/<pid>cgroup) of the nextflow processes on a node where the tmp directory was not cleaned? For example, in the apr06_logs/cl-n01-lsof.txt file that you uploaded, you have PID 5292: slurmstep 5292 root 6r DIR 0,42 4096 17868314191245455793 ./1366482 slurmstep 5292 root 9r DIR 0,42 4096 15018413718779533490 ./1366482/.1366482 slurmstep 5292 root 10r DIR 0,42 4096 10784177319922843380 ./1366482/.1366482/_tmp slurmstep 5292 root 11r DIR 0,42 4096 8197285856838784784 ./1366482/.1366482/_tmp/build-temp-356126792 slurmstep 5292 root 12r DIR 0,42 4096 2479474795689377041 ./1366482/.1366482/_tmp/build-temp-356126792/rootfs slurmstep 5292 root 13r DIR 0,42 4096 14391705936550825154 ./1366482/.1366482/_tmp/build-temp-356126792/rootfs/opt slurmstep 5292 root 14r DIR 0,42 4096 6512950008968731535 ./1366482/.1366482/_tmp/build-temp-356126792/rootfs/opt/conda slurmstep 5292 root 15r DIR 0,42 4096 8617558895494467877 ./1366482/.1366482/_tmp/build-temp-356126792/rootfs/opt/conda/pkgs slurmstep 5292 root 16r DIR 0,42 4096 6500221641326575175 ./1366482/.1366482/_tmp/build-temp-356126792/rootfs/opt/conda/pkgs/pulseaudio-16.1-h4ab2085_1 slurmstep 5292 root 17r DIR 0,42 4096 16896803527836703003 ./1366482/.1366482/_tmp/build-temp-356126792/rootfs/opt/conda/pkgs/pulseaudio-16.1-h4ab2085_1/include slurmstep 5292 root 18r DIR 0,42 4096 16961621215152742610 ./1366482/.1366482/_tmp/build-temp-356126792/rootfs/opt/conda/pkgs/pulseaudio-16.1-h4ab2085_1/include/pulse Running this command will get the cgroups of this process: cat /proc/5292/cgroup I want to see whether this process belongs to a Slurm cgroup. I am asking this because if processes are started using a launcher daemon of some kind, that can start processes not inside of a Slurm cgroup, and then those processes are not killed when Slurm tries to kill all processes belonging to a step (like the batch step). Created attachment 31319 [details]
timeout_job
Sorry for the big delay!
We don't see processes outside Slurm running on the nodes. From my understanding of that particular pipeline is that everything should be running on a login node.
Got a copy of the pipeline, I will give it another look.
We updated our test cluster to 23.02.03 and see less directories left behind. I see some jobs that are left to time out fail to clean since the processes cannot be terminated when the firs kill signal is sent to the processes within the job. sigkill is sent afterwards. Attaching the slurmd log for the node and that job.
(In reply to Miguel Esteva from comment #65) > We updated our test cluster to 23.02.03 and see less directories left > behind. I see some jobs that are left to time out fail to clean since the > processes cannot be terminated when the firs kill signal is sent to the > processes within the job. sigkill is sent afterwards. Attaching the slurmd > log for the node and that job. We do only try to cleanup the directories (1) when the extern step completes (which happens when the job completes), and (2) when restarting slurmd. If it doesn't work when the job completes, does it work later if you restart the slurmd? If this is the case, I can look into introducing logic to occasionally retry cleaning up stray tmpfs directories, or maybe just retrying a bit more at job completion. I'm not sure how difficult this will be. Restarting slurmd clears all the stale jobs from the tmpfs directory successfully. Will keep running more of the sample jobs we got from our users to see if I can get more information. We have another bug (bug 17941) where steps can be deleted in the wrong order. If the extern step is deleted before other steps, then it will fail in trying to clean up the container. That could be the potential underlying cause of this bug. We are targeting a fix for bug 17941 in 23.11.5. Once that is fixed, we can ask you to test with that fix and see if you can reproduce the issue of containers not getting cleaned up. Hi Marshall, Thank you for the heads up. Will give it a go when 23.11.5 is out. Unfortunately we were not able to get the patch in before 23.11.5. I'll let you know once we get a fix upstream. Just letting you know that we are not planning to fix bug 17941 in 23.11. Fixing it is proving to be somewhat complicated and involves changes we do not want to make this late in the 23.11 release cycle. Although we haven't proven that your bug here is a duplicate of bug 17941, I am fairly confident that it is, or that it mostly is. Are you okay if we close this bug as a duplicate of bug 17941? Hi Marshall, Happy to track bug 17941. I assume this will be expected in 24.05? We can test then and file a new case if we still run into this issue. Thank you for the help! (In reply to Miguel Esteva from comment #82) > Hi Marshall, > > Happy to track bug 17941. I assume this will be expected in 24.05? We are trying to get some of the fixes in 24.05, which should fix issues with job_container/tmpfs most of the time. We will need to make additional changes, but because 24.05 is very close they won't make it upstream in time. > We can test then and file a new case if we still run into this issue. > > Thank you for the help! You're welcome! Closing as a dup of bug 17941. *** This ticket has been marked as a duplicate of ticket 17941 *** We just pushed a fix for ticket 17941. This should (in theory) fix the bug in this ticket as well. The fix will be in 24.05. |