Ticket 16013

Summary:	job_container/tmpfs not cleaning directories
Product:	Slurm	Reporter:	Miguel Esteva <esteva.m>
Component:	Other	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	felip.moll, lyeager, mcmullan
Version:	22.05.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=13304 https://bugs.schedmd.com/show_bug.cgi?id=18482 https://bugs.schedmd.com/show_bug.cgi?id=17941
Site:	WEHI	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	23.02.1 23.11.0rc1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Ticket Depends on:	17941
Ticket Blocks:
Attachments:	slurmd node logs Contents of the basepath directory Post job_container.conf change logs Node logs 23.02.1 debug patch v2 Apr04 logs debug patch (v3) debug patch v4 patch v4 logs apr06 lsof log Jobs description cgroup.conf timeout_job

Description Miguel Esteva 2023-02-12 17:15:40 MST

Hi SchedMD team,

We started testing job_container/tmpfs in our development cluster. Jobs are starting fine, however noticed some jobs leave directories behind in the nominated BasePath after they complete (seems it does not matter if they completed fine or not). Is this intended or is there something we have to change?

Kind regards,

Miguel

Comment 1 Marcin Stolarek 2023-02-14 02:36:52 MST

Can you share slurmd logs from the node where you files not being cleaned up?

cheers,
Marcin

Comment 3 Jason Booth 2023-02-14 14:44:11 MST

In addition to the comment from Marcin please follow these steps.

1. Set DebugFlags=jobcontainer + debug2
2. ls -lahR of the path where the file is not being removed
3. Attach the job_container.conf
4. scontrol show job of the affected job.
5. Attach the sbatch script if any and job submission arguments.

There is a small delay but the files should be cleaned up even if there is a slight delay.

Comment 4 Miguel Esteva 2023-02-14 15:44:40 MST

Hi Marcin and Jason,

Saw nothing in the logs and indeed bumped slurmd logging to debug3 (then lowered to debug2). Will add JobContainer to the debug flags and share what I find.

Thank you.

Comment 5 Miguel Esteva 2023-02-15 16:05:46 MST

Created attachment 28880 [details]
slurmd node logs

Hi Marcin and Jason,

Some logs from nodes that didn't clean up are attached.

Our job_container.conf is simply:

AutoBasePath=true
BasePath=/vast/scratch/tmp

I can see these entries:

[2023-02-16T09:25:16.576] [1362352.extern] debug2: _rm_data: could not remove path: /vast/scratch/tmp/1362352: Device or resource busy

However, even when that entry is show, I have not been able to reproduce with my own jobs, even when they were cancelled halfway. Directories get cleaned up ok. What I see from the jobs run by the users is that the majority got cancelled. 

I should mention that the mount used for BasePath is NFSv3 that we use for scratch:

vast:/scratch /vast/scratch nfs vers=3,relatime,nodiratime,acregmax=3,acdirmin=3,acdirmax=3,mountproto=tcp

/etc/slurm/plugins/conf.d/tmpdir.conf is present but the contents are all commented out. 

Kind regards,

Miguel

Comment 6 Miguel Esteva 2023-02-15 16:10:33 MST

Created attachment 28883 [details]
Contents of the basepath directory

Comment 7 Miguel Esteva 2023-02-15 19:22:26 MST

Changed the configuration of job_container.conf so each node has a unique directory:

The template used to generate the file:

AutoBasePath=true
BasePath=/vast/scratch/tmp/dev/$(hostame -s)

/etc/slurm/job_container.conf
il-n01: AutoBasePath=true
il-n01: BasePath=/vast/scratch/tmp/dev/il-n01
em-n01: AutoBasePath=true
em-n01: BasePath=/vast/scratch/tmp/dev/em-n01
milton-sml-02: AutoBasePath=true
milton-sml-02: BasePath=/vast/scratch/tmp/dev/milton-sml-02
cl-n01: AutoBasePath=true
cl-n01: BasePath=/vast/scratch/tmp/dev/cl-n01
milton-sml-01: AutoBasePath=true
milton-sml-01: BasePath=/vast/scratch/tmp/dev/milton-sml-01

Comment 11 Marshall Garey 2023-02-16 16:09:22 MST

> Changed the configuration of job_container.conf so each node has a unique directory:
As you discovered, each node needs to have its own unique directory for BasePath. In Slurm 23.02, we added wildcard expansion for hostname (%h) and nodename (%n) in BasePath so that you can have the same job_container.conf file on every node, which should make this easier if your BasePath is on a shared filesystem.

Can you let us know if you see directories not cleaned up since you made the change?

Comment 12 Miguel Esteva 2023-02-16 19:55:15 MST

Created attachment 28905 [details]
Post job_container.conf change logs

Unfortunately, we still noticed a couple jobs not cleaning up their directories. The logs from node are attached (jobs 1362429, 1362431).

em-n01/1362429:
total 0
drwx------ 2 root root 4096 Feb 16 16:54 .
drwxr-xr-x 2 root root 4096 Feb 17 09:10 ..

em-n01/1362431:
total 0
drwx------ 2 root root 4096 Feb 16 16:36 .
drwxr-xr-x 2 root root 4096 Feb 17 09:10 ..

Cheers.

Comment 14 Marshall Garey 2023-02-20 15:54:08 MST

(In reply to Miguel Esteva from comment #12)
> Created attachment 28905 [details]
> Post job_container.conf change logs
> 
> Unfortunately, we still noticed a couple jobs not cleaning up their
> directories. The logs from node are attached (jobs 1362429, 1362431).
> 
> em-n01/1362429:
> total 0
> drwx------ 2 root root 4096 Feb 16 16:54 .
> drwxr-xr-x 2 root root 4096 Feb 17 09:10 ..
> 
> em-n01/1362431:
> total 0
> drwx------ 2 root root 4096 Feb 16 16:36 .
> drwxr-xr-x 2 root root 4096 Feb 17 09:10 ..
> 
> Cheers.

Thanks, I'm looking into it. I may have found a possible cause for this.

Comment 16 Miguel Esteva 2023-02-21 03:38:17 MST

Thank you!

We have since updated our test cluster to 23.02.0-0rc1. Can see some directories were not cleaned up posterior to the update.

Comment 28 Marshall Garey 2023-03-02 13:34:25 MST

Hi,

We have pushed commit 99a6d87322 which fixes a situation where job directories would not be removed if the namespace mount was already gone. From your uploaded slurmd log, I know that this happened on your node. To test this, you should be able to either locally apply this commit to your test system, or upgrade your test system to the latest commit on the slurm-23.02 branch.

Will you be able to test this on your test system so we can confirm whether there are other outstanding issues that cause the job container directories to not be cleaned up?

One additional issue that we observed is unmount2() failing with errno == ESTALE, which can happen on shared filesystems. If you observe this happening and job container directories are not cleaned up right away, once the filesystem issue is resolved, restarting the slurmd should cause the job container directories to be cleaned up.

Comment 29 Miguel Esteva 2023-03-07 16:13:34 MST

Thank you. We have updated our test cluster to 23.02. Will report back once we run more jobs.

Comment 30 Marshall Garey 2023-03-14 10:07:54 MDT

Hi Miguel,

Have you had a chance to run jobs in your Slurm 23.02 test environment to verify if job_container/tmpfs is cleaning the job directories?

Comment 31 Miguel Esteva 2023-03-19 18:17:27 MDT

Hi Marshall,

Ran a simple array job that uses mktemp to create a file inside a directory. Then the file is kept open with tail until the job times out. So far I have not seen any tmp directories left behind.

Are there any other tests you would recommend?

Cheers,

Miguel

Comment 32 Marshall Garey 2023-03-19 23:58:20 MDT

Just any test that you noticed job directories leftover is fine. I'm marking this as fixed ahead of 23.02.1. Let us know if you encounter more issues.

Comment 33 Miguel Esteva 2023-03-29 23:45:28 MDT

Created attachment 29596 [details]
Node logs 23.02.1

Hi,

Updated to v23.02.1. Still saw some directories left behind after a user ran some jobs in our test cluster. Attaching logs. 

Cheers,

Miguel

Comment 41 Marshall Garey 2023-04-03 12:22:46 MDT

Hi Miguel,

First, just to confirm: are your mountpoints private or shared? I know in comment 7 you made it so each mountpoint was unique to each node, and just want to confirm this is still the case.

I also want to remind you that in Slurm 23.02, we added wildcard expansion for hostname (%h) and nodename (%n) in BasePath so that you can have the same job_container.conf file on every node, which should make this easier for you to maintain if your BasePath is on a shared filesystem.

I just uploaded attachment 29647 [details] (debug patch v2). This includes two commits that will help us with debugging:

* Add an error message for a failed unmount. Previously a failed unmount in this spot did not log a message. This will hopefully let us see why unmounts are failing and diagnose the problem.


* Prevent unmounts that will always fail in private mode; only do them in shared mode. This will silence most of the debug2 log messages that we saw in your log file:

cl-n01.log:[2023-03-30T10:33:19.281] [1363882.extern] debug2: job_container/tmpfs: _clean_job_basepath: failed to unmount /vast/scratch/tmp/dev/cl-n01/1363885 for job 1363882



Can you apply this patch (to slurmd) and upload a slurmd log file with failed unmounts with this patch applied?

Comment 42 Miguel Esteva 2023-04-03 19:30:44 MDT

Hi Marshall,

We setup ansible to create a unique directory and job_container.conf for each node. 

I can give the %n wildcard a go, it will save us a couple tasks.

Thanks for the patches, will apply them and contact the user to run again once they have been applied.

Cheers and thanks,

Miguel

Comment 43 Miguel Esteva 2023-04-04 01:43:49 MDT

Created attachment 29659 [details]
Apr04 logs

Applied the patch and rebuilt. The user ran the same workload. Now we see most (or all) directories got left behind. Quickly skimmed some logs and couldn't see much. Attaching slurmd logs in the nodes that had directories left in their BasePath.

Thank you!

Comment 46 Marshall Garey 2023-04-04 15:06:26 MDT

Created attachment 29679 [details]
debug patch v4

Hi,

Thanks for that extra data. I don't see any error logs, so the umount didn't fail. We added one more debug patch to replace the one you have. It has the same two commits that your current debug patch has, plus one more commit that adds error logs for failed rmdir calls. I suspect this is what is failing since your directories are left.

Thanks!

Comment 48 Miguel Esteva 2023-04-05 02:44:52 MDT

Created attachment 29691 [details]
patch v4 logs

Hi Marshall,

Implemented the patch and captured some logs from affected nodes. Additionally, added some audit logs from the NFS side. 

Might have to do with the namespace being created with unshare passing the same mount flags used in the local /tmp (our NFS mounts have some additional options, included in a previous entry).

Didn't have much time to check the logs but I do see some .nfs files failing to clear (processes in the job step still using those files maybe?), resulting in "Device or resource busy" errors. Consequently, rmdir fails. For some Slurm jobs that create jobs in rapid succession, we instruct users to add a small delay to allow NFS file entries to propagate. 

Cheers.

Comment 50 Marshall Garey 2023-04-05 08:35:51 MDT

Can you run lsof on il-n01, or another node that still has lingering job container directories?

Comment 51 Miguel Esteva 2023-04-05 21:42:24 MDT

Created attachment 29722 [details]
apr06 lsof log

Attaching logs from a node that had files open under slurmstep after the job ended.

Comment 52 Marshall Garey 2023-04-06 08:49:37 MDT

This shows that something started by slurmstepd set up a conda environment, and that a process is still holding files in the job's tmpfs directory. What is setting up this conda environment? A task prolog or the job script? How exactly is it being started?

Comment 54 Miguel Esteva 2023-04-18 22:03:49 MDT

Created attachment 29902 [details]
Jobs description

(In reply to Marshall Garey from comment #52)
> This shows that something started by slurmstepd set up a conda environment,
> and that a process is still holding files in the job's tmpfs directory. What
> is setting up this conda environment? A task prolog or the job script? How
> exactly is it being started?

An apology for the late reply.

Those nodes do have prolog/epilog scripts but they do not run anything (they just exit 0). Only our GPU nodes currently use prologs and epilogs to setup DCGM. 

Those jobs are submitted with the Nextflow Workflow manager. A document detailing what each job ran is attached.

Most of the nodes leaving files behind are Java hsperfdata files (attempted to run with -XX:-UsePerfData but could not tell if there were differences) and those conda files setup during the batch step.

Cheers.

Comment 55 Marshall Garey 2023-04-21 16:09:20 MDT

If the processes started by the batch step are not killed when the batch step exits, these processes can still hold files open in the job's local /tmp directory.

What is your ProctrackType in slurm.conf?

Comment 56 Miguel Esteva 2023-04-23 17:48:36 MDT

Created attachment 29975 [details]
cgroup.conf

We use proctrack/cgroup. The configuration is attached.

Comment 60 Marshall Garey 2023-04-28 14:48:50 MDT

For your information, we pushed the patches that we gave to you upstream.


The fix for the erroneous debug2 log is in ahead of 23.02.2:

commit 7bb373e3c9

And the added logs are in master:

commit 44007fb569
commit 213c415993

Comment 61 Marshall Garey 2023-06-01 14:18:15 MDT

Sorry for a long delay in responding to you. For further debugging this, can you get the cgroups (with cat /proc/<pid>cgroup) of the nextflow processes on a node where the tmp directory was not cleaned?



For example, in the apr06_logs/cl-n01-lsof.txt file that you uploaded, you have PID 5292:

slurmstep   5292   root    6r   DIR   0,42       4096 17868314191245455793 ./1366482
slurmstep   5292   root    9r   DIR   0,42       4096 15018413718779533490 ./1366482/.1366482
slurmstep   5292   root   10r   DIR   0,42       4096 10784177319922843380 ./1366482/.1366482/_tmp
slurmstep   5292   root   11r   DIR   0,42       4096  8197285856838784784 ./1366482/.1366482/_tmp/build-temp-356126792
slurmstep   5292   root   12r   DIR   0,42       4096  2479474795689377041 ./1366482/.1366482/_tmp/build-temp-356126792/rootfs
slurmstep   5292   root   13r   DIR   0,42       4096 14391705936550825154 ./1366482/.1366482/_tmp/build-temp-356126792/rootfs/opt
slurmstep   5292   root   14r   DIR   0,42       4096  6512950008968731535 ./1366482/.1366482/_tmp/build-temp-356126792/rootfs/opt/conda
slurmstep   5292   root   15r   DIR   0,42       4096  8617558895494467877 ./1366482/.1366482/_tmp/build-temp-356126792/rootfs/opt/conda/pkgs
slurmstep   5292   root   16r   DIR   0,42       4096  6500221641326575175 ./1366482/.1366482/_tmp/build-temp-356126792/rootfs/opt/conda/pkgs/pulseaudio-16.1-h4ab2085_1
slurmstep   5292   root   17r   DIR   0,42       4096 16896803527836703003 ./1366482/.1366482/_tmp/build-temp-356126792/rootfs/opt/conda/pkgs/pulseaudio-16.1-h4ab2085_1/include
slurmstep   5292   root   18r   DIR   0,42       4096 16961621215152742610 ./1366482/.1366482/_tmp/build-temp-356126792/rootfs/opt/conda/pkgs/pulseaudio-16.1-h4ab2085_1/include/pulse

Running this command will get the cgroups of this process:
  cat /proc/5292/cgroup


I want to see whether this process belongs to a Slurm cgroup.

I am asking this because if processes are started using a launcher daemon of some kind, that can start processes not inside of a Slurm cgroup, and then those processes are not killed when Slurm tries to kill all processes belonging to a step (like the batch step).

Comment 65 Miguel Esteva 2023-07-19 19:36:59 MDT

Created attachment 31319 [details]
timeout_job

Sorry for the big delay!

We don't see processes outside Slurm running on the nodes. From my understanding of that particular pipeline is that everything should be running on a login node.

Got a copy of the pipeline, I will give it another look.

We updated our test cluster to 23.02.03 and see less directories left behind. I see some jobs that are left to time out fail to clean since the processes cannot be terminated when the firs kill signal is sent to the processes within the job. sigkill is sent afterwards. Attaching the slurmd log for the node and that job.

Comment 66 Marshall Garey 2023-08-02 16:41:00 MDT

(In reply to Miguel Esteva from comment #65)
> We updated our test cluster to 23.02.03 and see less directories left
> behind. I see some jobs that are left to time out fail to clean since the
> processes cannot be terminated when the firs kill signal is sent to the
> processes within the job. sigkill is sent afterwards. Attaching the slurmd
> log for the node and that job.

We do only try to cleanup the directories (1) when the extern step completes (which happens when the job completes), and (2) when restarting slurmd. If it doesn't work when the job completes, does it work later if you restart the slurmd?

If this is the case, I can look into introducing logic to occasionally retry cleaning up stray tmpfs directories, or maybe just retrying a bit more at job completion. I'm not sure how difficult this will be.

Comment 67 Miguel Esteva 2023-08-09 17:57:31 MDT

Restarting slurmd clears all the stale jobs from the tmpfs directory successfully. Will keep running more of the sample jobs we got from our users to see if I can get more information.

Comment 77 Marshall Garey 2024-02-21 11:20:44 MST

We have another bug (bug 17941) where steps can be deleted in the wrong order. If the extern step is deleted before other steps, then it will fail in trying to clean up the container. That could be the potential underlying cause of this bug. We are targeting a fix for bug 17941 in 23.11.5. Once that is fixed, we can ask you to test with that fix and see if you can reproduce the issue of containers not getting cleaned up.

Comment 78 Miguel Esteva 2024-02-22 17:12:51 MST

Hi Marshall,

Thank you for the heads up.

Will give it a go when 23.11.5 is out.

Comment 80 Marshall Garey 2024-03-25 15:54:23 MDT

Unfortunately we were not able to get the patch in before 23.11.5. I'll let you know once we get a fix upstream.

Comment 81 Marshall Garey 2024-04-10 15:13:49 MDT

Just letting you know that we are not planning to fix bug 17941 in 23.11. Fixing it is proving to be somewhat complicated and involves changes we do not want to make this late in the 23.11 release cycle.

Although we haven't proven that your bug here is a duplicate of bug 17941, I am fairly confident that it is, or that it mostly is. Are you okay if we close this bug as a duplicate of bug 17941?

Comment 82 Miguel Esteva 2024-04-23 00:42:24 MDT

Hi Marshall,

Happy to track bug 17941. I assume this will be expected in 24.05?

We can test then and file a new case if we still run into this issue.

Thank you for the help!

Comment 83 Marshall Garey 2024-04-23 08:13:16 MDT

(In reply to Miguel Esteva from comment #82)
> Hi Marshall,
> 
> Happy to track bug 17941. I assume this will be expected in 24.05?

We are trying to get some of the fixes in 24.05, which should fix issues with job_container/tmpfs most of the time. We will need to make additional changes, but because 24.05 is very close they won't make it upstream in time.

> We can test then and file a new case if we still run into this issue.
> 
> Thank you for the help!

You're welcome! Closing as a dup of bug 17941.

*** This ticket has been marked as a duplicate of ticket 17941 ***

Comment 84 Marshall Garey 2024-05-01 13:51:46 MDT

We just pushed a fix for ticket 17941. This should (in theory) fix the bug in this ticket as well. The fix will be in 24.05.