Ticket 9429

Summary:	cgroups not cleaned up properly, leading to perf degradation
Product:	Slurm	Reporter:	Luke Yeager <lyeager>
Component:	Limits	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED FIXED	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	alex, bart, fabecassis, karun.kallakuri, radmer, tdockendorf, troy
Version:	20.02.3
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=3604 https://bugs.schedmd.com/show_bug.cgi?id=1771 https://bugs.schedmd.com/show_bug.cgi?id=8911 https://bugs.schedmd.com/show_bug.cgi?id=9351 https://bugs.schedmd.com/show_bug.cgi?id=9498 https://bugs.schedmd.com/show_bug.cgi?id=7536
Site:	NVIDIA (PSLA)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.02.5
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	job411291.leaked_cgroups job411291.sacct job411516.leaked_cgroups job411516.sacct slurm.conf cgroup.conf job413891.leaked_cgroups job413891.slurmd perf-drift-fixed job418939.slurmd.debug2 job418939.epilog debug9429_2002_v1.patch

Description Luke Yeager 2020-07-20 15:15:31 MDT

We have noticed that the performance of all nodes in our cluster degrades over time (fixable by a reboot). We finally tracked it down to an issue with cgroup cleanup at job completion.

All data presented below is on our older cluster (19.05.4), and so far we have only noticed the performance impact on this cluster. But now that we know the root-cause, we have noticed the same issue on our newer cluster (20.02.3). We don't yet know why it hasn't impacted performance there. But the issue does not appear to be fixed in 20.02.

After a job finishes, sometimes there are still leftover cgroups which can be found with this command:

find /sys/fs/cgroup/*/slurm -mindepth 1 -type d

When we delete these cgroups, which are leaked from previous jobs, the node performance goes back up to the expected performance without requiring a reboot. Our current workaround is to run this in the job epilog:

find /sys/fs/cgroup/*/slurm -depth -mindepth 1 -type d -print -exec rmdir {} \;

1824 jobs have run on the cluster since the last reboot. 1104 of those jobs leaked one or more cgroup directory. Of those 1104 jobs, here is the breakdown by job state:

JobState CountLeaked
CANCELLED+ 9
COMPLETED 607
FAILED 16
PREEMPTED 14
TIMEOUT 458

So, leaks occur for all sorts of JobStates. Here is the breakdown of StepState for those 1104 jobs which leaked one or more cgroup dir:

State CountAll CountLeaked
CANCELLED 5905 5838
CANCELLED+ 2067 6
COMPLETED 27893 2
FAILED 12 12

So, the vast majority of the leaked cgroups come from steps with the CANCELLED state. But not quite all of them. And most of those CANCELLED steps leak something, but not quite all of them.

Sanitized cluster configuration (from our older cluster) and information about two of the 1104 leaking jobs will be attached shortly.

Comment 1 Luke Yeager 2020-07-20 15:16:26 MDT

Created attachment 15089 [details]
job411291.leaked_cgroups

Comment 2 Luke Yeager 2020-07-20 15:16:50 MDT

Created attachment 15090 [details]
job411291.sacct

Comment 3 Luke Yeager 2020-07-20 15:17:08 MDT

Created attachment 15091 [details]
job411516.leaked_cgroups

Comment 4 Luke Yeager 2020-07-20 15:17:26 MDT

Created attachment 15092 [details]
job411516.sacct

Comment 5 Luke Yeager 2020-07-20 15:22:03 MDT

Created attachment 15093 [details]
slurm.conf

Comment 6 Luke Yeager 2020-07-20 15:22:21 MDT

Created attachment 15094 [details]
cgroup.conf

Comment 7 Felip Moll 2020-07-21 03:41:24 MDT

Hi Luke,

I've experienced this issue and I am investigating how we can mitigate it. We have some races with the different cgroup plugins that could lead to this issue.

Thanks for the files.

If you don't mind I re-tag this as a sev-3 which I think fits better with your issue, because you already have a valid workaround. That doesn't mean I will give it less attention.

Will keep you informed.

Comment 8 Felip Moll 2020-07-21 04:14:35 MDT

May you attach the slurmd log from a node where there are recent uncleaned cgroups?

Comment 10 Luke Yeager 2020-07-21 09:59:01 MDT

Created attachment 15108 [details]
job413891.leaked_cgroups

Comment 11 Luke Yeager 2020-07-21 10:00:01 MDT

Created attachment 15109 [details]
job413891.slurmd

Sure thing. Here is the log from a job which leaked cgroups on this node. SlurmdDebug=info.

Comment 12 Felip Moll 2020-07-22 03:03:03 MDT

(In reply to luke.yeager from comment #11)
> Created attachment 15109 [details]
> job413891.slurmd
> 
> Sure thing. Here is the log from a job which leaked cgroups on this node.
> SlurmdDebug=info.

Unfortunately SlurmdDebug=info is not enough to see if this is the same I was experiencing. Is it possile for you to increase it temporary to debug2, trigger the error and resend it?

Thanks

Comment 13 Luke Yeager 2020-07-24 12:59:08 MDT

Created attachment 15165 [details]
perf-drift-fixed

Sorry, I'll try to get that debug2 log to you soon.

In the meantime, look how dramatically well our workaround is doing. Attached is a screenshot of average application performance. We run this CPU-heavy application twice a day on every node in the cluster. Before the workaround, the perf would drop predictably every time until we rebooted the nodes. After the cgroup fix in the epilog, things are MUCH better.

Comment 14 Luke Yeager 2020-07-27 14:43:46 MDT

Created attachment 15181 [details]
job418939.slurmd.debug2

Here is a debug2 slurmd log from a job which leaked cgroups.

Comment 15 Luke Yeager 2020-07-27 14:44:06 MDT

Created attachment 15182 [details]
job418939.epilog

Comment 16 Luke Yeager 2020-07-29 16:30:13 MDT

Raising to Sev2. Though we have an ugly workaround, this affects all our clusters and affects the way we would plan for multitenancy on our cluster, etc.

Comment 17 Felip Moll 2020-07-31 07:16:56 MDT

Thanks Luke,

I am trying to reproduce your exact situation still without luck. I was looking for something like this in the logs:

[2020-07-31T15:08:11.530] [113241.batch] debug:  _slurm_cgroup_destroy: problem deleting step cgroup path /sys/fs/cgroup/freezer/slurm_gamba1/uid_1000/job_113241/step_batch: Device or resource busy

but I cannot see any similar message in your logs. Instead I see how the step is caneled and later the 993 signal throws an error, which doesn't mean it's the responsible of your leaking cgroup:

[2020-07-27T13:29:18.891] debug2: container signal 18 to job 418939.19
[2020-07-27T13:29:18.891] debug2: container signal 15 to job 418939.19
[2020-07-27T13:29:18.891] [418939.19] error: *** STEP 418939.19 ON n001 CANCELLED AT 2020-07-27T13:29:18 ***
[2020-07-27T13:29:22.533] [418939.19] done with job

[2020-07-27T13:29:23.273] debug2: Processing RPC: REQUEST_SIGNAL_TASKS
[2020-07-27T13:29:23.274] debug:  _rpc_signal_tasks: sending signal 991 to step 418939.19 flag 0
[2020-07-27T13:29:23.274] debug:  _step_connect: connect() failed dir /var/spool/slurm/d node n001 step 418939.19 No such file or directory
[2020-07-27T13:29:23.274] debug:  signal for nonexistent 418939.19 stepd_connect failed: No such file or directory


This signal 991 shouldn't be triggered but it is, nevertheless I have a couple of bugs which already identified the problem and it doesn't seem it should cause any direct harm.


I am continuing with this.

Comment 19 Felip Moll 2020-07-31 07:35:30 MDT

Do you have any information on how/why exactly the job 418938 was terminated?


Timelimit? scancel? error in the app?..

[2020-07-27T13:29:18.889] [418939.20] error: *** STEP 418939.20 ON n001 CANCELLED AT 2020-07-27T13:29:18 ***

Comment 21 Felip Moll 2020-08-03 13:07:52 MDT

Luke,

I haven't reproduced your situation yet. I am wondering if there's really an issue in your cgroups controller. If I prepare a debug patch, would you be willing to test it in your systems?

There's not any error in your logs which indicates the directories cannot be deleted or removed. I'd like to add a debug statement when removing the cgroup which shows exactly what's going on at that exact moment.

Comment 22 Felip Moll 2020-08-03 13:54:45 MDT

Created attachment 15282 [details]
debug9429_2002_v1.patch

The patch should be as simple as this one. I just want to see *when* and which cgroup is deleted.

After applying, SlurmdDebug=debug (ideally debug2) must be enabled at least on nodes where we want to debug it.

Let me know if you're willing to apply and test it. If so, just send me back the complete slurmd debug2 logs along with the list of leaked cgroups for a job, and also the slurmctld log at debug level.

Thanks

Comment 23 Luke Yeager 2020-08-04 11:03:00 MDT

Felix pointed out that the log from comment 14 is probably missing some important entries. I changed SlurmdDebug while the job was running, so the slurmstepd processes didn't have a chance to pick up the new value.

I'll try to get a repro again and will especially look out for log lines like this:

>  problem deleting step cgroup path

Comment 24 Felix Abecassis 2020-08-04 11:07:11 MDT

And looking at the job that caused this particular leak, it creates a bunch of background job, with something like "srun mycmd &".

On my local machine, I can reproduce a leaking cgroup with a similar pattern:
srun -N1 -t 1 bash -c 'srun --oversubscribe -N1 sleep 180s & sleep infinity'
Or with a simple C program that just creates an orphaned process.

And I do see error messages with "Device or resource busy" related to cgroup deletion. So my guess is that it's indeed the same problem that you mentioned, and not a new one.

Comment 25 Felip Moll 2020-08-05 02:25:31 MDT

> Felix pointed out that the log from comment 14 is probably missing some important entries. I changed SlurmdDebug while the job was running, so the slurmstepd processes didn't have a chance to pick up the new value.

Ah, that would make sense. Please let me know how the new tests goes.

(In reply to Felix Abecassis from comment #24)
> And looking at the job that caused this particular leak, it creates a bunch
> of background job, with something like "srun mycmd &".
> 
> On my local machine, I can reproduce a leaking cgroup with a similar pattern:
> srun -N1 -t 1 bash -c 'srun --oversubscribe -N1 sleep 180s & sleep infinity'
> Or with a simple C program that just creates an orphaned process.

I don't. The only "cg leak" I get is when running an sbatch, where the batch directory remains, but it is different from what you are seeing.

Your example cleans it up correctly, for example memory cgroup:

-Start of job:

8 directories
[lipi@llagosti memory]$ tree -d  slurm_gamba1/
slurm_gamba1/
└── uid_1000
    └── job_3326
        ├── step_0
        │   └── task_0
        └── step_extern
            └── task_0


-Sleep 180 ends:

6 directories
[lipi@llagosti memory]$ tree -d  slurm_gamba1/
slurm_gamba1/
└── uid_1000
    └── job_3326
        ├── step_0
        │   └── task_0
        └── step_extern
            └── task_0


-CTRL+C or job cancel:

6 directories
[lipi@llagosti memory]$ tree -d  slurm_gamba1/
slurm_gamba1/


> 
> And I do see error messages with "Device or resource busy" related to cgroup
> deletion. So my guess is that it's indeed the same problem that you
> mentioned, and not a new one.

That is not the issue but a known race with different cgroup plugins. The task plugin tries to remove the cgroup while there's still a process in it, which is later removed by proctrack plugin. The result is the cgroup should be cleaned up.

Comment 26 Felix Abecassis 2020-08-05 09:29:04 MDT

This is not exactly what I've done:
srun -N1 -t 1 bash -c 'srun --oversubscribe -N1 sleep 120s & sleep infinity'

With "-t 1", the job will timeout and be killed. So don't ctrl-C or scancel it.

Here's what I see in my slurmd log:
[2020-08-05T08:27:49.372] [566.0] debug2: Sending SIGKILL to pgid 2478110
[2020-08-05T08:27:49.372] [566.0] debug:  Waiting for IO
[2020-08-05T08:27:49.372] [566.0] debug:  Closing debug channel
[2020-08-05T08:27:49.372] [566.0] debug:  IO handler exited, rc=0
[2020-08-05T08:27:49.399] [566.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm_ioctl/uid_1000/job_566/step_0): Device or resource busy
[2020-08-05T08:27:49.399] [566.0] debug2: task/cgroup: unable to remove step memcg : Device or resource busy
[2020-08-05T08:27:49.399] [566.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm_ioctl/uid_1000/job_566): Device or resource busy
[2020-08-05T08:27:49.399] [566.0] debug2: task/cgroup: not removing job memcg : Device or resource busy
[2020-08-05T08:27:49.399] [566.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm_ioctl/uid_1000): Device or resource busy
[2020-08-05T08:27:49.399] [566.0] debug2: task/cgroup: not removing user memcg : Device or resource busy


And the memory cgroup after the job was killed (and the "sleep" is not running):
$ tree -d /sys/fs/cgroup/memory/slurm_ioctl
/sys/fs/cgroup/memory/slurm_ioctl
└── uid_1000
    └── job_566
        └── step_0
            └── task_0

Comment 27 Felip Moll 2020-08-05 10:18:25 MDT

> With "-t 1", the job will timeout and be killed. So don't ctrl-C or scancel
> it.

Thanks, I see that now.

I will come back with something asap.

Comment 28 Felip Moll 2020-08-06 09:34:59 MDT

Hi,

I can observe this only happens with jobacct_gather/cgroup plugin.

It seems the cgroup plugin creates the task_0 directory which is incompatible with task plugin. I've seen this kind of race before.

Can we confirm it stops happening if you move to jobacct_gather/linux? There are only slight differences with both plugins which shouldn't impact your accounting significantly. Specifically jobacctgather/cgroup reads cpuacct.stat file, which directly accumulates the used cputime over time while linux plugin goes over each /proc/ tasks to make the calculation.

We have also opened bug 8763 which detected a possible issue in jobacct_gather/cgroups related to thread safety.

Changing to jobacct_gather/linux can be done with just an 'scontrol reconfig'. The new parameter will be applied to new jobs only.

May you try it and let me know if new jobs still leave cgroups dirs behind?

Comment 31 Felip Moll 2020-08-07 07:35:12 MDT

(In reply to Felip Moll from comment #28)
> Hi,
> 
> I can observe this only happens with jobacct_gather/cgroup plugin.
> 
> It seems the cgroup plugin creates the task_0 directory which is
> incompatible with task plugin. I've seen this kind of race before.
> 
> Can we confirm it stops happening if you move to jobacct_gather/linux? There
> are only slight differences with both plugins which shouldn't impact your
> accounting significantly. Specifically jobacctgather/cgroup reads
> cpuacct.stat file, which directly accumulates the used cputime over time
> while linux plugin goes over each /proc/ tasks to make the calculation.
> 
> We have also opened bug 8763 which detected a possible issue in
> jobacct_gather/cgroups related to thread safety.
> 
> Changing to jobacct_gather/linux can be done with just an 'scontrol
> reconfig'. The new parameter will be applied to new jobs only.
> 
> May you try it and let me know if new jobs still leave cgroups dirs behind?

I've found that a kill signal is issued *after* the jobacctgather tries to remove the task_0 cgroup directory. Some processes, like the "sleep infinity" which are not directly launched by srun, may need a different signal other than 996, 18 or 15 to die. In that case the process isn't gone until a harder signal is issued.

So, jobacctgather cannot remove the cgroup directory. I am looking for workarounds, but just making use of jobacctgather/linux is what I recommend for the moment and that should fix your issues.

Comment 33 Felip Moll 2020-08-13 04:09:30 MDT

Hi,

That's to inform I have a patch pending for review which should fix the issue.
I will let you know when it have passed QA and is accepted.

Regards!

Comment 35 Felip Moll 2020-08-17 08:28:53 MDT

*** Ticket 9351 has been marked as a duplicate of this ticket. ***

Comment 36 Felip Moll 2020-08-18 08:29:39 MDT

*** Ticket 9498 has been marked as a duplicate of this ticket. ***

Comment 40 Felip Moll 2020-08-21 07:51:58 MDT

Hi,

We've commited a fix for this issue which will be available in the next 20.02.5. Commit is 931d3bfb.

You should apply the individual patch, upgrade to latest 20.02.5 when released, or use jobacctgather/linux as a workaround.

Thanks for reporting.

Comment 42 Chad Vizino 2021-07-26 11:17:32 MDT

*** Ticket 12107 has been marked as a duplicate of this ticket. ***