Ticket 9429

Summary: cgroups not cleaned up properly, leading to perf degradation
Product: Slurm Reporter: Luke Yeager <lyeager>
Component: LimitsAssignee: Felip Moll <felip.moll>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: alex, bart, fabecassis, karun.kallakuri, radmer, tdockendorf, troy
Version: 20.02.3   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=3604
https://bugs.schedmd.com/show_bug.cgi?id=1771
https://bugs.schedmd.com/show_bug.cgi?id=8911
https://bugs.schedmd.com/show_bug.cgi?id=9351
https://bugs.schedmd.com/show_bug.cgi?id=9498
https://bugs.schedmd.com/show_bug.cgi?id=7536
Site: NVIDIA (PSLA) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.02.5 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: job411291.leaked_cgroups
job411291.sacct
job411516.leaked_cgroups
job411516.sacct
slurm.conf
cgroup.conf
job413891.leaked_cgroups
job413891.slurmd
perf-drift-fixed
job418939.slurmd.debug2
job418939.epilog
debug9429_2002_v1.patch

Description Luke Yeager 2020-07-20 15:15:31 MDT
We have noticed that the performance of all nodes in our cluster degrades over time (fixable by a reboot). We finally tracked it down to an issue with cgroup cleanup at job completion.

All data presented below is on our older cluster (19.05.4), and so far we have only noticed the performance impact on this cluster. But now that we know the root-cause, we have noticed the same issue on our newer cluster (20.02.3). We don't yet know why it hasn't impacted performance there. But the issue does not appear to be fixed in 20.02.

After a job finishes, sometimes there are still leftover cgroups which can be found with this command:

    find /sys/fs/cgroup/*/slurm -mindepth 1 -type d

When we delete these cgroups, which are leaked from previous jobs, the node performance goes back up to the expected performance without requiring a reboot. Our current workaround is to run this in the job epilog:

    find /sys/fs/cgroup/*/slurm -depth -mindepth 1 -type d -print -exec rmdir {} \;

1824 jobs have run on the cluster since the last reboot. 1104 of those jobs leaked one or more cgroup directory. Of those 1104 jobs, here is the breakdown by job state:

    JobState    CountLeaked
    CANCELLED+  9
    COMPLETED   607
    FAILED      16
    PREEMPTED   14
    TIMEOUT     458

So, leaks occur for all sorts of JobStates. Here is the breakdown of StepState for those 1104 jobs which leaked one or more cgroup dir:

    State       CountAll  CountLeaked
    CANCELLED   5905      5838
    CANCELLED+  2067      6
    COMPLETED   27893     2
    FAILED      12        12

So, the vast majority of the leaked cgroups come from steps with the CANCELLED state. But not quite all of them. And most of those CANCELLED steps leak something, but not quite all of them.

Sanitized cluster configuration (from our older cluster) and information about two of the 1104 leaking jobs will be attached shortly.
Comment 1 Luke Yeager 2020-07-20 15:16:26 MDT
Created attachment 15089 [details]
job411291.leaked_cgroups
Comment 2 Luke Yeager 2020-07-20 15:16:50 MDT
Created attachment 15090 [details]
job411291.sacct
Comment 3 Luke Yeager 2020-07-20 15:17:08 MDT
Created attachment 15091 [details]
job411516.leaked_cgroups
Comment 4 Luke Yeager 2020-07-20 15:17:26 MDT
Created attachment 15092 [details]
job411516.sacct
Comment 5 Luke Yeager 2020-07-20 15:22:03 MDT
Created attachment 15093 [details]
slurm.conf
Comment 6 Luke Yeager 2020-07-20 15:22:21 MDT
Created attachment 15094 [details]
cgroup.conf
Comment 7 Felip Moll 2020-07-21 03:41:24 MDT
Hi Luke,

I've experienced this issue and I am investigating how we can mitigate it. We have some races with the different cgroup plugins that could lead to this issue.

Thanks for the files.

If you don't mind I re-tag this as a sev-3 which I think fits better with your issue, because you already have a valid workaround. That doesn't mean I will give it less attention.

Will keep you informed.
Comment 8 Felip Moll 2020-07-21 04:14:35 MDT
May you attach the slurmd log from a node where there are recent uncleaned cgroups?
Comment 10 Luke Yeager 2020-07-21 09:59:01 MDT
Created attachment 15108 [details]
job413891.leaked_cgroups
Comment 11 Luke Yeager 2020-07-21 10:00:01 MDT
Created attachment 15109 [details]
job413891.slurmd

Sure thing. Here is the log from a job which leaked cgroups on this node. SlurmdDebug=info.
Comment 12 Felip Moll 2020-07-22 03:03:03 MDT
(In reply to luke.yeager from comment #11)
> Created attachment 15109 [details]
> job413891.slurmd
> 
> Sure thing. Here is the log from a job which leaked cgroups on this node.
> SlurmdDebug=info.

Unfortunately SlurmdDebug=info is not enough to see if this is the same I was experiencing. Is it possile for you to increase it temporary to debug2, trigger the error and resend it?

Thanks
Comment 13 Luke Yeager 2020-07-24 12:59:08 MDT
Created attachment 15165 [details]
perf-drift-fixed

Sorry, I'll try to get that debug2 log to you soon.

In the meantime, look how dramatically well our workaround is doing. Attached is a screenshot of average application performance. We run this CPU-heavy application twice a day on every node in the cluster. Before the workaround, the perf would drop predictably every time until we rebooted the nodes. After the cgroup fix in the epilog, things are MUCH better.
Comment 14 Luke Yeager 2020-07-27 14:43:46 MDT
Created attachment 15181 [details]
job418939.slurmd.debug2

Here is a debug2 slurmd log from a job which leaked cgroups.
Comment 15 Luke Yeager 2020-07-27 14:44:06 MDT
Created attachment 15182 [details]
job418939.epilog
Comment 16 Luke Yeager 2020-07-29 16:30:13 MDT
Raising to Sev2. Though we have an ugly workaround, this affects all our clusters and affects the way we would plan for multitenancy on our cluster, etc.
Comment 17 Felip Moll 2020-07-31 07:16:56 MDT
Thanks Luke,

I am trying to reproduce your exact situation still without luck. I was looking for something like this in the logs:

[2020-07-31T15:08:11.530] [113241.batch] debug:  _slurm_cgroup_destroy: problem deleting step cgroup path /sys/fs/cgroup/freezer/slurm_gamba1/uid_1000/job_113241/step_batch: Device or resource busy

but I cannot see any similar message in your logs. Instead I see how the step is caneled and later the 993 signal throws an error, which doesn't mean it's the responsible of your leaking cgroup:

[2020-07-27T13:29:18.891] debug2: container signal 18 to job 418939.19
[2020-07-27T13:29:18.891] debug2: container signal 15 to job 418939.19
[2020-07-27T13:29:18.891] [418939.19] error: *** STEP 418939.19 ON n001 CANCELLED AT 2020-07-27T13:29:18 ***
[2020-07-27T13:29:22.533] [418939.19] done with job

[2020-07-27T13:29:23.273] debug2: Processing RPC: REQUEST_SIGNAL_TASKS
[2020-07-27T13:29:23.274] debug:  _rpc_signal_tasks: sending signal 991 to step 418939.19 flag 0
[2020-07-27T13:29:23.274] debug:  _step_connect: connect() failed dir /var/spool/slurm/d node n001 step 418939.19 No such file or directory
[2020-07-27T13:29:23.274] debug:  signal for nonexistent 418939.19 stepd_connect failed: No such file or directory


This signal 991 shouldn't be triggered but it is, nevertheless I have a couple of bugs which already identified the problem and it doesn't seem it should cause any direct harm.


I am continuing with this.
Comment 19 Felip Moll 2020-07-31 07:35:30 MDT
Do you have any information on how/why exactly the job 418938 was terminated?


Timelimit? scancel? error in the app?..

[2020-07-27T13:29:18.889] [418939.20] error: *** STEP 418939.20 ON n001 CANCELLED AT 2020-07-27T13:29:18 ***
Comment 21 Felip Moll 2020-08-03 13:07:52 MDT
Luke,

I haven't reproduced your situation yet. I am wondering if there's really an issue in your cgroups controller. If I prepare a debug patch, would you be willing to test it in your systems?

There's not any error in your logs which indicates the directories cannot be deleted or removed. I'd like to add a debug statement when removing the cgroup which shows exactly what's going on at that exact moment.
Comment 22 Felip Moll 2020-08-03 13:54:45 MDT
Created attachment 15282 [details]
debug9429_2002_v1.patch

The patch should be as simple as this one. I just want to see *when* and which cgroup is deleted.

After applying, SlurmdDebug=debug (ideally debug2) must be enabled at least on nodes where we want to debug it.

Let me know if you're willing to apply and test it. If so, just send me back the complete slurmd debug2 logs along with the list of leaked cgroups for a job, and also the slurmctld log at debug level.

Thanks
Comment 23 Luke Yeager 2020-08-04 11:03:00 MDT
Felix pointed out that the log from comment 14 is probably missing some important entries. I changed SlurmdDebug while the job was running, so the slurmstepd processes didn't have a chance to pick up the new value.

I'll try to get a repro again and will especially look out for log lines like this:

>  problem deleting step cgroup path
Comment 24 Felix Abecassis 2020-08-04 11:07:11 MDT
And looking at the job that caused this particular leak, it creates a bunch of background job, with something like "srun mycmd &".

On my local machine, I can reproduce a leaking cgroup with a similar pattern:
srun -N1 -t 1 bash -c 'srun --oversubscribe -N1 sleep 180s & sleep infinity'
Or with a simple C program that just creates an orphaned process.

And I do see error messages with "Device or resource busy" related to cgroup deletion. So my guess is that it's indeed the same problem that you mentioned, and not a new one.
Comment 25 Felip Moll 2020-08-05 02:25:31 MDT
> Felix pointed out that the log from comment 14 is probably missing some important entries. I changed SlurmdDebug while the job was running, so the slurmstepd processes didn't have a chance to pick up the new value.

Ah, that would make sense. Please let me know how the new tests goes.

(In reply to Felix Abecassis from comment #24)
> And looking at the job that caused this particular leak, it creates a bunch
> of background job, with something like "srun mycmd &".
> 
> On my local machine, I can reproduce a leaking cgroup with a similar pattern:
> srun -N1 -t 1 bash -c 'srun --oversubscribe -N1 sleep 180s & sleep infinity'
> Or with a simple C program that just creates an orphaned process.

I don't. The only "cg leak" I get is when running an sbatch, where the batch directory remains, but it is different from what you are seeing.

Your example cleans it up correctly, for example memory cgroup:

-Start of job:

8 directories
[lipi@llagosti memory]$ tree -d  slurm_gamba1/
slurm_gamba1/
└── uid_1000
    └── job_3326
        ├── step_0
        │   └── task_0
        └── step_extern
            └── task_0


-Sleep 180 ends:

6 directories
[lipi@llagosti memory]$ tree -d  slurm_gamba1/
slurm_gamba1/
└── uid_1000
    └── job_3326
        ├── step_0
        │   └── task_0
        └── step_extern
            └── task_0


-CTRL+C or job cancel:

6 directories
[lipi@llagosti memory]$ tree -d  slurm_gamba1/
slurm_gamba1/


> 
> And I do see error messages with "Device or resource busy" related to cgroup
> deletion. So my guess is that it's indeed the same problem that you
> mentioned, and not a new one.

That is not the issue but a known race with different cgroup plugins. The task plugin tries to remove the cgroup while there's still a process in it, which is later removed by proctrack plugin. The result is the cgroup should be cleaned up.
Comment 26 Felix Abecassis 2020-08-05 09:29:04 MDT
This is not exactly what I've done:
srun -N1 -t 1 bash -c 'srun --oversubscribe -N1 sleep 120s & sleep infinity'

With "-t 1", the job will timeout and be killed. So don't ctrl-C or scancel it.

Here's what I see in my slurmd log:
[2020-08-05T08:27:49.372] [566.0] debug2: Sending SIGKILL to pgid 2478110
[2020-08-05T08:27:49.372] [566.0] debug:  Waiting for IO
[2020-08-05T08:27:49.372] [566.0] debug:  Closing debug channel
[2020-08-05T08:27:49.372] [566.0] debug:  IO handler exited, rc=0
[2020-08-05T08:27:49.399] [566.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm_ioctl/uid_1000/job_566/step_0): Device or resource busy
[2020-08-05T08:27:49.399] [566.0] debug2: task/cgroup: unable to remove step memcg : Device or resource busy
[2020-08-05T08:27:49.399] [566.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm_ioctl/uid_1000/job_566): Device or resource busy
[2020-08-05T08:27:49.399] [566.0] debug2: task/cgroup: not removing job memcg : Device or resource busy
[2020-08-05T08:27:49.399] [566.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm_ioctl/uid_1000): Device or resource busy
[2020-08-05T08:27:49.399] [566.0] debug2: task/cgroup: not removing user memcg : Device or resource busy


And the memory cgroup after the job was killed (and the "sleep" is not running):
$ tree -d /sys/fs/cgroup/memory/slurm_ioctl
/sys/fs/cgroup/memory/slurm_ioctl
└── uid_1000
    └── job_566
        └── step_0
            └── task_0
Comment 27 Felip Moll 2020-08-05 10:18:25 MDT
> With "-t 1", the job will timeout and be killed. So don't ctrl-C or scancel
> it.

Thanks, I see that now.

I will come back with something asap.
Comment 28 Felip Moll 2020-08-06 09:34:59 MDT
Hi,

I can observe this only happens with jobacct_gather/cgroup plugin.

It seems the cgroup plugin creates the task_0 directory which is incompatible with task plugin. I've seen this kind of race before.

Can we confirm it stops happening if you move to jobacct_gather/linux? There are only slight differences with both plugins which shouldn't impact your accounting significantly. Specifically jobacctgather/cgroup reads cpuacct.stat file, which directly accumulates the used cputime over time while linux plugin goes over each /proc/ tasks to make the calculation.

We have also opened bug 8763 which detected a possible issue in jobacct_gather/cgroups related to thread safety.

Changing to jobacct_gather/linux can be done with just an 'scontrol reconfig'. The new parameter will be applied to new jobs only.

May you try it and let me know if new jobs still leave cgroups dirs behind?
Comment 31 Felip Moll 2020-08-07 07:35:12 MDT
(In reply to Felip Moll from comment #28)
> Hi,
> 
> I can observe this only happens with jobacct_gather/cgroup plugin.
> 
> It seems the cgroup plugin creates the task_0 directory which is
> incompatible with task plugin. I've seen this kind of race before.
> 
> Can we confirm it stops happening if you move to jobacct_gather/linux? There
> are only slight differences with both plugins which shouldn't impact your
> accounting significantly. Specifically jobacctgather/cgroup reads
> cpuacct.stat file, which directly accumulates the used cputime over time
> while linux plugin goes over each /proc/ tasks to make the calculation.
> 
> We have also opened bug 8763 which detected a possible issue in
> jobacct_gather/cgroups related to thread safety.
> 
> Changing to jobacct_gather/linux can be done with just an 'scontrol
> reconfig'. The new parameter will be applied to new jobs only.
> 
> May you try it and let me know if new jobs still leave cgroups dirs behind?

I've found that a kill signal is issued *after* the jobacctgather tries to remove the task_0 cgroup directory. Some processes, like the "sleep infinity" which are not directly launched by srun, may need a different signal other than 996, 18 or 15 to die. In that case the process isn't gone until a harder signal is issued.

So, jobacctgather cannot remove the cgroup directory. I am looking for workarounds, but just making use of jobacctgather/linux is what I recommend for the moment and that should fix your issues.
Comment 33 Felip Moll 2020-08-13 04:09:30 MDT
Hi,

That's to inform I have a patch pending for review which should fix the issue.
I will let you know when it have passed QA and is accepted.

Regards!
Comment 35 Felip Moll 2020-08-17 08:28:53 MDT
*** Ticket 9351 has been marked as a duplicate of this ticket. ***
Comment 36 Felip Moll 2020-08-18 08:29:39 MDT
*** Ticket 9498 has been marked as a duplicate of this ticket. ***
Comment 40 Felip Moll 2020-08-21 07:51:58 MDT
Hi,

We've commited a fix for this issue which will be available in the next 20.02.5. Commit is 931d3bfb.

You should apply the individual patch, upgrade to latest 20.02.5 when released, or use jobacctgather/linux as a workaround.

Thanks for reporting.
Comment 42 Chad Vizino 2021-07-26 11:17:32 MDT
*** Ticket 12107 has been marked as a duplicate of this ticket. ***