We have noticed that the performance of all nodes in our cluster degrades over time (fixable by a reboot). We finally tracked it down to an issue with cgroup cleanup at job completion. All data presented below is on our older cluster (19.05.4), and so far we have only noticed the performance impact on this cluster. But now that we know the root-cause, we have noticed the same issue on our newer cluster (20.02.3). We don't yet know why it hasn't impacted performance there. But the issue does not appear to be fixed in 20.02. After a job finishes, sometimes there are still leftover cgroups which can be found with this command: find /sys/fs/cgroup/*/slurm -mindepth 1 -type d When we delete these cgroups, which are leaked from previous jobs, the node performance goes back up to the expected performance without requiring a reboot. Our current workaround is to run this in the job epilog: find /sys/fs/cgroup/*/slurm -depth -mindepth 1 -type d -print -exec rmdir {} \; 1824 jobs have run on the cluster since the last reboot. 1104 of those jobs leaked one or more cgroup directory. Of those 1104 jobs, here is the breakdown by job state: JobState CountLeaked CANCELLED+ 9 COMPLETED 607 FAILED 16 PREEMPTED 14 TIMEOUT 458 So, leaks occur for all sorts of JobStates. Here is the breakdown of StepState for those 1104 jobs which leaked one or more cgroup dir: State CountAll CountLeaked CANCELLED 5905 5838 CANCELLED+ 2067 6 COMPLETED 27893 2 FAILED 12 12 So, the vast majority of the leaked cgroups come from steps with the CANCELLED state. But not quite all of them. And most of those CANCELLED steps leak something, but not quite all of them. Sanitized cluster configuration (from our older cluster) and information about two of the 1104 leaking jobs will be attached shortly.
Created attachment 15089 [details] job411291.leaked_cgroups
Created attachment 15090 [details] job411291.sacct
Created attachment 15091 [details] job411516.leaked_cgroups
Created attachment 15092 [details] job411516.sacct
Created attachment 15093 [details] slurm.conf
Created attachment 15094 [details] cgroup.conf
Hi Luke, I've experienced this issue and I am investigating how we can mitigate it. We have some races with the different cgroup plugins that could lead to this issue. Thanks for the files. If you don't mind I re-tag this as a sev-3 which I think fits better with your issue, because you already have a valid workaround. That doesn't mean I will give it less attention. Will keep you informed.
May you attach the slurmd log from a node where there are recent uncleaned cgroups?
Created attachment 15108 [details] job413891.leaked_cgroups
Created attachment 15109 [details] job413891.slurmd Sure thing. Here is the log from a job which leaked cgroups on this node. SlurmdDebug=info.
(In reply to luke.yeager from comment #11) > Created attachment 15109 [details] > job413891.slurmd > > Sure thing. Here is the log from a job which leaked cgroups on this node. > SlurmdDebug=info. Unfortunately SlurmdDebug=info is not enough to see if this is the same I was experiencing. Is it possile for you to increase it temporary to debug2, trigger the error and resend it? Thanks
Created attachment 15165 [details] perf-drift-fixed Sorry, I'll try to get that debug2 log to you soon. In the meantime, look how dramatically well our workaround is doing. Attached is a screenshot of average application performance. We run this CPU-heavy application twice a day on every node in the cluster. Before the workaround, the perf would drop predictably every time until we rebooted the nodes. After the cgroup fix in the epilog, things are MUCH better.
Created attachment 15181 [details] job418939.slurmd.debug2 Here is a debug2 slurmd log from a job which leaked cgroups.
Created attachment 15182 [details] job418939.epilog
Raising to Sev2. Though we have an ugly workaround, this affects all our clusters and affects the way we would plan for multitenancy on our cluster, etc.
Thanks Luke, I am trying to reproduce your exact situation still without luck. I was looking for something like this in the logs: [2020-07-31T15:08:11.530] [113241.batch] debug: _slurm_cgroup_destroy: problem deleting step cgroup path /sys/fs/cgroup/freezer/slurm_gamba1/uid_1000/job_113241/step_batch: Device or resource busy but I cannot see any similar message in your logs. Instead I see how the step is caneled and later the 993 signal throws an error, which doesn't mean it's the responsible of your leaking cgroup: [2020-07-27T13:29:18.891] debug2: container signal 18 to job 418939.19 [2020-07-27T13:29:18.891] debug2: container signal 15 to job 418939.19 [2020-07-27T13:29:18.891] [418939.19] error: *** STEP 418939.19 ON n001 CANCELLED AT 2020-07-27T13:29:18 *** [2020-07-27T13:29:22.533] [418939.19] done with job [2020-07-27T13:29:23.273] debug2: Processing RPC: REQUEST_SIGNAL_TASKS [2020-07-27T13:29:23.274] debug: _rpc_signal_tasks: sending signal 991 to step 418939.19 flag 0 [2020-07-27T13:29:23.274] debug: _step_connect: connect() failed dir /var/spool/slurm/d node n001 step 418939.19 No such file or directory [2020-07-27T13:29:23.274] debug: signal for nonexistent 418939.19 stepd_connect failed: No such file or directory This signal 991 shouldn't be triggered but it is, nevertheless I have a couple of bugs which already identified the problem and it doesn't seem it should cause any direct harm. I am continuing with this.
Do you have any information on how/why exactly the job 418938 was terminated? Timelimit? scancel? error in the app?.. [2020-07-27T13:29:18.889] [418939.20] error: *** STEP 418939.20 ON n001 CANCELLED AT 2020-07-27T13:29:18 ***
Luke, I haven't reproduced your situation yet. I am wondering if there's really an issue in your cgroups controller. If I prepare a debug patch, would you be willing to test it in your systems? There's not any error in your logs which indicates the directories cannot be deleted or removed. I'd like to add a debug statement when removing the cgroup which shows exactly what's going on at that exact moment.
Created attachment 15282 [details] debug9429_2002_v1.patch The patch should be as simple as this one. I just want to see *when* and which cgroup is deleted. After applying, SlurmdDebug=debug (ideally debug2) must be enabled at least on nodes where we want to debug it. Let me know if you're willing to apply and test it. If so, just send me back the complete slurmd debug2 logs along with the list of leaked cgroups for a job, and also the slurmctld log at debug level. Thanks
Felix pointed out that the log from comment 14 is probably missing some important entries. I changed SlurmdDebug while the job was running, so the slurmstepd processes didn't have a chance to pick up the new value. I'll try to get a repro again and will especially look out for log lines like this: > problem deleting step cgroup path
And looking at the job that caused this particular leak, it creates a bunch of background job, with something like "srun mycmd &". On my local machine, I can reproduce a leaking cgroup with a similar pattern: srun -N1 -t 1 bash -c 'srun --oversubscribe -N1 sleep 180s & sleep infinity' Or with a simple C program that just creates an orphaned process. And I do see error messages with "Device or resource busy" related to cgroup deletion. So my guess is that it's indeed the same problem that you mentioned, and not a new one.
> Felix pointed out that the log from comment 14 is probably missing some important entries. I changed SlurmdDebug while the job was running, so the slurmstepd processes didn't have a chance to pick up the new value. Ah, that would make sense. Please let me know how the new tests goes. (In reply to Felix Abecassis from comment #24) > And looking at the job that caused this particular leak, it creates a bunch > of background job, with something like "srun mycmd &". > > On my local machine, I can reproduce a leaking cgroup with a similar pattern: > srun -N1 -t 1 bash -c 'srun --oversubscribe -N1 sleep 180s & sleep infinity' > Or with a simple C program that just creates an orphaned process. I don't. The only "cg leak" I get is when running an sbatch, where the batch directory remains, but it is different from what you are seeing. Your example cleans it up correctly, for example memory cgroup: -Start of job: 8 directories [lipi@llagosti memory]$ tree -d slurm_gamba1/ slurm_gamba1/ └── uid_1000 └── job_3326 ├── step_0 │ └── task_0 └── step_extern └── task_0 -Sleep 180 ends: 6 directories [lipi@llagosti memory]$ tree -d slurm_gamba1/ slurm_gamba1/ └── uid_1000 └── job_3326 ├── step_0 │ └── task_0 └── step_extern └── task_0 -CTRL+C or job cancel: 6 directories [lipi@llagosti memory]$ tree -d slurm_gamba1/ slurm_gamba1/ > > And I do see error messages with "Device or resource busy" related to cgroup > deletion. So my guess is that it's indeed the same problem that you > mentioned, and not a new one. That is not the issue but a known race with different cgroup plugins. The task plugin tries to remove the cgroup while there's still a process in it, which is later removed by proctrack plugin. The result is the cgroup should be cleaned up.
This is not exactly what I've done: srun -N1 -t 1 bash -c 'srun --oversubscribe -N1 sleep 120s & sleep infinity' With "-t 1", the job will timeout and be killed. So don't ctrl-C or scancel it. Here's what I see in my slurmd log: [2020-08-05T08:27:49.372] [566.0] debug2: Sending SIGKILL to pgid 2478110 [2020-08-05T08:27:49.372] [566.0] debug: Waiting for IO [2020-08-05T08:27:49.372] [566.0] debug: Closing debug channel [2020-08-05T08:27:49.372] [566.0] debug: IO handler exited, rc=0 [2020-08-05T08:27:49.399] [566.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm_ioctl/uid_1000/job_566/step_0): Device or resource busy [2020-08-05T08:27:49.399] [566.0] debug2: task/cgroup: unable to remove step memcg : Device or resource busy [2020-08-05T08:27:49.399] [566.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm_ioctl/uid_1000/job_566): Device or resource busy [2020-08-05T08:27:49.399] [566.0] debug2: task/cgroup: not removing job memcg : Device or resource busy [2020-08-05T08:27:49.399] [566.0] debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm_ioctl/uid_1000): Device or resource busy [2020-08-05T08:27:49.399] [566.0] debug2: task/cgroup: not removing user memcg : Device or resource busy And the memory cgroup after the job was killed (and the "sleep" is not running): $ tree -d /sys/fs/cgroup/memory/slurm_ioctl /sys/fs/cgroup/memory/slurm_ioctl └── uid_1000 └── job_566 └── step_0 └── task_0
> With "-t 1", the job will timeout and be killed. So don't ctrl-C or scancel > it. Thanks, I see that now. I will come back with something asap.
Hi, I can observe this only happens with jobacct_gather/cgroup plugin. It seems the cgroup plugin creates the task_0 directory which is incompatible with task plugin. I've seen this kind of race before. Can we confirm it stops happening if you move to jobacct_gather/linux? There are only slight differences with both plugins which shouldn't impact your accounting significantly. Specifically jobacctgather/cgroup reads cpuacct.stat file, which directly accumulates the used cputime over time while linux plugin goes over each /proc/ tasks to make the calculation. We have also opened bug 8763 which detected a possible issue in jobacct_gather/cgroups related to thread safety. Changing to jobacct_gather/linux can be done with just an 'scontrol reconfig'. The new parameter will be applied to new jobs only. May you try it and let me know if new jobs still leave cgroups dirs behind?
(In reply to Felip Moll from comment #28) > Hi, > > I can observe this only happens with jobacct_gather/cgroup plugin. > > It seems the cgroup plugin creates the task_0 directory which is > incompatible with task plugin. I've seen this kind of race before. > > Can we confirm it stops happening if you move to jobacct_gather/linux? There > are only slight differences with both plugins which shouldn't impact your > accounting significantly. Specifically jobacctgather/cgroup reads > cpuacct.stat file, which directly accumulates the used cputime over time > while linux plugin goes over each /proc/ tasks to make the calculation. > > We have also opened bug 8763 which detected a possible issue in > jobacct_gather/cgroups related to thread safety. > > Changing to jobacct_gather/linux can be done with just an 'scontrol > reconfig'. The new parameter will be applied to new jobs only. > > May you try it and let me know if new jobs still leave cgroups dirs behind? I've found that a kill signal is issued *after* the jobacctgather tries to remove the task_0 cgroup directory. Some processes, like the "sleep infinity" which are not directly launched by srun, may need a different signal other than 996, 18 or 15 to die. In that case the process isn't gone until a harder signal is issued. So, jobacctgather cannot remove the cgroup directory. I am looking for workarounds, but just making use of jobacctgather/linux is what I recommend for the moment and that should fix your issues.
Hi, That's to inform I have a patch pending for review which should fix the issue. I will let you know when it have passed QA and is accepted. Regards!
*** Ticket 9351 has been marked as a duplicate of this ticket. ***
*** Ticket 9498 has been marked as a duplicate of this ticket. ***
Hi, We've commited a fix for this issue which will be available in the next 20.02.5. Commit is 931d3bfb. You should apply the individual patch, upgrade to latest 20.02.5 when released, or use jobacctgather/linux as a workaround. Thanks for reporting.
*** Ticket 12107 has been marked as a duplicate of this ticket. ***