Summary: | Agent Queue Size bursts and no cleanup | ||
---|---|---|---|
Product: | Slurm | Reporter: | Martin Siegert <siegert> |
Component: | slurmctld | Assignee: | Dominik Bartkiewicz <bart> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 2 - High Impact | ||
Priority: | --- | CC: | asa188, kaizaad |
Version: | 17.11.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=5111 | ||
Site: | Simon Fraser University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 17.11.7 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
backtrace of slurmstepd
slurmctld.log sdiag output slurmd.log from cdr761 patch |
Description
Martin Siegert
2018-05-08 17:06:08 MDT
The biggest problem right now is that jobs get "started" by the scheduler are hung in the prolog (`scontrol show job <jobid>` shows Reason=Prolog), but nothing ever gets sent to the nodes. After a while the JobState changes to COMPLETING and the job disappears from the system without any record for the user. The agent queue size even after a slurmctld restart rarely drops below 2000 and then quickly rises again. Hi Good news is, slurmd bt looks fine now, this should reduced at least some rpcs. Could you send me current slurmctld.log, slurmd.log and sdiag output? Dominik Yesterday evening we paused the starting of new jobs by setting all partitions to State=Down. It took a while (more than 30 min.) during which slurm mostly completed jobs, but then the agent queue size dropped to 0. We brought the partitions back up and slurm has been stable since. I agree that the problems that we were seeing yesterday may not be related to the thread deadlocks. It locks more like that if the agent queue size grows above a certain limit, it continues to increase and never recovers. We just had a "mini blib": the agent queue size rose to about 1500. I am attaching slurmctld.log and sdiag output during that blib. Slurm did recover though from this blib and the agent queue size dropped back to 0. I am also attaching slurm.log from one particular node (cdr761) which may have contributed to the blib because a user was running sbatch on it. Created attachment 6815 [details]
slurmctld.log
Created attachment 6816 [details]
sdiag output
Created attachment 6817 [details]
slurmd.log from cdr761
Created attachment 6877 [details]
patch
Hi
This patch solves some minor race/deadlock in slurmstepd which was introduced in 17.11.6.
Could you apply it and check if it helps?
Dominik
Thanks Dominik. I've applied the patch on our Cluster; by the looks of this it will only really apply to new slurmstepd processes so it'll take a little while to have an effect. I'll have to have Martin update you on this as I'm away for the next 2.5 weeks and we're doing an outage for the last week of May so it wont be under any load during that time. Additionally we patched 17.11.5 now with commit 6a74be8. The system has been stable ever since. We also plan to upgrade to 17.11.7 on May 30. Hi Glad to hear that's all back to normal. If it's alright with you, I'm going to move this to resolved/fixed. If there's anything else I can help with, please reopen or file a new ticket. Dominik |