5147 – Agent Queue Size bursts and no cleanup

Ticket 5147 - Agent Queue Size bursts and no cleanup

Summary: Agent Queue Size bursts and no cleanup

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	17.11.5
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-05-08 17:06 MDT by Martin Siegert
Modified:	2018-05-23 05:13 MDT (History)
CC List:	2 users (show)

See Also:	5111
Site:	Simon Fraser University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	17.11.7
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
backtrace of slurmstepd (4.44 KB, text/plain) 2018-05-08 17:06 MDT, Martin Siegert	Details
slurmctld.log (5.78 MB, application/xz) 2018-05-09 19:37 MDT, Martin Siegert	Details
sdiag output (19.78 KB, text/plain) 2018-05-09 19:38 MDT, Martin Siegert	Details
slurmd.log from cdr761 (2.90 MB, text/plain) 2018-05-09 19:38 MDT, Martin Siegert	Details
patch (2.20 KB, patch) 2018-05-16 10:31 MDT, Dominik Bartkiewicz	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Martin Siegert 2018-05-08 17:06:08 MDT

Created attachment 6799 [details]
backtrace of slurmstepd

This is in reference to bug 5111.
We have upgraded to 17.11.5 and additionally cherry picked commits 1675ada0a, a7c8964e, 3be9e1ee0 and e5f03971b. We continue to see the same problem: the agent queue size increases continuously (we've seen it go as high as 200000) and jobs stay in completing state. I am attaching the output from node cdr1545 of `gdb -batch -ex "thread apply all bt full" -p 99504` where is the slurmstepd process:
S USER       PID  PPID  NI   RSS  VSZ STIME %CPU     TIME COMMAND
S root       9489      1   0  7716 1506312 May03 0.0 00:00:01 /opt/software/slurm/sbin/slurmd
S root      99504      1   0  4536 306160 07:28  0.0 00:00:00 slurmstepd: [7774580.extern]

- Martin

Comment 2 Martin Siegert 2018-05-08 18:48:14 MDT

The biggest problem right now is that jobs get "started" by the scheduler are hung in the prolog (`scontrol show job <jobid>` shows Reason=Prolog), but nothing ever gets sent to the nodes. After a while the JobState changes to COMPLETING and the job disappears from the system without any record for the user. The agent queue size even after a slurmctld restart rarely drops below 2000 and then quickly rises again.

Comment 3 Dominik Bartkiewicz 2018-05-09 01:47:26 MDT

Hi

Good news is, slurmd bt looks fine now, this should reduced at least some rpcs.
Could you send me current slurmctld.log, slurmd.log and sdiag output?

Dominik

Comment 4 Martin Siegert 2018-05-09 19:35:31 MDT

Yesterday evening we paused the starting of new jobs by setting all partitions to State=Down. It took a while (more than 30 min.) during which slurm mostly completed jobs, but then the agent queue size dropped to 0. We brought the partitions back up and slurm has been stable since. I agree that the problems that we were seeing yesterday may not be related to the thread deadlocks. It locks more like that if the agent queue size grows above a certain limit, it continues to increase and never recovers.

We just had a "mini blib": the agent queue size rose to about 1500. I am attaching slurmctld.log and sdiag output during that blib. Slurm did recover though from this blib and the agent queue size dropped back to 0. I am also attaching slurm.log from one particular node (cdr761) which may have contributed to the blib because a user was running sbatch on it.

Comment 5 Martin Siegert 2018-05-09 19:37:27 MDT

Created attachment 6815 [details]
slurmctld.log

Comment 6 Martin Siegert 2018-05-09 19:38:10 MDT

Created attachment 6816 [details]
sdiag output

Comment 7 Martin Siegert 2018-05-09 19:38:51 MDT

Created attachment 6817 [details]
slurmd.log from cdr761

Comment 8 Dominik Bartkiewicz 2018-05-16 10:31:48 MDT

Created attachment 6877 [details]
patch

Hi

This patch solves some minor race/deadlock in slurmstepd which was introduced in 17.11.6.
Could you apply it and check if it helps?

Dominik

Comment 9 Adam 2018-05-16 12:14:50 MDT

Thanks Dominik.

I've applied the patch on our Cluster; by the looks of this it will only really apply to new slurmstepd processes so it'll take a little while to have an effect.  I'll have to have Martin update you on this as I'm away for the next 2.5 weeks and we're doing an outage for the last week of May so it wont be under any load during that time.

Comment 10 Martin Siegert 2018-05-22 18:49:59 MDT

Additionally we patched 17.11.5 now with commit 6a74be8.
The system has been stable ever since.
We also plan to upgrade to 17.11.7 on May 30.

Comment 11 Dominik Bartkiewicz 2018-05-23 05:13:04 MDT

Hi

Glad to hear that's all back to normal.
If it's alright with you, I'm going to move this to resolved/fixed.
If there's anything else I can help with, please reopen or file a new ticket.

Dominik