10767 – slurmctld hangs on `scontrol reconfig`

Ticket 10767 - slurmctld hangs on `scontrol reconfig`

Summary: slurmctld hangs on `scontrol reconfig`

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	20.11.3
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:	10605
Blocks:
	Show dependency tree / graph

Reported:	2021-02-02 15:54 MST by Kilian Cavalotti
Modified:	2021-03-24 07:59 MDT (History)
CC List:	2 users (show)

See Also:
Site:	Stanford
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:	Sherlock
CLE Version:
Version Fixed:	20.11.6
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
`thread apply all bt full` output (599.71 KB, text/x-log) 2021-02-02 15:54 MST, Kilian Cavalotti	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Kilian Cavalotti 2021-02-02 15:54:16 MST

Created attachment 17722 [details]
`thread apply all bt full` output

Hi SchedMD!

We've noticed a pattern where `slurmctld` seems to be hanging after a `scontrol reconfig`. This looks like a new behavior in 20.11, and to be pretty reproducible in our environment.

After a `scontrol reconfig` is issued, the scontrol command returns, and the controller logs additional steps for a few seconds. During that time, `scontrol ping` shows the controller as `UP`, but after a few seconds, a new `scontrol ping` hangs, and the controller stops logging anything. It doesn't show any more process/CPu activity, but doesn't really go down, so the secondary controller doesn't take over and things stays stuck forever. Until the primary `slurmctld` process is forced killed and restarted.

I took a core dump when the `slurmctld` process was stuck, here's the regular info:

(gdb) bt
#0  0x00007fb0d1e4ad12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000420dd7 in _agent_init (arg=<optimized out>) at agent.c:1377
#2  0x00007fb0d1e46dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fb0d1b7002d in clone () from /lib64/libc.so.6


And the output of `thread apply all bt full` is attached.

Happy to provide any more details from the core.

Thanks!
--
Kilian

Comment 2 Dominik Bartkiewicz 2021-02-03 02:36:45 MST

Hi

This is a duplicate of bug 10605.
Unfortunately fix isn't in the repo yet.
This deadlock is a regression from https://github.com/SchedMD/slurm/commit/b8f2337f04793
It should be really rare, but you can protect slurmctld from this deadlock by reverting this patch.

Dominik

Comment 3 Kilian Cavalotti 2021-02-03 09:36:01 MST

Hi Dominik,

(In reply to Dominik Bartkiewicz from comment #2)
> This is a duplicate of bug 10605.

Thanks for the pointer!

> Unfortunately fix isn't in the repo yet.
> This deadlock is a regression from
> https://github.com/SchedMD/slurm/commit/b8f2337f04793
> It should be really rare, but you can protect slurmctld from this deadlock
> by reverting this patch.

Got it. It's pretty reproducible in our case, pretty much every single "scontrol reconfig" ends up in that deadlock situation.

I've deployed a version with b8f2337f04793 reverted, and that seems to resolve the issue indeed. `scontrol reconfig` doesn't make the controller hang anymore, so that's good!

From bug 10605, I'm not exactly clear on what the status is for a fix to be merged.  Do you have any update?

Thanks!
--
Kilian

Comment 4 Dominik Bartkiewicz 2021-02-08 03:00:14 MST

Hi

We consider different approaches to solving this issue. Probably this will not be a simple revert of b8f2337f04793. But this issue is severe, and I think the fix will be included in 20.11.4.

Dominik

Comment 5 Kilian Cavalotti 2021-02-19 09:05:32 MST

Hi Dominik,

(In reply to Dominik Bartkiewicz from comment #4)
> We consider different approaches to solving this issue. Probably this will
> not be a simple revert of b8f2337f04793. But this issue is severe, and I
> think the fix will be included in 20.11.4.

Just a quick check: given 20.11.4 has been release yesterday, could you please confirm if a fix for this issue has been included in the release?

Thanks!
--
Kilian

Comment 6 Dominik Bartkiewicz 2021-02-19 09:14:57 MST

Hi 

I am sorry, but unfortunately, no.

Dominik

Comment 7 Kilian Cavalotti 2021-02-19 09:27:57 MST

On Fri, Feb 19, 2021 at 8:14 AM <bugs@schedmd.com> wrote:
> I am sorry, but unfortunately, no.

No worries! I'm preparing to deploy 20.11.4 and reviewing our local
patchset, so I just wanted to make sure that I still needed to revert
b8f2337f04793.

Thanks for the confirmation!

Cheers,
--
Kilian

Comment 8 Kilian Cavalotti 2021-03-17 14:15:51 MDT

Hi!

Just checking to see if a fix has been merged for this issue in 20.11.5?

Thanks!
--
Kilian

Comment 10 Dominik Bartkiewicz 2021-03-18 04:32:14 MDT

Hi 

I am sorry, but unfortunately, still no.

Dominik

Comment 11 Kilian Cavalotti 2021-03-18 08:28:13 MDT

On Thu, Mar 18, 2021 at 3:32 AM <bugs@schedmd.com> wrote:
> I am sorry, but unfortunately, still no.

No worries, thanks, I'll keep reverting b8f2337f04793 then.

Cheers,
--
Kilian

Comment 12 Jenny Williams 2021-03-24 07:36:36 MDT

We have been running 20.11.3 since Feb 3rd; while we cannot cause the issue at will, it is happening for us roughly once a week.

I have a different bug I am following that is rolled into 20.11.6 - is there a plan to roll this patch into 20.11.6 as well?

Thanks 
--
Jenny

Comment 13 Dominik Bartkiewicz 2021-03-24 07:59:03 MDT

Hi

Fix for this issue is committed to the repo and will be included in Slurm 20.11.6.
https://github.com/SchedMD/slurm/commit/6db0aca5a

Dominik