Ticket 10767 - slurmctld hangs on `scontrol reconfig`
Summary: slurmctld hangs on `scontrol reconfig`
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.11.3
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on: 10605
Blocks:
  Show dependency treegraph
 
Reported: 2021-02-02 15:54 MST by Kilian Cavalotti
Modified: 2021-03-24 07:59 MDT (History)
2 users (show)

See Also:
Site: Stanford
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name: Sherlock
CLE Version:
Version Fixed: 20.11.6
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
`thread apply all bt full` output (599.71 KB, text/x-log)
2021-02-02 15:54 MST, Kilian Cavalotti
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2021-02-02 15:54:16 MST
Created attachment 17722 [details]
`thread apply all bt full` output

Hi SchedMD!

We've noticed a pattern where `slurmctld` seems to be hanging after a `scontrol reconfig`. This looks like a new behavior in 20.11, and to be pretty reproducible in our environment.

After a `scontrol reconfig` is issued, the scontrol command returns, and the controller logs additional steps for a few seconds. During that time, `scontrol ping` shows the controller as `UP`, but after a few seconds, a new `scontrol ping` hangs, and the controller stops logging anything. It doesn't show any more process/CPu activity, but doesn't really go down, so the secondary controller doesn't take over and things stays stuck forever. Until the primary `slurmctld` process is forced killed and restarted.

I took a core dump when the `slurmctld` process was stuck, here's the regular info:

(gdb) bt
#0  0x00007fb0d1e4ad12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000420dd7 in _agent_init (arg=<optimized out>) at agent.c:1377
#2  0x00007fb0d1e46dd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fb0d1b7002d in clone () from /lib64/libc.so.6


And the output of `thread apply all bt full` is attached.

Happy to provide any more details from the core.

Thanks!
--
Kilian
Comment 2 Dominik Bartkiewicz 2021-02-03 02:36:45 MST
Hi

This is a duplicate of bug 10605.
Unfortunately fix isn't in the repo yet.
This deadlock is a regression from https://github.com/SchedMD/slurm/commit/b8f2337f04793
It should be really rare, but you can protect slurmctld from this deadlock by reverting this patch.

Dominik
Comment 3 Kilian Cavalotti 2021-02-03 09:36:01 MST
Hi Dominik,

(In reply to Dominik Bartkiewicz from comment #2)
> This is a duplicate of bug 10605.

Thanks for the pointer!

> Unfortunately fix isn't in the repo yet.
> This deadlock is a regression from
> https://github.com/SchedMD/slurm/commit/b8f2337f04793
> It should be really rare, but you can protect slurmctld from this deadlock
> by reverting this patch.

Got it. It's pretty reproducible in our case, pretty much every single "scontrol reconfig" ends up in that deadlock situation.

I've deployed a version with b8f2337f04793 reverted, and that seems to resolve the issue indeed. `scontrol reconfig` doesn't make the controller hang anymore, so that's good!

From bug 10605, I'm not exactly clear on what the status is for a fix to be merged.  Do you have any update?

Thanks!
--
Kilian
Comment 4 Dominik Bartkiewicz 2021-02-08 03:00:14 MST
Hi

We consider different approaches to solving this issue. Probably this will not be a simple revert of b8f2337f04793. But this issue is severe, and I think the fix will be included in 20.11.4.

Dominik
Comment 5 Kilian Cavalotti 2021-02-19 09:05:32 MST
Hi Dominik,

(In reply to Dominik Bartkiewicz from comment #4)
> We consider different approaches to solving this issue. Probably this will
> not be a simple revert of b8f2337f04793. But this issue is severe, and I
> think the fix will be included in 20.11.4.

Just a quick check: given 20.11.4 has been release yesterday, could you please confirm if a fix for this issue has been included in the release?

Thanks!
--
Kilian
Comment 6 Dominik Bartkiewicz 2021-02-19 09:14:57 MST
Hi 

I am sorry, but unfortunately, no.

Dominik
Comment 7 Kilian Cavalotti 2021-02-19 09:27:57 MST
On Fri, Feb 19, 2021 at 8:14 AM <bugs@schedmd.com> wrote:
> I am sorry, but unfortunately, no.

No worries! I'm preparing to deploy 20.11.4 and reviewing our local
patchset, so I just wanted to make sure that I still needed to revert
b8f2337f04793.

Thanks for the confirmation!

Cheers,
--
Kilian
Comment 8 Kilian Cavalotti 2021-03-17 14:15:51 MDT
Hi!

Just checking to see if a fix has been merged for this issue in 20.11.5?

Thanks!
--
Kilian
Comment 10 Dominik Bartkiewicz 2021-03-18 04:32:14 MDT
Hi 

I am sorry, but unfortunately, still no.

Dominik
Comment 11 Kilian Cavalotti 2021-03-18 08:28:13 MDT
On Thu, Mar 18, 2021 at 3:32 AM <bugs@schedmd.com> wrote:
> I am sorry, but unfortunately, still no.

No worries, thanks, I'll keep reverting b8f2337f04793 then.

Cheers,
--
Kilian
Comment 12 Jenny Williams 2021-03-24 07:36:36 MDT
We have been running 20.11.3 since Feb 3rd; while we cannot cause the issue at will, it is happening for us roughly once a week.

I have a different bug I am following that is rolled into 20.11.6 - is there a plan to roll this patch into 20.11.6 as well?

Thanks 
--
Jenny
Comment 13 Dominik Bartkiewicz 2021-03-24 07:59:03 MDT
Hi

Fix for this issue is committed to the repo and will be included in Slurm 20.11.6.
https://github.com/SchedMD/slurm/commit/6db0aca5a

Dominik