Created attachment 17722 [details] `thread apply all bt full` output Hi SchedMD! We've noticed a pattern where `slurmctld` seems to be hanging after a `scontrol reconfig`. This looks like a new behavior in 20.11, and to be pretty reproducible in our environment. After a `scontrol reconfig` is issued, the scontrol command returns, and the controller logs additional steps for a few seconds. During that time, `scontrol ping` shows the controller as `UP`, but after a few seconds, a new `scontrol ping` hangs, and the controller stops logging anything. It doesn't show any more process/CPu activity, but doesn't really go down, so the secondary controller doesn't take over and things stays stuck forever. Until the primary `slurmctld` process is forced killed and restarted. I took a core dump when the `slurmctld` process was stuck, here's the regular info: (gdb) bt #0 0x00007fb0d1e4ad12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000420dd7 in _agent_init (arg=<optimized out>) at agent.c:1377 #2 0x00007fb0d1e46dd5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fb0d1b7002d in clone () from /lib64/libc.so.6 And the output of `thread apply all bt full` is attached. Happy to provide any more details from the core. Thanks! -- Kilian
Hi This is a duplicate of bug 10605. Unfortunately fix isn't in the repo yet. This deadlock is a regression from https://github.com/SchedMD/slurm/commit/b8f2337f04793 It should be really rare, but you can protect slurmctld from this deadlock by reverting this patch. Dominik
Hi Dominik, (In reply to Dominik Bartkiewicz from comment #2) > This is a duplicate of bug 10605. Thanks for the pointer! > Unfortunately fix isn't in the repo yet. > This deadlock is a regression from > https://github.com/SchedMD/slurm/commit/b8f2337f04793 > It should be really rare, but you can protect slurmctld from this deadlock > by reverting this patch. Got it. It's pretty reproducible in our case, pretty much every single "scontrol reconfig" ends up in that deadlock situation. I've deployed a version with b8f2337f04793 reverted, and that seems to resolve the issue indeed. `scontrol reconfig` doesn't make the controller hang anymore, so that's good! From bug 10605, I'm not exactly clear on what the status is for a fix to be merged. Do you have any update? Thanks! -- Kilian
Hi We consider different approaches to solving this issue. Probably this will not be a simple revert of b8f2337f04793. But this issue is severe, and I think the fix will be included in 20.11.4. Dominik
Hi Dominik, (In reply to Dominik Bartkiewicz from comment #4) > We consider different approaches to solving this issue. Probably this will > not be a simple revert of b8f2337f04793. But this issue is severe, and I > think the fix will be included in 20.11.4. Just a quick check: given 20.11.4 has been release yesterday, could you please confirm if a fix for this issue has been included in the release? Thanks! -- Kilian
Hi I am sorry, but unfortunately, no. Dominik
On Fri, Feb 19, 2021 at 8:14 AM <bugs@schedmd.com> wrote: > I am sorry, but unfortunately, no. No worries! I'm preparing to deploy 20.11.4 and reviewing our local patchset, so I just wanted to make sure that I still needed to revert b8f2337f04793. Thanks for the confirmation! Cheers, -- Kilian
Hi! Just checking to see if a fix has been merged for this issue in 20.11.5? Thanks! -- Kilian
Hi I am sorry, but unfortunately, still no. Dominik
On Thu, Mar 18, 2021 at 3:32 AM <bugs@schedmd.com> wrote: > I am sorry, but unfortunately, still no. No worries, thanks, I'll keep reverting b8f2337f04793 then. Cheers, -- Kilian
We have been running 20.11.3 since Feb 3rd; while we cannot cause the issue at will, it is happening for us roughly once a week. I have a different bug I am following that is rolled into 20.11.6 - is there a plan to roll this patch into 20.11.6 as well? Thanks -- Jenny
Hi Fix for this issue is committed to the repo and will be included in Slurm 20.11.6. https://github.com/SchedMD/slurm/commit/6db0aca5a Dominik