Ticket 13032

Summary:	HA failures with slurmctld
Product:	Slurm	Reporter:	Google Cloud Team <slurm-gcp>
Component:	Configuration	Assignee:	Tim McMullan <mcmullan>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	carlosboneti, dgouju, lyeager, nick
Version:	20.11.7
Hardware:	Linux
OS:	Linux
Site:	Google	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	22.05pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Log files

Description Google Cloud Team 2021-12-14 10:57:45 MST

Created attachment 22672 [details]
Log files

Cluster description:
There is two servers A and B. Both have slurmctld and slurmdbd running.
A is configured as the primary controller, and B is configured as the backup controller for both daemons.

They mount a NFS filesystem from a third server, where they share the home, the configuration directory, the munge key, and the slurm state.
Both slurmdbd connect to an external SQL server.


Bug 1:
On A, killall -9 slurmctld slurmdbd
Wait for slurmctld on B to take control
On A, systemctl start slurmdbd ; sleep 1 ; systemctl start slurmctld
Wait for slurmctld on B to give the control back to A
On A, killall -9 slurmctld slurmdbd


Bug 2:
On A: echo 1 | sudo tee /proc/sys/kernel/sysrq ; echo c | sudo tee /proc/sysrq-trigger
slurmdbd from B never takes control. slurmctld from B is not able to connect to slurmdbd.

Detailed logs attached.

Comment 3 Damien Gouju 2021-12-15 02:22:44 MST

To clarify bug 1: in the end of the process, slurmdbd B first takes over but then crashes when slurmctld B takes over.

Comment 4 Tim McMullan 2021-12-15 06:48:05 MST

(In reply to Damien Gouju from comment #3)
> To clarify bug 1: in the end of the process, slurmdbd B first takes over but
> then crashes when slurmctld B takes over.

Thank you for the extra clarity!  Does the slurmdbd generate a core file when it crashes in this case?

I'm trying to reproduce locally, but if there is a core file it might be good to get a little more info on the crash that you saw specifically.

Thanks!
--Tim

Comment 5 Damien Gouju 2021-12-15 08:16:15 MST

In the attachement:
- *.kill-9.log refer to bug 1
- *.panic.log refer to bug 2

We are looking at creating crashdump for bug 1.

Comment 6 Tim McMullan 2021-12-15 08:53:01 MST

(In reply to Damien Gouju from comment #5)
> In the attachement:
> - *.kill-9.log refer to bug 1
> - *.panic.log refer to bug 2

Thanks, I did notice those.  Unfortunately they haven't provided much insight yet :/

> We are looking at creating crashdump for bug 1.

Thank you!

As a status update from me, I have replicated bug 2 from this ticket and am looking into it.

Comment 8 Damien Gouju 2021-12-17 09:36:02 MST

Hi Tim,
Regarding bug 1, no core was found in / /var/tmp or /var/log/slurm according to https://slurm.schedmd.com/slurmdbd.html#SECTION_CORE-FILE-LOCATION .
Thank you!
Best regards,

Comment 9 Tim McMullan 2021-12-17 09:47:29 MST

Thanks for the update!

I've made some progress on the second issue but I still don't have it quite right.  My first try at replicating the first issue didn't succeed, but I'm going to try a couple more things to see if I can get a local reproducer for it!

Thanks again!
--Tim

Comment 10 Damien Gouju 2021-12-17 10:00:12 MST

I corrected the used version to 20.11.7

Comment 11 Carlos Boneti 2021-12-21 11:04:06 MST

Bug #1 seems to be resolved if we restart slurmdbd manually.  It seems its systemd entry is not configured to restart on failure, but just start on boot. 

We wonder if it would not be simple to just change systemd config for slurmdb so it would restart on failure.

Comment 12 Tim McMullan 2021-12-21 11:44:00 MST

(In reply to Carlos Boneti from comment #11)
> Bug #1 seems to be resolved if we restart slurmdbd manually.  It seems its
> systemd entry is not configured to restart on failure, but just start on
> boot. 
> 
> We wonder if it would not be simple to just change systemd config for
> slurmdb so it would restart on failure.

That is something you could do to handle it in the meantime, but it really should still be able to handle more than one failover event.  Right now I don't think that's the long-term fix from my perspective, but I don't see anything wrong with using it for now!

Comment 13 Tim McMullan 2021-12-21 13:20:23 MST

I now have the first bug reproducing locally as well, and I was able to grab a core file.  I'm checking it out to see where that takes me!

Thanks,
--Tim

Comment 17 Tim McMullan 2022-01-06 08:37:07 MST

Hi and sorry about the delay in an update here!

For the slurmdbd crash I've identified whats causing the crash, but while what appeared to be the right fix for it does avoid the crash it is leaving the slurmdbd in an unusual state so there is more to it than I initially suspected.  I'm still investigating that state to see what the right way out of it is.

For the issue of the slurmdbd not taking over if the whole primary node crashes, its just not detecting that the socket is dead.  Its a strange problem, but I have a couple ideas for fixing it one of which I am testing now, but results so far show that it is working.

Thanks, and I'll keep you updated as I make more progress with this!
--Tim

Comment 18 Tim McMullan 2022-01-18 08:20:14 MST

Hey, I just wanted to give you an update on where we are with this.

The first issue continues to require investigation as to the correct solution.  There are some remaining quirks to the slurmdbd HA that I need to work out before I can settle on the right fix.

I've got a chat today regarding the fix for the second issue, implementing the fix may be somewhat involved.

Thanks!
--Tim

Comment 19 Damien Gouju 2022-01-18 08:26:23 MST

Thanks a lot Tim, this is highly appreciated!

Comment 26 Tim McMullan 2022-01-24 14:48:17 MST

Just another update for you:

We just landed a patch (https://github.com/SchedMD/slurm/commit/979cc92) that should fix the slurmdbd crashing with multiple failover/failback events.  I expect that it will be available in 21.08.6+

The approach for fixing the second issue has been settled on and I'm working on getting a patch together for that.

Thanks!
--Tim

Comment 31 Tim McMullan 2022-02-08 06:53:44 MST

Hey, I just wanted to give you guys another update here.  We have a working solution for the second issue (primary slurmdbd crashing but backup not taking over), I'm just working on getting the patch finalized and reviewed.

Thanks!
--Tim

Comment 35 Damien Gouju 2022-04-06 03:34:55 MDT

Hi Tim,
Do you have an update on the patch for the second issue?
Thanks a lot!
Best regards,

Comment 36 Tim McMullan 2022-04-08 07:46:32 MDT

(In reply to Damien Gouju from comment #35)
> Hi Tim,
> Do you have an update on the patch for the second issue?
> Thanks a lot!
> Best regards,

Sorry about the lengthy wait here!  We've gone through a few revisions of this patch, but what looks like a final version should be reviewed soon.  It involved adding/changing some configuration options so we expect it will be targeted at the 22.05 release.

Thanks!
--Tim

Comment 39 Tim McMullan 2022-04-13 05:58:21 MDT

The remaining patches have landed!

https://github.com/SchedMD/slurm/compare/9b29afc11b..18b93bd2d3

This should allow the backup slurmdbd to detect that the primary has failed when the primary node crashes.  There are some new options in the slurmdbd.conf for tuning keepalive so you can adjust this for your system.

Let me know if you have any questions!

Thanks,
--Tim

Comment 40 Tim McMullan 2022-04-25 07:23:04 MDT

Marking this as resolved for now since all the patches landed. Let us know if you have any other issues!

Thanks!
--Tim