Ticket 13032

Summary: HA failures with slurmctld
Product: Slurm Reporter: Google Cloud Team <slurm-gcp>
Component: ConfigurationAssignee: Tim McMullan <mcmullan>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: carlosboneti, dgouju, lyeager, nick
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
Site: Google Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 22.05pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Log files

Description Google Cloud Team 2021-12-14 10:57:45 MST
Created attachment 22672 [details]
Log files

Cluster description:
There is two servers A and B. Both have slurmctld​ and slurmdbd​ running.
A is configured as the primary controller, and B is configured as the backup controller for both daemons.​

They mount a NFS filesystem from a third server, where they share the home, the configuration directory, the munge key, and the slurm state.
Both slurmdbd​ connect to an external SQL server.


Bug 1:
On A, killall -9 slurmctld slurmdbd
Wait for slurmctld on B to take control
On A, systemctl start slurmdbd ; sleep 1 ; systemctl start slurmctld
Wait for slurmctld on B to give the control back to A
On A, killall -9 slurmctld slurmdbd


Bug 2:
On A: echo 1 | sudo tee /proc/sys/kernel/sysrq ; echo c | sudo tee /proc/sysrq-trigger
slurmdbd​ from B never takes control. slurmctld​ from B is not able to connect to slurmdbd​.

Detailed logs attached.
Comment 3 Damien Gouju 2021-12-15 02:22:44 MST
To clarify bug 1: in the end of the process, slurmdbd B first takes over but then crashes when slurmctld B takes over.
Comment 4 Tim McMullan 2021-12-15 06:48:05 MST
(In reply to Damien Gouju from comment #3)
> To clarify bug 1: in the end of the process, slurmdbd B first takes over but
> then crashes when slurmctld B takes over.

Thank you for the extra clarity!  Does the slurmdbd generate a core file when it crashes in this case?

I'm trying to reproduce locally, but if there is a core file it might be good to get a little more info on the crash that you saw specifically.

Thanks!
--Tim
Comment 5 Damien Gouju 2021-12-15 08:16:15 MST
In the attachement:
- *.kill-9.log refer to bug 1
- *.panic.log refer to bug 2

We are looking at creating crashdump for bug 1.
Comment 6 Tim McMullan 2021-12-15 08:53:01 MST
(In reply to Damien Gouju from comment #5)
> In the attachement:
> - *.kill-9.log refer to bug 1
> - *.panic.log refer to bug 2

Thanks, I did notice those.  Unfortunately they haven't provided much insight yet :/

> We are looking at creating crashdump for bug 1.

Thank you!

As a status update from me, I have replicated bug 2 from this ticket and am looking into it.
Comment 8 Damien Gouju 2021-12-17 09:36:02 MST
Hi Tim,
Regarding bug 1, no core was found in / /var/tmp or /var/log/slurm according to https://slurm.schedmd.com/slurmdbd.html#SECTION_CORE-FILE-LOCATION .
Thank you!
Best regards,
Comment 9 Tim McMullan 2021-12-17 09:47:29 MST
Thanks for the update!

I've made some progress on the second issue but I still don't have it quite right.  My first try at replicating the first issue didn't succeed, but I'm going to try a couple more things to see if I can get a local reproducer for it!

Thanks again!
--Tim
Comment 10 Damien Gouju 2021-12-17 10:00:12 MST
I corrected the used version to 20.11.7
Comment 11 Carlos Boneti 2021-12-21 11:04:06 MST
Bug #1 seems to be resolved if we restart slurmdbd manually.  It seems its systemd entry is not configured to restart on failure, but just start on boot. 

We wonder if it would not be simple to just change systemd config for slurmdb so it would restart on failure.
Comment 12 Tim McMullan 2021-12-21 11:44:00 MST
(In reply to Carlos Boneti from comment #11)
> Bug #1 seems to be resolved if we restart slurmdbd manually.  It seems its
> systemd entry is not configured to restart on failure, but just start on
> boot. 
> 
> We wonder if it would not be simple to just change systemd config for
> slurmdb so it would restart on failure.

That is something you could do to handle it in the meantime, but it really should still be able to handle more than one failover event.  Right now I don't think that's the long-term fix from my perspective, but I don't see anything wrong with using it for now!
Comment 13 Tim McMullan 2021-12-21 13:20:23 MST
I now have the first bug reproducing locally as well, and I was able to grab a core file.  I'm checking it out to see where that takes me!

Thanks,
--Tim
Comment 17 Tim McMullan 2022-01-06 08:37:07 MST
Hi and sorry about the delay in an update here!

For the slurmdbd crash I've identified whats causing the crash, but while what appeared to be the right fix for it does avoid the crash it is leaving the slurmdbd in an unusual state so there is more to it than I initially suspected.  I'm still investigating that state to see what the right way out of it is.

For the issue of the slurmdbd not taking over if the whole primary node crashes, its just not detecting that the socket is dead.  Its a strange problem, but I have a couple ideas for fixing it one of which I am testing now, but results so far show that it is working.

Thanks, and I'll keep you updated as I make more progress with this!
--Tim
Comment 18 Tim McMullan 2022-01-18 08:20:14 MST
Hey, I just wanted to give you an update on where we are with this.

The first issue continues to require investigation as to the correct solution.  There are some remaining quirks to the slurmdbd HA that I need to work out before I can settle on the right fix.

I've got a chat today regarding the fix for the second issue, implementing the fix may be somewhat involved.

Thanks!
--Tim
Comment 19 Damien Gouju 2022-01-18 08:26:23 MST
Thanks a lot Tim, this is highly appreciated!
Comment 26 Tim McMullan 2022-01-24 14:48:17 MST
Just another update for you:

We just landed a patch (https://github.com/SchedMD/slurm/commit/979cc92) that should fix the slurmdbd crashing with multiple failover/failback events.  I expect that it will be available in 21.08.6+

The approach for fixing the second issue has been settled on and I'm working on getting a patch together for that.

Thanks!
--Tim
Comment 31 Tim McMullan 2022-02-08 06:53:44 MST
Hey, I just wanted to give you guys another update here.  We have a working solution for the second issue (primary slurmdbd crashing but backup not taking over), I'm just working on getting the patch finalized and reviewed.

Thanks!
--Tim
Comment 35 Damien Gouju 2022-04-06 03:34:55 MDT
Hi Tim,
Do you have an update on the patch for the second issue?
Thanks a lot!
Best regards,
Comment 36 Tim McMullan 2022-04-08 07:46:32 MDT
(In reply to Damien Gouju from comment #35)
> Hi Tim,
> Do you have an update on the patch for the second issue?
> Thanks a lot!
> Best regards,

Sorry about the lengthy wait here!  We've gone through a few revisions of this patch, but what looks like a final version should be reviewed soon.  It involved adding/changing some configuration options so we expect it will be targeted at the 22.05 release.

Thanks!
--Tim
Comment 39 Tim McMullan 2022-04-13 05:58:21 MDT
The remaining patches have landed!

https://github.com/SchedMD/slurm/compare/9b29afc11b..18b93bd2d3

This should allow the backup slurmdbd to detect that the primary has failed when the primary node crashes.  There are some new options in the slurmdbd.conf for tuning keepalive so you can adjust this for your system.

Let me know if you have any questions!

Thanks,
--Tim
Comment 40 Tim McMullan 2022-04-25 07:23:04 MDT
Marking this as resolved for now since all the patches landed. Let us know if you have any other issues!

Thanks!
--Tim