| Summary: | HA failures with slurmctld | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Google Cloud Team <slurm-gcp> |
| Component: | Configuration | Assignee: | Tim McMullan <mcmullan> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | carlosboneti, dgouju, lyeager, nick |
| Version: | 20.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Alineos Sites: | --- | |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 22.05pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Log files | ||
|
Description
Google Cloud Team
2021-12-14 10:57:45 MST
To clarify bug 1: in the end of the process, slurmdbd B first takes over but then crashes when slurmctld B takes over. (In reply to Damien Gouju from comment #3) > To clarify bug 1: in the end of the process, slurmdbd B first takes over but > then crashes when slurmctld B takes over. Thank you for the extra clarity! Does the slurmdbd generate a core file when it crashes in this case? I'm trying to reproduce locally, but if there is a core file it might be good to get a little more info on the crash that you saw specifically. Thanks! --Tim In the attachement: - *.kill-9.log refer to bug 1 - *.panic.log refer to bug 2 We are looking at creating crashdump for bug 1. (In reply to Damien Gouju from comment #5) > In the attachement: > - *.kill-9.log refer to bug 1 > - *.panic.log refer to bug 2 Thanks, I did notice those. Unfortunately they haven't provided much insight yet :/ > We are looking at creating crashdump for bug 1. Thank you! As a status update from me, I have replicated bug 2 from this ticket and am looking into it. Hi Tim, Regarding bug 1, no core was found in / /var/tmp or /var/log/slurm according to https://slurm.schedmd.com/slurmdbd.html#SECTION_CORE-FILE-LOCATION . Thank you! Best regards, Thanks for the update! I've made some progress on the second issue but I still don't have it quite right. My first try at replicating the first issue didn't succeed, but I'm going to try a couple more things to see if I can get a local reproducer for it! Thanks again! --Tim I corrected the used version to 20.11.7 Bug #1 seems to be resolved if we restart slurmdbd manually. It seems its systemd entry is not configured to restart on failure, but just start on boot. We wonder if it would not be simple to just change systemd config for slurmdb so it would restart on failure. (In reply to Carlos Boneti from comment #11) > Bug #1 seems to be resolved if we restart slurmdbd manually. It seems its > systemd entry is not configured to restart on failure, but just start on > boot. > > We wonder if it would not be simple to just change systemd config for > slurmdb so it would restart on failure. That is something you could do to handle it in the meantime, but it really should still be able to handle more than one failover event. Right now I don't think that's the long-term fix from my perspective, but I don't see anything wrong with using it for now! I now have the first bug reproducing locally as well, and I was able to grab a core file. I'm checking it out to see where that takes me! Thanks, --Tim Hi and sorry about the delay in an update here! For the slurmdbd crash I've identified whats causing the crash, but while what appeared to be the right fix for it does avoid the crash it is leaving the slurmdbd in an unusual state so there is more to it than I initially suspected. I'm still investigating that state to see what the right way out of it is. For the issue of the slurmdbd not taking over if the whole primary node crashes, its just not detecting that the socket is dead. Its a strange problem, but I have a couple ideas for fixing it one of which I am testing now, but results so far show that it is working. Thanks, and I'll keep you updated as I make more progress with this! --Tim Hey, I just wanted to give you an update on where we are with this. The first issue continues to require investigation as to the correct solution. There are some remaining quirks to the slurmdbd HA that I need to work out before I can settle on the right fix. I've got a chat today regarding the fix for the second issue, implementing the fix may be somewhat involved. Thanks! --Tim Thanks a lot Tim, this is highly appreciated! Just another update for you: We just landed a patch (https://github.com/SchedMD/slurm/commit/979cc92) that should fix the slurmdbd crashing with multiple failover/failback events. I expect that it will be available in 21.08.6+ The approach for fixing the second issue has been settled on and I'm working on getting a patch together for that. Thanks! --Tim Hey, I just wanted to give you guys another update here. We have a working solution for the second issue (primary slurmdbd crashing but backup not taking over), I'm just working on getting the patch finalized and reviewed. Thanks! --Tim Hi Tim, Do you have an update on the patch for the second issue? Thanks a lot! Best regards, (In reply to Damien Gouju from comment #35) > Hi Tim, > Do you have an update on the patch for the second issue? > Thanks a lot! > Best regards, Sorry about the lengthy wait here! We've gone through a few revisions of this patch, but what looks like a final version should be reviewed soon. It involved adding/changing some configuration options so we expect it will be targeted at the 22.05 release. Thanks! --Tim The remaining patches have landed! https://github.com/SchedMD/slurm/compare/9b29afc11b..18b93bd2d3 This should allow the backup slurmdbd to detect that the primary has failed when the primary node crashes. There are some new options in the slurmdbd.conf for tuning keepalive so you can adjust this for your system. Let me know if you have any questions! Thanks, --Tim Marking this as resolved for now since all the patches landed. Let us know if you have any other issues! Thanks! --Tim |