| Summary: | slurmdbd segfaults on null reason in as_mysql_node_down | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Ryan Cox <ryan_cox> |
| Component: | slurmdbd | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 1 - System not usable | ||
| Priority: | --- | CC: | brian, da |
| Version: | 14.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | BYU - Brigham Young University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
gdb
scontrol show node m8-20-[1-16] slurm.conf |
||
Created attachment 1742 [details]
scontrol show node m8-20-[1-16]
One other thing to note is that there is a reservation with flags=maint on this reservation.
ReservationName=m8-chassis20 StartTime=2015-02-12T17:08:34 EndTime=2016-02-12T17:08:34 Duration=365-00:00:00 Nodes=m8-20-[1-16] NodeCnt=16 CoreCnt=384 Features=(null) PartitionName=(null) Flags=MAINT,OVERLAP,IGNORE_JOBS,SPEC_NODES Users=(null) Accounts=staff Licenses=(null) State=ACTIVE
This has already been fixed earlier. I don't have the commit readily available right now but it should be easy to find. On March 19, 2015 7:06:21 AM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1545 > >--- Comment #1 from Ryan Cox <ryan_cox@byu.edu> --- >Created attachment 1742 [details] > --> http://bugs.schedmd.com/attachment.cgi?id=1742&action=edit >scontrol show node m8-20-[1-16] > >One other thing to note is that there is a reservation with flags=maint >on this >reservation. > >ReservationName=m8-chassis20 StartTime=2015-02-12T17:08:34 >EndTime=2016-02-12T17:08:34 Duration=365-00:00:00 Nodes=m8-20-[1-16] >NodeCnt=16 >CoreCnt=384 Features=(null) PartitionName=(null) >Flags=MAINT,OVERLAP,IGNORE_JOBS,SPEC_NODES Users=(null) Accounts=staff >Licenses=(null) State=ACTIVE > >-- >You are receiving this mail because: >You are on the CC list for the bug. Created attachment 1743 [details]
slurm.conf
Somehow I missed your message. Let me know when you find it. Or we can just move up to the latest commit you think is stable. https://github.com/SchedMD/slurm/commit/2e2d924e3d042393230e006e05f506947e74dd9c On March 19, 2015 7:11:12 AM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1545 > >--- Comment #4 from Ryan Cox <ryan_cox@byu.edu> --- >Somehow I missed your message. Let me know when you find it. Or we can >just >move up to the latest commit you think is stable. > >-- >You are receiving this mail because: >You are on the CC list for the bug. Would you recommend applying just that one commit or everything between 14.11.4 and that commit? We did have to apply one other patch as well. Either way you should be good. The safest is just the one patch, but the whole thing should be relatively safe as well. On March 19, 2015 7:15:06 AM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1545 > >--- Comment #6 from Ryan Cox <ryan_cox@byu.edu> --- >Would you recommend applying just that one commit or everything between >14.11.4 >and that commit? We did have to apply one other patch as well. > >-- >You are receiving this mail because: >You are on the CC list for the bug. I decided to just apply the patch. slurmdbd is up again and working. Thanks. Info given. |
Created attachment 1741 [details] gdb We brought some new nodes online yesterday that had been marked in our slurm config as state=down. We set them to resume and things were good. We ran jobs on all but one of them. m8-20-1 stayed as state=maint but we're not sure why. I planned to diagnose it this morning. Earlier this morning our slurmdbd's (primary and backup) began crashing when calling as_mysql_node_down with a null reason, specifically for m8-20-1. I tried a few things to fix it including marking the 16 new nodes as state=drain with a reason. That didn't work, so I restarted slurmctld then tried again. slurmdbd will only stay up for tens of seconds at most before segfaulting. I may try removing m8-20-[1-16] from the config if I can't get anything else to work. What's weird is that some of the other nodes have a null reason but don't crash.