Ticket 1545

Summary: slurmdbd segfaults on null reason in as_mysql_node_down
Product: Slurm Reporter: Ryan Cox <ryan_cox>
Component: slurmdbdAssignee: Brian Christiansen <brian>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 1 - System not usable    
Priority: --- CC: brian, da
Version: 14.11.4   
Hardware: Linux   
OS: Linux   
Site: BYU - Brigham Young University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: gdb
scontrol show node m8-20-[1-16]
slurm.conf

Description Ryan Cox 2015-03-19 02:04:23 MDT
Created attachment 1741 [details]
gdb

We brought some new nodes online yesterday that had been marked in our slurm config as state=down.  We set them to resume and things were good.  We ran jobs on all but one of them.  m8-20-1 stayed as state=maint but we're not sure why.  I planned to diagnose it this morning.

Earlier this morning our slurmdbd's (primary and backup) began crashing when calling as_mysql_node_down with a null reason, specifically for m8-20-1.

I tried a few things to fix it including marking the 16 new nodes as state=drain with a reason.  That didn't work, so I restarted slurmctld then tried again.

slurmdbd will only stay up for tens of seconds at most before segfaulting.  I may try removing m8-20-[1-16] from the config if I can't get anything else to work.  What's weird is that some of the other nodes have a null reason but don't crash.
Comment 1 Ryan Cox 2015-03-19 02:06:21 MDT
Created attachment 1742 [details]
scontrol show node m8-20-[1-16]

One other thing to note is that there is a reservation with flags=maint on this reservation.

ReservationName=m8-chassis20 StartTime=2015-02-12T17:08:34 EndTime=2016-02-12T17:08:34 Duration=365-00:00:00 Nodes=m8-20-[1-16] NodeCnt=16 CoreCnt=384 Features=(null) PartitionName=(null) Flags=MAINT,OVERLAP,IGNORE_JOBS,SPEC_NODES Users=(null) Accounts=staff Licenses=(null) State=ACTIVE
Comment 2 Danny Auble 2015-03-19 02:08:31 MDT
This has already been fixed earlier.  I don't have the commit readily available right now but it should be easy to find. 

On March 19, 2015 7:06:21 AM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1545
>
>--- Comment #1 from Ryan Cox <ryan_cox@byu.edu> ---
>Created attachment 1742 [details]
>  --> http://bugs.schedmd.com/attachment.cgi?id=1742&action=edit
>scontrol show node m8-20-[1-16]
>
>One other thing to note is that there is a reservation with flags=maint
>on this
>reservation.
>
>ReservationName=m8-chassis20 StartTime=2015-02-12T17:08:34
>EndTime=2016-02-12T17:08:34 Duration=365-00:00:00 Nodes=m8-20-[1-16]
>NodeCnt=16
>CoreCnt=384 Features=(null) PartitionName=(null)
>Flags=MAINT,OVERLAP,IGNORE_JOBS,SPEC_NODES Users=(null) Accounts=staff
>Licenses=(null) State=ACTIVE
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 3 Ryan Cox 2015-03-19 02:09:23 MDT
Created attachment 1743 [details]
slurm.conf
Comment 4 Ryan Cox 2015-03-19 02:11:12 MDT
Somehow I missed your message. Let me know when you find it.  Or we can just move up to the latest commit you think is stable.
Comment 5 Danny Auble 2015-03-19 02:13:17 MDT
https://github.com/SchedMD/slurm/commit/2e2d924e3d042393230e006e05f506947e74dd9c

On March 19, 2015 7:11:12 AM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1545
>
>--- Comment #4 from Ryan Cox <ryan_cox@byu.edu> ---
>Somehow I missed your message. Let me know when you find it.  Or we can
>just
>move up to the latest commit you think is stable.
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 6 Ryan Cox 2015-03-19 02:15:06 MDT
Would you recommend applying just that one commit or everything between 14.11.4 and that commit?  We did have to apply one other patch as well.
Comment 7 Danny Auble 2015-03-19 02:17:58 MDT
Either way you should be good.  The safest is just the one patch, but the whole thing should be relatively safe as well. 

On March 19, 2015 7:15:06 AM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1545
>
>--- Comment #6 from Ryan Cox <ryan_cox@byu.edu> ---
>Would you recommend applying just that one commit or everything between
>14.11.4
>and that commit?  We did have to apply one other patch as well.
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 8 Ryan Cox 2015-03-19 02:25:07 MDT
I decided to just apply the patch.  slurmdbd is up again and working.  Thanks.
Comment 9 Brian Christiansen 2015-03-19 03:43:22 MDT
Info given.