Ticket 1545

Summary:	slurmdbd segfaults on null reason in as_mysql_node_down
Product:	Slurm	Reporter:	Ryan Cox <ryan_cox>
Component:	slurmdbd	Assignee:	Brian Christiansen <brian>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	1 - System not usable
Priority:	---	CC:	brian, da
Version:	14.11.4
Hardware:	Linux
OS:	Linux
Site:	BYU - Brigham Young University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	gdb scontrol show node m8-20-[1-16] slurm.conf

Description Ryan Cox 2015-03-19 02:04:23 MDT

Created attachment 1741 [details]
gdb

We brought some new nodes online yesterday that had been marked in our slurm config as state=down.  We set them to resume and things were good.  We ran jobs on all but one of them.  m8-20-1 stayed as state=maint but we're not sure why.  I planned to diagnose it this morning.

Earlier this morning our slurmdbd's (primary and backup) began crashing when calling as_mysql_node_down with a null reason, specifically for m8-20-1.

I tried a few things to fix it including marking the 16 new nodes as state=drain with a reason.  That didn't work, so I restarted slurmctld then tried again.

slurmdbd will only stay up for tens of seconds at most before segfaulting.  I may try removing m8-20-[1-16] from the config if I can't get anything else to work.  What's weird is that some of the other nodes have a null reason but don't crash.

Comment 1 Ryan Cox 2015-03-19 02:06:21 MDT

Created attachment 1742 [details]
scontrol show node m8-20-[1-16]

One other thing to note is that there is a reservation with flags=maint on this reservation.

ReservationName=m8-chassis20 StartTime=2015-02-12T17:08:34 EndTime=2016-02-12T17:08:34 Duration=365-00:00:00 Nodes=m8-20-[1-16] NodeCnt=16 CoreCnt=384 Features=(null) PartitionName=(null) Flags=MAINT,OVERLAP,IGNORE_JOBS,SPEC_NODES Users=(null) Accounts=staff Licenses=(null) State=ACTIVE

Comment 2 Danny Auble 2015-03-19 02:08:31 MDT

This has already been fixed earlier.  I don't have the commit readily available right now but it should be easy to find. 

On March 19, 2015 7:06:21 AM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1545
>
>--- Comment #1 from Ryan Cox <ryan_cox@byu.edu> ---
>Created attachment 1742 [details]
>  --> http://bugs.schedmd.com/attachment.cgi?id=1742&action=edit
>scontrol show node m8-20-[1-16]
>
>One other thing to note is that there is a reservation with flags=maint
>on this
>reservation.
>
>ReservationName=m8-chassis20 StartTime=2015-02-12T17:08:34
>EndTime=2016-02-12T17:08:34 Duration=365-00:00:00 Nodes=m8-20-[1-16]
>NodeCnt=16
>CoreCnt=384 Features=(null) PartitionName=(null)
>Flags=MAINT,OVERLAP,IGNORE_JOBS,SPEC_NODES Users=(null) Accounts=staff
>Licenses=(null) State=ACTIVE
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.

Comment 3 Ryan Cox 2015-03-19 02:09:23 MDT

Created attachment 1743 [details]
slurm.conf

Comment 4 Ryan Cox 2015-03-19 02:11:12 MDT

Somehow I missed your message. Let me know when you find it.  Or we can just move up to the latest commit you think is stable.

Comment 5 Danny Auble 2015-03-19 02:13:17 MDT

https://github.com/SchedMD/slurm/commit/2e2d924e3d042393230e006e05f506947e74dd9c

On March 19, 2015 7:11:12 AM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1545
>
>--- Comment #4 from Ryan Cox <ryan_cox@byu.edu> ---
>Somehow I missed your message. Let me know when you find it.  Or we can
>just
>move up to the latest commit you think is stable.
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.

Comment 6 Ryan Cox 2015-03-19 02:15:06 MDT

Would you recommend applying just that one commit or everything between 14.11.4 and that commit?  We did have to apply one other patch as well.

Comment 7 Danny Auble 2015-03-19 02:17:58 MDT

Either way you should be good.  The safest is just the one patch, but the whole thing should be relatively safe as well. 

On March 19, 2015 7:15:06 AM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1545
>
>--- Comment #6 from Ryan Cox <ryan_cox@byu.edu> ---
>Would you recommend applying just that one commit or everything between
>14.11.4
>and that commit?  We did have to apply one other patch as well.
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.

Comment 8 Ryan Cox 2015-03-19 02:25:07 MDT

I decided to just apply the patch.  slurmdbd is up again and working.  Thanks.

Comment 9 Brian Christiansen 2015-03-19 03:43:22 MDT

Info given.