Ticket 9827

Summary: occasional slurmctld crash with sigbus
Product: Slurm Reporter: Lloyd Brown <lloyd_brown>
Component: slurmctldAssignee: Marcin Stolarek <cinek>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart, cinek
Version: 20.02.4   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=10492
Site: BYU - Brigham Young University Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: output of gdb backtrace
output of requested "p $_signinfo" and "ptype $_signinfo"

Description Lloyd Brown 2020-09-15 14:09:08 MDT
Created attachment 15904 [details]
output of gdb backtrace

This is very low priority, but it seemed like something you should know about.

We occasionally have slurmctld crash on our primary scheduler.  Since we have a monitoring tool that restarts it automatically, and it happens relatively infrequently, we haven't worried much about it.  But occasionally, we notice that we have a core file, and go take a look.

Yesterday (Sept 14) at around 1:46 PM, this occurred.  I gathered a backtrace, using the following syntax, which I'll attach here:

gdb `which slurmctld` core -ex 'set print pretty on' -ex 'info thread' -ex 'thread apply all bt full' --batch

This particular time, it's a SIGBUS crash, though I'm certainly not qualified to diagnose it beyond that.  I've preserved the core file for now, so we can extract more information if you need.

This is a local build of 20.02.5 (as of commit 3343dc235dd9c8744698a971a08e0af28601eb72), running on Debian 9, compiled as follows:

export CFLAGS="-O3 -g3 -ggdb3"
./configure --disable-debug --sysconfdir=/etc/slurm && \
make -j3



Lloyd
Comment 1 Lloyd Brown 2020-09-15 14:29:46 MDT
I misspoke earlier.  This is v20.02.4, not 20.02.5.  I've updated the bug to reflect this.
Comment 2 Marcin Stolarek 2020-09-16 03:53:39 MDT
Could you please load the core to gdb and execute:
>(gdb) p $_siginfo
>(gdb) ptype $_siginfo

Is it possible that there was a change in slurm binaries stored on NFS or change in files like passwd/shadow refered by nss_compat?  

cheers,
Marcin
Comment 3 Lloyd Brown 2020-09-16 10:30:44 MDT
Created attachment 15911 [details]
output of requested "p $_signinfo" and "ptype $_signinfo"
Comment 4 Lloyd Brown 2020-09-16 10:32:24 MDT
I've attached the requested output here.  Let me know if you need additional information.

As far as I know, there should not have been any changes to the slurm binaries without a restart.  As you know from Bug 9726, we did upgrade to 20.02.4, but that was days before the crash, and we restarted both slurmctld and slurmdbd after the upgrade.  Also, the storage location is local to the node, not NFS.
Comment 7 Marcin Stolarek 2020-09-22 08:56:01 MDT
Could you please execute build and execute the following test program on the host:
>  1 #include<stdio.h>                                                                
>  2 #include <unistd.h>                                                              
>  3                                                                                  
>  4 void main(void) {                                                                
>  5         printf("Buffer size is:%d",_SC_GETGR_R_SIZE_MAX);                        
>  6 }  
It should compile with just `gcc /tmp/testProgram.c -o /tmp/output`.


The gdb output shows that sigbus was received with si_code=2 which means BUS_ADRERR - Access to a non-existent area of a memory object[1].

This error is most likely(in general) comming from a failure in accessing mmaped file - that's why I asked about NFS which is probably not the case based on your reply. Looking at the backtrace the active thread was outside of Slurm code executing something like:
#getent passwd vmosquer
Is the user active? Do you store any user database file on NFS? 

Probably not the case, but... There was a bug a few years ago (CentOS6) where this si_code was set for all cases when hudge pages were not disabled, what's your os/kernel version? The output suggests that her shell is set to /bin/false. What is your nsswitch user information source for the user?

cheers,
Marcin
Comment 8 Lloyd Brown 2020-09-30 10:13:28 MDT
Marcin,

I apologize for the delay.  We got distracted by another systems issue.

root@sched1:/tmp# ./testProgram 
Buffer size is:69
root@sched1:/tmp#

As far as the rest goes, the slurm binary is definitely local to the host.  The only NFS involved in normal operation, is in the "StateSaveLocation" path.

Having said that, the user credentials (like vmosquer, who is an active user), are populated into the local /etc/{passwd,shadow,group} files via a script.  It's theoretically possible that it happened to be clobbering one of those files, at the same time as the sigbus, I suppose.  Though it's not clear to me if that scenario would even lead to a sigbus in the first place.
Comment 9 Marcin Stolarek 2020-10-01 08:10:38 MDT
 Lloyd,

Looking further into glibc/nss_compat code I think that you hit the bug there. It looks like getpwnam_r (which should be thread safe) makes use of fgets_unlocked(which is not thread safe). I'll report this in glibc bugzilla and add a link to it here.

Will that be fine if I'll attach the backtrace you shared with us there?

cheers
Marcin
Comment 10 Marcin Stolarek 2020-10-07 04:09:50 MDT
 Lloyd,

I've anonymized all real data in the thread 1 backtrace and attached it to the case in glibc buzilla:
https://sourceware.org/bugzilla/show_bug.cgi?id=26713

I hope you don't have anything against that. The attached information doesn't have any personal info nor hostname/ip addesses. It's just a sequence of functions called in glibc.

cheers,
Marcin
Comment 11 Lloyd Brown 2020-10-07 10:37:07 MDT
Marcin,

I don't see any problem with sharing the information.  Sorry that I missed your previous message.  Too many other issues to deal with, I guess.

Lloyd
Comment 12 Marcin Stolarek 2020-10-07 22:26:11 MDT
 Lloyd,


Do you agree that we can read Florian Weimer(RedHat) comment[1] as a confirmation that the bug is actually outside of Slurm and we can close this bug report?

cheers,
Marcin

[1]https://sourceware.org/bugzilla/show_bug.cgi?id=26713#c1
Comment 13 Lloyd Brown 2020-10-12 08:50:00 MDT
Marcin,

I'm not sure I'm the best qualified to discuss the underlying bug.  But if you believe that the problem exists outside of Slurm, and you've opened the appropriate upstream bug, then I'm certainly not going to contradict you.