Ticket 2451

Summary: slurmctld core dump
Product: Slurm Reporter: Don Lipari <lipari1>
Component: Bluegene select pluginAssignee: Danny Auble <da>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 14.11.10   
Hardware: IBM BlueGene   
OS: Linux   
Site: LLNL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Full Back Trace of slurmctld core dump on rzuseq

Description Don Lipari 2016-02-16 02:04:56 MST
Created attachment 2735 [details]
Full Back Trace of slurmctld core dump on rzuseq

Appears to be a sporadic occurrence.  The system was restarted and stayed up.  In a nutshell:

(gdb) where
#0  0x0000040000c2e194 in ._ZN7log4cxx7helpers10ObjectPtrTINS_5LevelEEC2ERKS3_ () from /bgsys/drivers/ppcfloor/extlib/lib/liblog4cxx.so.10
#1  0x0000040000c44c28 in ._ZN7log4cxx3spi12LoggingEventC1ERKSsRKNS_7helpers10ObjectPtrTINS_5LevelEEES3_RKNS0_12LocationInfoE ()
   from /bgsys/drivers/ppcfloor/extlib/lib/liblog4cxx.so.10
#2  0x0000040000c3e474 in ._ZNK7log4cxx6Logger9forcedLogERKNS_7helpers10ObjectPtrTINS_5LevelEEERKSsRKNS_3spi12LocationInfoE ()
   from /bgsys/drivers/ppcfloor/extlib/lib/liblog4cxx.so.10
#3  0x0000040000769110 in boost::assertion_failed (expr=0x400008c1a18 "!pthread_mutex_lock(&m)", function=<value optimized out>, 
    file=0x400008c19e8 "/usr/include/boost/thread/pthread/mutex.hpp", line=<value optimized out>) at utility.cc:40
#4  0x00000400008923f0 in lock (filter=..., sort=..., user="") at /usr/include/boost/thread/pthread/mutex.hpp:50
#5  lock_guard (filter=..., sort=..., user="") at /usr/include/boost/thread/locks.hpp:194
#6  instance (filter=..., sort=..., user="") at /bgsys/drivers/V1R2M3/ppc64/utility/include/Singleton.h:181
#7  Instance (filter=..., sort=..., user="") at /bgsys/drivers/V1R2M3/ppc64/utility/include/Singleton.h:195
#8  bgsched::core::getJobs (filter=..., sort=..., user="") at core/core.cc:1062
#9  0x00000400004007f0 in _block_wait_for_jobs (bg_block_id=0x1079e3f0 "RMP11Ja110453417", job_ptr=0x10772ab0) at bridge_linker.cc:344
#10 0x0000040000400dd4 in _remove_jobs_on_block_and_reset (block_id=0x1079e3f0 "RMP11Ja110453417", job_ptr=0x10772ab0) at bridge_linker.cc:389
#11 0x000004000040682c in bridge_block_post_job (bg_block_id=0x1079e3f0 "RMP11Ja110453417", job_ptr=0x10772ab0) at bridge_linker.cc:1254
#12 0x00000400003d8994 in _block_agent (args=0x1078e3f0) at bg_job_run.c:595
#13 0x00000400000dc5dc in .start_thread () from /lib64/libpthread.so.0
#14 0x000004000023a8ec in .__clone () from /lib64/libc.so.6
Comment 1 Danny Auble 2016-02-16 03:21:03 MST
Hey Don, could you do a

thread apply all bt full

The call it crashed in is just getting jobs from the IBM system.  I am surprised anything would happen there as it is a fairly static call.  Based on the backtrace it appears the pthread_mutex_lock had an issue for some reason.  Perhaps IBM could look and see what is suppose to be happening there.  I am interested to see the other threads though.  You say this happened more than once?  Is it happening often?
Comment 2 Don Lipari 2016-02-16 03:31:25 MST
(In reply to Danny Auble from comment #1)
> Hey Don, could you do a
> 
> thread apply all bt full

It's attached.
 
> The call it crashed in is just getting jobs from the IBM system.  I am
> surprised anything would happen there as it is a fairly static call.  Based
> on the backtrace it appears the pthread_mutex_lock had an issue for some
> reason.  Perhaps IBM could look and see what is suppose to be happening
> there.  I am interested to see the other threads though.  You say this
> happened more than once?  Is it happening often?

No, actually it's been rather stable.  I've asked our IBM consultants to examine the system logs to get a more complete picture.  The slurmctld could have been just a casualty in a deeper system problem and not the perpetrator.

Let me know if you see anything more after you look through the back trace.
Comment 3 Danny Auble 2016-02-16 03:55:58 MST
Thanks Don, it appears all the threads were all waiting for the same block (RMP11Ja110453417) to be free of jobs.  This is normal, guessing there was a job or multiple jobs waiting for the block to free so the resources could be used else where.  Based on the amount of threads it looks like this block was running quite a few jobs.  Each thread has the jobid that was needing to be finished if that is relavant. 2869908 was the unlucky one that hit the issue.
Here is the complete list of jobs with many of them having multiple threads waiting for the same job all which is normal.
        job_id = 2869882
        job_id = 2869883
        job_id = 2869884
        job_id = 2869886
        job_id = 2869887
        job_id = 2869888
        job_id = 2869892
        job_id = 2869893
        job_id = 2869894
        job_id = 2869895
        job_id = 2869896
        job_id = 2869898
        job_id = 2869899
        job_id = 2869903
        job_id = 2869904
        job_id = 2869905
        job_id = 2869906
        job_id = 2869907
        job_id = 2869908
        job_id = 2869909
        job_id = 2869910
        job_id = 2869911
        job_id = 2869913
        job_id = 2869914
        job_id = 2869915


I don't know how these are related, but I see 2 errors on memory seeming from the IBM lib...

RuntimeError: Cannot access memory at address 0x18
RuntimeError: Cannot access memory at address 0xffe8


In any case I'm guessing that block's jobs eventually ended and then the block freed and things went along their merry way.

I'm guessing you may be correct that something else was happening on the system when this hit.  I am not sure what else we can do on the matter.  From the logs it appears Slurm was doing what you would expect.
Comment 4 Don Lipari 2016-02-16 04:00:44 MST
Ok, thanks for the analysis, Danny.  The indications are that something went bad on the sn.  I'll append more comments once I learn the details.
Comment 6 Don Lipari 2016-02-16 11:49:58 MST
We're pretty sure the problem was a result of a node crash.  No further action is required.  Thank you!
Comment 7 Don Lipari 2016-02-16 11:51:19 MST
Resolved.