| Summary: | slurmctld core dump | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | Bluegene select plugin | Assignee: | Danny Auble <da> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 14.11.10 | ||
| Hardware: | IBM BlueGene | ||
| OS: | Linux | ||
| Site: | LLNL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Full Back Trace of slurmctld core dump on rzuseq | ||
Hey Don, could you do a thread apply all bt full The call it crashed in is just getting jobs from the IBM system. I am surprised anything would happen there as it is a fairly static call. Based on the backtrace it appears the pthread_mutex_lock had an issue for some reason. Perhaps IBM could look and see what is suppose to be happening there. I am interested to see the other threads though. You say this happened more than once? Is it happening often? (In reply to Danny Auble from comment #1) > Hey Don, could you do a > > thread apply all bt full It's attached. > The call it crashed in is just getting jobs from the IBM system. I am > surprised anything would happen there as it is a fairly static call. Based > on the backtrace it appears the pthread_mutex_lock had an issue for some > reason. Perhaps IBM could look and see what is suppose to be happening > there. I am interested to see the other threads though. You say this > happened more than once? Is it happening often? No, actually it's been rather stable. I've asked our IBM consultants to examine the system logs to get a more complete picture. The slurmctld could have been just a casualty in a deeper system problem and not the perpetrator. Let me know if you see anything more after you look through the back trace. Thanks Don, it appears all the threads were all waiting for the same block (RMP11Ja110453417) to be free of jobs. This is normal, guessing there was a job or multiple jobs waiting for the block to free so the resources could be used else where. Based on the amount of threads it looks like this block was running quite a few jobs. Each thread has the jobid that was needing to be finished if that is relavant. 2869908 was the unlucky one that hit the issue.
Here is the complete list of jobs with many of them having multiple threads waiting for the same job all which is normal.
job_id = 2869882
job_id = 2869883
job_id = 2869884
job_id = 2869886
job_id = 2869887
job_id = 2869888
job_id = 2869892
job_id = 2869893
job_id = 2869894
job_id = 2869895
job_id = 2869896
job_id = 2869898
job_id = 2869899
job_id = 2869903
job_id = 2869904
job_id = 2869905
job_id = 2869906
job_id = 2869907
job_id = 2869908
job_id = 2869909
job_id = 2869910
job_id = 2869911
job_id = 2869913
job_id = 2869914
job_id = 2869915
I don't know how these are related, but I see 2 errors on memory seeming from the IBM lib...
RuntimeError: Cannot access memory at address 0x18
RuntimeError: Cannot access memory at address 0xffe8
In any case I'm guessing that block's jobs eventually ended and then the block freed and things went along their merry way.
I'm guessing you may be correct that something else was happening on the system when this hit. I am not sure what else we can do on the matter. From the logs it appears Slurm was doing what you would expect.
Ok, thanks for the analysis, Danny. The indications are that something went bad on the sn. I'll append more comments once I learn the details. We're pretty sure the problem was a result of a node crash. No further action is required. Thank you! Resolved. |
Created attachment 2735 [details] Full Back Trace of slurmctld core dump on rzuseq Appears to be a sporadic occurrence. The system was restarted and stayed up. In a nutshell: (gdb) where #0 0x0000040000c2e194 in ._ZN7log4cxx7helpers10ObjectPtrTINS_5LevelEEC2ERKS3_ () from /bgsys/drivers/ppcfloor/extlib/lib/liblog4cxx.so.10 #1 0x0000040000c44c28 in ._ZN7log4cxx3spi12LoggingEventC1ERKSsRKNS_7helpers10ObjectPtrTINS_5LevelEEES3_RKNS0_12LocationInfoE () from /bgsys/drivers/ppcfloor/extlib/lib/liblog4cxx.so.10 #2 0x0000040000c3e474 in ._ZNK7log4cxx6Logger9forcedLogERKNS_7helpers10ObjectPtrTINS_5LevelEEERKSsRKNS_3spi12LocationInfoE () from /bgsys/drivers/ppcfloor/extlib/lib/liblog4cxx.so.10 #3 0x0000040000769110 in boost::assertion_failed (expr=0x400008c1a18 "!pthread_mutex_lock(&m)", function=<value optimized out>, file=0x400008c19e8 "/usr/include/boost/thread/pthread/mutex.hpp", line=<value optimized out>) at utility.cc:40 #4 0x00000400008923f0 in lock (filter=..., sort=..., user="") at /usr/include/boost/thread/pthread/mutex.hpp:50 #5 lock_guard (filter=..., sort=..., user="") at /usr/include/boost/thread/locks.hpp:194 #6 instance (filter=..., sort=..., user="") at /bgsys/drivers/V1R2M3/ppc64/utility/include/Singleton.h:181 #7 Instance (filter=..., sort=..., user="") at /bgsys/drivers/V1R2M3/ppc64/utility/include/Singleton.h:195 #8 bgsched::core::getJobs (filter=..., sort=..., user="") at core/core.cc:1062 #9 0x00000400004007f0 in _block_wait_for_jobs (bg_block_id=0x1079e3f0 "RMP11Ja110453417", job_ptr=0x10772ab0) at bridge_linker.cc:344 #10 0x0000040000400dd4 in _remove_jobs_on_block_and_reset (block_id=0x1079e3f0 "RMP11Ja110453417", job_ptr=0x10772ab0) at bridge_linker.cc:389 #11 0x000004000040682c in bridge_block_post_job (bg_block_id=0x1079e3f0 "RMP11Ja110453417", job_ptr=0x10772ab0) at bridge_linker.cc:1254 #12 0x00000400003d8994 in _block_agent (args=0x1078e3f0) at bg_job_run.c:595 #13 0x00000400000dc5dc in .start_thread () from /lib64/libpthread.so.0 #14 0x000004000023a8ec in .__clone () from /lib64/libc.so.6