| Summary: | Slurm core dump on Sequoia | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | Bluegene select plugin | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 1 - System not usable | ||
| Priority: | --- | ||
| Version: | 2.4.x | ||
| Hardware: | IBM BlueGene | ||
| OS: | Linux | ||
| Site: | LLNL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | sequoia | CLE Version: | |
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
stack trace and last log entries
log from blocks going into error |
||
Send the logs when blocks RMP10Jl073343488 RMP10Jl073343399 RMP10Jl073343274 RMP10Jl073343210 RMP09Jl224539188 were created and went into error. Them going into error is probably the interesting portion. The most interesting create is the second group which should happen all at the same time and then the first group. Created attachment 83 [details]
log from blocks going into error
The log you sent perviously was good. I'll see what I can find. It apears am attaching it here since it wasn't sent to the bug before.
slurmctld dumped core again:
(gdb) bt full
#0 0x000000808fa4e6fc in .pthread_mutex_lock () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x00000000100af304 in list_iterator_create (l=0x91) at list.c:702
e = <value optimized out>
i = 0x400a80dac60
__PRETTY_FUNCTION__ = "list_iterator_create"
#2 0x00000400003f2290 in ba_sub_block_in_record_clear (bg_record=0x401501a4c08, step_ptr=0x401500f3c38) at block_allocator.c:1262
bit = 148
itr = 0x0
ba_mp = 0x0
jobinfo = 0x401500ef218
tmp_char = 0x0
tmp_char2 = 0x0
tmp_char3 = 0x0
#3 0x00000400003cffcc in select_p_step_finish (step_ptr=0x401500f3c38) at select_bluegene.c:2097
bg_record = 0x401501a4c08
jobinfo = <value optimized out>
rc = 0
tmp_char = 0x0
#4 0x00000000100d06fc in select_g_step_finish (step_ptr=0x401500f3c38) at node_select.c:798
No locals.
#5 0x000000001008dfc0 in job_step_complete (job_id=<value optimized out>, step_id=<value optimized out>, uid=<value optimized out>, requeue=<value optimized out>, job_return_code=<value optimized out>) at step_mgr.c:526
job_ptr = 0x400f000fef8
step_ptr = <value optimized out>
error_code = <value optimized out>
#6 0x0000000010073a50 in _slurm_rpc_step_complete (msg=0x400f0132198) at proc_req.c:2413
error_code = 0
rc = <value optimized out>
rem = 0
step_rc = 0
tv1 = {tv_sec = 1341959305, tv_usec = 855611}
tv2 = {tv_sec = 4402073113368, tv_usec = 4398885495200}
tv_str = "\360\024\221\340\000\000\000\000\020#(8\000\000\004\000\062\001\342@"
req = 0x400f00c53a8
job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK}
uid = 8830
dump_job = false
#7 0x0000000010076658 in slurmctld_req (msg=0x400f0132198) at proc_req.c:369
No locals.
#8 0x000000001002e518 in _service_connection (arg=0x40110002c48) at controller.c:1017
conn = 0x40110002c48
msg = 0x400f0132198
#9 0x000000808fa4c2bc in .start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#10 0x000000808f94866c in .__clone () from /lib64/libc.so.6
No symbol table info available.
[2012-07-10T15:28:13] add_bg_record: asking for [3000x3033] 0 0 0 0 0 T,T,T,T
[2012-07-10T15:28:13] debug: 513 can start unassigned job 1911 at 4294967295 on seq[3000x3033]
[2012-07-10T15:28:13] sched: _slurm_rpc_allocate_resources JobId=1911 NodeList=(null) usec=44008
[2012-07-10T15:28:20] Nodeboard 'N00' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:21] Nodeboard 'N01' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:21] Nodeboard 'N02' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:21] Nodeboard 'N03' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:21] Nodeboard 'N04' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:21] Nodeboard 'N05' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:21] Nodeboard 'N06' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:21] Nodeboard 'N07' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N08' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N09' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N10' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N11' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N12' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N13' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N14' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:22] Nodeboard 'N15' on Midplane R70-M1(1101), has returned to service
[2012-07-10T15:28:25] debug: Processing RPC: REQUEST_STEP_COMPLETE for 1904.0 nodes 0-0 rc=0 uid=8830
Please save this core. If this is related to the other core then there is memory corruption somewhere. Do you know what is happening on the system? Could you put the core from this last one and slurmctld in my home dir? I can't seem to log onto seqsn. The cores don't appear to be related at this point. I have a handle on both but I want to verify some information. Mostly in ba_sub_block_in_record_clear print *bg_record The log from 2012-07-10T15:19:02 2012-07-10T15:32:51 would be helpful as well. Both of these issues are fixed in this bug. The first one is fixed here... https://github.com/SchedMD/slurm/commit/0c371d3617d0e2ac44a1d98661b4c480f548c6ac The second one is actually related to bug 82 I believe. But this will fix the issue if it ever shows up again. https://github.com/SchedMD/slurm/commit/4731a11b6ba665d8910743a374bda9ddd7c1910e |
Created attachment 82 [details] stack trace and last log entries