| Summary: | Slurmctld Crashes: Cannot Access memory using select plugin in | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jason Sollom <jasons> |
| Component: | slurmctld | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | da, david.gloe |
| Version: | 14.03.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CRAY | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 14.03.0pre6 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | partial slurmctld log | ||
The slurmctld core file and slurmctld log are also in the directory galaxy:/home/users/jasons/slurmctld_crash_bug/. I would expect p_ptr->row[i].job_list[j] to be NULL here on the line above the seg fault it gets explicitly set to NULL. But memory corruption does appear to be part of this issue though, making this a much tricker issue. This doesn't appear related to previous memory corruption issues.
According to the core file it looks like there was 5 jobs in the system at this time.
gdb) print *tmpjobs[x]
$2 = {core_bitmap = 0x414141445177413a, core_bitmap_used = 0x4b383757584a354a, cpu_array_cnt = 1869898049, cpu_array_value = 0x6c71593436545853, cpu_array_reps = 0x496b497a666e614b, cpus = 0x5573516b39587a4e, cpus_used = 0x2f2b654549696e35, cores_per_socket = 0x44784965505a6732, memory_allocated = 0x6d66344f62776a63, memory_used = 0x445933477573316e, nhosts = 1701458008, node_bitmap = 0x43614e496a454972, node_req = 1312966765, nodes = 0x7849795369575663 <Address 0x7849795369575663 out of bounds>, ncpus = 1702384208, sock_core_rep_count = 0x4ccd0100003a3d, sockets_per_node = 0x3}
This is oviously bad memory.
(gdb) print x
$3 = 0
(gdb) print i
$4 = 0
(gdb) print j
$5 = 0
(gdb) print p_ptr->row[i].num_jobs
$7 = 5
The rest of the list looks ok though
(gdb) print *p_ptr->row[i].job_list[j+1]
$10 = {core_bitmap = 0x9eaa70, core_bitmap_used = 0x8b0440, cpu_array_cnt = 9, cpu_array_value = 0x88f408, cpu_array_reps = 0x8aaa08, cpus = 0x8a9b68, cpus_used = 0x9a7748, cores_per_socket = 0x9019f8, memory_allocated = 0xa4c078, memory_used = 0x953e38, nhosts = 32, node_bitmap = 0x93e3d0, node_req = 1, nodes = 0x9f7b78 "nid00[215-237,352-358,360-361]", ncpus = 524, sock_core_rep_count = 0x91e688, sockets_per_node = 0x8bca58}
(gdb) print *p_ptr->row[i].job_list[j+2]
$11 = {core_bitmap = 0x93e320, core_bitmap_used = 0x8cc7d0, cpu_array_cnt = 1, cpu_array_value = 0x86f628, cpu_array_reps = 0x8cc698, cpus = 0x8cc658, cpus_used = 0x9a2e48, cores_per_socket = 0x93e2e8, memory_allocated = 0x8c5348, memory_used = 0xa48cd8, nhosts = 19, node_bitmap = 0xa48ca0, node_req = 1, nodes = 0x89ee48 "nid00[217-235]", ncpus = 152, sock_core_rep_count = 0x8cc778, sockets_per_node = 0x8d1078}
(gdb) print *p_ptr->row[i].job_list[j+3]
$12 = {core_bitmap = 0x7f7650017580, core_bitmap_used = 0x7f7650014e10, cpu_array_cnt = 1, cpu_array_value = 0x7f7650017ea8, cpu_array_reps = 0x7f7650014e58, cpus = 0x7f7650014e38, cpus_used = 0x7f76500a9808, cores_per_socket = 0x7f7650017548, memory_allocated = 0x7f7650015108, memory_used = 0x7f76500a9828, nhosts = 1, node_bitmap = 0x7f76500c7780, node_req = 1, nodes = 0x7f7650069f58 "nid00236", ncpus = 23, sock_core_rep_count = 0x7f7650017568, sockets_per_node = 0x7f76500a9848}
(gdb) print *p_ptr->row[i].job_list[j+4]
$13 = {core_bitmap = 0x7f76500adb70, core_bitmap_used = 0x7f76500adb90, cpu_array_cnt = 1, cpu_array_value = 0x7f76500c70e8, cpu_array_reps = 0x7f7650017998, cpus = 0x7f7650017978, cpus_used = 0x7f765000c818, cores_per_socket = 0x7f7650013ec8, memory_allocated = 0x7f76500adc58, memory_used = 0x7f7650014dd8, nhosts = 1, node_bitmap = 0x7f7650058690, node_req = 1, nodes = 0x7f76500563a8 "nid00330", ncpus = 9, sock_core_rep_count = 0x7f7650018c18, sockets_per_node = 0x7f76500189c8}
(gdb) print *p_ptr->row[i].job_list[j+5]
Cannot access memory at address 0x0
This job was job 117991.
$14 = {account = 0x0, alias_list = 0x0, alloc_node = 0x8f3a08 "galaxy", alloc_resp_port = 0, alloc_sid = 17379, array_job_id = 0, array_task_id = 4294967294, assoc_id = 0, assoc_ptr = 0x0, batch_flag = 1, batch_host = 0x8b1298 "nid00346", check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, comment = 0x0, cpu_cnt = 222, cr_enabled = 1, db_index = 2361, derived_ec = 0, details = 0x96b088, direct_set_prio = 0, end_time = 1384631159, exit_code = 0, front_end_ptr = 0x0, gres = 0x0, gres_list = 0x0, gres_alloc = 0x97e848 "", gres_req = 0x8c5ce8 "", gres_used = 0x0, group_id = 11121, job_id = 117991, job_next = 0x0, job_resrcs = 0xa11a78, job_state = 32772, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set_max_cpus = 0, limit_set_max_nodes = 0, limit_set_min_cpus = 0, limit_set_min_nodes = 0, limit_set_pn_min_memory = 0, limit_set_time = 0, limit_set_qos = 0, mail_type = 0, mail_user = 0x0, magic = 4038539564, name = 0x940038 "IMB-crayxt.msg22", network = 0x0, next_step_id = 1, nodes = 0x8cf0b8 "nid00[346-348,351-355,357-358,360-361]", node_addr = 0x92f3f8, node_bitmap = 0xa11ba0, node_bitmap_cg = 0x8a92c0, node_cnt = 12, nodes_completing = 0x0, other_port = 0, partition = 0x8bff98 "workq", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x7f7668023188, pre_sus_time = 0, preempt_time = 0, priority = 4294899834, priority_array = 0x0, prio_factors = 0x96b1e8, profile = 0, qos_id = 1, qos_ptr = 0x864758, restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, resv_flags = 0, requid = 1356, resp_host = 0x0, select_jobinfo = 0x8cc438, spank_job_env = 0x0, spank_job_env_size = 0, start_time = 1384630680, state_desc = 0x0, state_reason = 0, step_list = 0x7f764c20ffd8, suspend_time = 0, time_last_active = 1384631159, time_limit = 60, time_min = 0, tot_sus_time = 0, total_cpus = 222, total_nodes = 12, user_id = 1356, wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
print *job_ptr->job_resrcs
$15 = {core_bitmap = 0x8b0ef0, core_bitmap_used = 0x92f690, cpu_array_cnt = 5, cpu_array_value = 0x92f758, cpu_array_reps = 0x92f6f8, cpus = 0x92f6c8, cpus_used = 0x92f668, cores_per_socket = 0xa11b18, memory_allocated = 0x8fbb38, memory_used = 0x8b1818, nhosts = 12, node_bitmap = 0xa11a40, node_req = 1, nodes = 0x8cd0a8 "nid00[346-348,351-355,357-358,360-361]", ncpus = 222, sock_core_rep_count = 0x8b0eb8, sockets_per_node = 0x92f638}
Created attachment 516 [details]
partial slurmctld log
This is the last 100 lines of the slurmctld log file.
It appears job 117991 was canceled by scancel. Jason do you know if that was the case or not?
[2013-11-16T13:45:59.311] Calling NHC for jobid 117991 and apid 0 on nodes nid00[346-348,351-355,357-358,360-361](346-348,351-355,357-358,360-361)
[2013-11-16T13:46:00.791] completing job 117991 status 0
[2013-11-16T13:46:00.791] _slurm_rpc_complete_batch_script JobId=117991: Job/step already completing or completed
[2013-11-16T13:46:00.885] _run_nhc jobid 117991 and apid 0 completed took: usec=1575365
This makes it appear the NHC for the reservation started before the rpc for the batch script complete made it in. That seems backwards and is probably the reason for the memory corruption (at least this is my guess).
There must be a path where this is plausible. I'll see if I can reproduce.
(In reply to Danny Auble from comment #2) > I would expect p_ptr->row[i].job_list[j] to be NULL here on the line above > the seg fault it gets explicitly set to NULL. But memory corruption does > appear to be part of this issue though, making this a much tricker issue. > This doesn't appear related to previous memory corruption issues. > This is an easy conclusion to jump to; however, I think it is incorrect. (gdb) info frame Stack level 2, frame at 0x7f766369ce40: rip = 0x7f76648d75dd in _rm_job_from_res (select_cons_res.c:1249); saved rip 0x7f76648d9975 called by frame at 0x7f766369ce60, caller of frame at 0x7f766369cdc0 source language c. Arglist at 0x7f766369ce30, args: part_record_ptr=0x7f7668024be8, node_usage=0x7f7668026a38, job_ptr=0xa47918, action=0 Locals at 0x7f766369ce30, Previous frame's sp is 0x7f766369ce40 Saved registers: rbx at 0x7f766369ce28, rbp at 0x7f766369ce30, rip at 0x7f766369ce38 (gdb) p n $22 = 1 (gdb) p i $23 = 5 Looking at i=5 means that, it was row[5].job_list[j] that set to NULL, not row[0], which is the row in question. Interesting, in my findings this is what I see... (gdb) print x $3 = 0 (gdb) print i $4 = 0 (gdb) print j $5 = 0 (gdb) print p_ptr->row[i].num_jobs $7 = 5 You appear to be in the wrong frame. Frame 1 is where the print should be done, not frame 2. I just ran the gdb again to confirm. While I believe there is memory corruption, I believe p_ptr->row[i].job_list[j] should be NULL. I wouldn't expect it to be anything other. This happened again on galaxy. The backtrace looks the same although the line number of _job_fini (args=0xb75508) at select_cray.c:694 rather than
0x00007fbffd0ef77d in _job_fini (args=0x925d78) at select_cray.c:682
***bt full***
#0 0x00007ff9a5c7db35 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007ff9a5c7f111 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007ff9a5c769f0 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#3 0x00000000004df8ce in bit_ffs (b=0x0) at bitstring.c:408
bit = 0
value = -1
__PRETTY_FUNCTION__ = "bit_ffs"
#4 0x00007ff99cde3680 in _build_row_bitmaps (p_ptr=0x7ff9a0024fe8, job_ptr=0xb75508) at select_cons_res.c:640
i = 0
j = 6
num_jobs = 19
size = 2092
x = 6
jstart = 0xa10648
this_row = 0x7ff999282db0
orig_row = 0x8719f8
tmpjobs = 0xa71488
job = 0x4d8ba6 <debug3+137>
#5 0x00007ff99cde55dd in _rm_job_from_res (part_record_ptr=0x7ff9a0024fe8, node_usage=0x7ff9a0026dc8, job_ptr=0xb75508, action=0) at select_cons_res.c:1249
p_ptr = 0x7ff9a0024fe8
job = 0xb5f708
node_ptr = 0x7ff9a0031de8
first_bit = 105
last_bit = 105
i = 5
n = 1
gres_list = 0x0
#6 0x00007ff99cde7975 in select_p_job_fini (job_ptr=0xb75508) at select_cons_res.c:2130
No locals.
#7 0x00007ff9a45314bb in other_job_fini (job_ptr=0xb75508) at other_select.c:372
No locals.
#8 0x00007ff9a452f95f in _job_fini (args=0xb75508) at select_cray.c:694
jobinfo = 0x0
job_ptr = 0xb75508
nhc_info = {apid = 0, exit_code = 1, jobid = 122940, nodelist = 0x0, step = false}
job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK}
job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK}
#9 0x00007ff9a5fc97b6 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#10 0x00007ff9a5d24c5d in clone () from /lib64/libc.so.6
No symbol table info available.
#11 0x0000000000000000 in ?? ()
No symbol table info available.
***Further analysis***
**Frame 3**
(gdb) frame 3
#3 0x00000000004df8ce in bit_ffs (b=0x0) at bitstring.c:408
408 bitstring.c: No such file or directory.
(gdb) p *b
Cannot access memory at address 0x0
**Frame 4**
#4 0x00007ff99cde3680 in _build_row_bitmaps (p_ptr=0x7ff9a0024fe8, job_ptr=0xb75508) at select_cons_res.c:640
640 select_cons_res.c: No such file or directory.
(gdb) p *tmpjobs[x]
$16 = {core_bitmap = 0x7ff9a5fbdfb8 <main_arena+312>, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0, cores_per_socket = 0x0,
memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 1, nodes = 0x0, ncpus = 38, sock_core_rep_count = 0x0, sockets_per_node = 0x0}
Notice node_bitmap is NULL.
It's hard to trace this further back any further because the job_list gets nullified here.
p_ptr->row[i].job_list[j] = NULL;
This occurred after we synced up with SchedMD. The latest SchedMD mod was 694e724c3308febfb0bdad970c0d1d19886d128f . This may help narrow down the window in which to look for any errors. Happened again today on galaxy. The last SchedMD commit was 529f4324 and it has the uid/gid patch applied.
(gdb) bt full
#0 0x00000000004dfd46 in bit_ffs (b=0x280) at bitstring.c:408
bit = 0
value = -1
__PRETTY_FUNCTION__ = "bit_ffs"
#1 0x00007f1643398654 in _build_row_bitmaps (p_ptr=0x7f1644000c98, job_ptr=0x8af6d8) at select_cons_res.c:645
i = 0
j = 0
num_jobs = 1
size = 2092
x = 0
this_row = 0x7f1640848cf0
orig_row = 0x7f1638000b58
ss = 0x7f1638000d18
#2 0x00007f164339a3dd in _rm_job_from_res (part_record_ptr=0x7f1644000c98, node_usage=0x7f16440027a8, job_ptr=0x8af6d8, action=0) at select_cons_res.c:1241
p_ptr = 0x7f1644000c98
job = 0x8a97d8
node_ptr = 0x7f1644168998
first_bit = 24
last_bit = 31
i = 3
n = 1
gres_list = 0x0
#3 0x00007f164339c7e8 in select_p_job_fini (job_ptr=0x8af6d8) at select_cons_res.c:2138
No locals.
#4 0x00007f164aadb4bb in other_job_fini (job_ptr=0x8af6d8) at other_select.c:372
No locals.
#5 0x00007f164aad995f in _job_fini (args=0x8af6d8) at select_cray.c:705
jobinfo = 0x0
job_ptr = 0x8af6d8
nhc_info = {apid = 0, exit_code = 1, jobid = 149578, nodelist = 0x0, step = false}
job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK}
job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK}
#6 0x00007f164c6767b6 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#7 0x00007f164c3d1c5d in clone () from /lib64/libc.so.6
No symbol table info available.
#8 0x0000000000000000 in ?? ()
No symbol table info available.
*** Ticket 476 has been marked as a duplicate of this ticket. *** I found and fixed an issue that may be related to this. Patches... c4511bdfb3f2caab2875cdc101b550d2509b5ce1 ad68eac67fb38e86da442ef36395ca4620bb59ee Let me know if you see the issue after these patches. I'm closing this based upon Danny's patches. Please re-open as needed. We saw this today on cst82 running slurm-14-03-0-0pre5
The core file is in /var/spool/slurm/core on the sdb.
(gdb) bt full
#0 0x00007fc54cc91b35 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fc54cc93111 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fc54cc8a9f0 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#3 0x00000000004e17f0 in bit_ffs (b=0x7fc53001af40) at bitstring.c:451
bit = 0
value = -1
__PRETTY_FUNCTION__ = "bit_ffs"
#4 0x00007fc548112654 in _build_row_bitmaps (p_ptr=0xa371b8, job_ptr=0xc9f698) at select_cons_res.c:645
i = 0
j = 7
num_jobs = 58
size = 82224
x = 7
this_row = 0x7fc545fcdcf0
orig_row = 0x7fc53420f2d8
ss = 0x7fc534036778
#5 0x00007fc5481143dd in _rm_job_from_res (part_record_ptr=0xa371b8, node_usage=0xa69048, job_ptr=0xc9f698, action=0) at select_cons_res.c:1241
p_ptr = 0xa371b8
job = 0x7fc53c001eb8
node_ptr = 0xbbf3c8
first_bit = 3371
last_bit = 3371
i = 5
n = 1
gres_list = 0x0
#6 0x00007fc5481167e8 in select_p_job_fini (job_ptr=0xc9f698) at select_cons_res.c:2138
No locals.
#7 0x00007fc54b846277 in other_job_fini (job_ptr=0xc9f698) at other_select.c:372
No locals.
#8 0x00007fc54b844925 in _job_fini (args=0xc9f698) at select_cray.c:711
jobinfo = 0x0
job_ptr = 0xc9f698
nhc_info = {apid = 0, exit_code = 1, jobid = 591, nodelist = 0x0, step = false, user_id = 1356}
job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK}
job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK}
#9 0x00007fc54cfdd7b6 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#10 0x00007fc54cd38c5d in clone () from /lib64/libc.so.6
No symbol table info available.
#11 0x0000000000000000 in ?? ()
No symbol table info available.
OK, this is fixed in 6d7e592c739ab7e89768e8ba0325587d5fdcee56. Already checked into the master branch. Let me know if you see this again, but I am fairly confident this should fix this issue. |
Slurmctld crashed twice over the weekend on galaxy. The Cray slurm code was last synced with the mod 6b210cc8cd802dd37dd21e8003a6276706ee9988 Here's the full backtrace. (gdb) bt full #0 0x00007fbffe940b35 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00007fbffe942111 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00007fbffe9399f0 in __assert_fail () from /lib64/libc.so.6 No symbol table info available. #3 0x00000000004df90e in bit_ffs (b=0x0) at bitstring.c:408 bit = 0 value = -1 __PRETTY_FUNCTION__ = "bit_ffs" #4 0x00007fbff99bf680 in _build_row_bitmaps (p_ptr=0x8c0758, job_ptr=0x925d78) at select_cons_res.c:640 i = 0 j = 2 num_jobs = 36 size = 2092 x = 2 jstart = 0x99cf38 this_row = 0x7fbff31f0dc0 orig_row = 0x9a4a08 tmpjobs = 0x8c04d8 job = 0x4d8be6 <debug3+137> #5 0x00007fbff99c15dd in _rm_job_from_res (part_record_ptr=0x8c0758, node_usage=0x8ca838, job_ptr=0x925d78, action=0) at select_cons_res.c:1249 p_ptr = 0x8c0758 job = 0x7fbfe4020e08 node_ptr = 0x89d518 first_bit = 52 last_bit = 52 i = 5 n = 1 gres_list = 0x0 #6 0x00007fbff99c3975 in select_p_job_fini (job_ptr=0x925d78) at select_cons_res.c:2130 No locals. #7 0x00007fbffd0f12d7 in other_job_fini (job_ptr=0x925d78) at other_select.c:372 No locals. #8 0x00007fbffd0ef77d in _job_fini (args=0x925d78) at select_cray.c:682 jobinfo = 0x0 job_ptr = 0x925d78 job_id = 118976 node_list = 0x0 job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK} job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK} #9 0x00007fbffec8c7b6 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #10 0x00007fbffe9e7c5d in clone () from /lib64/libc.so.6 No symbol table info available. #11 0x0000000000000000 in ?? () No symbol table info available. I have copied the slurmctld binary as well as the *.so's into /home/users/jasons/slurmctld_crash_bug/. Looking more indepth at this in gdb, we start out with ***Frame 0*** (gdb) frame 0 #0 0x00000000004df912 in bit_ffs (b=0x43614e496a454972) at bitstring.c:408 (gdb) p *b Cannot access memory at address 0x43614e496a454972 ***Frame 1*** (gdb) frame 1 #1 0x00007f76648d5680 in _build_row_bitmaps (p_ptr=0x7f7668024be8, job_ptr=0xa47918) at select_cons_res.c:640 This is the code: jstart[x] = bit_ffs(tmpjobs[x]->node_bitmap); (gdb) p tmpjobs[x]->node_bitmap $26 = (bitstr_t *) 0x43614e496a454972 <<< Same inaccessible address as above Which comes from line 638: tmpjobs[x] = p_ptr->row[i].job_list[j]; (gdb) p p_ptr->row[i].job_list[j] $27 = (struct job_resources *) 0x0 <<< NULL, not a good sign ***Frame 2*** (gdb) frame 2 #2 0x00007f76648d75dd in _rm_job_from_res (part_record_ptr=0x7f7668024be8, node_usage=0x7f7668026a38, job_ptr=0xa47918, action=0) at select_cons_res.c:1249 Which is this line in select_cons_res.c /* job was found and removed, so refresh the bitmaps */ _build_row_bitmaps(p_ptr, job_ptr); Just to reinterate the frame 1 findings, below we see this is NULL, which I believe is incorrect: (gdb) p p_ptr->row[0].job_list[0] $21 = (struct job_resources *) 0x0