Summary: | slurmctld abrt (slurm_xfree/_gres_job_list_delete) | ||
---|---|---|---|
Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
Component: | slurmctld | Assignee: | Dominik Bartkiewicz <bart> |
Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | bart |
Version: | 18.08.8 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=6739 https://bugs.schedmd.com/show_bug.cgi?id=5821 |
||
Site: | Stanford | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | Sherlock |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
"thread apply all bt full" output
slurmctld logs |
Hi Could you send me additional info from core dump? f 5 p *gres_ptr f 6 p *gres_ptr f 8 p *job_ptr Thanks, Dominik Hi dominik, Here you go: (gdb) f 5 #5 0x00007fcccfc49e5e in _job_state_delete (gres_data=<optimized out>) at gres.c:3005 3005 gres.c: No such file or directory. (gdb) p *gres_ptr $1 = {gres_name = 0x7fca9c0cf2a0 "gpu", type_id = 0, type_name = 0x0, cpus_per_gres = 0, gres_per_job = 0, gres_per_node = 1, gres_per_socket = 0, gres_per_task = 0, mem_per_gres = 0, def_cpus_per_gres = 0, def_mem_per_gres = 0, total_gres = 0, gres_cnt_node_alloc = 0x7fccc5e6b3e0, node_cnt = 1, gres_bit_alloc = 0x0, gres_bit_step_alloc = 0x0, gres_cnt_step_alloc = 0x7fcc2c424900} (gdb) f 6 #6 0x00007fcccfc4ca80 in _gres_job_list_delete (list_element=<optimized out>) at gres.c:3022 3022 in gres.c (gdb) p *gres_ptr $2 = {plugin_id = 7696487, gres_data = 0x7fca9c0abc80} (gdb) f 8 #8 0x0000000000451cc0 in _list_delete_job (job_entry=<optimized out>) at job_mgr.c:9347 9347 job_mgr.c: No such file or directory. (gdb) p *job_ptr $3 = {magic = 0, account = 0x0, admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x0, alloc_resp_port = 0, alloc_sid = 165757, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 13936, assoc_ptr = 0x1801190, batch_features = 0x0, batch_flag = 1, batch_host = 0x0, billable_tres = 2, bit_flags = 436471808, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 0, cpus_per_tres = 0x0, cr_enabled = 1, db_index = 962432220, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x0, direct_set_prio = 0, end_time = 1569598874, end_time_exp = 4294967294, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x7fcbbc1f1150, gres_list = 0x60c6eb0, gres_alloc = 0x0, gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x0, gres_used = 0x0, group_id = 244300, job_id = 51420364, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 3, kill_on_node_fail = 1, last_sched_eval = 1569594955, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x7fca9c0d59e0}, mail_type = 271, mail_user = 0x0, mem_per_tres = 0x0, mcs_label = 0x0, name = 0x7fca9c098a20 "basecalling.flowcell=FAK26979", network = 0x0, next_step_id = 0, ngids = 3, nodes = 0x7fccc4f10490 "sh-115-11", node_addr = 0x7fccc58d3130, node_bitmap = 0x7fccc5616710, node_bitmap_cg = 0x7fcca45305d0, node_cnt = 0, node_cnt_wag = 1, nodes_completing = 0x0, origin_cluster = 0x0, other_port = 0, pack_details = 0x0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x7fccc40205f0 "owners,gpu", part_ptr_list = 0x7fccc6f7aa80, part_nodes_missing = false, part_ptr = 0x7fca9022ebd0, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 38899, priority_array = 0x7fca9c0ac6a0, prio_factors = 0x7fca9c638370, profile = 4294967295, qos_id = 3, qos_ptr = 0x13b0080, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 2, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x7fca9c0ac610, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8448, start_time = 1569594955, state_desc = 0x0, state_reason = 0, state_reason_prev = 3, step_list = 0x7fca849f03e0, suspend_time = 0, system_comment = 0x0, time_last_active = 1569598874, time_limit = 240, time_min = 0, tot_sus_time = 4, total_cpus = 2, total_nodes = 1, tres_bind = 0x0, tres_freq = 0x0, tres_per_job = 0x0, tres_per_node = 0x7fca9c01ea50 "gpu:1", tres_per_socket = 0x0, tres_per_task = 0x0, tres_req_cnt = 0x7fca9c5a4a00, tres_req_str = 0x7fca9c51b1c0 "1=2,2=4096,4=1,5=2,1002=1", tres_fmt_req_str = 0x7fca9c0ac3a0 "cpu=2,mem=4G,node=1,billing=2,gres/gpu=1", tres_alloc_cnt = 0x7fccc4feb150, tres_alloc_str = 0x7fcad04c1e60 "1=2,2=4096,3=669114,4=1,5=2,1002=1", tres_fmt_alloc_str = 0x7fcad0412cd0 "cpu=2,mem=4G,energy=669114,node=1,billing=2,gres/gpu=1", user_id = 342258, user_name = 0x7fcbbc544b80 "ndolson", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} Cheers, -- Kilian Hi I still can't reproduce this. Could you send me slurmctld.log covering 1h before the crash? Dominik Created attachment 11765 [details]
slurmctld logs
Sure, here it is (the crash happened Sep 27 08:57)
Thanks!
--
Kilian
Hi One more request, could you send me the output from: f 4 p p p p[0] p p[1] Which version of Glibc do you use? Dominik Hi Dominik, (In reply to Dominik Bartkiewicz from comment #5) > One more request, could you send me the output from: > f 4 > p p > p p[0] > p p[1] (gdb) f 4 #4 0x00007fcccfd2faee in slurm_xfree (item=0x7fca9c0abce0, file=file@entry=0x7fcccfd42473 "gres.c", line=line@entry=3005, func=func@entry=0x7fcccfd44a40 <__func__.21392> "_job_state_delete") at xmalloc.c:244 244 xmalloc.c: No such file or directory. (gdb) p p $1 = <optimized out> (gdb) p p[0] value has been optimized out (gdb) p p[1] value has been optimized out Not sure that's very helpful :\ > Which version of Glibc do you use? Stock CentOS 7.6: glibc-2.17-260.el7_6.4.x86_64 Cheers, -- Kilian Sorry, could you try this? f 4 p *item p ((size_t *)*item - 2)[1] p ((size_t *)*item - 2)[0] Dominik Yup: (gdb) f 4 #4 0x00007fcccfd2faee in slurm_xfree (item=0x7fca9c0abce0, file=file@entry=0x7fcccfd42473 "gres.c", line=line@entry=3005, func=func@entry=0x7fcccfd44a40 <__func__.21392> "_job_state_delete") at xmalloc.c:244 244 xmalloc.c: No such file or directory. (gdb) p *item $1 = (void *) 0x7fccc5e6b3e0 (gdb) p ((size_t *)*item - 2)[1] $2 = 8 (gdb) p ((size_t *)*item - 2)[0] $3 = 0 Cheers, Hi Did you notice this abort again? Unfortunately, I can't reproduce this and I have no idea what can cause this. Dominik Hi Dominik, No, we haven't seen this recently, so it's ok to close this ticket. Thanks! Thanks, if this is OK for you I'm closing the bug as "cannot reproduce". Dominik |
Created attachment 11728 [details] "thread apply all bt full" output Hi there, New slurmctld ABRT (long time no see! :) (gdb) bt #0 0x00007fcccf41a207 in raise () from /lib64/libc.so.6 #1 0x00007fcccf41b8f8 in abort () from /lib64/libc.so.6 #2 0x00007fcccf45cd27 in __libc_message () from /lib64/libc.so.6 #3 0x00007fcccf465489 in _int_free () from /lib64/libc.so.6 #4 0x00007fcccfd2faee in slurm_xfree (item=0x7fca9c0abce0, file=file@entry=0x7fcccfd42473 "gres.c", line=line@entry=3005, func=func@entry=0x7fcccfd44a40 <__func__.21392> "_job_state_delete") at xmalloc.c:244 #5 0x00007fcccfc49e5e in _job_state_delete (gres_data=<optimized out>) at gres.c:3005 #6 0x00007fcccfc4ca80 in _gres_job_list_delete (list_element=<optimized out>) at gres.c:3022 #7 0x00007fcccfc6ce29 in list_destroy (l=0x60c6eb0) at list.c:223 #8 0x0000000000451cc0 in _list_delete_job (job_entry=<optimized out>) at job_mgr.c:9347 #9 0x00007fcccfc6d2be in list_delete_all (l=0x13978a0, f=f@entry=0x441fa3 <_list_find_job_old>, key=key@entry=0x4bee86) at list.c:392 #10 0x0000000000449ba0 in purge_old_job () at job_mgr.c:10996 #11 0x000000000042d0c2 in _slurmctld_background (no_data=0x0) at controller.c:2154 #12 main (argc=<optimized out>, argv=<optimized out>) at controller.c:764 "thread apply all bt full" output attached. The backup controller took over, and the primary controller didn't abort again when it was restarted. Cheers, -- Kilian