Ticket 7826

Summary: slurmctld abrt (slurm_xfree/_gres_job_list_delete)
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: slurmctldAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED CANNOTREPRODUCE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bart
Version: 18.08.8   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=6739
https://bugs.schedmd.com/show_bug.cgi?id=5821
Site: Stanford Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name: Sherlock
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: "thread apply all bt full" output
slurmctld logs

Description Kilian Cavalotti 2019-09-27 10:10:54 MDT
Created attachment 11728 [details]
"thread apply all bt full" output

Hi there,

New slurmctld ABRT (long time no see! :)

(gdb) bt
#0  0x00007fcccf41a207 in raise () from /lib64/libc.so.6
#1  0x00007fcccf41b8f8 in abort () from /lib64/libc.so.6
#2  0x00007fcccf45cd27 in __libc_message () from /lib64/libc.so.6
#3  0x00007fcccf465489 in _int_free () from /lib64/libc.so.6
#4  0x00007fcccfd2faee in slurm_xfree (item=0x7fca9c0abce0, file=file@entry=0x7fcccfd42473 "gres.c", line=line@entry=3005, func=func@entry=0x7fcccfd44a40 <__func__.21392> "_job_state_delete") at xmalloc.c:244
#5  0x00007fcccfc49e5e in _job_state_delete (gres_data=<optimized out>) at gres.c:3005
#6  0x00007fcccfc4ca80 in _gres_job_list_delete (list_element=<optimized out>) at gres.c:3022
#7  0x00007fcccfc6ce29 in list_destroy (l=0x60c6eb0) at list.c:223
#8  0x0000000000451cc0 in _list_delete_job (job_entry=<optimized out>) at job_mgr.c:9347
#9  0x00007fcccfc6d2be in list_delete_all (l=0x13978a0, f=f@entry=0x441fa3 <_list_find_job_old>, key=key@entry=0x4bee86) at list.c:392
#10 0x0000000000449ba0 in purge_old_job () at job_mgr.c:10996
#11 0x000000000042d0c2 in _slurmctld_background (no_data=0x0) at controller.c:2154
#12 main (argc=<optimized out>, argv=<optimized out>) at controller.c:764

"thread apply all bt full" output attached.

The backup controller took over, and the primary controller didn't abort again when it was restarted.

Cheers,
-- 
Kilian
Comment 1 Dominik Bartkiewicz 2019-09-30 08:19:07 MDT
Hi

Could you send me additional info from core dump?
f 5
p *gres_ptr
f 6
p *gres_ptr
f 8
p *job_ptr

Thanks,
Dominik
Comment 2 Kilian Cavalotti 2019-09-30 08:40:26 MDT
Hi dominik, 

Here you go:

(gdb) f 5
#5  0x00007fcccfc49e5e in _job_state_delete (gres_data=<optimized out>) at gres.c:3005
3005    gres.c: No such file or directory.
(gdb) p *gres_ptr
$1 = {gres_name = 0x7fca9c0cf2a0 "gpu", type_id = 0, type_name = 0x0, cpus_per_gres = 0, 
  gres_per_job = 0, gres_per_node = 1, gres_per_socket = 0, gres_per_task = 0, 
  mem_per_gres = 0, def_cpus_per_gres = 0, def_mem_per_gres = 0, total_gres = 0, 
  gres_cnt_node_alloc = 0x7fccc5e6b3e0, node_cnt = 1, gres_bit_alloc = 0x0, 
  gres_bit_step_alloc = 0x0, gres_cnt_step_alloc = 0x7fcc2c424900}

(gdb) f 6
#6  0x00007fcccfc4ca80 in _gres_job_list_delete (list_element=<optimized out>) at gres.c:3022
3022    in gres.c
(gdb) p *gres_ptr
$2 = {plugin_id = 7696487, gres_data = 0x7fca9c0abc80}

(gdb) f 8
#8  0x0000000000451cc0 in _list_delete_job (job_entry=<optimized out>) at job_mgr.c:9347
9347    job_mgr.c: No such file or directory.
(gdb) p *job_ptr
$3 = {magic = 0, account = 0x0, admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x0,
  alloc_resp_port = 0, alloc_sid = 165757, array_job_id = 0, array_task_id = 4294967294,
  array_recs = 0x0, assoc_id = 13936, assoc_ptr = 0x1801190, batch_features = 0x0,
  batch_flag = 1, batch_host = 0x0, billable_tres = 2, bit_flags = 436471808,
  burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0,
  ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 0, cpus_per_tres = 0x0,
  cr_enabled = 1, db_index = 962432220, deadline = 0, delay_boot = 0, derived_ec = 0,
  details = 0x0, direct_set_prio = 0, end_time = 1569598874, end_time_exp = 4294967294,
  epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0,
  gids = 0x7fcbbc1f1150, gres_list = 0x60c6eb0, gres_alloc = 0x0, gres_detail_cnt = 0,
  gres_detail_str = 0x0, gres_req = 0x0, gres_used = 0x0, group_id = 244300,
  job_id = 51420364, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0,
  job_resrcs = 0x0, job_state = 3, kill_on_node_fail = 1, last_sched_eval = 1569594955,
  licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x7fca9c0d59e0},
  mail_type = 271, mail_user = 0x0, mem_per_tres = 0x0, mcs_label = 0x0,
  name = 0x7fca9c098a20 "basecalling.flowcell=FAK26979", network = 0x0, next_step_id = 0,
  ngids = 3, nodes = 0x7fccc4f10490 "sh-115-11", node_addr = 0x7fccc58d3130,
  node_bitmap = 0x7fccc5616710, node_bitmap_cg = 0x7fcca45305d0, node_cnt = 0,
  node_cnt_wag = 1, nodes_completing = 0x0, origin_cluster = 0x0, other_port = 0,
  pack_details = 0x0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0,
  pack_job_list = 0x0, partition = 0x7fccc40205f0 "owners,gpu",
  part_ptr_list = 0x7fccc6f7aa80, part_nodes_missing = false, part_ptr = 0x7fca9022ebd0,
  power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false,
  priority = 38899, priority_array = 0x7fca9c0ac6a0, prio_factors = 0x7fca9c638370,
  profile = 4294967295, qos_id = 3, qos_ptr = 0x13b0080, qos_blocking_ptr = 0x0,
  reboot = 0 '\000', restart_cnt = 2, resize_time = 0, resv_id = 0, resv_name = 0x0,
  resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0,
  select_jobinfo = 0x7fca9c0ac610, spank_job_env = 0x0, spank_job_env_size = 0,
  start_protocol_ver = 8448, start_time = 1569594955, state_desc = 0x0, state_reason = 0,
  state_reason_prev = 3, step_list = 0x7fca849f03e0, suspend_time = 0, system_comment = 0x0,
  time_last_active = 1569598874, time_limit = 240, time_min = 0, tot_sus_time = 4,
  total_cpus = 2, total_nodes = 1, tres_bind = 0x0, tres_freq = 0x0, tres_per_job = 0x0,
  tres_per_node = 0x7fca9c01ea50 "gpu:1", tres_per_socket = 0x0, tres_per_task = 0x0,
  tres_req_cnt = 0x7fca9c5a4a00, tres_req_str = 0x7fca9c51b1c0 "1=2,2=4096,4=1,5=2,1002=1",
  tres_fmt_req_str = 0x7fca9c0ac3a0 "cpu=2,mem=4G,node=1,billing=2,gres/gpu=1",
  tres_alloc_cnt = 0x7fccc4feb150,
  tres_alloc_str = 0x7fcad04c1e60 "1=2,2=4096,3=669114,4=1,5=2,1002=1",
  tres_fmt_alloc_str = 0x7fcad0412cd0 "cpu=2,mem=4G,energy=669114,node=1,billing=2,gres/gpu=1", user_id = 342258, user_name = 0x7fcbbc544b80 "ndolson", wait_all_nodes = 0, warn_flags = 0,
  warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0,
  best_switch = true, wait4switch_start = 0}


Cheers, 
-- 
Kilian
Comment 3 Dominik Bartkiewicz 2019-10-01 08:40:06 MDT
Hi

I still can't reproduce this.
Could you send me slurmctld.log covering 1h before the crash?

Dominik
Comment 4 Kilian Cavalotti 2019-10-01 09:08:25 MDT
Created attachment 11765 [details]
slurmctld logs

Sure, here it is (the crash happened Sep 27 08:57)

Thanks!
-- 
Kilian
Comment 5 Dominik Bartkiewicz 2019-10-02 10:08:23 MDT
Hi

One more request, could you send me the output from:
f 4 
p p
p p[0]
p p[1]

Which version of Glibc do you use?

Dominik
Comment 6 Kilian Cavalotti 2019-10-02 10:10:40 MDT
Hi Dominik, 

(In reply to Dominik Bartkiewicz from comment #5)
> One more request, could you send me the output from:
> f 4 
> p p
> p p[0]
> p p[1]


(gdb) f 4
#4  0x00007fcccfd2faee in slurm_xfree (item=0x7fca9c0abce0, file=file@entry=0x7fcccfd42473 "gres.c", line=line@entry=3005, func=func@entry=0x7fcccfd44a40 <__func__.21392> "_job_state_delete") at xmalloc.c:244
244     xmalloc.c: No such file or directory.
(gdb) p p
$1 = <optimized out>
(gdb) p p[0]
value has been optimized out
(gdb) p p[1]
value has been optimized out

Not sure that's very helpful :\


> Which version of Glibc do you use?

Stock CentOS 7.6: glibc-2.17-260.el7_6.4.x86_64

Cheers,
-- 
Kilian
Comment 7 Dominik Bartkiewicz 2019-10-02 10:22:25 MDT
Sorry, could you try this?

f 4
p *item
p ((size_t *)*item - 2)[1]
p ((size_t *)*item - 2)[0]


Dominik
Comment 8 Kilian Cavalotti 2019-10-02 10:42:59 MDT
Yup:

(gdb) f 4
#4  0x00007fcccfd2faee in slurm_xfree (item=0x7fca9c0abce0,
file=file@entry=0x7fcccfd42473 "gres.c", line=line@entry=3005,
func=func@entry=0x7fcccfd44a40 <__func__.21392> "_job_state_delete")
at xmalloc.c:244
244     xmalloc.c: No such file or directory.
(gdb) p *item
$1 = (void *) 0x7fccc5e6b3e0
(gdb) p ((size_t *)*item - 2)[1]
$2 = 8
(gdb) p ((size_t *)*item - 2)[0]
$3 = 0

Cheers,
Comment 9 Dominik Bartkiewicz 2019-10-28 07:20:24 MDT
Hi

Did you notice this abort again?
Unfortunately, I can't reproduce this
and I have no idea what can cause this.

Dominik
Comment 10 Kilian Cavalotti 2019-10-28 08:40:22 MDT
Hi Dominik,

No, we haven't seen this recently, so it's ok to close this ticket.

Thanks!
Comment 11 Dominik Bartkiewicz 2019-10-28 09:54:52 MDT
Thanks, if this is OK for you I'm closing the bug as "cannot reproduce".

Dominik