Ticket 3515 - slurmctld abort() packing job
Summary: slurmctld abort() packing job
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 17.02.0
Hardware: Cray XC Linux
: 2 - High Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-02-28 08:50 MST by Doug Jacobsen
Modified: 2017-03-09 07:06 MST (History)
2 users (show)

See Also:
Site: NERSC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.02.1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmctld log (3.49 MB, text/x-log)
2017-02-28 09:15 MST, Doug Jacobsen
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Doug Jacobsen 2017-02-28 08:50:37 MST
Hello,

I'm marking this high priority because we are planning to move to slurm 17.02 in the next couple days.

Last evening slurmctld aborted.


(gdb) bt
#0  0x00002b742ecd40c7 in raise () from /lib64/libc.so.6
#1  0x00002b742ecd5478 in abort () from /lib64/libc.so.6
#2  0x00002b742eccd146 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00002b742eccd1f2 in __assert_fail () from /lib64/libc.so.6
#4  0x00000000005295de in bit_fmt (str=0x2b7441583920 "", len=128, b=0x0) at bitstring.c:1000
#5  0x000000000062910b in gres_build_job_details (job_gres_list=0xaf5e70, gres_detail_cnt=0x2b74d404ea30, gres_detail_str=0x2b74d404ea38) at gres.c:6482
#6  0x000000000047160a in _pack_job_gres (dump_job_ptr=0x2b74d404e920, buffer=0x2b74cc006970, protocol_version=7936) at job_mgr.c:8645
#7  0x00000000004726de in pack_job (dump_job_ptr=0x2b74d404e920, show_flags=2, buffer=0x2b74cc006970, protocol_version=7936, uid=70817) at job_mgr.c:8817
#8  0x000000000047128b in pack_all_jobs (buffer_ptr=0x2b7441593c70, buffer_size=0x2b7441593c60, show_flags=2, uid=70817, filter_uid=4294967294, 
    protocol_version=7936) at job_mgr.c:8534
#9  0x00000000004bd145 in _slurm_rpc_dump_jobs (msg=0x2b7441593eb0) at proc_req.c:1338
#10 0x00000000004b9e70 in slurmctld_req (msg=0x2b7441593eb0, arg=0x2b74480008f0) at proc_req.c:314
#11 0x00000000004478c1 in _service_connection (arg=0x2b74480008f0) at controller.c:1116
#12 0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#13 0x00002b742ed8402d in clone () from /lib64/libc.so.6
(gdb) thread apply all bt

Thread 17 (Thread 0x2b7440a7e700 (LWP 17777)):
#0  0x00002b742ea8e3e8 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b7440776788 in bb_sleep (state_ptr=0x2b744097d8c0 <bb_state>, add_secs=30) at burst_buffer_common.c:952
#2  0x00002b74407623c2 in _bb_agent (args=0x0) at burst_buffer_cray.c:424
#3  0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b742ed8402d in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x2b7440b7f700 (LWP 17784)):
#0  0x00002b742ed7d293 in select () from /lib64/libc.so.6
#1  0x00000000004473f7 in _slurmctld_rpc_mgr (no_data=0x0) at controller.c:1008
#2  0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b742ed8402d in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x2b7435622700 (LWP 17570)):
#0  0x00002b742ed560dd in nanosleep () from /lib64/libc.so.6
#1  0x00002b742ed55f74 in sleep () from /lib64/libc.so.6
#2  0x00002b7435319107 in _set_db_inx_thread (no_data=0x0) at accounting_storage_slurmdbd.c:426
#3  0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b742ed8402d in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x2b743635d700 (LWP 17656)):
#0  0x00002b742ea8e03f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b7436058ea3 in _script_agent (args=0x0) at jobcomp_nersc.c:163
#2  0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b742ed8402d in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x2b7435a28700 (LWP 17574)):
#0  0x00002b742ea8e3e8 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005dd907 in _agent (x=0x0) at slurmdbd_defs.c:1963
#2  0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b742ed8402d in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x2b7435314700 (LWP 17569)):
#0  0x00002b742ed560dd in nanosleep () from /lib64/libc.so.6
#1  0x00002b742ed55f74 in sleep () from /lib64/libc.so.6
#2  0x00002b743457861f in _lease_extender (args=0x0) at cookies.c:350
#3  0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b742ed8402d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x2b7441493700 (LWP 17789)):
#0  0x00002b742ed560dd in nanosleep () from /lib64/libc.so.6
#1  0x00002b742ed55f74 in sleep () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit--- 
#2  0x00000000004b781e in _init_power_save (arg=0x0) at power_save.c:741
#3  0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b742ed8402d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x2b744075c700 (LWP 17659)):
#0  0x00002b742ed7bbfd in poll () from /lib64/libc.so.6
#1  0x00002b742fc9531b in _aeld_event_loop (args=0x0) at select_cray.c:519
#2  0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b742ed8402d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x2b743666b700 (LWP 17657)):
#0  0x00002b742ea8e3e8 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b74363628a3 in _my_sleep (usec=120000000) at backfill.c:495
#2  0x00002b7436363576 in backfill_agent (args=0x0) at backfill.c:804
#3  0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b742ed8402d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x2b7435723700 (LWP 17571)):
#0  0x00002b742ea8b4c2 in pthread_join () from /lib64/libpthread.so.0
#1  0x00002b743531912c in _cleanup_thread (no_data=0x0) at accounting_storage_slurmdbd.c:434
#2  0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b742ed8402d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x2b742e78b700 (LWP 17568)):
#0  0x00002b742ed560dd in nanosleep () from /lib64/libc.so.6
#1  0x00002b742ed55f74 in sleep () from /lib64/libc.so.6
#2  0x00002b7434163330 in _queue_agent (args=0x0) at node_features_knl_cray.c:1967
#3  0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b742ed8402d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x2b7441291700 (LWP 17787)):
#0  0x00002b742ea915a9 in do_sigwait () from /lib64/libpthread.so.0
#1  0x00002b742ea91633 in sigwait () from /lib64/libpthread.so.0
#2  0x0000000000446d23 in _slurmctld_signal_hand (no_data=0x0) at controller.c:870
#3  0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b742ed8402d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x2b7440f8e700 (LWP 17786)):
#0  0x00002b742ea8b4c2 in pthread_join () from /lib64/libpthread.so.0
#1  0x00002b7440b86e3d in _cleanup_thread (no_data=0x0) at priority_multifactor.c:1404
#2  0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b742ed8402d in clone () from /lib64/libc.so.6

---Type <return> to continue, or q <return> to quit---
Thread 4 (Thread 0x2b742e689800 (LWP 17565)):
#0  0x00002b742ed560dd in nanosleep () from /lib64/libc.so.6
#1  0x00002b742ed7d9b4 in usleep () from /lib64/libc.so.6
#2  0x000000000044923d in _slurmctld_background (no_data=0x0) at controller.c:1741
#3  0x0000000000446635 in main (argc=1, argv=0x7ffc5b38e198) at controller.c:605

Thread 3 (Thread 0x2b7440e8d700 (LWP 17785)):
#0  0x00002b742ed560dd in nanosleep () from /lib64/libc.so.6
#1  0x00002b742ed55f74 in sleep () from /lib64/libc.so.6
#2  0x00002b7440b86cfc in _decay_thread (no_data=0x0) at priority_multifactor.c:1350
#3  0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#4  0x00002b742ed8402d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x2b7441392700 (LWP 17788)):
#0  0x00002b742ea8e3e8 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004e525b in slurmctld_state_save (no_data=0x0) at state_save.c:201
#2  0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#3  0x00002b742ed8402d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2b7441594700 (LWP 32118)):
#0  0x00002b742ecd40c7 in raise () from /lib64/libc.so.6
#1  0x00002b742ecd5478 in abort () from /lib64/libc.so.6
#2  0x00002b742eccd146 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00002b742eccd1f2 in __assert_fail () from /lib64/libc.so.6
#4  0x00000000005295de in bit_fmt (str=0x2b7441583920 "", len=128, b=0x0) at bitstring.c:1000
#5  0x000000000062910b in gres_build_job_details (job_gres_list=0xaf5e70, gres_detail_cnt=0x2b74d404ea30, gres_detail_str=0x2b74d404ea38) at gres.c:6482
#6  0x000000000047160a in _pack_job_gres (dump_job_ptr=0x2b74d404e920, buffer=0x2b74cc006970, protocol_version=7936) at job_mgr.c:8645
#7  0x00000000004726de in pack_job (dump_job_ptr=0x2b74d404e920, show_flags=2, buffer=0x2b74cc006970, protocol_version=7936, uid=70817) at job_mgr.c:8817
#8  0x000000000047128b in pack_all_jobs (buffer_ptr=0x2b7441593c70, buffer_size=0x2b7441593c60, show_flags=2, uid=70817, filter_uid=4294967294, 
    protocol_version=7936) at job_mgr.c:8534
#9  0x00000000004bd145 in _slurm_rpc_dump_jobs (msg=0x2b7441593eb0) at proc_req.c:1338
#10 0x00000000004b9e70 in slurmctld_req (msg=0x2b7441593eb0, arg=0x2b74480008f0) at proc_req.c:314
#11 0x00000000004478c1 in _service_connection (arg=0x2b74480008f0) at controller.c:1116
#12 0x00002b742ea8a0a4 in start_thread () from /lib64/libpthread.so.0
#13 0x00002b742ed8402d in clone () from /lib64/libc.so.6
(gdb) 




Will give more detail shortly.
-Doug
Comment 1 Doug Jacobsen 2017-02-28 09:00:40 MST
(gdb) print *dump_job_ptr
$1 = {account = 0x2b74d40493a0 "nstaff", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x2b74d40924b0 "gert01", alloc_resp_port = 0, 
  alloc_sid = 22360, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 3340, assoc_ptr = 0x967130, batch_flag = 1, 
  batch_host = 0x2b743c002100 "nid00034", billable_tres = 4294967294, bit_flags = 0, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, 
  ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 1088, cr_enabled = 1, db_index = 0, deadline = 0, delay_boot = 0, 
  derived_ec = 0, details = 0x2b74d404ec90, direct_set_prio = 0, end_time = 1488246784, end_time_exp = 1488246784, epilog_running = false, exit_code = 0, 
  fed_details = 0x0, front_end_ptr = 0x0, gres = 0x2b74d404eea0 "craynetwork:1", gres_list = 0xaf5e70, 
  gres_alloc = 0x2b743c004dd0 "craynetwork:16,7168616:0", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x2b743c004f00 "craynetwork:4", 
  gres_used = 0x0, group_id = 70817, job_id = 18972, job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x2b743c002bb0, 
  job_state = 1, kill_on_node_fail = 1, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x2b74d404e540}, mail_type = 0, 
  mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, name = 0x2b74d404ee10 "test-suite.q", network = 0x0, next_step_id = 1, 
  nodes = 0x2b743c0020d0 "nid000[34-37]", node_addr = 0x2b743c004e30, node_bitmap = 0x2b743c0021f0, node_bitmap_cg = 0x0, node_cnt = 4, node_cnt_wag = 4, 
  nodes_completing = 0x0, other_port = 0, pack_leader = 0, partition = 0x2b74d404e5b0 "knl_reboot", part_ptr_list = 0x0, part_nodes_missing = false, 
  part_ptr = 0xb8f130, pelog_env = 0x0, pelog_env_size = 0, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, 
  priority = 1, priority_array = 0x0, prio_factors = 0x2b74d404dec0, profile = 0, qos_id = 21, qos_ptr = 0x946350, reboot = 0 '\000', restart_cnt = 0, 
  resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x2b74d404ef30, 
  spank_job_env = 0x2b74d4060970, spank_job_env_size = 1, start_protocol_ver = 7936, start_time = 1488244984, state_desc = 0x0, state_reason = 0, 
  state_reason_prev = 0, step_list = 0xb89050, suspend_time = 0, time_last_active = 1488244987, time_limit = 30, time_min = 0, tot_sus_time = 0, 
  total_cpus = 1088, total_nodes = 4, tres_req_cnt = 0x2b74d404e570, tres_req_str = 0x2b74d404d740 "1=4,4=4", 
  tres_fmt_req_str = 0x2b74d404ee40 "cpu=4,node=4", tres_alloc_cnt = 0x2b743c002370, tres_alloc_str = 0x2b743c0023b0 "1=1088,2=368640,4=4", 
  tres_fmt_alloc_str = 0x2b743c004e90 "cpu=1088,mem=360G,node=4", user_id = 70817, wait_all_nodes = 1, warn_flags = 0, warn_signal = 0, warn_time = 0, 
  wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
(gdb) 



(gdb) print *(gres_state_t *)(dump_job_ptr.gres_list->head->data)
$6 = {plugin_id = 4047587904, gres_data = 0x2b74d404e5e0}

(gdb) print *(gres_job_state_t *)((gres_state_t *)(dump_job_ptr.gres_list->head->data)).gres_data
$11 = {type_model = 0x0, gres_cnt_alloc = 1, node_cnt = 4, gres_bit_alloc = 0x2b743c002980, gres_bit_step_alloc = 0x0, gres_cnt_step_alloc = 0x2b743c002940, 
  gres_name_type_id = 4047587904}
(gdb)
Comment 2 Tim Wickberg 2017-02-28 09:13:33 MST
Dominik's looking into this now.

One warning, although unrelated to this crash - there's a deadlock in _slurm_rpc_dump_conf() that may lead to a 17.02.1 release soon; I'll update you once a patch is available for that.
Comment 3 Doug Jacobsen 2017-02-28 09:15:03 MST
Created attachment 4122 [details]
slurmctld log
Comment 4 Doug Jacobsen 2017-02-28 09:16:25 MST
As long as we have the commit ids for the recommended changes, I'll be fine with a patched build.
Comment 5 Tim Wickberg 2017-02-28 09:18:11 MST
(In reply to Doug Jacobsen from comment #4)
> As long as we have the commit ids for the recommended changes, I'll be fine
> with a patched build.

I assumed as much, I'll send them over when they land. Just didn't want you to trip into it and spend too much time collecting data for that - we already have two backtraces pinpointing the exact cause, and should have the patch out shortly.
Comment 6 Doug Jacobsen 2017-02-28 09:24:10 MST
As you can guess (from crash frame #5):

(gdb) print *job_gres_data                
$12 = {type_model = 0x0, gres_cnt_alloc = 1, node_cnt = 4, gres_bit_alloc = 0x2b743c002980, gres_bit_step_alloc = 0x0, gres_cnt_step_alloc = 0x2b743c002940, 
  gres_name_type_id = 4047587904}
(gdb) print *job_gres_data->gres_bit_alloc
$13 = (bitstr_t *) 0x0
(gdb) print i
$14 = 0
Comment 7 Tim Wickberg 2017-02-28 09:43:42 MST
The deadlock fix is commit 0c5e35089c9ea. Commit b17e2aee851 would be nice to add as well.
Comment 8 Doug Jacobsen 2017-02-28 11:28:23 MST
OK, I have those integrated into my build and running on gerty now.  I'm about to lose gerty for 1.5 days (or more) owing to the hardware upgrade.  Will try to grab the core file and RPMs for this bug in case there are further questions.
Comment 11 Dominik Bartkiewicz 2017-03-01 04:43:03 MST
Yesterday we prepared few patches that should prevent from
this kind of problems.

58a2f4501e
f7a2428503
2c3b7ad735

Dominik
Comment 12 Doug Jacobsen 2017-03-01 04:57:42 MST
Great, should I include them in my production build?  They seem to be
committed onto the 17.02 branch, so I suppose so.

Doug

On Mar 1, 2017 3:48 AM, <bugs@schedmd.com> wrote:

> *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=3515#c11> on bug
> 3515 <https://bugs.schedmd.com/show_bug.cgi?id=3515> from Dominik
> Bartkiewicz <bart@schedmd.com> *
>
> Yesterday we prepared few patches that should prevent from
> this kind of problems.
>
> 58a2f4501e
> f7a2428503
> 2c3b7ad735
>
> Dominik
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 13 Dominik Bartkiewicz 2017-03-09 06:55:20 MST
Hi

Sorry for late response.
Yes, that commits are in slurm-17-02-1-2.
Can we close this ticket ?

Dominik
Comment 14 Doug Jacobsen 2017-03-09 07:02:00 MST
Sure, thanks so much.

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Thu, Mar 9, 2017 at 5:55 AM, <bugs@schedmd.com> wrote:

> *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=3515#c13> on bug
> 3515 <https://bugs.schedmd.com/show_bug.cgi?id=3515> from Dominik
> Bartkiewicz <bart@schedmd.com> *
>
> Hi
>
> Sorry for late response.
> Yes, that commits are in slurm-17-02-1-2.
> Can we close this ticket ?
>
> Dominik
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>