2922 – segfault in slurmctld while terminating job

Ticket 2922 - segfault in slurmctld while terminating job

Summary: segfault in slurmctld while terminating job

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	16.05.2
Hardware:	Cray XC Linux

Severity:	2 - High Impact
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2016-07-20 04:48 MDT by Doug Jacobsen
Modified:	2016-07-24 22:04 MDT (History)
CC List:	1 user (show)

See Also:
Site:	NERSC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	16.05.3
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmctld log (56.39 MB, application/x-bzip2) 2016-07-20 04:48 MDT, Doug Jacobsen	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Doug Jacobsen 2016-07-20 04:48:55 MDT

Created attachment 3338 [details]
slurmctld log

Hello,

This morning slurmctld segfaulted.   The backtrace is distinct from earlier instances.  Our slurmctld is currently based on commit 0c7bd6d024359e7de5146534e8f1cc150f28be38 along the 16.05 branch.

Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f6ec6b6551b in select_p_step_finish (step_ptr=0x7f6e746a62d0, killing_step=true) at select_cray.c:2251
2251	select_cray.c: No such file or directory.
Missing separate debuginfos, use: zypper install slurm-debuginfo-16.05.2-20160715210839head.x86_64
(gdb) bt
#0  0x00007f6ec6b6551b in select_p_step_finish (step_ptr=0x7f6e746a62d0, killing_step=true) at select_cray.c:2251
#1  0x0000000000537516 in select_g_step_finish (step_ptr=0x7f6e746a62d0, killing_step=true) at node_select.c:795
#2  0x00000000004a16a0 in re_kill_job (job_ptr=0x7f6e756ede60) at node_scheduler.c:3742
#3  0x0000000000467d07 in _list_find_job_old (job_entry=0x7f6e756ede60, key=0x649a3c) at job_mgr.c:8090
#4  0x000000000050c506 in list_delete_all (l=0x8d6990, f=0x467c6b <_list_find_job_old>, key=0x649a3c) at list.c:494
#5  0x000000000046e5af in purge_old_job () at job_mgr.c:9468
#6  0x0000000000447a6e in _slurmctld_background (no_data=0x0) at controller.c:1872
#7  0x0000000000444940 in main (argc=1, argv=0x7fffabb5aaa8) at controller.c:605
(gdb)


(gdb) print *step_ptr
$1 = {
  batch_step = 0,
  ckpt_interval = 0,
  check_job = 0x0,
  ckpt_dir = 0x0,
  ckpt_time = 0,
  core_bitmap_job = 0x0,
  cpu_count = 1,
  cpu_freq_min = 0,
  cpu_freq_max = 0,
  cpu_freq_gov = 0,
  cpus_per_task = 0,
  cyclic_alloc = 0,
  exclusive = 0,
  exit_code = 4294967294,
  exit_node_bitmap = 0x0,
  ext_sensors = 0x0,
  gres = 0x0,
  gres_list = 0x0,
  host = 0x7f6e75689cc0 "nid01574",
  job_ptr = 0x7f6e756ede60,
  jobacct = 0x7f6e74de13a0,
  pn_min_memory = 0,
  name = 0x0,
  network = 0x0,
  no_kill = 0 '\000',
  port = 60458,
  pre_sus_time = 0,
  start_protocol_ver = 7680,
  resv_port_array = 0x0,
  resv_port_cnt = 0,
  resv_ports = 0x0,
  requid = 4294967295,
  start_time = 0,
  time_limit = 4294967295,
  select_jobinfo = 0x0,
  state = 0,
  step_id = 4294967293,
  step_layout = 0x0,
  step_node_bitmap = 0x7f6e74dfcca0,
  switch_job = 0x0,
  time_last_active = 1468996336,
  tot_sus_time = 0,
  tres_alloc_str = 0x0,
  tres_fmt_alloc_str = 0x0
}
(gdb)



The specific crash was caused by a NULL reference through step_ptr->select_jobinfo.

slurmctld logged the following before crashing:

[2016-07-20T00:02:13.859] debug:  Job 2758074.4294967293 has no nodes or app name, skipping

It looks like it was trying to kill a job with the new pending step id.  Also looks like 2758074 was a burstbuffer job, so perhaps there's an interaction there with the implicit stageout steps.

The job was being abnormally terminated owing to timelimit.  It has submitted nearly 1000 numbered steps over the course of its existence, not all of which failed...

-Doug

Comment 1 Doug Jacobsen 2016-07-20 04:49:47 MDT

(gdb) print *step_ptr->job_ptr
$2 = {
  account = 0x7f6e74e4c4e0 "mpccc",
  alias_list = 0x0,
  alloc_node = 0x7f6e754bc900 "cori02",
  alloc_resp_port = 0,
  alloc_sid = 45156,
  array_job_id = 0,
  array_task_id = 4294967294,
  array_recs = 0x0,
  assoc_id = 1124,
  assoc_ptr = 0x190bbc0,
  batch_flag = 1,
  batch_host = 0x7f6eb42eded0 "nid01574",
  bit_flags = 0,
  burst_buffer = 0x7f6e756ee200 "#DW jobdw access_mode=striped type=scratch capacity=539GiB",
  check_job = 0x0,
  ckpt_interval = 0,
  ckpt_time = 0,
  comment = 0x0,
  cpu_cnt = 64,
  billable_tres = 1024,
  cr_enabled = 1,
  deadline = 0,
  db_index = 197470884,
  derived_ec = 6,
  details = 0x7f6e74ffcdc0,
  direct_set_prio = 0,
  end_time = 1468997953,
  end_time_exp = 4294967294,
  epilog_running = false,
  exit_code = 1,
  front_end_ptr = 0x0,
  gres = 0x7f6e74ffcfd0 "craynetwork:1",
  gres_list = 0x4523d40,
  gres_alloc = 0x7f6eb48768d0 "craynetwork:64",
  gres_req = 0x7f6eb423ba60 "craynetwork:16",
  gres_used = 0x0,
  group_id = 68441,
  job_id = 2758074,
  job_next = 0x0,
  job_array_next_j = 0x0,
  job_array_next_t = 0x0,
  job_resrcs = 0x7f6eb484ea80,
  job_state = 32774,
  kill_on_node_fail = 1,
  licenses = 0x7f6e74ffccb0 "cscratch1:1",
  license_list = 0x4c09a70,
  limit_set = {
    qos = 0,
    time = 0,
    tres = 0x7f6e74ffcc80
---Type <return> to continue, or q <return> to quit---
  },
  mail_type = 0,
  mail_user = 0x0,
  magic = 4038539564,
  mcs_label = 0x0,
  name = 0x7f6e74093430 "job.20160719.195536.948663493.sh",
  network = 0x0,
  next_step_id = 994,
  nodes = 0x7f6eb48c97a0 "nid0[1574-1584,1948-1952]",
  node_addr = 0x7f6eb488ca40,
  node_bitmap = 0x7f6eb49e6c80,
  node_bitmap_cg = 0x4ddb5a0,
  node_cnt = 1,
  node_cnt_wag = 16,
  nodes_completing = 0x7f6e5c000ba0 "nid0[1574-1584,1948-1952]",
  other_port = 0,
  partition = 0x7f6e76d700b0 "regular",
  part_ptr_list = 0x0,
  part_nodes_missing = false,
  part_ptr = 0x2612600,
  power_flags = 0 '\000',
  pre_sus_time = 0,
  preempt_time = 0,
  preempt_in_progress = false,
  priority = 69120,
  priority_array = 0x0,
  prio_factors = 0x7f6e74821e00,
  profile = 0,
  qos_id = 69,
  qos_ptr = 0x908a40,
  reboot = 0 '\000',
  restart_cnt = 0,
  resize_time = 0,
  resv_id = 0,
  resv_name = 0x0,
  resv_ptr = 0x0,
  requid = 4294967295,
  resp_host = 0x0,
  sched_nodes = 0x0,
  select_jobinfo = 0x7f6e756ee260,
  spank_job_env = 0x7f6e7449a7b0,
  spank_job_env_size = 1,
  start_protocol_ver = 7680,
  start_time = 1468983537,
  state_desc = 0x0,
  state_reason = 24,
  state_reason_prev = 1,
  step_list = 0x4da4850,
  suspend_time = 0,
  time_last_active = 1468998133,
  time_limit = 240,
---Type <return> to continue, or q <return> to quit---
  time_min = 0,
  tot_sus_time = 5,
  total_cpus = 1024,
  total_nodes = 16,
  tres_req_cnt = 0x7f6e740aff60,
  tres_req_str = 0x7f6e74ffcf40 "1=16,4=16,6=614400",
  tres_fmt_req_str = 0x7f6e756ee1a0 "cpu=16,node=16,bb/cray=614400",
  tres_alloc_cnt = 0x7f6eb418d860,
  tres_alloc_str = 0x7f6eb48220f0 "1=1024,2=1998848,4=16,6=614400",
  tres_fmt_alloc_str = 0x7f6eb4a35190 "cpu=1024,mem=1.91T,node=16,bb/cray=600G",
  user_id = 68441,
  wait_all_nodes = 0,
  warn_flags = 0,
  warn_signal = 0,
  warn_time = 0,
  wckey = 0x0,
  req_switch = 0,
  wait4switch = 0,
  best_switch = true,
  wait4switch_start = 0
}
(gdb)

Comment 2 Doug Jacobsen 2016-07-20 04:54:56 MDT

I've reviewed this user's (staff member) script before, it submits steps in a tight loop.  Near the end of the job all the steps were MPI_Abort()ing, so it wouldn't surprise me if there were steps queued to start when the time limit cancellation came down.

Comment 5 Tim Wickberg 2016-07-20 09:09:35 MDT

Hey Doug -

As you surmised, this was caused by a pending step still being present as the job was terminated.

I have a patch ready that we're testing internally now - I want to verify the fix doesn't just move the issue somewhere else on you - but I can provide that now if you're concerned that this may reoccur today.

- Tim

Comment 8 Tim Wickberg 2016-07-20 12:07:27 MDT

Doug - the fix is committed as 3b914e5b4fd. This will be in 16.05.3 when released.

- Tim

Comment 9 Doug Jacobsen 2016-07-24 22:04:03 MDT

OK, this hit us again, unfortunately I hadn't updated slurm since I left on
vacation on the 20th...  anyway the other bug that Dave Paul filed this
evening appears to have been a resurgence of this issue.  I updated to the
latest 16.05 HEAD and your fixes from the 20th appear to have corrected the
situation.

Thanks!
Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________

On Wed, Jul 20, 2016 at 11:07 AM, <bugs@schedmd.com> wrote:

> Tim Wickberg <tim@schedmd.com> changed bug 2922
> <https://bugs.schedmd.com/show_bug.cgi?id=2922>
> What Removed Added
> Status UNCONFIRMED RESOLVED
> Version Fixed   16.05.3
> Resolution --- FIXED
>
> *Comment # 8 <https://bugs.schedmd.com/show_bug.cgi?id=2922#c8> on bug
> 2922 <https://bugs.schedmd.com/show_bug.cgi?id=2922> from Tim Wickberg
> <tim@schedmd.com> *
>
> Doug - the fix is committed as 3b914e5b4fd. This will be in 16.05.3 when
> released.
>
> - Tim
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>