3211 – slurmctld segfaults on schedule when gang scheduling is used.

Ticket 3211 - slurmctld segfaults on schedule when gang scheduling is used.

Summary: slurmctld segfaults on schedule when gang scheduling is used.

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	16.05.5
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Danny Auble
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2016-10-26 10:29 MDT by Danny Auble
Modified:	2016-11-01 16:07 MDT (History)
CC List:	0 users

See Also:
Site:	SchedMD
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	16.05.6 17.02.0-pre3
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurmctld log (693.01 KB, text/x-log) 2016-10-26 10:29 MDT, Danny Auble	Details
slurm.conf (6.98 KB, application/x-download) 2016-10-26 10:30 MDT, Danny Auble	Details
slurmd log (7.81 KB, text/x-log) 2016-10-26 10:49 MDT, Danny Auble	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Danny Auble 2016-10-26 10:29:35 MDT

Created attachment 3641 [details]
Slurmctld log

From what I can tell a job was requeued from a failed node (see attached log Oct 25 22:08:51.687175) which appears to suspend job 36765, perhaps the job in question, the backtrace had 0 for the bad job.  In any case this continues on attempting to signal the job that has already completed, but is still part of the system.  It was running on the node that went bad.  The log starts with the creation of this job.  It may or may not be related.

I will add relevant backtraces after creation.

Main one is

where
#0  0x0000556235eb2c2d in add_job_to_cores (job_resrcs_ptr=0x0, full_core_bitmap=0x7f4488007b70, bits_per_node=0x7f448800e780) at ../../../../slurm/src/common/job_resources.c:1220
#1  0x0000556235ca3fa5 in _add_job_to_active (job_ptr=0x7f449c015710, p_ptr=0x7f4488007b40) at ../../../../slurm/src/slurmctld/gang.c:508
#2  0x0000556235ca4f1c in _update_active_row (p_ptr=0x7f4488007b40, add_new_jobs=1) at ../../../../slurm/src/slurmctld/gang.c:882
#3  0x0000556235ca4fc5 in _update_all_active_rows () at ../../../../slurm/src/slurmctld/gang.c:913
#4  0x0000556235ca6369 in gs_job_fini (job_ptr=0x7f44bc0016b0) at ../../../../slurm/src/slurmctld/gang.c:1360
#5  0x0000556235d43e27 in slurm_sched_g_freealloc (job_ptr=0x7f44bc0016b0) at ../../../../slurm/src/slurmctld/sched_plugin.c:215
#6  0x0000556235cee4be in cleanup_completing (job_ptr=0x7f44bc0016b0) at ../../../../slurm/src/slurmctld/job_scheduler.c:4197
#7  0x0000556235cff587 in make_node_idle (node_ptr=0x7f4488010238, job_ptr=0x7f44bc0016b0) at ../../../../slurm/src/slurmctld/node_mgr.c:3667
#8  0x0000556235cdb99c in job_epilog_complete (job_id=37035, node_name=0x7f449c015d60 "snowflake3", return_code=0) at ../../../../slurm/src/slurmctld/job_mgr.c:13194
#9  0x0000556235c9616b in _thread_per_group_rpc (args=0x7f44d0000f70) at ../../../../slurm/src/slurmctld/agent.c:963
#10 0x00007f44faba170a in start_thread (arg=0x7f44ec7e1700) at pthread_create.c:333
#11 0x00007f44fa8db0af in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105

It appears the job being looked at in frame 1 isn't complete

 print *job_ptr
$1 = {
  account = 0x0, 
  alias_list = 0x0, 
  alloc_node = 0x0, 
  alloc_resp_port = 34147, 
  alloc_sid = 17317, 
  array_job_id = 0, 
  array_task_id = 4294967294, 
  array_recs = 0x0, 
  assoc_id = 4, 
  assoc_ptr = 0x556237698930, 
  batch_flag = 0, 
  batch_host = 0x0, 
  bit_flags = 0, 
  burst_buffer = 0x0, 
  check_job = 0x0, 
  ckpt_interval = 0, 
  ckpt_time = 0, 
  comment = 0x0, 
  cpu_cnt = 0, 
  billable_tres = 8, 
  cr_enabled = 1, 
  deadline = 0, 
  db_index = 15956, 
  derived_ec = 0, 
  details = 0x0, 
  direct_set_prio = 0, 
  end_time = 1477439304, 
  end_time_exp = 4294967294, 
  epilog_running = false, 
  exit_code = 0, 
  front_end_ptr = 0x0, 
  gres = 0x0, 
  gres_list = 0x0, 
  gres_alloc = 0x0, 
  gres_req = 0x0, 
  gres_used = 0x0, 
  group_id = 7558, 
  job_id = 0, 
  job_next = 0x0, 
  job_array_next_j = 0x0, 
  job_array_next_t = 0x0, 
  job_resrcs = 0x0, 
  job_state = 4, 
  kill_on_node_fail = 1, 
  licenses = 0x0, 
  license_list = 0x0, 
  limit_set = {
    qos = 0, 
    time = 0, 
    tres = 0x0
  }, 
  mail_type = 0, 
  mail_user = 0x0, 
  magic = 0, 
  mcs_label = 0x0, 
  name = 0x0, 
  network = 0x0, 
  next_step_id = 1, 
  nodes = 0x0, 
  node_addr = 0x0, 
  node_bitmap = 0x0, 
  node_bitmap_cg = 0x0, 
  node_cnt = 0, 
  node_cnt_wag = 4, 
  nodes_completing = 0x0, 
  other_port = 35113, 
  partition = 0x0, 
  part_ptr_list = 0x0, 
  part_nodes_missing = false, 
  part_ptr = 0x7f4488009380, 
  power_flags = 0 '\000', 
  pre_sus_time = 0, 
  preempt_time = 0, 
  preempt_in_progress = false, 
  priority = 55051, 
  priority_array = 0x0, 
  prio_factors = 0x7f449c014700, 
  profile = 0, 
  qos_id = 1, 
  qos_ptr = 0x5562376974a0, 
  reboot = 0 '\000', 
  restart_cnt = 0, 
  resize_time = 0, 
  resv_id = 0, 
  resv_name = 0x0, 
  resv_ptr = 0x0, 
  requid = 7558, 
  resp_host = 0x0, 
  sched_nodes = 0x0, 
  select_jobinfo = 0x7f449c016020, 
  spank_job_env = 0x0, 
  spank_job_env_size = 0, 
  start_protocol_ver = 7680, 
  start_time = 1477439183, 
  state_desc = 0x0, 
  state_reason = 0, 
  state_reason_prev = 0, 
  step_list = 0x0, 
  suspend_time = 0, 
  time_last_active = 1477454898, 
  time_limit = 1, 
  time_min = 0, 
  tot_sus_time = 0, 
  total_cpus = 8, 
  total_nodes = 4, 
  tres_req_cnt = 0x0, 
  tres_req_str = 0x0, 
  tres_fmt_req_str = 0x0, 
  tres_alloc_cnt = 0x0, 
  tres_alloc_str = 0x0, 
  tres_fmt_alloc_str = 0x0, 
  user_id = 7558, 
  wait_all_nodes = 0, 
  warn_flags = 0, 
  warn_signal = 0, 
  warn_time = 0, 
  wckey = 0x0, 
  req_switch = 0, 
  wait4switch = 0, 
  best_switch = true, 
  wait4switch_start = 0
}

You can see it was canceled, but there isn't a job_id.  I am wondering if a job is canceled by a failed node or such the proper cleanup does not happen.  36765 did ran on snowflake[3-6] where snowflake5 was the node that went bad.  Based on accounting this was the last job that ran on the node.

Core dump is found on snowflake /home/da/slurm/16.05/snowflake/sbin/snowflake-agent-27871.core  Build on 16.05 branch 8dfa2e9e3c6a.

Let me know if you need anything else.  It doesn't seem that job should be there though in the parts jobs.

Comment 1 Danny Auble 2016-10-26 10:30:54 MDT

Created attachment 3642 [details]
slurm.conf

I believe the important parts are the gang scheduling settings.  Note the jobs were all running in the debug partition.

Comment 2 Danny Auble 2016-10-26 10:49:12 MDT

Created attachment 3643 [details]
slurmd log

Here is the slurmd on snowflake5.  It appears it ran out of threads.

Luckily the stepd was still hanging out.  Though it is hard to tell if it is all that useful.  The threads are almost all stuck on

#0  __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x00007ffbd777a3a8 in reused_arena (avoid_arena=0x0) at arena.c:809
#2  arena_get2 (size=size@entry=40, avoid_arena=avoid_arena@entry=0x0) at arena.c:880
#3  0x00007ffbd777fe9a in arena_get2 (avoid_arena=0x0, size=40) at malloc.c:2923
#4  __GI___libc_malloc (bytes=40) at malloc.c:2923
#5  0x00005568ee803672 in slurm_xmalloc (size=24, clear=false, file=0x5568ee9a8510 "../../../../slurm/src/common/pack.c", line=152, func=0x5568ee9a8509 "") at ../../../../slurm/src/common/xmalloc.c:85
#6  0x00005568ee81d875 in init_buf (size=16384) at ../../../../slurm/src/common/pack.c:152
#7  0x00005568ee7d0fa7 in _handle_accept (arg=0x0) at ../../../../../slurm/src/slurmd/slurmstepd/req.c:419
#8  0x00007ffbd7ac870a in start_thread (arg=0x7ffad33f3700) at pthread_create.c:333
#9  0x00007ffbd78020af in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105

Which makes you wonder how the memory was all locked up.  All threads appear to be stuck on that, most likely all threads trying to send a sigkill to the step.

The main thread was just waiting on IO

#0  0x00007ffbd7ac99cd in pthread_join (threadid=140719549171456, thread_return=0x0) at pthread_join.c:90
#1  0x00005568ee7c41dd in _wait_for_io (job=0x5568efe82f50) at ../../../../../slurm/src/slurmd/slurmstepd/mgr.c:2160
#2  0x00005568ee7c23f3 in job_manager (job=0x5568efe82f50) at ../../../../../slurm/src/slurmd/slurmstepd/mgr.c:1336
#3  0x00005568ee7bd1a4 in main (argc=1, argv=0x7fff8213c528) at ../../../../../slurm/src/slurmd/slurmstepd/slurmstepd.c:163

Comment 3 Danny Auble 2016-10-26 12:09:33 MDT

I verified in the database db_index = 15956, is 36765.  Luckily that wasn't zeroed out.

Comment 4 Danny Auble 2016-10-27 14:08:18 MDT

This is fixed in commit 9c0a2f2bd9b99470.

It turns out there were quite a few places that appeared to only partially clean up a "completing" job.

This patch fixes all those areas.

In this particular case the job_ptr was being "completed" without ever calling the end to gang scheduling which would leave the job_ptr in a few gang scheduling lists.  License counts would most likely be missed as well in these corner cases.  All of the issues seem to stem from a node going down/unresponsive while a job was running.