Ticket 11019

Summary:	Slurmctld crashing after upgrading to 20.11.4
Product:	Slurm	Reporter:	Steve Ford <fordste5>
Component:	slurmctld	Assignee:	Director of Support <support>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, bart, marshall, mcmullan
Version:	20.11.4
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=12451
Site:	MSU	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.11.5
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	thread apply all bt slurm.conf

Description Steve Ford 2021-03-05 04:58:24 MST

Created attachment 18260 [details]
thread apply all bt

Hello,

After updating our SLURM server to 20.11.4, we are seeing frequent crashes of the slurmctld daemon. Attached is the output of `thread apply all bt` for the latest crash.

Thanks,
Steve

Comment 1 Steve Ford 2021-03-05 05:02:01 MST

Created attachment 18261 [details]
slurm.conf

Comment 4 Steve Ford 2021-03-05 05:15:48 MST

There are similar thread 1 traces from other dumps. They reference different nodes, but have the same jobid: 15435906.

Thread 1 (Thread 0x7f375f6f6700 (LWP 39487)):
#0  _next_month (tm=0x7f375f6f5b50, entry=0x0) at cron.c:193
#1  calc_next_cron_start (entry=0x0) at cron.c:310
#2  0x0000000000457311 in job_hold_requeue (job_ptr=job_ptr@entry=0x3e531d0) at job_mgr.c:17740
#3  0x0000000000470ce9 in cleanup_completing (job_ptr=job_ptr@entry=0x3e531d0) at job_scheduler.c:4960
#4  0x0000000000477792 in make_node_idle (node_ptr=0x363f1a8, job_ptr=job_ptr@entry=0x3e531d0) at node_mgr.c:3933
#5  0x000000000045051d in job_epilog_complete (job_id=15435906, node_name=0x7f3760000c50 "css-039", return_code=0) at job_mgr.c:15411
#6  0x000000000049593a in _slurm_rpc_epilog_complete (msg=0x7f3760001870) at proc_req.c:1933
#7  0x000000000049b730 in slurmctld_req (msg=msg@entry=0x7f3760001870) at proc_req.c:6501
#8  0x000000000042aec5 in _service_connection (arg=0x0) at controller.c:1270
#9  0x00007f37ae9a5e25 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f37ae6cfbad in clone () from /lib64/libc.so.6

---

Thread 1 (Thread 0x7f0c90fcf700 (LWP 22547)):
#0  _next_month (tm=0x7f0c90fceb50, entry=0x0) at cron.c:193
#1  calc_next_cron_start (entry=0x0) at cron.c:310
#2  0x0000000000457311 in job_hold_requeue (job_ptr=job_ptr@entry=0x2ca51d0) at job_mgr.c:17740
#3  0x0000000000470ce9 in cleanup_completing (job_ptr=job_ptr@entry=0x2ca51d0) at job_scheduler.c:4960
#4  0x0000000000477792 in make_node_idle (node_ptr=0x24911a8, job_ptr=job_ptr@entry=0x2ca51d0) at node_mgr.c:3933
#5  0x000000000045051d in job_epilog_complete (job_id=15435906, node_name=0x7f0c40000b40 "css-039", return_code=0) at job_mgr.c:15411
#6  0x000000000049593a in _slurm_rpc_epilog_complete (msg=0x7f0c40000bc0) at proc_req.c:1933
#7  0x000000000049b730 in slurmctld_req (msg=msg@entry=0x7f0c40000bc0) at proc_req.c:6501
#8  0x000000000042aec5 in _service_connection (arg=0x0) at controller.c:1270
#9  0x00007f0caad53e25 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f0caaa7dbad in clone () from /lib64/libc.so.6

---

Thread 1 (Thread 0x7fa216cac700 (LWP 21031)):
#0  _next_month (tm=0x7fa216cabb50, entry=0x0) at cron.c:193
#1  calc_next_cron_start (entry=0x0) at cron.c:310
#2  0x0000000000457311 in job_hold_requeue (job_ptr=job_ptr@entry=0x2ea71d0) at job_mgr.c:17740
#3  0x0000000000470ce9 in cleanup_completing (job_ptr=job_ptr@entry=0x2ea71d0) at job_scheduler.c:4960
#4  0x0000000000477792 in make_node_idle (node_ptr=0x26931a8, job_ptr=job_ptr@entry=0x2ea71d0) at node_mgr.c:3933
#5  0x000000000045051d in job_epilog_complete (job_id=15435906, node_name=0x7fa208009670 "css-039", return_code=0) at job_mgr.c:15411
#6  0x000000000049593a in _slurm_rpc_epilog_complete (msg=0x7fa208009760) at proc_req.c:1933
#7  0x000000000049b730 in slurmctld_req (msg=msg@entry=0x7fa208009760) at proc_req.c:6501
#8  0x000000000042aec5 in _service_connection (arg=0x0) at controller.c:1270
#9  0x00007fa22f57ae25 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fa22f2a4bad in clone () from /lib64/libc.so.6

---

Thread 1 (Thread 0x7f70f65a5700 (LWP 20537)):
#0  _next_month (tm=0x7f70f65a4b50, entry=0x0) at cron.c:193
#1  calc_next_cron_start (entry=0x0) at cron.c:310
#2  0x0000000000457311 in job_hold_requeue (job_ptr=job_ptr@entry=0x2955190) at job_mgr.c:17740
#3  0x0000000000470ce9 in cleanup_completing (job_ptr=job_ptr@entry=0x2955190) at job_scheduler.c:4960
#4  0x0000000000477792 in make_node_idle (node_ptr=0x21411a8, job_ptr=job_ptr@entry=0x2955190) at node_mgr.c:3933
#5  0x000000000045051d in job_epilog_complete (job_id=15435906, node_name=0x7f709c52e460 "css-039", return_code=0) at job_mgr.c:15411
#6  0x000000000049593a in _slurm_rpc_epilog_complete (msg=0x7f709c46f360) at proc_req.c:1933
#7  0x000000000049b730 in slurmctld_req (msg=msg@entry=0x7f709c46f360) at proc_req.c:6501
#8  0x000000000042aec5 in _service_connection (arg=0x0) at controller.c:1270
#9  0x00007f7112b80e25 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f71128aabad in clone () from /lib64/libc.so.6

---

Thread 1 (Thread 0x7faee48c8700 (LWP 13476)):
#0  _next_month (tm=0x7faee48c7b50, entry=0x0) at cron.c:193
#1  calc_next_cron_start (entry=0x0) at cron.c:310
#2  0x0000000000457311 in job_hold_requeue (job_ptr=job_ptr@entry=0x439c1c0) at job_mgr.c:17740
#3  0x0000000000470ce9 in cleanup_completing (job_ptr=job_ptr@entry=0x439c1c0) at job_scheduler.c:4960
#4  0x0000000000477792 in make_node_idle (node_ptr=0x3b975b8, job_ptr=job_ptr@entry=0x439c1c0) at node_mgr.c:3933
#5  0x000000000045051d in job_epilog_complete (job_id=15435906, node_name=0x7faf14600020 "lac-041", return_code=0) at job_mgr.c:15411
#6  0x000000000049593a in _slurm_rpc_epilog_complete (msg=0x7faf14449170) at proc_req.c:1933
#7  0x000000000049b730 in slurmctld_req (msg=msg@entry=0x7faf14449170) at proc_req.c:6501
#8  0x000000000042aec5 in _service_connection (arg=0x0) at controller.c:1270
#9  0x00007faf78a72e25 in start_thread () from /lib64/libpthread.so.0
#10 0x00007faf7879cbad in clone () from /lib64/libc.so.6

---

Thread 1 (Thread 0x7f0928282700 (LWP 4752)):
#0  _next_month (tm=0x7f0928281b50, entry=0x0) at cron.c:193
#1  calc_next_cron_start (entry=0x0) at cron.c:310
#2  0x0000000000457311 in job_hold_requeue (job_ptr=job_ptr@entry=0x2e03190) at job_mgr.c:17740
#3  0x0000000000470ce9 in cleanup_completing (job_ptr=job_ptr@entry=0x2e03190) at job_scheduler.c:4960
#4  0x0000000000477792 in make_node_idle (node_ptr=0x25fe5b8, job_ptr=job_ptr@entry=0x2e03190) at node_mgr.c:3933
#5  0x000000000045051d in job_epilog_complete (job_id=15435906, node_name=0x7f09000de810 "lac-041", return_code=0) at job_mgr.c:15411
#6  0x000000000049593a in _slurm_rpc_epilog_complete (msg=0x7f090000aef0) at proc_req.c:1933
#7  0x000000000049b730 in slurmctld_req (msg=msg@entry=0x7f090000aef0) at proc_req.c:6501
#8  0x000000000042aec5 in _service_connection (arg=0x0) at controller.c:1270
#9  0x00007f0945a88e25 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f09457b2bad in clone () from /lib64/libc.so.6

Forground logs before a crash also implicate this job:

slurmctld: debug2: node_did_resp amr-162
slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB
slurmctld: cleanup_completing: JobId=15435906 completion process took 24652 seconds
Segmentation fault

Comment 5 Steve Ford 2021-03-05 08:20:45 MST

Digging a little further, it looks like this block on line 17736 in job_mgr.c is where things go awry. The jobs bit_flags pass this check to see it is a cron job, but then a null crontab_entry gets passed to calc_next_cron_start.

 /* handle crontab jobs */
  if (job_ptr->bit_flags & CRON_JOB) {
    job_ptr->job_state |= JOB_REQUEUE;
    job_ptr->details->begin_time =
      calc_next_cron_start(job_ptr->details->crontab_entry);
  }


from frame 2 in gdb:

p job_ptr->bit_flags 
$9 = 516163584

p job_ptr->details->crontab_entry 
$10 = (cron_entry_t *) 0x0

I'm thinking a check could be added here that aborts the requeue of the cron job if crontab_entry is null.


  /* handle crontab jobs */
  if (job_ptr->bit_flags & CRON_JOB && job_ptr->details->crontab_entry) {
    job_ptr->job_state |= JOB_REQUEUE;
    job_ptr->details->begin_time =
      calc_next_cron_start(job_ptr->details->crontab_entry);
  }

I'm going to go ahead and compile this. Let me know if you have any other thoughts about what might fix this issue.

Thanks,
Steve

Comment 6 Jason Booth 2021-03-05 08:59:46 MST

Hi Steve we are looking into this for you and we will update you soon.

Comment 7 Jason Booth 2021-03-05 09:15:42 MST

That workaround looks good, at least until we found out why crontab_entry ended up null. I have a few engineers looking into the root cause in the meantime.

Comment 8 Dominik Bartkiewicz 2021-03-05 09:19:36 MST

Hi

Could you send us additional gdb output from?
t 1
f 2
p *job_ptr
p *job_ptr->details

DOminik

Comment 9 Steve Ford 2021-03-05 09:22:20 MST

Hello Jason,

We're back online with that patch. I imagine the worst this might do is prevent scrontab jobs from working properly, but that would be a fair tradeoff for having the scheduler online.

Thanks,
Steve

Comment 10 Steve Ford 2021-03-05 09:27:37 MST

Hello Dominik,

print *job_ptr
$1 = {
  magic = 4038539564, 
  account = 0x34866f0 "XXXXX", 
  admin_comment = 0x0, 
  alias_list = 0x0, 
  alloc_node = 0x34866c0 "XXXXX", 
  alloc_resp_port = 0, 
  alloc_sid = 3765, 
  array_job_id = 0, 
  array_task_id = 4294967294, 
  array_recs = 0x0, 
  assoc_id = 5516, 
  assoc_ptr = 0x1f7f840, 
  batch_features = 0x0, 
  batch_flag = 3, 
  batch_host = 0x86a63a0 "css-039", 
  billable_tres = 1245.184, 
  bit_flags = 516163584, 
  burst_buffer = 0x0, 
  burst_buffer_state = 0x0, 
  clusters = 0x0, 
  comment = 0x3486720 "stdout=XXXXX", 
  cpu_cnt = 4, 
  cpus_per_tres = 0x0, 
  cr_enabled = 0, 
  db_flags = 4, 
  db_index = 1112966738, 
  deadline = 0, 
  delay_boot = 0, 
  derived_ec = 0, 
  details = 0x3485dc0, 
  direct_set_prio = 0, 
  end_time = 1614921755, 
  end_time_exp = 4294967294, 
  epilog_running = false, 
  exit_code = 0, 
  fed_details = 0x0, 
  front_end_ptr = 0x0, 
  gres_list = 0x0, 
  gres_detail_cnt = 0, 
  gres_detail_str = 0x0, 
  gres_used = 0x0, 
  group_id = 2156, 
  het_details = 0x0, 
  het_job_id = 0, 
  het_job_id_set = 0x0, 
  het_job_offset = 0, 
  het_job_list = 0x0, 
  job_id = 15435906, 
  job_next = 0x0, 
  job_array_next_j = 0x0, 
  job_array_next_t = 0x0, 
  job_preempt_comp = 0x0, 
  job_resrcs = 0x34867f0, 
  job_state = 1027, 
  kill_on_node_fail = 1, 
  last_sched_eval = 1614921750, 
  licenses = 0x0, 
  license_list = 0x0, 
  limit_set = {
    qos = 0, 
    time = 0, 
    tres = 0x3486040
  }, 
  mail_type = 0, 
  mail_user = 0x3486790 "XXXXX", 
  mem_per_tres = 0x0, 
  mcs_label = 0x0, 
  name = 0x341fc90 "HTML", 
  network = 0x0, 
  next_step_id = 0, 
  nodes = 0x3451bb0 "css-039", 
  node_addr = 0x85a8310, 
  node_bitmap = 0x8644c60, 
  node_bitmap_cg = 0x869c9f0, 
  node_cnt = 0, 
  node_cnt_wag = 0, 
  nodes_completing = 0x347fb10 "css-039", 
  origin_cluster = 0x0, 
  other_port = 0, 
  partition = 0x3486530 "XXXXX,general-short", 
  part_ptr_list = 0x3485930, 
  part_nodes_missing = false, 
  part_ptr = 0x2bfd480, 
  power_flags = 0 '\000', 
  pre_sus_time = 0, 
  preempt_time = 0, 
  preempt_in_progress = false, 
  prep_epilog_cnt = 0, 
  prep_prolog_cnt = 0, 
  prep_prolog_failed = false, 
  priority = 62889, 
  priority_array = 0x0, 
  prio_factors = 0x3485f90, 
  profile = 0, 
  qos_id = 1, 
  qos_ptr = 0x0, 
  qos_blocking_ptr = 0x0, 
  reboot = 0 '\000', 
  restart_cnt = 2, 
  resize_time = 0, 
  resv_id = 0, 
  resv_list = 0x0, 
  resv_name = 0x0, 
  resv_ptr = 0x0, 
  requid = 4294967295, 
  resp_host = 0x0, 
  sched_nodes = 0x0, 
  select_jobinfo = 0x34867c0, 
  site_factor = 2147483648, 
  spank_job_env = 0x0, 
  spank_job_env_size = 0, 
  start_protocol_ver = 8704, 
  start_time = 1614921750, 
  state_desc = 0x0, 
  state_reason = 0, 
  state_reason_prev_db = 36, 
  step_list = 0x3486480, 
  suspend_time = 0, 
  system_comment = 0x0, 
  time_last_active = 1614944043, 
  time_limit = 60, 
  time_min = 0, 
  tot_sus_time = 0, 
  total_cpus = 4, 
  total_nodes = 1, 
  tres_bind = 0x0, 
  tres_freq = 0x0, 
  tres_per_job = 0x0, 
  tres_per_node = 0x0, 
  tres_per_socket = 0x0, 
  tres_per_task = 0x0, 
  tres_req_cnt = 0x3487990, 
  tres_req_str = 0x3487550 "1=4,2=8192,4=1,5=1245", 
  tres_fmt_req_str = 0x3487c70 "cpu=4,mem=8G,node=1,billing=1245", 
  tres_alloc_cnt = 0x34875b0, 
  tres_alloc_str = 0x7f329c02f960 "1=4,2=8192,4=1,5=1245", 
  tres_fmt_alloc_str = 0x7f329c01a180 "cpu=4,mem=8G,node=1,billing=1245", 
  user_id = 926004, 
  user_name = 0x3486690 "XXXXX", 
  wait_all_nodes = 0, 
  warn_flags = 0, 
  warn_signal = 0, 
  warn_time = 0, 
  wckey = 0x0, 
  req_switch = 0, 
  wait4switch = 0, 
  best_switch = true, 
  wait4switch_start = 0
}
(gdb) print *job_ptr->details
$2 = {
  magic = 233473255, 
  acctg_freq = 0x0, 
  accrue_time = 1614921750, 
  argc = 1, 
  argv = 0x3486ae0, 
  begin_time = 1614921750, 
  cluster_features = 0x0, 
  contiguous = 0, 
  core_spec = 65534, 
  cpu_bind = 0x0, 
  cpu_bind_type = 65534, 
  cpu_freq_min = 4294967294, 
  cpu_freq_max = 4294967294, 
  cpu_freq_gov = 4294967294, 
  cpus_per_task = 4, 
  crontab_entry = 0x0, 
  orig_cpus_per_task = 4, 
  depend_list = 0x0, 
  dependency = 0x0, 
  orig_dependency = 0x3486940 "afterok:15435901,afterok:15435879,afterok:15435883,afterok:15435888,afterok:15435892,afterok:15435897", 
  env_cnt = 0, 
  env_sup = 0x0, 
  exc_node_bitmap = 0x0, 
  exc_nodes = 0x0, 
  expanding_jobid = 0, 
  extra = 0x0, 
  feature_list = 0x8cf85b0, 
  features = 0x3486570 "[intel14|intel16|intel18|amr|nvf]", 
  max_cpus = 4294967294, 
  orig_max_cpus = 4294967294, 
  max_nodes = 1, 
  mc_ptr = 0x3486ab0, 
  mem_bind = 0x0, 
  mem_bind_type = 0, 
  min_cpus = 4, 
  orig_min_cpus = 4, 
  min_gres_cpu = 0, 
  min_nodes = 1, 
  nice = 2147483648, 
  ntasks_per_node = 1, 
  ntasks_per_tres = 0, 
  num_tasks = 1, 
  open_mode = 0 '\000', 
  overcommit = 0 '\000', 
  plane_size = 0, 
  pn_min_cpus = 4, 
  orig_pn_min_cpus = 4, 
  pn_min_memory = 8192, 
  orig_pn_min_memory = 8192, 
  pn_min_tmp_disk = 0, 
  prolog_running = 0 '\000', 
  reserved_resources = 0, 
  req_node_bitmap = 0x0, 
  preempt_start_time = 0, 
  req_nodes = 0x0, 
  requeue = 1, 
  share_res = 254 '\376', 
  std_err = 0x0, 
  std_in = 0x34869c0 "/dev/null", 
  std_out = 0x34869f0 "XXXXX", 
  submit_time = 1614921629, 
  task_dist = 8192, 
  usable_nodes = 0, 
  whole_node = 0 '\000', 
  work_dir = 0x3486a60 "XXXXX", 
  x11 = 0, 
  x11_magic_cookie = 0x0, 
  x11_target = 0x0, 
  x11_target_port = 0
}

Comment 14 Jason Booth 2021-03-05 10:31:08 MST

>We're back online with that patch. I imagine the worst this might do is prevent scrontab jobs from working properly, but that would be a fair tradeoff for having the scheduler online.

Steve - I am going to downgrade this in the meantime. Feel free to change priority back if needed.

Comment 24 Steve Ford 2021-03-08 09:31:42 MST

Jason and Dominik,

FYI, we do not have scrontab enabled on our cluster.

Best,
Steve

Comment 26 Michael Hinton 2021-03-08 10:06:14 MST

(In reply to Steve Ford from comment #24)
> FYI, we do not have scrontab enabled on our cluster.
You don't need scrontab for this issue to happen. We removed the INVALID_DEPEND job flag from 19.05 and resued that spot in 20.11 for the CRON_JOB flag for scrontab. (Commit https://github.com/SchedMD/slurm/commit/4abf5423b4 removes INVALID_DEPEND and commit https://github.com/SchedMD/slurm/commit/5ff307dfdc reuses that spot for CRON_JOB).

Comment 27 Michael Hinton 2021-03-08 10:28:58 MST

(In reply to Steve Ford from comment #0)
> After updating our SLURM server to 20.11.4, we are seeing frequent crashes
> of the slurmctld daemon.
What version did you upgrade from?

Comment 28 Michael Hinton 2021-03-08 10:33:57 MST

(In reply to Michael Hinton from comment #26)
> We removed the INVALID_DEPEND job flag from 19.05
Correction: That flag was last in 19.05, and we removed it starting in 20.02

Comment 29 Steve Ford 2021-03-08 10:40:31 MST

Michael,

We updated from 19.05.7.

Steve

Comment 35 Michael Hinton 2021-03-08 13:09:06 MST

Steve,

This issue has now been fixed in the following commits and will land in 20.11.5:
* https://github.com/SchedMD/slurm/commit/507508257f
* https://github.com/SchedMD/slurm/commit/2db5823105

I put you down as a co-author for the second commit, since you did the initial legwork to find the segfaulting spot and to develop the workaround.

I'll go ahead and close this out. If you have any more questions, let me know.

Thanks!
-Michael