5010 – Slurmctld Segfault

Ticket 5010 - Slurmctld Segfault

Summary: Slurmctld Segfault

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	17.11.3
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Alejandro Sanchez
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-03-31 23:40 MDT by Adam
Modified:	2019-08-27 09:59 MDT (History)
CC List:	2 users (show)

See Also:	7641
Site:	Simon Fraser University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
quick patch to avoid segfault (448 bytes, patch) 2018-04-01 00:00 MDT, Tim Wickberg	Details \| Diff
slurmctld log during the issue, not after(missing those logs for some reason) (66.55 KB, text/plain) 2018-04-27 14:30 MDT, Adam	Details
Slightly newer slurm.conf but should be fairly accurate (49.80 KB, text/x-matlab) 2018-04-27 14:38 MDT, Adam	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Adam 2018-03-31 23:40:39 MDT

Slurmctld segfaulting at the same spot and refuses to stay running beyond 1 minute.
Our version of 17.11.3 is patched with the patch supplied that is placed in 17.11.5 for the sacct db vulnerability.  

Segfaults on the exact same spot.
Array job: 6602860_715  Job_id: 6603604  failed on Node: cdr388 but I do not see either of those job ids in the logs for that node.

sacct shows that job, I do not see that job in any of that hash.# folders in /var/spool/slurmctld

(gdb) bt
#0  _step_dealloc_lps (step_ptr=0x849fad0) at step_mgr.c:2081
#1  post_job_step (step_ptr=step_ptr@entry=0x849fad0) at step_mgr.c:4652
#2  0x00000000004a98fd in _post_job_step (step_ptr=0x849fad0) at step_mgr.c:266
#3  _internal_step_complete (job_ptr=job_ptr@entry=0x849efd0, step_ptr=step_ptr@entry=0x849fad0)
    at step_mgr.c:307
#4  0x00000000004a9a89 in job_step_complete (job_id=6603604, step_id=4294967295, uid=uid@entry=0,
    requeue=requeue@entry=false, job_return_code=<optimized out>) at step_mgr.c:853
#5  0x000000000048d9b9 in _slurm_rpc_step_complete (running_composite=false, msg=0x7ffff1343e80)
    at proc_req.c:3776
#6  slurmctld_req (msg=msg@entry=0x7ffff1343e80, arg=arg@entry=0x7fffd0001840) at proc_req.c:516
#7  0x000000000042478a in _service_connection (arg=0x7fffd0001840) at controller.c:1122
#8  0x00007ffff7601e25 in start_thread (arg=0x7ffff1344700) at pthread_create.c:308
#9  0x00007ffff732f34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

(gdb) frame 0
#0  _step_dealloc_lps (step_ptr=0x849fad0) at step_mgr.c:2081
2081            i_first = bit_ffs(job_resrcs_ptr->node_bitmap);
(gdb) print * job_ptr
$13 = {account = 0x849f5f0 "def-tgleeson_cpu", admin_comment = 0x0, alias_list = 0x0,
  alloc_node = 0x849f5d0 "cedar5", alloc_resp_port = 0, alloc_sid = 186132, array_job_id = 6602860,
  array_task_id = 715, array_recs = 0x0, assoc_id = 42165, assoc_ptr = 0xbb6f00, batch_flag = 3,
  batch_host = 0x849f690 "cdr388", billable_tres = 1, bit_flags = 2048, burst_buffer = 0x0,
  burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0,
  cpu_cnt = 1, cr_enabled = 0, db_index = 36363375, deadline = 0, delay_boot = 0, derived_ec = 0,
  details = 0x849f380, direct_set_prio = 0, end_time = 1522552699, end_time_exp = 4294967294,
  epilog_running = false, exit_code = 1, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0,
  gres_list = 0x0, gres_alloc = 0x849f620 "", gres_detail_cnt = 0, gres_detail_str = 0x0,
  gres_req = 0x849f640 "", gres_used = 0x0, group_id = 3066817, job_id = 6603604, job_next = 0x0,
  job_array_next_j = 0x849a070, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 5,
  kill_on_node_fail = 1, last_sched_eval = 1522531514, licenses = 0x0, license_list = 0x0, limit_set = {
    qos = 0, time = 0, tres = 0x8499cd0}, mail_type = 2, mail_user = 0x849f660 "samuelczipper@gmail.com",
  magic = 4038539564, mcs_label = 0x0, name = 0x849ada0 "mf", network = 0x0, next_step_id = 0, ngids = 0,
  nodes = 0x849ad80 "cdr388", node_addr = 0x0, node_bitmap = 0x8993ba0, node_bitmap_cg = 0x0, node_cnt = 0,
  node_cnt_wag = 0, nodes_completing = 0x0, origin_cluster = 0x0, other_port = 0, pack_job_id = 0,
  pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0,
  partition = 0x849ef80 "cpubase_bycore_b1,cpubackfill,c12hbackfill", part_ptr_list = 0x5333aa0,
  part_nodes_missing = false, part_ptr = 0x5429260, power_flags = 0 '\000', pre_sus_time = 0,
  preempt_time = 0, preempt_in_progress = false, priority = 467143, priority_array = 0x0,
  prio_factors = 0x849f520, profile = 0, qos_id = 1, qos_ptr = 0x0, qos_blocking_ptr = 0x0, reboot = 0 '\000',
  restart_cnt = 2, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295,
  resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x849f6b0, spank_job_env = 0x0, spank_job_env_size = 0,
  start_protocol_ver = 8192, start_time = 1522552699, state_desc = 0x0, state_reason = 21,
  state_reason_prev = 21, step_list = 0x5333af0, suspend_time = 0, time_last_active = 1522557472,
  time_limit = 2, time_min = 0, tot_sus_time = 0, total_cpus = 1, total_nodes = 1, tres_req_cnt = 0x84a4020,
  tres_req_str = 0x849fa70 "1=1,2=100,4=1", tres_fmt_req_str = 0x849ff50 "cpu=1,mem=100M,node=1",
  tres_alloc_cnt = 0x846c640, tres_alloc_str = 0x7fffe0002c00 "1=1,2=100,4=1,5=1",
  tres_fmt_alloc_str = 0x7fffe0002c90 "cpu=1,mem=100M,node=1,billing=1", user_id = 3066817,
  user_name = 0x849f5b0 "zipper", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0,
  wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}

Let me know what else I can provide to help.

Thanks,

Adam

Comment 1 Tim Wickberg 2018-04-01 00:00:35 MDT

Created attachment 6517 [details]
quick patch to avoid segfault

I'm attaching a quick patch that should avoid that segfault, although I would still like to understand how your system got into that state, and this may only move the next crash somewhere else.

When you get a chance, a current slurm.conf for the system, along with recent slurmctld logs, would be nice to have on hand. If you can also make sure to save that core file along with a copy of that slurmctld binary somewhere, we might have a few further questions it could answer in the future.

(Please do not attach it - the core file is useless without the corresponding binary and libraries which only you have.)

Comment 2 Tim Wickberg 2018-04-01 00:01:52 MDT

> Segfaults on the exact same spot.
> Array job: 6602860_715  Job_id: 6603604  failed on Node: cdr388 but I do not
> see either of those job ids in the logs for that node.

Did anything change with the Node definitions decently?

> sacct shows that job, I do not see that job in any of that hash.# folders in
> /var/spool/slurmctld

The job arrays only save a single copy of the job script and environment. But that's also not implicated here - you're running into a problem with the job cleanup crashing out, not with launching.

Comment 3 Adam 2018-04-01 00:20:55 MDT

I don't believe we've changed the node definitions in at least a week or more, we did have a large import into sacctmgr of fairshare and such as we're preparing for our new round of users/groups and a mistake could have likely been made there, but not sure if corecount, etc. would have affected that or not.

We had a segfault related to priority_multifactor 4 days ago, but that only happened the one time.

I'm trying to gather the logs for you and put them aside prior to putting the patch in, i've put aside the core dump and the slurmctld binary for now.  I had forgot to put aside the log at first and so it's 5GB at the moment due to me testing with -vvv

Comment 4 Adam 2018-04-01 01:19:27 MDT

That patch did the trick.  We can follow up with this after the long weekend, i'll be back in on Tuesday.  Thanks for the quick response Tim.

Feel free to adjust the severity to what you deem necessary now since we're back in production.

Comment 5 Tim Wickberg 2018-04-01 23:07:38 MDT

Glad that helped get things back up and running again.

I'm going to ask one of my colleagues to look into this further this week. If you're able to get those logs attached at some point they may still prove helpful to narrowing down what lead to the initial problem.

- Tim

Comment 6 Alejandro Sanchez 2018-04-11 05:48:01 MDT

Adam, I'm retagging this as sev-3 since you reported repeated segfaults ceased with the quick patch. We're curious though to understand what caused that. Did you manage to grab the ctld logs from that day? The slurm.conf used when the crash happened can be worth seeing as well. Thank you.

Comment 7 Adam 2018-04-27 14:18:54 MDT

Sorry about the delay, i've been busy adding our second stage of deployment(another 640 nodes)

We're not running the patched version anymore and things have been smooth.  It was only needed to get past that hiccup.

I've attached the slurmctld.log from the time of the issue and the slurm.conf that we're currently using, I didn't grab the one from the time unfortunately but it should have very little differences.  I do still have the core file if you need me to do any debug commands inside of it.

Thanks,

Adam

Comment 8 Adam 2018-04-27 14:30:25 MDT

Created attachment 6706 [details]
slurmctld log during the issue, not after(missing those logs for some reason)

The logs for during(the 5GB worth) seem to have walked off on me unfortunately, this log file is purely for a single launch while broken until it crashed.  I'm not sure there was anything actually useful in it.

Comment 9 Adam 2018-04-27 14:38:10 MDT

Created attachment 6707 [details]
Slightly newer slurm.conf but should be fairly accurate

Comment 10 Alejandro Sanchez 2018-04-30 05:54:52 MDT

Hi Adam. At this point, I'm inclined to think this is a consequence of the problem reported in bug 4800, fixed in 17.11.4:

https://github.com/SchedMD/slurm/commit/f381e4e6abca6ce45

I think this because of two reasons:

1. You report a failed job array task on a node, but don't see any reference to this job in such node logs.

2. The backtrace print of the job record shows:

job_array_next_j = 0x849a070, job_array_next_t = 0x0, job_resrcs = 0x0

I find it weird the job_record has job_array_next_j but no job_array_next_t. So I think the array hash corruption contributed messing things up for this job array. I'd vote for upgrading to the latest 17.11 micro release and reopen if the same bt is encountered. Thanks.