Ticket 5821

Summary:	Slurmctld segfault after update from 17.11.7 to 18.08.1
Product:	Slurm	Reporter:	Steve Ford <fordste5>
Component:	slurmctld	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	bart
Version:	18.08.1
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=7826
Site:	MSU	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	18.08.2
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Patch to create structures outside of if Slurmctld log Slurmctld log +SelectType, compressed with xz Slurmctld log +TraceJobs

Description Steve Ford 2018-10-08 07:42:21 MDT

After updating slurmdbd and slurmctld, slurmctld segfaults on startup. Here is the output of thread apply all bt:

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f0d68cdef38 in _job_alloc (core_bitmap=0x2c93b40, node_name=0x28f4790 "nvl-002", job_id=290382, 
    gres_name=0x229ccb0 "gpu", node_offset=0, node_cnt=1, node_gres_data=0x2a2ef20, job_gres_data=0x2b0fa70)
    at gres.c:5856
5856	gres.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-18.08.1-1.el7.centos.x86_64
(gdb) thread apply all bt

Thread 5 (Thread 0x7f0d65bc7700 (LWP 21578)):
#0  0x00007f0d68560f0d in poll () from /lib64/libc.so.6
#1  0x00007f0d68d2bb76 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7f0d65bc6da0)
    at /usr/include/bits/poll2.h:46
#2  _conn_readable (persist_conn=persist_conn@entry=0x22c04c0) at slurm_persist_conn.c:138
#3  0x00007f0d68d2d039 in slurm_persist_recv_msg (persist_conn=0x22c04c0) at slurm_persist_conn.c:882
#4  0x00007f0d65ed730f in _get_return_code () at slurmdbd_agent.c:148
#5  _agent (x=<optimized out>) at slurmdbd_agent.c:668
#6  0x00007f0d68841e25 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f0d6856bbad in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f0d69233700 (LWP 21572)):
#0  0x00007f0d68845d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000041fd26 in _agent_init (arg=<optimized out>) at agent.c:1377
#2  0x00007f0d68841e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f0d6856bbad in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f0d654b4700 (LWP 21588)):
#0  0x00007f0d68845d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0d654b8e06 in _my_sleep (usec=30000000) at backfill.c:590
#2  0x00007f0d654bf4f7 in backfill_agent (args=<optimized out>) at backfill.c:926
#3  0x00007f0d68841e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f0d6856bbad in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f0d65ecc700 (LWP 21575)):
#0  0x00007f0d68845d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0d65ed1e90 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:445
#2  0x00007f0d68841e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f0d6856bbad in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f0d69234740 (LWP 21567)):
#0  0x00007f0d68cdef38 in _job_alloc (core_bitmap=0x2c93b40, node_name=0x28f4790 "nvl-002", job_id=290382, 
    gres_name=0x229ccb0 "gpu", node_offset=0, node_cnt=1, node_gres_data=0x2a2ef20, job_gres_data=0x2b0fa70)
    at gres.c:5856
#1  gres_plugin_job_alloc (job_gres_list=<optimized out>, node_gres_list=node_gres_list@entry=0x282a7d0, 
    node_cnt=1, node_offset=node_offset@entry=0, job_id=290382, node_name=0x28f4790 "nvl-002", 
    core_bitmap=core_bitmap@entry=0x2c93b40) at gres.c:6128
#2  0x00007f0d67c4419c in _add_job_to_res (job_ptr=job_ptr@entry=0x2b0f070, action=action@entry=0)
    at select_cons_res.c:855
#3  0x00007f0d67c48108 in select_p_select_nodeinfo_set (job_ptr=0x2b0f070) at select_cons_res.c:2697
#4  0x00007f0d68d01061 in select_g_select_nodeinfo_set (job_ptr=job_ptr@entry=0x2b0f070) at node_select.c:857
#5  0x000000000044cef8 in reset_job_bitmaps () at job_mgr.c:10919
#6  0x000000000049a57f in read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false)
    at read_config.c:1326
#7  0x000000000042b2d0 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:663

Comment 1 Dominik Bartkiewicz 2018-10-08 07:54:03 MDT

Hi

Could you send me the output from this gdb commands?
t 1 
f 5
p *job_ptr

Dominik

Comment 2 Steve Ford 2018-10-08 07:58:33 MDT

(gdb) t 1
[Switching to thread 1 (Thread 0x7f0d69234740 (LWP 21567))]
#0  0x00007f0d68cdef38 in _job_alloc (core_bitmap=0x2c93b40, node_name=0x28f4790 "nvl-002", job_id=290382, 
    gres_name=0x229ccb0 "gpu", node_offset=0, node_cnt=1, node_gres_data=0x2a2ef20, job_gres_data=0x2b0fa70)
    at gres.c:5856
5856	in gres.c
(gdb) f 5
#5  0x000000000044cef8 in reset_job_bitmaps () at job_mgr.c:10919
10919	job_mgr.c: No such file or directory.
(gdb) p *job_ptr
$1 = {magic = 4038539564, account = 0x2b0f6d0 "general", admin_comment = 0x0, alias_list = 0x0, 
  alloc_node = 0x2b0f6a0 "dev-intel18", alloc_resp_port = 52491, alloc_sid = 29888, array_job_id = 0, 
  array_task_id = 4294967294, array_recs = 0x0, assoc_id = 2762, assoc_ptr = 0x2403620, batch_features = 0x0, 
  batch_flag = 0, batch_host = 0x2ac7f80 "nvl-002", billable_tres = 8, bit_flags = 32768, burst_buffer = 0x0, 
  burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, 
  comment = 0x2b0f6f0 "stdout=/mnt/home/dunan/slurm-290382.out", cpu_cnt = 8, cpus_per_tres = 0x0, cr_enabled = 0, 
  db_index = 2958410, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x2b0f470, direct_set_prio = 0, 
  end_time = 1539559241, end_time_exp = 1539559241, epilog_running = false, exit_code = 0, fed_details = 0x0, 
  front_end_ptr = 0x0, gids = 0x0, gres_list = 0x2804ec0, gres_alloc = 0x2b0f750 "7696487:1", gres_detail_cnt = 1, 
  gres_detail_str = 0x2b114f0, gres_req = 0x2b0f780 "gpu:1", gres_used = 0x0, group_id = 2003, job_id = 290382, 
  job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x2b0f7f0, job_state = 1, 
  kill_on_node_fail = 1, last_sched_eval = 1538954441, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, 
    time = 0, tres = 0x2b0efe0}, mail_type = 0, mail_user = 0x0, mem_per_tres = 0x0, mcs_label = 0x0, 
  name = 0x2b0efc0 "sh", network = 0x0, next_step_id = 1, ngids = 0, nodes = 0x2b0ef50 "nvl-002", 
  node_addr = 0x2c85ea0, node_bitmap = 0x2c85bd0, node_bitmap_cg = 0x0, node_cnt = 1, node_cnt_wag = 0, 
  nodes_completing = 0x0, origin_cluster = 0x2b11290 "msuhpcc", other_port = 52490, pack_job_id = 0, 
  pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, 
  partition = 0x2b0ef70 "general-long-18,general-long-14,general-long-16", part_ptr_list = 0x26cf950, 
  part_nodes_missing = false, part_ptr = 0x28f5e90, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, 
  preempt_in_progress = false, priority = 3389, priority_array = 0x0, prio_factors = 0x2b0ee90, profile = 0, 
  qos_id = 1, qos_ptr = 0x22c17a0, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, 
  resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x2b0ef20 "192.168.9.200", 
  sched_nodes = 0x0, select_jobinfo = 0x2b0f7c0, spank_job_env = 0x0, spank_job_env_size = 0, 
  start_protocol_ver = 8192, start_time = 1538954441, state_desc = 0x0, state_reason = 0, state_reason_prev = 0, 
  step_list = 0x26cf9a0, suspend_time = 0, system_comment = 0x0, time_last_active = 1539005708, time_limit = 10080, 
  time_min = 0, tot_sus_time = 0, total_cpus = 8, total_nodes = 1, tres_bind = 0x0, tres_freq = 0x0, 
  tres_per_job = 0x0, tres_per_node = 0x2b0f730 "gpu:1", tres_per_socket = 0x0, tres_per_task = 0x0, 
  tres_req_cnt = 0x2b11340, tres_req_str = 0x2b112b0 "1=8,2=6000,4=1,1001=1", 
  tres_fmt_req_str = 0x2b11250 "cpu=8,mem=6000M,node=1,gres/gpu=1", tres_alloc_cnt = 0x2b112e0, 
  tres_alloc_str = 0x2b113a0 "1=8,2=6000,3=18446744073709551614,4=1,5=8,1001=1", 
  tres_fmt_alloc_str = 0x2b113f0 "cpu=8,mem=6000M,node=1,billing=8,gres/gpu=1", user_id = 721898, 
  user_name = 0x2b0f680 "dunan", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, 
  req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}

Comment 3 Steve Ford 2018-10-08 08:11:41 MDT

Dominik,

I reverted our production server back to 17.11.7 for now. I'm going to change this to a severity 2.

Comment 4 Dominik Bartkiewicz 2018-10-08 08:46:50 MDT

Hi

Good to know that you are "up" now.

Could you send also job_gres_ptr?
t 1
f 0
p *job_gres_ptr

Dominik

Comment 5 Dominik Bartkiewicz 2018-10-08 09:38:29 MDT

Hi

I can recreate this issue. I will let you know when I fix this.

Dominik

Comment 7 Dominik Bartkiewicz 2018-10-08 10:16:26 MDT

Hi

I have the preliminary patch now, but it still needs some review, let me know if you need it.
I suggest staying on 17.11 slurmctld until it will be officially in git repo.


Dominik

Comment 9 Steve Ford 2018-10-08 11:04:25 MDT

Dominik,

That's good news! We are aiming to be on 18.08.1 by next Monday. Do you think this internal review will be done by then?

Thanks,
Steve

Comment 10 Dominik Bartkiewicz 2018-10-08 15:13:16 MDT

Hi

If it wasn't US holiday, it would be done already. Because of Columbus Day, it will take a little bit longer, it should be in repo tomorrow.

Dominik

Comment 14 Steve Ford 2018-10-10 14:01:28 MDT

Hey Dominik,

I'm hoping to do this upgrade later this evening. Will this patch be available today?

Thanks,
Steve

Comment 19 Danny Auble 2018-10-10 15:21:56 MDT

Comment on attachment 7994 [details]
Patch to create structures outside of if

Steve, please use this patch it should get you what you need.  It will be in 18.08.2 which will probably be tagged in a few weeks commit 33f8f22d.

Comment 21 Steve Ford 2018-10-11 07:18:00 MDT

Danny,

We updated using this patch and slurmctld runs, however, we're seeing a lot of the following errors in the log:

[2018-10-10T21:28:54.125] error: gres/gpu: job 326801 dealloc node nvl-006 topo gres count underflow (0 4)
[2018-10-10T21:28:54.125] error: gres/gpu: job 326801 dealloc node nvl-006 type v100 gres count underflow (0 4)
[2018-10-10T21:28:54.127] error: gres/gpu: job 303352 dealloc node nvl-002 topo gres count underflow (0 1)
[2018-10-10T21:28:54.127] error: gres/gpu: job 303352 dealloc node nvl-002 type v100 gres count underflow (0 1)
[2018-10-10T21:28:54.127] error: gres/gpu: job 326800 dealloc node nvl-006 topo gres count underflow (1 4)
[2018-10-10T21:28:54.127] error: gres/gpu: job 326800 dealloc node nvl-006 type v100 gres count underflow (0 4)
[2018-10-10T21:28:54.127] error: gres/gpu: job 326801 dealloc node nvl-006 topo gres count underflow (0 4)
[2018-10-10T21:28:54.128] error: gres/gpu: job 326801 dealloc node nvl-006 type v100 gres count underflow (0 4)
[2018-10-10T21:28:54.155] error: gres/gpu: job 303352 dealloc node nvl-002 topo gres count underflow (0 1)
[2018-10-10T21:28:54.155] error: gres/gpu: job 303352 dealloc node nvl-002 type v100 gres count underflow (0 1)
[2018-10-10T21:28:54.157] error: gres/gpu: job 303352 dealloc node nvl-002 topo gres count underflow (0 1)
[2018-10-10T21:28:54.157] error: gres/gpu: job 303352 dealloc node nvl-002 type v100 gres count underflow (0 1)
[2018-10-10T21:28:54.160] error: gres/gpu: job 303352 dealloc node nvl-002 topo gres count underflow (0 1)
[2018-10-10T21:28:54.160] error: gres/gpu: job 303352 dealloc node nvl-002 type v100 gres count underflow (0 1)
[2018-10-10T21:28:54.174] error: gres/gpu: job 326800 dealloc node nvl-006 topo gres count underflow (1 4)
[2018-10-10T21:28:54.174] error: gres/gpu: job 326800 dealloc node nvl-006 type v100 gres count underflow (0 4)
[2018-10-10T21:28:54.174] error: gres/gpu: job 326801 dealloc node nvl-006 topo gres count underflow (0 4)
[2018-10-10T21:28:54.174] error: gres/gpu: job 326801 dealloc node nvl-006 type v100 gres count underflow (0 4)
[2018-10-10T21:28:54.177] error: gres/gpu: job 326800 dealloc node nvl-006 topo gres count underflow (1 4)

Comment 22 Dominik Bartkiewicz 2018-10-11 11:33:03 MDT

Hi

We expect this is only a transitional thing and it should go away after all 17.11 jobs are gone. I will inform you when I will find what exactly caused this problem.

Dominik

Comment 23 Dominik Bartkiewicz 2018-10-12 04:16:43 MDT

Hi

Could you send me full slurmctld.log
and output from "scontrol show job <one of affected job_id eg. 326800 >", if such jobs still exist in the system?

Dominik

Comment 24 Steve Ford 2018-10-12 07:32:31 MDT

# scontrol show job 326800
JobId=326800 JobName=4zh9_long_200ns_MD
   UserId=foouser(*) GroupId=hpccclass(*) MCS_label=N/A
   Priority=20007 Nice=0 Account=general QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=1-14:46:46 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2018-10-10T18:39:43 EligibleTime=2018-10-10T18:39:43
   AccrueTime=Unknown
   StartTime=2018-10-10T18:39:43 EndTime=2018-10-17T18:39:43 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-10-10T18:39:43
   Partition=general-long-18 AllocNode:Sid=lac-249:2131
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=nvl-006
   BatchHost=nvl-006
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=12G,node=1,billing=1,gres/gpu=4
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=12G MinTmpDiskNode=0
   Features=intel16|intel18 DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/mnt/home/foouser/MD/CRBP/longRun/4zh9_long_200ns_MD.sb
   WorkDir=/mnt/home/foouser/MD/CRBP/longRun
   Comment=stdout=/mnt/home/foouser/MD/CRBP/longRun/4zh9_long_200ns_MD.sb.o%A 
   StdErr=/mnt/home/foouser/MD/CRBP/longRun/4zh9_long_200ns_MD.sb.e326800
   StdIn=/dev/null
   StdOut=/mnt/home/foouser/MD/CRBP/longRun/4zh9_long_200ns_MD.sb.o326800
   Power=
   TresPerNode=gpu:4

Comment 26 Dominik Bartkiewicz 2018-10-24 09:37:45 MDT

Hi

Does this error still occur?
I am still working on this but I can't reproduce it.

Dominik

Comment 27 Dominik Bartkiewicz 2018-10-25 09:27:45 MDT

Hi

Can we drop severity to 3 now, as the patch is already in git repo?

Dominik

Comment 28 Steve Ford 2018-10-25 10:39:17 MDT

Dominik,

This is fine to drop to sev 3.

I am still seeing this error in the logs. I'm now seeing it for jobs submitted after the update. For example, job 672084:

[2018-10-25T12:23:44.040] error: gres/gpu: job 672084 dealloc node lac-196 gres count underflow
[2018-10-25T12:23:44.040] error: gres/gpu: job 672084 dealloc node lac-196 topo gres count underflow (3 4)
[2018-10-25T12:23:44.040] error: gres/gpu: job 672084 dealloc node lac-196 type k80 gres count underflow (3 4)


JobId=672084 JobName=bash
   UserId=foouser(*) GroupId=group(*) MCS_label=N/A
   Priority=24787 Nice=0 Account=general QOS=normal
   JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=130:0
   RunTime=00:14:28 TimeLimit=00:20:00 TimeMin=N/A
   SubmitTime=2018-10-25T12:09:16 EligibleTime=2018-10-25T12:09:16
   StartTime=2018-10-25T12:09:16 EndTime=2018-10-25T12:23:44 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-10-25T12:09:16
   Partition=general-long-16 AllocNode:Sid=dev-intel16-k80:18489
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=lac-196
   BatchHost=lac-196
   NumNodes=1 NumCPUs=8 NumTasks=4 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=2G,node=1,billing=8,gres/gpu=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryNode=2G MinTmpDiskNode=0
   Features=gpgpu DelayBoot=00:00:00
   Gres=gpu:4 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/bash
   WorkDir=/mnt/home/foouser/Documents/ShearWaveImaging/CUDA_GFS_ForHPCC_Release
   Comment=stdout=/mnt/home/foouser/Documents/ShearWaveImaging/CUDA_GFS_ForHPCC_Release/slurm-672084.out 
   Power=

Thanks,
Steve

Comment 29 Steve Ford 2018-10-25 10:42:08 MDT

Created attachment 8105 [details]
Slurmctld log

Comment 30 Dominik Bartkiewicz 2018-10-31 05:58:04 MDT

Hi

It seems that you can easily reproduce this.
Because I still can't reproduce this.
Could you generate slurmctld.log with enabled DebugFlags SelectType?
eg.:
scontrol setdebugflags +SelectType
submit job
wait for log
scontrol setdebugflags -SelectType

Could you also send me slurm.conf and exact command you used to submit this job?


Dominik

Comment 31 Steve Ford 2018-11-05 07:39:09 MST

Dominik,

I have not been able to submit a job that produces these errors, I am only noticing these jobs after the fact. I'll attach the slurm log with the selecttype flag set. Perhaps the information from that flag for the jobs that are already throwing this error will be helpful.

Best,
Steve

Comment 32 Steve Ford 2018-11-05 07:56:42 MST

Created attachment 8202 [details]
Slurmctld log +SelectType, compressed with xz

Comment 33 Dominik Bartkiewicz 2018-12-13 09:06:40 MST

Hi

It looks like logic in select plugin is working well, a problem must be in a different place.
Could you generate slurmctld.log with enabled DebugFlags TraceJobs?

Dominik

Comment 34 Steve Ford 2019-02-20 13:40:52 MST

Created attachment 9233 [details]
Slurmctld log +TraceJobs

Dominik,

Here is the log output with TraceJobs enabled.

Best,
Steve

Comment 35 Dominik Bartkiewicz 2019-02-27 10:07:13 MST

Hi

This two commit should fix (gres count underflow) issue:
https://github.com/SchedMD/slurm/commit/6f8cd92e1091e
https://github.com/SchedMD/slurm/commit/69d78159c3305
Both patches will be included in 18.08.6.

In separate bug, we handle wrongly set debug level which returns a false positive message like this
"select_nodes: calling _get_req_features() for JobId=10129368 with not NULL job resources"

Let me know if we can close this bug now.

Dominik

Comment 36 Dominik Bartkiewicz 2019-03-18 09:58:58 MDT

Hi

I will close this ticket as resolved/fixed and set "Version Fixed" to 18.08.2 (this is version were an original bug was fixed).
Please open a new ticket if you notice more "count underflow" message above version 18.08.6.


Dominik