Ticket 8322

Summary:	slurmctld is repeatedly dead after restart, with core files generated
Product:	Slurm	Reporter:	Fenglai Liu <fenglai>
Component:	slurmctld	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	bart, fenglai
Version:	18.08.4
Hardware:	Linux
OS:	Linux
Site:	Vanderbilt	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	CentOS	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	for missing job_resrsc struct slurmctld log file for the failure

Description Fenglai Liu 2020-01-12 14:55:03 MST

Hello there,

This morning we have a serious issue with the slurm controller. The service of slurmctld is keeping repeatedly dead after we restart it, and each time we tried to restart the service we have a core file generated. Because the core file is very large so I can not upload it directly here. If you need the core file we can use private link to share it. Please help us to investigate the failure of slurmctld to provide us some insight about the failure. Right now we can not bring the slurm controller back.

Thank you so much!

Fenglai

Comment 1 Dominik Bartkiewicz 2020-01-12 15:05:48 MST

Hi

Can you load the core file into gdb and share the backtrace with us:
eg.:
gdb -ex 't a a bt' -batch slurmctld <corefile>

Dominik

Comment 2 Fenglai Liu 2020-01-12 16:05:09 MST

Hi Dominik,

Thank you for so fast to reach out! Here is what we got from gdb:

[New LWP 129003]

[New LWP 129011]

[New LWP 129010]

[New LWP 129013]

[New LWP 129149]

[New LWP 129148]

[New LWP 129153]

[New LWP 129136]

[New LWP 129155]

[New LWP 129150]

[New LWP 139038]

[New LWP 129151]

[New LWP 129152]

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib64/libthread_db.so.1".

Core was generated by `/usr/sbin/slurmctld'.

Program terminated with signal 11, Segmentation fault.

#0  _step_dealloc_lps (step_ptr=0x361b870) at step_mgr.c:2092

Thread 13 (Thread 0x7fbf13bfb700 (LWP 129152)):

#0  0x00007fbf261b2101 in sigwait () from /lib64/libpthread.so.0

#1  0x000000000042d867 in _slurmctld_signal_hand (no_data=<optimized out>) at controller.c:1052

#2  0x00007fbf261aadc5 in start_thread () from /lib64/libpthread.so.0

#3  0x00007fbf25ed976d in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7fbf13cfc700 (LWP 129151)):

#0  0x00007fbf25ed0bd3 in select () from /lib64/libc.so.6

#1  0x00000000004290f6 in _slurmctld_rpc_mgr (no_data=<optimized out>) at controller.c:1183

#2  0x00007fbf261aadc5 in start_thread () from /lib64/libpthread.so.0

#3  0x00007fbf25ed976d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7fbb06ded700 (LWP 139038)):

#0  0x00007fbf261ade24 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0

#1  0x000000000046c9db in lock_slurmctld (lock_levels=...) at locks.c:117

#2  0x00000000004875df in _slurm_rpc_dump_jobs (msg=msg@entry=0x7fbb06dece50) at proc_req.c:1783

#3  0x0000000000493a4d in slurmctld_req (msg=msg@entry=0x7fbb06dece50, arg=arg@entry=0x7fbee40072e0) at proc_req.c:337

#4  0x00000000004288f8 in _service_connection (arg=0x7fbee40072e0) at controller.c:1282

#5  0x00007fbf261aadc5 in start_thread () from /lib64/libpthread.so.0

#6  0x00007fbf25ed976d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7fbf13fff700 (LWP 129150)):

#0  0x00007fbf261aea82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x00007fbf200786e0 in _decay_thread (no_data=<optimized out>) at priority_multifactor.c:1334

#2  0x00007fbf261aadc5 in start_thread () from /lib64/libpthread.so.0

#3  0x00007fbf25ed976d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7fbf138f8700 (LWP 129155)):

#0  0x00007fbf261ae6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x0000000000426f75 in _purge_files_thread (no_data=<optimized out>) at controller.c:3345

#2  0x00007fbf261aadc5 in start_thread () from /lib64/libpthread.so.0

#3  0x00007fbf25ed976d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fbf215fc700 (LWP 129136)):

#0  0x00007fbf261aea82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x00007fbf21600ed6 in _my_sleep (usec=120000000) at backfill.c:591

#2  0x00007fbf216075eb in backfill_agent (args=<optimized out>) at backfill.c:934

#3  0x00007fbf261aadc5 in start_thread () from /lib64/libpthread.so.0

#4  0x00007fbf25ed976d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fbf13afa700 (LWP 129153)):

#0  0x00007fbf261ae6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x00000000004aa46f in slurmctld_state_save (no_data=<optimized out>) at state_save.c:207

#2  0x00007fbf261aadc5 in start_thread () from /lib64/libpthread.so.0

#3  0x00007fbf25ed976d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fbf2047f700 (LWP 129148)):

#0  0x00007fbf261aea82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x000000000043115f in _agent_thread (arg=<optimized out>) at fed_mgr.c:2225

#2  0x00007fbf261aadc5 in start_thread () from /lib64/libpthread.so.0

#3  0x00007fbf25ed976d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fbf2037e700 (LWP 129149)):

#0  0x00007fbf261aea82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x0000000000435029 in _fed_job_update_thread (arg=<optimized out>) at fed_mgr.c:2183

#2  0x00007fbf261aadc5 in start_thread () from /lib64/libpthread.so.0

#3  0x00007fbf25ed976d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fbf22514700 (LWP 129013)):

#0  0x00007fbf25ecee2d in poll () from /lib64/libc.so.6

#1  0x00007fbf26695606 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fbf22513da0) at /usr/include/bits/poll2.h:46

#2  _conn_readable (persist_conn=persist_conn@entry=0xcf4ad0) at slurm_persist_conn.c:138

#3  0x00007fbf26696b0f in slurm_persist_recv_msg (persist_conn=0xcf4ad0) at slurm_persist_conn.c:882

#4  0x00007fbf22c43cb8 in _handle_mult_rc_ret () at slurmdbd_agent.c:168

#5  _agent (x=<optimized out>) at slurmdbd_agent.c:667

#6  0x00007fbf261aadc5 in start_thread () from /lib64/libpthread.so.0

#7  0x00007fbf25ed976d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fbf26b93700 (LWP 129010)):

#0  0x00007fbf261ae03e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0

#1  0x000000000046ca1a in lock_slurmctld (lock_levels=...) at locks.c:119

#2  0x0000000000420209 in _agent_retry (mail_too=false, min_wait=999) at agent.c:1526

#3  _agent_init (arg=<optimized out>) at agent.c:1396

#4  0x00007fbf261aadc5 in start_thread () from /lib64/libpthread.so.0

#5  0x00007fbf25ed976d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fbf22c39700 (LWP 129011)):

#0  0x00007fbf261aea82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x00007fbf22c3eae0 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:447

#2  0x00007fbf261aadc5 in start_thread () from /lib64/libpthread.so.0

#3  0x00007fbf25ed976d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fbf26b94740 (LWP 129003)):

#0  _step_dealloc_lps (step_ptr=0x361b870) at step_mgr.c:2092

#1  post_job_step (step_ptr=step_ptr@entry=0x361b870) at step_mgr.c:4769

#2  0x00000000004b42ae in _post_job_step (step_ptr=0x361b870) at step_mgr.c:270

#3  _internal_step_complete (job_ptr=job_ptr@entry=0x361acd0, step_ptr=step_ptr@entry=0x361b870) at step_mgr.c:311

#4  0x00000000004b432c in delete_step_records (job_ptr=job_ptr@entry=0x361acd0) at step_mgr.c:340

#5  0x0000000000469a31 in cleanup_completing (job_ptr=job_ptr@entry=0x361acd0) at job_scheduler.c:4961

#6  0x00000000004561fc in kill_running_job_by_node_name (node_name=0x1936a70 "cn1277") at job_mgr.c:3831

#7  0x000000000047335a in set_node_down_ptr (node_ptr=node_ptr@entry=0x1994b60, reason=reason@entry=0x4cb72f "Not responding") at node_mgr.c:3555

#8  0x0000000000481868 in ping_nodes () at ping_nodes.c:290

#9  0x000000000042c9ad in _slurmctld_background (no_data=0x0) at controller.c:2090

#10 main (argc=<optimized out>, argv=<optimized out>) at controller.c:763

Thank you,

Fenglai


On 1/12/20 4:05 PM, bugs@schedmd.com wrote:
> Dominik Bartkiewicz <mailto:bart@schedmd.com> changed bug 8322 
> <https://bugs.schedmd.com/show_bug.cgi?id=8322>
> What 	Removed 	Added
> CC 		bart@schedmd.com
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=8322#c1> on bug 
> 8322 <https://bugs.schedmd.com/show_bug.cgi?id=8322> from Dominik 
> Bartkiewicz <mailto:bart@schedmd.com> *
> Hi
>
> Can you load the core file into gdb and share the backtrace with us:
> eg.:
> gdb -ex 't a a bt' -batch slurmctld <corefile>
>
> Dominik
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You reported the bug.
>

Comment 3 Dominik Bartkiewicz 2020-01-12 17:17:21 MST

Hi

Could you send me outputs from those gdb commands and slurmctld.log which cover
this crash?
eg.:

gdb slurmctld <corefile>

p *job_resrcs_ptr
f 3
p *step_ptr
p *job_ptr

Dominik

Comment 4 Fenglai Liu 2020-01-12 17:46:14 MST

Hi Dominik,

Here is the output from the gdb for your suggesting gdb running. There's 
no output from the slurmctld.log, so I just copy the output from the gdb 
to you:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
Program terminated with signal 11, Segmentation fault.
#0  _step_dealloc_lps (step_ptr=0x361b870) at step_mgr.c:2092
2092    step_mgr.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-18.08.4-1.el7.centos.x86_64
(gdb) p *job_resrcs_ptr
Cannot access memory at address 0x0
(gdb) f 3
#3  _internal_step_complete (job_ptr=job_ptr@entry=0x361acd0, step_ptr=step_ptr@entry=0x361b870) at step_mgr.c:311
311     in step_mgr.c
(gdb) p *step_ptr
$1 = {magic = 3405695742, batch_step = 0, ckpt_interval = 0, check_job = 0x0, ckpt_dir = 0x0, ckpt_time = 0, core_bitmap_job = 0x0, cpu_count = 0, cpu_freq_min = 0, cpu_freq_max = 0,
   cpu_freq_gov = 0, cpus_per_task = 0, cpus_per_tres = 0x0, cyclic_alloc = 0, exclusive = 0, exit_code = 4294967294, exit_node_bitmap = 0x0, ext_sensors = 0x361c0c0, gres_list = 0x0,
   host = 0x0, job_ptr = 0x361acd0, jobacct = 0x361ba00, mem_per_tres = 0x0, name = 0x361b6d0 "extern", network = 0x0, no_kill = 0 '\000', pn_min_memory = 0, port = 0, pre_sus_time = 0,
   start_protocol_ver = 8448, resv_port_array = 0x0, resv_port_cnt = 0, resv_ports = 0x0, requid = 4294967295, start_time = 1578792396, time_limit = 4294967295, select_jobinfo = 0x361b7f0,
   srun_pid = 0, state = 32769, step_id = 4294967295, step_layout = 0x361b6f0, step_node_bitmap = 0x3c294b0, switch_job = 0x0, time_last_active = 0, tot_sus_time = 0,
   tres_alloc_str = 0x361b820 "1=1,2=5120,3=18446744073709551614,4=1,5=1", tres_bind = 0x0, tres_fmt_alloc_str = 0x0, tres_freq = 0x0, tres_per_step = 0x0, tres_per_node = 0x0,
   tres_per_socket = 0x0, tres_per_task = 0x0}
(gdb) p *job_ptr
$2 = {magic = 4038539564, account = 0x361b2c0 "h_oguz_lab", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x361ac20 "gw346", alloc_resp_port = 0, alloc_sid = 30645,
   array_job_id = 15905261, array_task_id = 225, array_recs = 0x361c3b0, assoc_id = 3325, assoc_ptr = 0xdbc760, batch_features = 0x0, batch_flag = 2, batch_host = 0x35b3ce0 "cn1277",
   billable_tres = 1, bit_flags = 262144, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 0,
   cpus_per_tres = 0x0, cr_enabled = 0, db_index = 466836285, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x361b0d0, direct_set_prio = 0, end_time = 1578792880,
   end_time_exp = 4294967294, epilog_running = false, exit_code = 1, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres_list = 0x0, gres_alloc = 0x361b2f0 "", gres_detail_cnt = 0,
   gres_detail_str = 0x0, gres_req = 0x361b310 "", gres_used = 0x0, group_id = 20832, job_id = 15905261, job_next = 0x0, job_array_next_j = 0x3597330, job_array_next_t = 0x0, job_resrcs = 0x0,
   job_state = 32768, kill_on_node_fail = 1, last_sched_eval = 1578792396, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x361ac40}, mail_type = 271,
   mail_user = 0x361b330 "larsonke@vanderbilt.edu  <mailto:larsonke@vanderbilt.edu>", mem_per_tres = 0x0, mcs_label = 0x0, name = 0x361abe0 "FS-FRSBE+CorticalChanges.sh", network = 0x0, next_step_id = 0, ngids = 0,
   nodes = 0x361ab90 "cn1277", node_addr = 0x3c29340, node_bitmap = 0x3c29410, node_bitmap_cg = 0x3c29370, node_cnt = 0, node_cnt_wag = 1, nodes_completing = 0x361a780 "cn1277",
   origin_cluster = 0x0, other_port = 0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x361abb0 "production", part_ptr_list = 0x0,
   part_nodes_missing = false, part_ptr = 0x19611d0, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 28120, priority_array = 0x0,
   prio_factors = 0x361ab00, profile = 0, qos_id = 1, qos_ptr = 0xd194d0, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 1, resize_time = 0, resv_id = 0, resv_name = 0x0,
   resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x361b380, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8448, start_time = 0,
   state_desc = 0x0, state_reason = 35, state_reason_prev = 35, step_list = 0x18531a0, suspend_time = 0, system_comment = 0x0, time_last_active = 1578857972, time_limit = 6000, time_min = 0,
   tot_sus_time = 0, total_cpus = 1, total_nodes = 1, tres_bind = 0x0, tres_freq = 0x0, tres_per_job = 0x0, tres_per_node = 0x0, tres_per_socket = 0x0, tres_per_task = 0x0,
   tres_req_cnt = 0x361c630, tres_req_str = 0x361b7c0 "1=1,2=5120,4=1,5=1", tres_fmt_req_str = 0x361c370 "cpu=1,mem=5G,node=1,billing=1", tres_alloc_cnt = 0x361c260,
   tres_alloc_str = 0x361c170 "1=1,2=5120,3=18446744073709551614,4=1,5=1", tres_fmt_alloc_str = 0x361c330 "cpu=1,mem=5G,node=1,billing=1", user_id = 645142, user_name = 0x361b290 "larsonke",
   wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}



Thank you very much for your help!

Fenglai


On 1/12/20 6:17 PM, bugs@schedmd.com wrote:
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=8322#c3> on bug 
> 8322 <https://bugs.schedmd.com/show_bug.cgi?id=8322> from Dominik 
> Bartkiewicz <mailto:bart@schedmd.com> *
> Hi
>
> Could you send me outputs from those gdb commands and slurmctld.log which cover
> this crash?
> eg.:
>
> gdb slurmctld <corefile>
>
> p *job_resrcs_ptr
> f 3
> p *step_ptr
> p *job_ptr
>
> Dominik
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You reported the bug.
>

Comment 5 Dominik Bartkiewicz 2020-01-12 23:41:59 MST

Created attachment 12712 [details]
for missing job_resrsc struct

Hi

Could you apply this patch?
This should bypass that specific issue.

Dominik

Comment 6 Fenglai Liu 2020-01-13 06:43:09 MST

Hi Dominik,

Thank you I see the patch file. We will apply this patch file and to 
update you later to see whether the problem is solved.

Thank you again,

Fenglai

On 1/13/20 12:41 AM, bugs@schedmd.com wrote:
>
> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=8322#c5> on bug 
> 8322 <https://bugs.schedmd.com/show_bug.cgi?id=8322> from Dominik 
> Bartkiewicz <mailto:bart@schedmd.com> *
> Createdattachment 12712 <attachment.cgi?id=12712&action=diff> [details] 
> <attachment.cgi?id=12712&action=edit>
> for missing job_resrsc struct
>
> Hi
>
> Could you apply this patch?
> This should bypass that specific issue.
>
> Dominik
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>   * You are on the CC list for the bug.
>

Comment 7 Dominik Bartkiewicz 2020-01-13 07:26:10 MST

Hi

Glad to hear that slurmctld is online now.

Now we need to find the root cause of this issue (this patch only bypasses this bug).
Could you send me slurm.conf, and slurmctld.log covering the time from submission job 15905261 to the first crash?

Could we drop severity to 2 or 3 now when slurmctld is running?

Dominik

Comment 9 Fenglai Liu 2020-01-13 08:37:13 MST

Hi Dominik,

We are rebuilding the RPM and apply to the controller now. I will let 
you know how it goes. If everything good I will drop the severity to 2/3 
when the slurm can run.

Thank you,

Fenglai

On 1/13/20 8:26 AM, bugs@schedmd.com wrote:
>
> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=8322#c7> on bug 
> 8322 <https://bugs.schedmd.com/show_bug.cgi?id=8322> from Dominik 
> Bartkiewicz <mailto:bart@schedmd.com> *
> Hi
>
> Glad to hear that slurmctld is online now.
>
> Now we need to find the root cause of this issue (this patch only bypasses this
> bug).
> Could you send me slurm.conf, and slurmctld.log covering the time from
> submission job 15905261 to the first crash?
>
> Could we drop severity to 2 or 3 now when slurmctld is running?
>
> Dominik
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You reported the bug.
>

Comment 10 Fenglai Liu 2020-01-14 06:05:49 MST

Hi Dominik,

I apologize to let you waiting such long time. We suspect the slurm 
issue may be caused by the storage issues, and since yesterday my 
colleagues have been trying to fix the storage issues. I added your 
patch to the new RPM but unfortunately I still wait them to finish.

Let me drop the severity to 3, if the slurmctld is still having problems 
I will let you know.

Thank you,

Fenglai

On 1/13/20 8:26 AM, bugs@schedmd.com wrote:
>
> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=8322#c7> on bug 
> 8322 <https://bugs.schedmd.com/show_bug.cgi?id=8322> from Dominik 
> Bartkiewicz <mailto:bart@schedmd.com> *
> Hi
>
> Glad to hear that slurmctld is online now.
>
> Now we need to find the root cause of this issue (this patch only bypasses this
> bug).
> Could you send me slurm.conf, and slurmctld.log covering the time from
> submission job 15905261 to the first crash?
>
> Could we drop severity to 2 or 3 now when slurmctld is running?
>
> Dominik
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>   * You reported the bug.
>

Comment 11 Dominik Bartkiewicz 2020-01-14 07:33:29 MST

Hi

No problem.
Currently, your slurmctld has a corrupted state file and without this patch, it will crash every time.
An issue like this can be triggered by some disk problems, and need to be fixed in slurmctld.
If you can attach slurmctld.log and briefly describe potential disc issue this will help me to figure out if this is duplicate of bug 6837 
(https://github.com/SchedMD/slurm/commit/70d12f070908c33) or something new.

Dominik

Comment 12 Fenglai Liu 2020-01-14 11:18:47 MST

Created attachment 12733 [details]
slurmctld log file for the failure

Comment 13 Fenglai Liu 2020-01-14 11:20:56 MST

Hi Dominik,

We just bring the slurm controller back online. So far so good. Thank you for your patch file!

I appended the slurmctld.log here for your reference. We will try to check the local disk to see whether there's potential problem there, however we kind of suspect the issue was caused by the GPFS shared file system. I will keep you updated if we have new findings.

Thank you again,

Fenglai

Comment 14 Dominik Bartkiewicz 2020-01-20 06:30:27 MST

Hi

I am sure that this bug is a duplicate of bug 6837.

Slurm 18.08.8 and all versions above are free from this issue.
https://github.com/SchedMD/slurm/commit/70d12f070908c33

Let me know if we can close this bug.

Dominik

Comment 15 Fenglai Liu 2020-01-20 08:49:34 MST

Sure, thank you so much for your work!

Have a good day,

Fenglai