Ticket 10069

Summary:	after removal of reservation slurmctld is crashing at start with a coredump
Product:	Slurm	Reporter:	Michael Hebenstreit <michael.hebenstreit>
Component:	slurmctld	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	bart, cinek, nate, richard.dahringer
Version:	20.02.5
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=9720
Site:	Intel CRT	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	21.08pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	output from starting slurmctld as root with option -v -v -v -D workaround (v1)

Description Michael Hebenstreit 2020-10-27 08:00:48 MDT

Created attachment 16371 [details]
output from starting slurmctld as root with option -v -v -v -D

Oct 27 07:54:08 eslurm1 systemd-coredump[11882]: Process 11874 (slurmctld) of user 509 dumped core.

                                                 Stack trace of thread 11874:
                                                 #0  0x000014a0fc542e80 bit_clear (libslurmfull.so)
                                                 #1  0x000000000047979c make_node_idle (slurmctld)
                                                 #2  0x0000000000446c43 excise_node_from_job (slurmctld)
                                                 #3  0x00000000004a2816 _sync_nodes_to_active_job (slurmctld)
                                                 #4  0x000000000042e554 main (slurmctld)
                                                 #5  0x000014a0fbaf46a3 __libc_start_main (libc.so.6)
                                                 #6  0x0000000000418c1e _start (slurmctld)

                                                 Stack trace of thread 11879:
                                                 #0  0x000014a0fbea17da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)
                                                 #1  0x000014a0f2987110 _my_sleep (sched_backfill.so)
                                                 #2  0x000014a0f298e3a8 backfill_agent (sched_backfill.so)
                                                 #3  0x000014a0fbe9b2de start_thread (libpthread.so.0)
                                                 #4  0x000014a0fbbcce83 __clone (libc.so.6)

                                                 Stack trace of thread 11878:
                                                 #0  0x000014a0fbbc1f21 __poll (libc.so.6)
                                                 #1  0x000014a0fc5c6b77 _conn_readable (libslurmfull.so)
                                                 #2  0x000014a0fc5c80cd slurm_persist_recv_msg (libslurmfull.so)
                                                 #3  0x000014a0f81ca29e _handle_mult_rc_ret (accounting_storage_slurmdbd.so)
                                                 #4  0x000014a0fbe9b2de start_thread (libpthread.so.0)
                                                 #5  0x000014a0fbbcce83 __clone (libc.so.6)

                                                 Stack trace of thread 11875:
                                                 #0  0x000014a0fbea17da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)
                                                 #1  0x0000000000421fcb _agent_init (slurmctld)
                                                 #2  0x000014a0fbe9b2de start_thread (libpthread.so.0)
                                                 #3  0x000014a0fbbcce83 __clone (libc.so.6)

                                                 Stack trace of thread 11876:
                                                 #0  0x000014a0fbea17da pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0)
                                                 #1  0x000014a0f81c4073 _set_db_inx_thread (accounting_storage_slurmdbd.so)
                                                 #2  0x000014a0fbe9b2de start_thread (libpthread.so.0)
                                                 #3  0x000014a0fbbcce83 __clone (libc.so.6)

Comment 1 Marcin Stolarek 2020-10-27 08:28:59 MDT

Could you please load the core into gdb with your slurmctld binary and share the backtrace from it:
gdb /path/to/slurmctld /path/to/core.XX
(gdb)t a a bt f?

cheers,
Marcin

Comment 2 Michael Hebenstreit 2020-10-27 08:37:56 MDT

[root@eslurm1 ~]# gdb /opt/slurm/20.02.3/sbin/slurmctld /var/lib/systemd/coredump/core.slurmctld.509.00fce0901101407abc89e88bd95377c3.12557.1603808600000000
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-11.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /opt/slurm/20.02.3/sbin/slurmctld...done.
[New LWP 12557]
[New LWP 12561]
[New LWP 12563]
[New LWP 12558]
[New LWP 12559]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments
Core was generated by `/opt/slurm/20.02.3/sbin/slurmctld -D -vvv'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000014a142e80e59 in bit_clear (b=0x0, bit=bit@entry=277) at bitstring.c:258
258             b[_bit_word(bit)] &= ~_bit_mask(bit);
[Current thread is 1 (Thread 0x14a14340e100 (LWP 12557))]
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-101.el8.x86_64 libblkid-2.32.1-22.el8.x86_64 libcap-2.26-3.el8.x86_64 libgcc-8.3.1-5.el8.0.2.x86_64 libmount-2.32.1-22.el8.x86_64 libselinux-2.9-3.el8.x86_64 libuuid-2.32.1-22.el8.x86_64 munge-libs-0.5.13-1.el8.x86_64 openssl-libs-1.1.1c-15.el8.x86_64 systemd-libs-239-31.el8_2.2.x86_64 zlib-1.2.11-16.el8_2.x86_64
(gdb) t a a bt f?

Thread 5 (Thread 0x14a13eafb700 (LWP 12559)):
No symbol "f" in current context.
(gdb) t a a bt f

Thread 5 (Thread 0x14a13eafb700 (LWP 12559)):
#0  0x000014a1427df7da in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000014a13eb01073 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:451
        err = <optimized out>
        local_job_list = <optimized out>
        job_ptr = <optimized out>
        itr = <optimized out>
        tvnow = {tv_sec = 1603808597, tv_usec = 749411}
        abs = {tv_sec = 1603808602, tv_nsec = 749411000}
        job_read_lock = {conf = NO_LOCK, job = READ_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK}
        job_write_lock = {conf = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, part = NO_LOCK, fed = NO_LOCK}
        __func__ = "_set_db_inx_thread"
#2  0x000014a1427d92de in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x000014a14250ae83 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 4 (Thread 0x14a14340c700 (LWP 12558)):
#0  0x000014a1427df7da in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x0000000000421fef in _agent_init (arg=<optimized out>) at agent.c:1395
        err = <optimized out>
        min_wait = <optimized out>
        mail_too = <optimized out>
        ts = {tv_sec = 1603808601, tv_nsec = 0}
        last_defer_attempt = 0
        __func__ = "_agent_init"
#2  0x000014a1427d92de in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x000014a14250ae83 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x14a13d37c700 (LWP 12563)):
#0  0x000014a1427df7da in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000014a13d381110 in _my_sleep (usec=30000000) at backfill.c:668
        err = <optimized out>
        nsec = <optimized out>
        sleep_time = 0
        ts = {tv_sec = 1603808629, tv_nsec = 525767000}
        tv1 = {tv_sec = 1603808599, tv_usec = 525767}
        tv2 = {tv_sec = 0, tv_usec = 0}
        __func__ = "_my_sleep"
#2  0x000014a13d388644 in backfill_agent (args=<optimized out>) at backfill.c:1028
        now = <optimized out>
        wait_time = <optimized out>
        last_backfill_time = 1603808599
        all_locks = {conf = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, part = READ_LOCK, fed = READ_LOCK}
        load_config = <optimized out>
        short_sleep = false
        backfill_cnt = 0
        __func__ = "backfill_agent"
--Type <RET> for more, q to quit, c to continue without paging--c
#3  0x000014a1427d92de in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x000014a14250ae83 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x14a13e9fa700 (LWP 12561)):
#0  0x000014a1424fff21 in poll () from /lib64/libc.so.6
No symbol table info available.
#1  0x000014a142f04885 in _conn_readable (persist_conn=persist_conn@entry=0x7c5240) at slurm_persist_conn.c:138
        ufds = {fd = 6, events = 1, revents = 0}
        rc = <optimized out>
        time_left = 900000
        __func__ = "_conn_readable"
#2  0x000014a142f05e77 in slurm_persist_recv_msg (persist_conn=0x7c5240) at slurm_persist_conn.c:925
        msg_size = 226
        nw_size = 3791650816
        msg = 0x14a134004c70 ""
        msg_read = <optimized out>
        offset = 0
        buffer = <optimized out>
        __func__ = "slurm_persist_recv_msg"
#3  0x000014a13eb072c4 in _handle_mult_rc_ret () at slurmdbd_agent.c:189
        buffer = <optimized out>
        msg_type = 0
        msg = 0x0
        list_msg = 0x0
        rc = -1
        out_buf = 0x0
        buffer = <optimized out>
        msg_type = <optimized out>
        msg = <optimized out>
        list_msg = <optimized out>
        rc = <optimized out>
        out_buf = <optimized out>
        __func__ = "_handle_mult_rc_ret"
        err = <optimized out>
        itr = <optimized out>
        b = <optimized out>
        err = <optimized out>
#4  _agent (x=<optimized out>) at slurmdbd_agent.c:852
        rc = 0
        cnt = 12
        buffer = 0xcc4ba0
        abs_time = {tv_sec = 1603808607, tv_nsec = 0}
        fail_time = 0
        sigarray = {10, 0}
        list_req = {conn = 0x0, data = 0x14a13e9f9e70, data_size = 0, msg_type = 1474}
        list_msg = {my_list = 0x7b5bb0, return_code = 0}
        tv1 = {tv_sec = 1603808600, tv_usec = 250167}
        tv2 = {tv_sec = 1603808600, tv_usec = 250165}
        tv_str = "usec=44067\000\000\000\000\000\000\000\000\000"
        delta_t = 44067
        __func__ = "_agent"
#5  0x000014a1427d92de in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#6  0x000014a14250ae83 in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x14a14340e100 (LWP 12557)):
#0  0x000014a142e80e59 in bit_clear (b=0x0, bit=bit@entry=277) at bitstring.c:258
No locals.
#1  0x0000000000479e41 in make_node_idle (node_ptr=node_ptr@entry=0x14a1432a3c38, job_ptr=job_ptr@entry=0xca7950) at node_mgr.c:3859
        inx = 277
        node_flags = <optimized out>
        now = 1603808600
        node_bitmap = <optimized out>
        __func__ = "make_node_idle"
#2  0x0000000000446c4d in excise_node_from_job (job_ptr=job_ptr@entry=0xca7950, node_ptr=node_ptr@entry=0x14a1432a3c38) at job_mgr.c:4146
        i = <optimized out>
        i_first = <optimized out>
        i_last = <optimized out>
        orig_pos = -1
        new_pos = -1
        orig_bitmap = 0xd6f580
        __func__ = "excise_node_from_job"
#3  0x00000000004a2e4a in _sync_nodes_to_active_job (job_ptr=0xca7950) at read_config.c:2669
        save_accounting_enforce = 11
        i = 277
        cnt = 0
        node_flags = <optimized out>
        node_ptr = 0x14a1432a3c38
        i = <optimized out>
        cnt = <optimized out>
        node_flags = <optimized out>
       node_ptr = <optimized out>
        __func__ = "_sync_nodes_to_active_job"
        save_accounting_enforce = <optimized out>
#4  _sync_nodes_to_jobs (reconfig=false) at read_config.c:2548
        job_ptr = 0xca7950
        job_iterator = 0xd22c40
        update_cnt = 0
        job_ptr = <optimized out>
        job_iterator = <optimized out>
        update_cnt = <optimized out>
#5  read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1375
        tv1 = {tv_sec = 1603808599, tv_usec = 486287}
        tv2 = {tv_sec = 0, tv_usec = 0}
        tv_str = '\000' <repeats 19 times>
        delta_t = 0
        error_code = 0
        i = <optimized out>
        rc = 0
        load_job_ret = 0
        old_node_record_count = <optimized out>
        old_node_table_ptr = <optimized out>
        node_ptr = <optimized out>
        do_reorder_nodes = <optimized out>
        old_part_list = <optimized out>
        old_def_part_name = <optimized out>
       old_auth_type = 0xaed980 "auth/munge"
        old_bb_type = 0x0
        old_cred_type = 0xaed950 "cred/munge"
        old_preempt_mode = <optimized out>
        old_preempt_type = 0xaed8c0 "preempt/none"
        old_sched_type = 0xaed5c0 "sched/backfill"
        old_select_type = 0xac5160 "select/cons_res"
        old_switch_type = 0xac5190 "switch/none"
        state_save_dir = 0xaed9b0 "/opt/slurm/current/var/spool"
        mpi_params = 0x0
        old_select_type_p = <optimized out>
        cgroup_mem_confinement = <optimized out>
        __func__ = "read_slurm_conf"
#6  0x000000000042e578 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:680
        cnt = <optimized out>
        error_code = <optimized out>
        i = 3
        start = {tv_sec = 1603808597, tv_usec = 743865}
        now = {tv_sec = 1603808597, tv_usec = 743891}
        stat_buf = {st_dev = 2054, st_ino = 1521783, st_nlink = 1, st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 34280, st_blksize = 4096, st_blocks = 72, st_atim = {tv_sec = 1591804356, tv_nsec = 0}, st_mtim = {tv_sec = 1591804356, tv_nsec = 0}, st_ctim = {tv_sec = 1600794204, tv_nsec = 406611702}, __glibc_reserved = {0, 0, 0}}
        rlim = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
        config_write_lock = {conf = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, part = WRITE_LOCK, fed = NO_LOCK}
        callbacks = {acct_full = 0x4bd08e <trigger_primary_ctld_acct_full>, dbd_fail = 0x4bd275 <trigger_primary_dbd_fail>, dbd_resumed = 0x4bd2f9 <trigger_primary_dbd_res_op>, db_fail = 0x4bd374 <trigger_primary_db_fail>, db_resumed = 0x4bd3f8 <trigger_primary_db_res_op>}
        prep_callbacks = {prolog_slurmctld = 0x48d36f <prep_prolog_slurmctld_callback>, epilog_slurmctld = 0x48d625 <prep_epilog_slurmctld_callback>}
        create_clustername_file = <optimized out>
        conf_file = <optimized out>
        __func__ = "main"

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, October 27, 2020 8:29 AM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 10069] after removal of reservation slurmctld is crashing at start with a coredump

Marcin Stolarek<mailto:cinek@schedmd.com> changed bug 10069<https://bugs.schedmd.com/show_bug.cgi?id=10069>
What
Removed
Added
CC

cinek@schedmd.com<mailto:cinek@schedmd.com>
Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=10069#c1> on bug 10069<https://bugs.schedmd.com/show_bug.cgi?id=10069> from Marcin Stolarek<mailto:cinek@schedmd.com>

Could you please load the core into gdb with your slurmctld binary and share

the backtrace from it:

gdb /path/to/slurmctld /path/to/core.XX

(gdb)t a a bt f?



cheers,

Marcin

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 4 Marcin Stolarek 2020-10-27 09:06:17 MDT

Created attachment 16372 [details]
workaround (v1)

Michael,

I'm looking into the details. Could you please apply the attached patch as a temporary workaround that should let you start slurmctld.

cheers,
Marcin

Comment 5 Marcin Stolarek 2020-10-27 09:35:37 MDT

Could you please elaborate a little bit more on the subjec? What are the steps that resulted in your case?

Did the workaround patch work for you?

cheers,
Marcin

Comment 6 Michael Hebenstreit 2020-10-27 09:46:47 MDT

Just finished applying the patch
Cluster working again but some strange results in sinfo are still present:

[root@eslurm1 ~]# squeue  -p idealq
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              2829    idealq gamess_d aknyaze1 PD       0:00     32 (Resources)
              2830    idealq gamess_d aknyaze1 PD       0:00     48 (ReqNodeNotAvail, Reserved for maintenance)
              2831    idealq gamess_d aknyaze1 PD       0:00     64 (ReqNodeNotAvail, Reserved for maintenance)
              2833    idealq gamess_d aknyaze1 PD       0:00     32 (ReqNodeNotAvail, Reserved for maintenance)
              2834    idealq gamess_d aknyaze1 PD       0:00     48 (ReqNodeNotAvail, Reserved for maintenance)
              2835    idealq gamess_d aknyaze1 PD       0:00     64 (ReqNodeNotAvail, Reserved for maintenance)
              2836    idealq gamess_d aknyaze1 PD       0:00     16 (ReqNodeNotAvail, Reserved for maintenance)
              2837    idealq gamess_d aknyaze1 PD       0:00     32 (ReqNodeNotAvail, Reserved for maintenance)
              2838    idealq gamess_d aknyaze1 PD       0:00     48 (ReqNodeNotAvail, Reserved for maintenance)
              2839    idealq gamess_d aknyaze1 PD       0:00     64 (ReqNodeNotAvail, Reserved for maintenance)
              2840    idealq gamess_d aknyaze1 PD       0:00     16 (ReqNodeNotAvail, Reserved for maintenance)
              2841    idealq gamess_d aknyaze1 PD       0:00     32 (ReqNodeNotAvail, Reserved for maintenance)
              2842    idealq gamess_d aknyaze1 PD       0:00     48 (ReqNodeNotAvail, Reserved for maintenance)
              2843    idealq gamess_d aknyaze1 PD       0:00     64 (ReqNodeNotAvail, Reserved for maintenance)
              2882    idealq gamess_d aknyaze1 PD       0:00     16 (ReqNodeNotAvail, Reserved for maintenance)
              2883    idealq gamess_d aknyaze1 PD       0:00     32 (ReqNodeNotAvail, Reserved for maintenance)
              2884    idealq gamess_d aknyaze1 PD       0:00     48 (ReqNodeNotAvail, Reserved for maintenance)
              2885    idealq gamess_d aknyaze1 PD       0:00     64 (ReqNodeNotAvail, Reserved for maintenance)
              2886    idealq gamess_d aknyaze1 PD       0:00     16 (ReqNodeNotAvail, Reserved for maintenance)
              2887    idealq gamess_d aknyaze1 PD       0:00     32 (ReqNodeNotAvail, Reserved for maintenance)
              2888    idealq gamess_d aknyaze1 PD       0:00     48 (ReqNodeNotAvail, Reserved for maintenance)
              2889    idealq gamess_d aknyaze1 PD       0:00     64 (ReqNodeNotAvail, Reserved for maintenance)
              2890    idealq gamess_d aknyaze1 PD       0:00     16 (ReqNodeNotAvail, Reserved for maintenance)
              2891    idealq gamess_d aknyaze1 PD       0:00     32 (ReqNodeNotAvail, Reserved for maintenance)
              2892    idealq gamess_d aknyaze1 PD       0:00     48 (ReqNodeNotAvail, Reserved for maintenance)
              2893    idealq gamess_d aknyaze1 PD       0:00     64 (ReqNodeNotAvail, Reserved for maintenance)
              2894    idealq gamess_d aknyaze1 PD       0:00     16 (ReqNodeNotAvail, Reserved for maintenance)
              2895    idealq gamess_d aknyaze1 PD       0:00     32 (ReqNodeNotAvail, Reserved for maintenance)
              2896    idealq gamess_d aknyaze1 PD       0:00     48 (ReqNodeNotAvail, Reserved for maintenance)
              2897    idealq gamess_d aknyaze1 PD       0:00     64 (ReqNodeNotAvail, Reserved for maintenance)
              2960    idealq hycom_lr  msazhin PD       0:00     29 (ReqNodeNotAvail, Reserved for maintenance)
              2334    idealq gamess_d aknyaze1 PD       0:00     43 (Priority)
             2774    idealq gamess_d aknyaze1  R       5:57     57 eia[076-080,092-097,106-126,154-159,161-179]
              2832    idealq gamess_d aknyaze1  R       5:57     16 eia[072,086-090,100-104,145-149]
[root@eslurm1 ~]# scontrol show partition=idealq
PartitionName=idealq
   AllowGroups=idealqug AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=60-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=ekln[01-02],eca[001-036,038-040],eia[072-073,075-126,145-152,154-179]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=20208 TotalNodes=129 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

The jobs are not released because of “maintenance”, when in reality the nodes are simply not free.

In the morning I received information jobs were not starting. Reason was soon found – I had left behind a weekly reservation. I deleted the reservation with “scontrol delete reservation=test”, but no change. I decided to kill slurmctld and restart it. That’s when the issues started. Here is the entry of test from mysql database:

MariaDB [slurm_acct_db]> select * from endeavour_resv_table where resv_name='test';
+---------+---------+-----------+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----------+------------+------------+----------+-------------+
| id_resv | deleted | assoclist | flags  | nodelist                                                                                                                                                                                                                                                                                                                                    | node_inx | resv_name | time_start | time_end   | tres     | unused_wall |
+---------+---------+-----------+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----------+------------+------------+----------+-------------+
|       8 |       0 | 1         | 557137 | eca[001-036,038-088],ecnv[01-10],ecx[001-009],ega[001-080],ehx[01-02,04],eia[181-275,278-301,303-304,306-370,501-511,513-536,538-540],ekln[01-02],elb[01-04],elc[02-04],elh[01-04],els[02-05],enb[101-108,201-216,301-316,401-408,501-524],epb[001-252,254-360,501-519,555-584,587-590,601-618,701-718,801-818],epx[03-13,15-17],est[01-02] | 0-983    | test      | 1603206000 | 1603206300 | 1=112392 |         300 |
+---------+---------+-----------+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+-----------+------------+------------+----------+-------------+
1 row in set (0.000 sec)


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, October 27, 2020 9:36 AM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 10069] after removal of reservation slurmctld is crashing at start with a coredump

Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=10069#c5> on bug 10069<https://bugs.schedmd.com/show_bug.cgi?id=10069> from Marcin Stolarek<mailto:cinek@schedmd.com>

Could you please elaborate a little bit more on the subjec? What are the steps

that resulted in your case?



Did the workaround patch work for you?



cheers,

Marcin

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 7 Marcin Stolarek 2020-10-27 09:52:24 MDT

>Just finished applying the patch
>Cluster working again but some strange results in sinfo are still present:
That's positive. Are you OK with lowering the ticket serverity to 3 now?

>The jobs are not released because of “maintenance”, when in reality the nodes are simply not free.
This is known missleading value of state reason (Bug 9720), which should  not impact jobs scheduling.

>In the morning I received information jobs were not starting. Reason was soon found – I had left behind a weekly reservation. I deleted the reservation with “scontrol delete reservation=test”, but no change. I decided to kill slurmctld and restart it. That’s when the issues started. Here is the entry of test from mysql database:
Where there any changes to partition configuration in the meantime?

cheers,
Marcin

Comment 8 Michael Hebenstreit 2020-10-27 09:56:31 MDT

No changes to partitions AFAIK

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, October 27, 2020 9:52 AM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 10069] after removal of reservation slurmctld is crashing at start with a coredump

Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=10069#c7> on bug 10069<https://bugs.schedmd.com/show_bug.cgi?id=10069> from Marcin Stolarek<mailto:cinek@schedmd.com>

>Just finished applying the patch

>Cluster working again but some strange results in sinfo are still present:

That's positive. Are you OK with lowering the ticket serverity to 3 now?



>The jobs are not released because of “maintenance”, when in reality the nodes are simply not free.

This is known missleading value of state reason (Bug 9720<show_bug.cgi?id=9720>), which should  not

impact jobs scheduling.



>In the morning I received information jobs were not starting. Reason was soon found – I had left behind a weekly reservation. I deleted the reservation with “scontrol delete reservation=test”, but no change. I decided to kill slurmctld and restart it. That’s when the issues started. Here is the entry of test from mysql database:

Where there any changes to partition configuration in the meantime?



cheers,

Marcin

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 9 Michael Hebenstreit 2020-10-27 10:02:37 MDT

Any idea when the state update bug will be fixed?

Thanks
Michael

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, October 27, 2020 9:52 AM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 10069] after removal of reservation slurmctld is crashing at start with a coredump

Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=10069#c7> on bug 10069<https://bugs.schedmd.com/show_bug.cgi?id=10069> from Marcin Stolarek<mailto:cinek@schedmd.com>

>Just finished applying the patch

>Cluster working again but some strange results in sinfo are still present:

That's positive. Are you OK with lowering the ticket serverity to 3 now?



>The jobs are not released because of “maintenance”, when in reality the nodes are simply not free.

This is known missleading value of state reason (Bug 9720<show_bug.cgi?id=9720>), which should  not

impact jobs scheduling.



>In the morning I received information jobs were not starting. Reason was soon found – I had left behind a weekly reservation. I deleted the reservation with “scontrol delete reservation=test”, but no change. I decided to kill slurmctld and restart it. That’s when the issues started. Here is the entry of test from mysql database:

Where there any changes to partition configuration in the meantime?



cheers,

Marcin

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 10 Marcin Stolarek 2020-10-27 10:09:32 MDT

Are you OK with lowering the ticket serverity to 3 now?

>Any idea when the state update bug will be fixed?
That should rather go the mentioned bug. Shortly - there is a patch there being actively reviewed by our QA team and a workaround mentioned in bug 9720 comment 7.

cheers,
Marcin

Comment 11 Jason Booth 2020-10-27 10:43:36 MDT

Michael - I am lowering the priority since the cluster is up and running again, and we have a workaround of the other issue you brought. Let us know if you have questions.

Comment 12 Michael Hebenstreit 2020-10-27 10:54:16 MDT

So you think I can run with the patch in place?


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, October 27, 2020 10:44 AM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 10069] after removal of reservation slurmctld is crashing at start with a coredump

Jason Booth<mailto:jbooth@schedmd.com> changed bug 10069<https://bugs.schedmd.com/show_bug.cgi?id=10069>
What
Removed
Added
Severity
1 - System not usable
3 - Medium Impact
Comment # 11<https://bugs.schedmd.com/show_bug.cgi?id=10069#c11> on bug 10069<https://bugs.schedmd.com/show_bug.cgi?id=10069> from Jason Booth<mailto:jbooth@schedmd.com>

Michael - I am lowering the priority since the cluster is up and running again,

and we have a workaround of the other issue you brought. Let us know if you

have questions.

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 13 Nate Rini 2020-10-27 11:36:38 MDT

(In reply to Michael Hebenstreit from comment #12)
> So you think I can run with the patch in place?

Yes, the patch should resolve the crashing but we will likely have a different long term solution.

I'm going to lower this to SEV4 as it is now a research bug. Please respond if you have any more issues and we can increase the severity.

Thanks,
--Nate

Comment 22 Marcin Stolarek 2020-12-20 13:28:48 MST

Michael,

The workaround you applied is safe and you don't have to remove it. However, the final patch that landed in 21.08 is a little bit different, commits: 8410df4e0deb0670..19e8311e74ae2cb.

I'm closing the case as fixed now.

best regards,
Marcin