Ticket 5457

Summary:	SlurmCtld Segfaulted
Product:	Slurm	Reporter:	Steve Ford <fordste5>
Component:	slurmctld	Assignee:	Jason Booth <jbooth>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	brian
Version:	17.11.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=5474
Site:	MSU	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.11.9
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Slurmctld log Messages log slurmctld log from 7-24 Job submit script Slurm Config File css-033 logs css-033 logs css-076 logs Patch 5457

Description Steve Ford 2018-07-20 12:40:58 MDT

Our slurmctl daemon segfaulted. I was able to restart it and it's running now. Any idea what caused it to crash?

Here is the backtrace:

Thread 15 (Thread 0x7f64774b7700 (LWP 8071)):
#0  0x00007f647ab3156d in nanosleep () from /lib64/libc.so.6
#1  0x00007f647ab31404 in sleep () from /lib64/libc.so.6
#2  0x00007f64774bd745 in _decay_thread (no_data=<optimized out>) at priority_multifactor.c:1333
#3  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f647ab6abad in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7f64770b3700 (LWP 8073)):
#0  0x00007f647ab61c73 in select () from /lib64/libc.so.6
#1  0x0000000000425716 in _slurmctld_rpc_mgr (no_data=<optimized out>) at controller.c:1026
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7f6476eb1700 (LWP 8075)):
#0  0x00007f647ae44995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004a2221 in slurmctld_state_save (no_data=<optimized out>) at state_save.c:204
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f6476fb2700 (LWP 8074)):
#0  0x00007f647ae48461 in sigwait () from /lib64/libpthread.so.0
#1  0x0000000000429b81 in _slurmctld_signal_hand (no_data=<optimized out>) at controller.c:891
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f6476caf700 (LWP 8077)):
#0  0x00007f647ae44995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000423cd6 in _purge_files_thread (no_data=<optimized out>) at controller.c:3182
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f64773b6700 (LWP 8072)):
#0  0x00007f647ae41f97 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f64774baee5 in _cleanup_thread (no_data=<optimized out>) at priority_multifactor.c:1462
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f64786e0700 (LWP 8022)):
#0  0x00007f647ab3156d in nanosleep () from /lib64/libc.so.6
#1  0x00007f647ab31404 in sleep () from /lib64/libc.so.6
#2  0x00007f64786e4928 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:437
#3  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f647ab6abad in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f64777c3700 (LWP 8062)):
#0  0x00007f647ae44d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000431384 in _fed_job_update_thread (arg=<optimized out>) at fed_mgr.c:2161
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f64778c4700 (LWP 8061)):
#0  0x00007f647ae44d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000042d4c3 in _agent_thread (arg=<optimized out>) at fed_mgr.c:2203
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f6477bc7700 (LWP 8053)):
#0  0x00007f647ae44d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f6477bcbc46 in _my_sleep (usec=30000000) at backfill.c:540
#2  0x00007f6477bd2062 in backfill_agent (args=<optimized out>) at backfill.c:876
#3  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f647ab6abad in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f64782da700 (LWP 8025)):
#0  0x00007f647ae44d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f647b39c4e5 in _agent (x=<optimized out>) at slurmdbd_defs.c:1988
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f64785df700 (LWP 8023)):
#0  0x00007f647ae41f97 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f64786e4070 in _cleanup_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:445
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f647b827740 (LWP 8020)):
#0  0x00007f647ab3156d in nanosleep () from /lib64/libc.so.6
#1  0x00007f647ab62404 in usleep () from /lib64/libc.so.6
#2  0x0000000000428376 in _slurmctld_background (no_data=0x0) at controller.c:1778
#3  main (argc=<optimized out>, argv=<optimized out>) at controller.c:604

Thread 2 (Thread 0x7f647b826700 (LWP 8021)):
#0  0x00007f647ae44d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000041e477 in _agent_init (arg=<optimized out>) at agent.c:1313
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f6418f0f700 (LWP 2118)):
#0  0x00007f647aaa2277 in raise () from /lib64/libc.so.6
#1  0x00007f647aaa3968 in abort () from /lib64/libc.so.6
#2  0x00007f647aa9b096 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007f647aa9b142 in __assert_fail () from /lib64/libc.so.6
#4  0x00007f647b2c097a in bit_test (b=<optimized out>, bit=bit@entry=166) at bitstring.c:228
#5  0x0000000000449a8f in validate_jobs_on_node (reg_msg=reg_msg@entry=0x7f6450006b10) at job_mgr.c:13934
#6  0x000000000048dc51 in _slurm_rpc_node_registration (running_composite=false, msg=0x7f6418f0ee50) at proc_req.c:3076
#7  slurmctld_req (msg=msg@entry=0x7f6418f0ee50, arg=arg@entry=0x7f6468019510) at proc_req.c:407
#8  0x0000000000424f28 in _service_connection (arg=0x7f6468019510) at controller.c:1125
#9  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f647ab6abad in clone () from /lib64/libc.so.6

Comment 2 Jason Booth 2018-07-20 13:16:02 MDT

Hi Steve,

 Please send in the output of the following.

While in gdb: 'thread 1' , 'frame 5', 'info locals', and 'thread apply all bt full'.

It would also be good to know if you have ECC enabled on this system. If so, have any errors been reported? You may also want to look at the output of 'dmesg' and look for any entries in the /var/log/messages around the time of the event.

Kind regards,
Jason

Comment 3 Steve Ford 2018-07-23 08:05:42 MDT

Here is the output from gdb:

[Switching to thread 1 (Thread 0x7f6418f0f700 (LWP 2118))]
#0  0x00007f647aaa2277 in raise () from /lib64/libc.so.6
#5  0x0000000000449a8f in validate_jobs_on_node (reg_msg=reg_msg@entry=0x7f6450006b10) at job_mgr.c:13934
13934 job_mgr.c: No such file or directory.
i = 0
node_inx = 166
jobs_on_node = <optimized out>
node_ptr = 0x29f0a40
job_ptr = 0x7f6450015a70
step_ptr = <optimized out>
step_str = "Z%R[\000\000\000\000\256\305\b\000\000\000\000\000\020\000\000\000[\000\000\000 \352\360\030d\177\000\000\340\351\360\030d\177\000\000 \212\000Pd\177\000\000[\224\000Pd\177\000\000\346mF\000\000\000\000"
now = 1532110170
__func__ = "validate_jobs_on_node"

Thread 15 (Thread 0x7f64774b7700 (LWP 8071)):
#0  0x00007f647ab3156d in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f647ab31404 in sleep () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007f64774bd745 in _decay_thread (no_data=<optimized out>) at priority_multifactor.c:1333
        start_time = 1532109903
        last_reset = 1527873057
        next_reset = 1546318800
        calc_period = 300
        decay_hl = <optimized out>
        reset_period = 6
        now = 1532109903
        run_delta = <optimized out>
        real_decay = <optimized out>
        elapsed = <optimized out>
        job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = READ_LOCK, partition = READ_LOCK, federation = NO_LOCK}
        locks = {assoc = WRITE_LOCK, file = NO_LOCK, qos = NO_LOCK, res = NO_LOCK, tres = NO_LOCK, user = NO_LOCK, wckey = NO_LOCK}
        __func__ = "_decay_thread"
#3  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 14 (Thread 0x7f64770b3700 (LWP 8073)):
#0  0x00007f647ab61c73 in select () from /lib64/libc.so.6
No symbol table info available.
#1  0x0000000000425716 in _slurmctld_rpc_mgr (no_data=<optimized out>) at controller.c:1026
        max_fd = 7
        newsockfd = <optimized out>
        sockfd = 0x7f64680008d0
        cli_addr = {sin_family = 2, sin_port = 53453, sin_addr = {s_addr = 2819008704}, sin_zero = "\000\000\000\000\000\000\000"}
        srv_addr = {sin_family = 2, sin_port = 41242, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}
        port = 41242
        ip = "0.0.0.0", '\000' <repeats 24 times>
        fd_next = 0
        i = <optimized out>
        nports = 1
        rfds = {__fds_bits = {128, 0 <repeats 15 times>}}
        conn_arg = <optimized out>
        config_read_lock = {config = READ_LOCK, job = NO_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        sigarray = {10, 0}
        node_addr = <optimized out>
        __func__ = "_slurmctld_rpc_mgr"
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 13 (Thread 0x7f6476eb1700 (LWP 8075)):
#0  0x00007f647ae44995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00000000004a2221 in slurmctld_state_save (no_data=<optimized out>) at state_save.c:204
        err = <optimized out>
        last_save = 1532110141
        now = 1532110141
        save_delay = <optimized out>
        run_save = <optimized out>
        save_count = 0
        __func__ = "slurmctld_state_save"
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 12 (Thread 0x7f6476fb2700 (LWP 8074)):
#0  0x00007f647ae48461 in sigwait () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x0000000000429b81 in _slurmctld_signal_hand (no_data=<optimized out>) at controller.c:891
        sig = 0
        i = <optimized out>
        rc = <optimized out>
        sig_array = {2, 15, 1, 6, 12, 0}
        set = {__val = {18467, 0 <repeats 15 times>}}
        __func__ = "_slurmctld_signal_hand"
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 11 (Thread 0x7f6476caf700 (LWP 8077)):
#0  0x00007f647ae44995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x0000000000423cd6 in _purge_files_thread (no_data=<optimized out>) at controller.c:3182
        err = <optimized out>
        job_id = 0x0
        __func__ = "_purge_files_thread"
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 10 (Thread 0x7f64773b6700 (LWP 8072)):
#0  0x00007f647ae41f97 in pthread_join () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f64774baee5 in _cleanup_thread (no_data=<optimized out>) at priority_multifactor.c:1462
No locals.
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 9 (Thread 0x7f64786e0700 (LWP 8022)):
#0  0x00007f647ab3156d in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f647ab31404 in sleep () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007f64786e4928 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:437
        local_job_list = <optimized out>
        job_ptr = <optimized out>
        itr = <optimized out>
        job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        __func__ = "_set_db_inx_thread"
#3  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 8 (Thread 0x7f64777c3700 (LWP 8062)):
#0  0x00007f647ae44d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x0000000000431384 in _fed_job_update_thread (arg=<optimized out>) at fed_mgr.c:2161
        err = <optimized out>
        ts = {tv_sec = 1532110171, tv_nsec = 0}
        job_update_info = <optimized out>
        __func__ = "_fed_job_update_thread"
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 7 (Thread 0x7f64778c4700 (LWP 8061)):
#0  0x00007f647ae44d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000000000042d4c3 in _agent_thread (arg=<optimized out>) at fed_mgr.c:2203
        err = <optimized out>
        cluster = <optimized out>
        ts = {tv_sec = 1532110171, tv_nsec = 0}
        rpc_rec = <optimized out>
        req_msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, body_offset = 0, buffer = 0x0, conn = 0x0, conn_fd = 0, data = 0x0, data_size = 0, flags = 0, msg_index = 0, msg_type = 0, protocol_version = 0, forward = {cnt = 0, init = 0, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
        resp_msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, body_offset = 0, buffer = 0x0, conn = 0x0, conn_fd = 0, data = 0x0, data_size = 0, flags = 0, msg_index = 0, msg_type = 0, protocol_version = 0, forward = {cnt = 0, init = 0, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
        ctld_req_msg = {my_list = 0x0}
        success_bits = <optimized out>
        rc = <optimized out>
        resp_inx = <optimized out>
        success_size = <optimized out>
        fed_read_lock = {config = NO_LOCK, job = NO_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = READ_LOCK}
        __func__ = "_agent_thread"
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 6 (Thread 0x7f6477bc7700 (LWP 8053)):
#0  0x00007f647ae44d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f6477bcbc46 in _my_sleep (usec=30000000) at backfill.c:540
        err = <optimized out>
        nsec = <optimized out>
        sleep_time = 0
        ts = {tv_sec = 1532110187, tv_nsec = 808441000}
        tv1 = {tv_sec = 1532110157, tv_usec = 808441}
        tv2 = {tv_sec = 0, tv_usec = 0}
        __func__ = "_my_sleep"
#2  0x00007f6477bd2062 in backfill_agent (args=<optimized out>) at backfill.c:876
        now = <optimized out>
        wait_time = <optimized out>
        last_backfill_time = 1532110157
        all_locks = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = READ_LOCK, federation = READ_LOCK}
        load_config = <optimized out>
        short_sleep = false
        backfill_cnt = 556
        __func__ = "backfill_agent"
#3  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4  0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 5 (Thread 0x7f64782da700 (LWP 8025)):
#0  0x00007f647ae44d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f647b39c4e5 in _agent (x=<optimized out>) at slurmdbd_defs.c:1988
        err = <optimized out>
        cnt = <optimized out>
        rc = <optimized out>
        buffer = <optimized out>
        abs_time = {tv_sec = 1532110173, tv_nsec = 0}
        fail_time = 0
        sigarray = {10, 0}
        list_req = {msg_type = 1474, data = 0x7f64782d9eb0}
        list_msg = {my_list = 0x0, return_code = 0}
        __func__ = "_agent"
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 4 (Thread 0x7f64785df700 (LWP 8023)):
#0  0x00007f647ae41f97 in pthread_join () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f64786e4070 in _cleanup_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:445
No locals.
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 3 (Thread 0x7f647b827740 (LWP 8020)):
#0  0x00007f647ab3156d in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f647ab62404 in usleep () from /lib64/libc.so.6
No symbol table info available.
#2  0x0000000000428376 in _slurmctld_background (no_data=0x0) at controller.c:1778
        i = 8
        job_limit = <optimized out>
        delta_t = 27
        last_full_sched_time = 1532110143
        last_ctld_bu_ping = 1532110169
        last_uid_update = 1532106905
        last_reboot_msg_time = 1532092504
        ping_interval = 100
        job_read_lock = {config = READ_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        job_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = READ_LOCK, federation = READ_LOCK}
        job_node_read_lock = {config = NO_LOCK, job = READ_LOCK, node = READ_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        last_group_time = 1532109903
        last_acct_gather_node_time = 1532092503
        last_ext_sensors_time = 1532092503
        last_resv_time = 1532110165
        tv1 = {tv_sec = 1532110169, tv_usec = 697263}
        node_write_lock2 = {config = NO_LOCK, job = NO_LOCK, node = WRITE_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        last_timelimit_time = 1532110145
        last_assert_primary_time = 1532092503
        purge_job_interval = 60
        tv2 = {tv_sec = 1532110169, tv_usec = 697290}
        config_read_lock = {config = READ_LOCK, job = NO_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        node_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        last_purge_job_time = 1532110143
        last_node_acct = 1532109903
        no_resp_msg_interval = <optimized out>
        tv_str = "usec=27\000\000\065\000\000\000\000\000\000\000\000\000"
        job_write_lock2 = {config = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
        last_no_resp_msg_time = 1532109903
        now = <optimized out>
        last_sched_time = 1532110143
        last_ping_node_time = 1532110107
        part_write_lock = {config = NO_LOCK, job = NO_LOCK, node = NO_LOCK, partition = WRITE_LOCK, federation = NO_LOCK}
        last_health_check_time = 1532110158
        last_checkpoint_time = 1532110083
        last_ping_srun_time = 1532092503
        last_trigger = 1532110169
#3  main (argc=<optimized out>, argv=<optimized out>) at controller.c:604
        cnt = <optimized out>
        error_code = <optimized out>
        i = 3
        stat_buf = {st_dev = 64769, st_ino = 33752695, st_nlink = 1, st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 392880, st_blksize = 4096, st_blocks = 768, st_atim = {tv_sec = 1532024690, tv_nsec = 668992174}, st_mtim = {tv_sec = 1523430473, tv_nsec = 0}, st_ctim = {tv_sec = 1531333934, tv_nsec = 727046754}, __unused = {0, 0, 0}}
        rlim = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
        config_write_lock = {config = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = WRITE_LOCK, federation = NO_LOCK}
        node_part_write_lock = {config = NO_LOCK, job = NO_LOCK, node = WRITE_LOCK, partition = WRITE_LOCK, federation = NO_LOCK}
        callbacks = {acct_full = 0x4acad5 <trigger_primary_ctld_acct_full>, dbd_fail = 0x4acce4 <trigger_primary_dbd_fail>, dbd_resumed = 0x4acd72 <trigger_primary_dbd_res_op>, db_fail = 0x4acdf7 <trigger_primary_db_fail>, db_resumed = 0x4ace85 <trigger_primary_db_res_op>}
        create_clustername_file = 120
        __func__ = "main"

Thread 2 (Thread 0x7f647b826700 (LWP 8021)):
#0  0x00007f647ae44d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000000000041e477 in _agent_init (arg=<optimized out>) at agent.c:1313
        err = <optimized out>
        min_wait = <optimized out>
        mail_too = <optimized out>
        ts = {tv_sec = 1532110171, tv_nsec = 0}
        __func__ = "_agent_init"
#2  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.

Thread 1 (Thread 0x7f6418f0f700 (LWP 2118)):
#0  0x00007f647aaa2277 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f647aaa3968 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007f647aa9b096 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3  0x00007f647aa9b142 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4  0x00007f647b2c097a in bit_test (b=<optimized out>, bit=bit@entry=166) at bitstring.c:228
        __PRETTY_FUNCTION__ = "bit_test"
#5  0x0000000000449a8f in validate_jobs_on_node (reg_msg=reg_msg@entry=0x7f6450006b10) at job_mgr.c:13934
        i = 0
        node_inx = 166
        jobs_on_node = <optimized out>
        node_ptr = 0x29f0a40
        job_ptr = 0x7f6450015a70
        step_ptr = <optimized out>
        step_str = "Z%R[\000\000\000\000\256\305\b\000\000\000\000\000\020\000\000\000[\000\000\000 \352\360\030d\177\000\000\340\351\360\030d\177\000\000 \212\000Pd\177\000\000[\224\000Pd\177\000\000\346mF\000\000\000\000"
        now = 1532110170
        __func__ = "validate_jobs_on_node"
#6  0x000000000048dc51 in _slurm_rpc_node_registration (running_composite=false, msg=0x7f6418f0ee50) at proc_req.c:3076
        tv1 = {tv_sec = 1532110170, tv_usec = 575017}
        tv_str = '\000' <repeats 19 times>
        delta_t = 140068815669792
        error_code = <optimized out>
        tv2 = {tv_sec = 390842023984, tv_usec = 140067891899248}
        newly_up = false
        node_reg_stat_msg = 0x7f6450006b10
        job_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK, federation = READ_LOCK}
        uid = 0
#7  slurmctld_req (msg=msg@entry=0x7f6418f0ee50, arg=arg@entry=0x7f6468019510) at proc_req.c:407
        tv1 = {tv_sec = 1532110170, tv_usec = 575017}
        tv2 = {tv_sec = 8589934593, tv_usec = 2}
        tv_str = '\000' <repeats 19 times>
        delta_t = 390842023984
        i = 0
        rpc_type_index = 6
        rpc_user_index = 0
        rpc_uid = <optimized out>
        __func__ = "slurmctld_req"
#8  0x0000000000424f28 in _service_connection (arg=0x7f6468019510) at controller.c:1125
        conn = 0x7f6468019510
        msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x7f6450000a10, body_offset = 173, buffer = 0x7f645002d430, conn = 0x0, conn_fd = 4, data = 0x7f6450006b10, data_size = 0, flags = 0, msg_index = 0, msg_type = 1002, protocol_version = 8192, forward = {cnt = 0, init = 65534, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
        __func__ = "_service_connection"
#9  0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#10 0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.


Our slurm server is a virtual machine. No ECC errors have been reported by the host system.

Dmesg shows a number of processes that hung up in XFS calls. We has some network issues that day may have affected the VM hosts ability to access its backend storage. Slurmctld was not one of the hung processes.

Comment 4 Jason Booth 2018-07-23 10:50:03 MDT

Hi Steve,

 Would you also attach your slurmctld.log and the /var/log/messages from that day?

Kind regards,
Jason

Comment 6 Jason Booth 2018-07-23 11:50:51 MDT

Hi Steve,

In addition to my last email would you also include:

thread 1
frame 5
p *job_ptr


the bt shows that node_bitmap is optimized out:

#4  0x00007f647b2c097a in bit_test (b=<optimized out>, bit=bit@entry=166) at
bitstring.c:228
        __PRETTY_FUNCTION__ = "bit_test"


It'd be nice to know if it was null or just corrupted.

Kind regards,
Jason

Comment 7 Steve Ford 2018-07-24 10:30:27 MDT

Created attachment 7382 [details]
Slurmctld log

Comment 8 Steve Ford 2018-07-24 10:32:22 MDT

Created attachment 7384 [details]
Messages log

Comment 9 Steve Ford 2018-07-24 10:32:45 MDT

(gdb) thread 1
[Switching to thread 1 (Thread 0x7f6418f0f700 (LWP 2118))]
#0  0x00007f647aaa2277 in raise () from /lib64/libc.so.6
(gdb) frame 5
#5  0x0000000000449a8f in validate_jobs_on_node (reg_msg=reg_msg@entry=0x7f6450006b10) at job_mgr.c:13934
13934	job_mgr.c: No such file or directory.
(gdb) p *job_ptr
$1 = {account = 0x7f64500161d0 "classres", admin_comment = 0x0, alias_list = 0x0, 
  alloc_node = 0x7f64500161a0 "dev-intel18", alloc_resp_port = 51895, alloc_sid = 21447, array_job_id = 0, 
  array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1607, assoc_ptr = 0x264cad0, batch_flag = 0, 
  batch_host = 0x7f645000fed0 "css-033", billable_tres = 2, bit_flags = 16384, burst_buffer = 0x0, 
  burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, 
  cpu_cnt = 2, cr_enabled = 1, db_index = 101612, deadline = 0, delay_boot = 0, derived_ec = 0, 
  details = 0x7f6450015e20, direct_set_prio = 0, end_time = 1532113356, end_time_exp = 1532113356, 
  epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, 
  gres_list = 0x0, gres_alloc = 0x7f645000fe70 "", gres_detail_cnt = 0, gres_detail_str = 0x0, 
  gres_req = 0x7f6450012140 "", gres_used = 0x0, group_id = 2103, job_id = 6256, job_next = 0x0, 
  job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1, 
  last_sched_eval = 1532109756, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, 
    tres = 0x7f6450015650}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, 
  name = 0x7f6450000f50 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7f6450012160 "css-[033,076]", 
  node_addr = 0x7f645000ef90, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 2, node_cnt_wag = 2, 
  nodes_completing = 0x0, origin_cluster = 0x7f6450001720 "msuhpcc", other_port = 51894, pack_job_id = 0, 
  pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, 
  partition = 0x7f645000ef00 "classres,general-short", part_ptr_list = 0x7f64480bc270, part_nodes_missing = false, 
  part_ptr = 0x29a9d40, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, 
  priority = 101, priority_array = 0x7f6450016340, prio_factors = 0x7f6450015fd0, profile = 0, qos_id = 0, 
  qos_ptr = 0x0, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, 
  resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x7f6450016200 "192.168.9.200", 
  sched_nodes = 0x0, select_jobinfo = 0x7f6450016260, spank_job_env = 0x0, spank_job_env_size = 0, 
  start_protocol_ver = 8192, start_time = 1532109756, 
  state_desc = 0x7f645000f8d0 "ReqNodeNotAvail, UnavailableNodes:csm-[000-023],csn-[000-039],csp-[001-027],css-[001-032,034-075,078-127],lac-[000-197,199-337,340-371,373-445],vim-[000-002]", state_reason = 15, 
  state_reason_prev = 15, step_list = 0x290f030, suspend_time = 0, time_last_active = 1532109756, time_limit = 60, 
  time_min = 0, tot_sus_time = 0, total_cpus = 2, total_nodes = 2, tres_req_cnt = 0x7f64500156b0, 
  tres_req_str = 0x7f64500160e0 "1=2,2=6554,4=2", tres_fmt_req_str = 0x7f6450016140 "cpu=2,mem=6554M,node=2", 
  tres_alloc_cnt = 0x7f645000ee80, tres_alloc_str = 0x7f645000ff40 "1=2,2=6554,3=18446744073709551614,4=2,5=2", 
  tres_fmt_alloc_str = 0x7f645000efd0 "cpu=2,mem=6554M,node=2,billing=2", user_id = 804793, 
  user_name = 0x7f6450015680 "changc81", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, 
  wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}

Comment 10 Jason Booth 2018-07-24 11:38:14 MDT

Hi Steve,

 After reviewing the logs you have uploaded, and the traces, I believe this is caused by changes to the cluster configuration without restarting the slurmd and slurmctld process(es). 

You will notice a number of errors in the slurmctld logs like the following:

 error: Node lac-338 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.


Making changes to the slurm.conf and not restarting the processes will result is strange behavior similar to what you are seeing. We, therefore, suggest that you make sure that each node has the same slurm.conf as the controller and restart all the slurmd processes and slurmctld process. 

 We would also like to know if SLURM is being managed by another utility such as Bright Cluster Manager. This will help us know if these changes are being made somewhere else other then by the admins.


Kind regards,
Jason

Comment 11 Steve Ford 2018-07-24 12:30:12 MDT

All our SLURM clients and the SLURM server are managed with Puppet so the daemons are restarted shortly after being updated.

We added some more nodes to the config file this morning, which may have caused the log messages. I don't think that's what caused the segfaults, at least not all of them. We had segfaults occur at 12:39 and 1:20, which was several hours after config files were changed and services restarted. I don't see any of those log messages around this time either. Could something else have caused these?

Comment 12 Steve Ford 2018-07-24 13:04:54 MDT

We just saw another segfault. The config files have not changed.

#0  0x00007f0f7beaf277 in raise () from /lib64/libc.so.6
#1  0x00007f0f7beb0968 in abort () from /lib64/libc.so.6
#2  0x00007f0f7bea8096 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007f0f7bea8142 in __assert_fail () from /lib64/libc.so.6
#4  0x00007f0f7c6cd97a in bit_test (b=<optimized out>, bit=bit@entry=416) at bitstring.c:228
#5  0x0000000000449de6 in _purge_missing_jobs (now=1532458700, node_inx=<optimized out>) at job_mgr.c:14059
#6  validate_jobs_on_node (reg_msg=reg_msg@entry=0x7f0f340171c0) at job_mgr.c:14014
#7  0x000000000048dc51 in _slurm_rpc_node_registration (running_composite=false, msg=0x7f0f19352e50)
    at proc_req.c:3076
#8  slurmctld_req (msg=msg@entry=0x7f0f19352e50, arg=arg@entry=0x7f0f680008f0) at proc_req.c:407
#9  0x0000000000424f28 in _service_connection (arg=0x7f0f680008f0) at controller.c:1125
#10 0x00007f0f7c24de25 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f0f7bf77bad in clone () from /lib64/libc.so.6

Comment 13 Jason Booth 2018-07-24 13:32:06 MDT

Hi Steve,

 Could I have you gather the output from the last backtrace? Find the thread with either:

_purge_missing_jobs
validate_jobs_on_node

For example:

thread 1
frame 5
p *job_ptr

We are curious to know if this is this may be caused by the same job '6256'.

Kind regards,
Jason

Comment 15 Steve Ford 2018-07-25 07:57:44 MDT

Here is thread 1, frame 5, p *job_ptr for all of the core dumps.

PID 37953 __assert_fail, _purge_missing_jobs
$1 = {account = 0x7f70cc001680 "test1", admin_comment = 0x0, alias_list = 0x0, 
  alloc_node = 0x7f70cc0e0900 "dev-intel18", alloc_resp_port = 51010, alloc_sid = 14161, array_job_id = 0, 
  array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1646, assoc_ptr = 0x7f70cc008cc0, batch_flag = 0, 
  batch_host = 0x7f70cc0c2200 "css-077", billable_tres = 18, bit_flags = 16384, burst_buffer = 0x0, 
  burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, 
  cpu_cnt = 18, cr_enabled = 1, db_index = 102622, deadline = 0, delay_boot = 0, derived_ec = 0, 
  details = 0x7f70cc0e0580, direct_set_prio = 0, end_time = 1532455676, end_time_exp = 1532455676, 
  epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, 
  gres_list = 0x0, gres_alloc = 0x7f70cc0c2240 "", gres_detail_cnt = 0, gres_detail_str = 0x0, 
  gres_req = 0x7f70cc0c21c0 "", gres_used = 0x0, group_id = 2000, job_id = 6777, job_next = 0x0, 
  job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1, 
  last_sched_eval = 1532441276, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, 
    tres = 0x7f70cc0e5450}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, 
  name = 0x7f70cc002190 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7f70cc0c21e0 "css-077", 
  node_addr = 0x7f70cc0c25d0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 1, node_cnt_wag = 1, 
  nodes_completing = 0x0, origin_cluster = 0x7f70cc00b110 "msuhpcc", other_port = 51009, pack_job_id = 0, 
  pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, 
  partition = 0x7f70cc0c20a0 "general-short-18,test1-14,test1-16,test1-18,general-short-14,general-short-16", 
  part_ptr_list = 0x7f70cc0bb160, part_nodes_missing = false, part_ptr = 0x1ba9270, power_flags = 0 '\000', 
  pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 100101, 
  priority_array = 0x7f70cc0e58b0, prio_factors = 0x7f70cc0e0730, profile = 0, qos_id = 1, qos_ptr = 0x1842350, 
  qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, 
  resv_ptr = 0x0, requid = 4294967295, resp_host = 0x7f70cc0e0930 "192.168.9.200", sched_nodes = 0x0, 
  select_jobinfo = 0x7f70cc0e09a0, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, 
  start_time = 1532441276, state_desc = 0x0, state_reason = 20, state_reason_prev = 20, step_list = 0x1afb8b0, 
  suspend_time = 0, time_last_active = 1532441276, time_limit = 240, time_min = 0, tot_sus_time = 0, 
  total_cpus = 18, total_nodes = 1, tres_req_cnt = 0x7f70cc0e54f0, 
  tres_req_str = 0x7f70cc0e0840 "1=18,2=58986,4=1", tres_fmt_req_str = 0x7f70cc0e08a0 "cpu=18,mem=58986M,node=1", 
  tres_alloc_cnt = 0x7f70cc0c2580, tres_alloc_str = 0x7f70cc0c2600 "1=18,2=58986,3=18446744073709551614,4=1,5=18", 
  tres_fmt_alloc_str = 0x7f70cc0c2770 "cpu=18,mem=58986M,node=1,billing=18", user_id = 175025, 
  user_name = 0x7f70cc0c2260 "jal", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, 
  wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}

PID 55525 __assert_fail, _purge_missing_jobs
$1 = {account = 0x7f48b4002e60 "test1", admin_comment = 0x0, alias_list = 0x0, 
  alloc_node = 0x7f48b400be80 "dev-intel18", alloc_resp_port = 52268, alloc_sid = 27577, array_job_id = 0, 
  array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1646, assoc_ptr = 0xaf5880, batch_flag = 0, 
  batch_host = 0x7f48b400dd80 "css-077", billable_tres = 10, bit_flags = 16384, burst_buffer = 0x0, 
  burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, 
  cpu_cnt = 10, cr_enabled = 1, db_index = 102674, deadline = 0, delay_boot = 0, derived_ec = 0, 
  details = 0x7f48b400bb00, direct_set_prio = 0, end_time = 1532457014, end_time_exp = 1532457014, 
  epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, 
  gres_list = 0x0, gres_alloc = 0x7f48b400ddc0 "", gres_detail_cnt = 0, gres_detail_str = 0x0, 
  gres_req = 0x7f48b400e830 "", gres_used = 0x0, group_id = 2000, job_id = 6805, job_next = 0x0, 
  job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1, 
  last_sched_eval = 1532442614, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, 
    tres = 0x7f48b400b2f0}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, 
  name = 0x7f48b4002670 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7f48b400e850 "css-077", 
  node_addr = 0x7f48b400e120, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 1, node_cnt_wag = 1, 
  nodes_completing = 0x0, origin_cluster = 0x7f48b40042c0 "msuhpcc", other_port = 52267, pack_job_id = 0, 
  pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, 
  partition = 0x7f48b400e7b0 "general-short-18,test1-14,test1-16,test1-18,general-short-14,general-short-16", 
  part_ptr_list = 0xdb7c20, part_nodes_missing = false, part_ptr = 0xe557a0, power_flags = 0 '\000', 
  pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 18091, 
  priority_array = 0x7f48b400bff0, prio_factors = 0x7f48b400bcb0, profile = 0, qos_id = 1, qos_ptr = 0xaee2a0, 
  qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, 
  resv_ptr = 0x0, requid = 4294967295, resp_host = 0x7f48b400beb0 "192.168.9.200", sched_nodes = 0x0, 
  select_jobinfo = 0x7f48b400bf10, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, 
  start_time = 1532442614, state_desc = 0x0, state_reason = 20, state_reason_prev = 20, step_list = 0xda7bb0, 
  suspend_time = 0, time_last_active = 1532442614, time_limit = 240, time_min = 0, tot_sus_time = 0, 
  total_cpus = 10, total_nodes = 1, tres_req_cnt = 0x7f48b400b390, 
  tres_req_str = 0x7f48b400bdc0 "1=10,2=32770,4=1", tres_fmt_req_str = 0x7f48b400be20 "cpu=10,mem=32770M,node=1", 
  tres_alloc_cnt = 0x7f48b400e0d0, tres_alloc_str = 0x7f48b400e150 "1=10,2=32770,3=18446744073709551614,4=1,5=10", 
  tres_fmt_alloc_str = 0x7f48b400e2c0 "cpu=10,mem=32770M,node=1,billing=10", user_id = 175025, 
  user_name = 0x7f48b400dde0 "jal", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, 
  wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}

PID 64876 __assert_fail, validate_jobs_on_node
$1 = {account = 0x7fd54400b430 "test1", admin_comment = 0x0, alias_list = 0x0, 
  alloc_node = 0x7fd54400b400 "dev-intel18", alloc_resp_port = 51102, alloc_sid = 27577, array_job_id = 0, 
  array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1646, assoc_ptr = 0xa6e880, batch_flag = 0, 
  batch_host = 0x7fd54400d3a0 "css-077", billable_tres = 20, bit_flags = 16384, burst_buffer = 0x0, 
  burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, 
  cpu_cnt = 20, cr_enabled = 1, db_index = 102708, deadline = 0, delay_boot = 0, derived_ec = 0, 
  details = 0x7fd54400b080, direct_set_prio = 0, end_time = 1532457865, end_time_exp = 1532457865, 
  epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, 
  gres_list = 0x0, gres_alloc = 0x7fd54400d3e0 "", gres_detail_cnt = 0, gres_detail_str = 0x0, 
  gres_req = 0x7fd54400de50 "", gres_used = 0x0, group_id = 2000, job_id = 6821, job_next = 0x0, 
  job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1, 
  last_sched_eval = 1532443465, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, 
    tres = 0x7fd54400a870}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, 
  name = 0x7fd544003780 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7fd54400de70 "css-077", 
  node_addr = 0x7fd54400d740, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 1, node_cnt_wag = 1, 
  nodes_completing = 0x0, origin_cluster = 0x7fd54400b480 "msuhpcc", other_port = 51101, pack_job_id = 0, 
  pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, 
  partition = 0x7fd54400ddd0 "general-short-18,test1-14,test1-16,test1-18,general-short-14,general-short-16", 
  part_ptr_list = 0xd30f40, part_nodes_missing = false, part_ptr = 0xdce7a0, power_flags = 0 '\000', 
  pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 8266, 
  priority_array = 0x7fd54400b610, prio_factors = 0x7fd54400b230, profile = 0, qos_id = 1, qos_ptr = 0xa672a0, 
  qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, 
  resv_ptr = 0x0, requid = 4294967295, resp_host = 0x7fd54400b450 "192.168.9.200", sched_nodes = 0x0, 
  select_jobinfo = 0x7fd54400b500, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, 
  start_time = 1532443465, state_desc = 0x0, state_reason = 20, state_reason_prev = 20, step_list = 0xd30f90, 
  suspend_time = 0, time_last_active = 1532443465, time_limit = 240, time_min = 0, tot_sus_time = 0, 
  total_cpus = 20, total_nodes = 1, tres_req_cnt = 0x7fd54400a910, 
  tres_req_str = 0x7fd54400b340 "1=20,2=40960,4=1", tres_fmt_req_str = 0x7fd54400b3a0 "cpu=20,mem=40G,node=1", 
  tres_alloc_cnt = 0x7fd54400d6f0, tres_alloc_str = 0x7fd54400d770 "1=20,2=40960,3=18446744073709551614,4=1,5=20", 
  tres_fmt_alloc_str = 0x7fd54400d8e0 "cpu=20,mem=40G,node=1,billing=20", user_id = 175025, 
  user_name = 0x7fd54400d400 "jal", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, 
  wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}

PID 12769 __assert_fail, validate_jobs_on_node
$1 = {account = 0x7fb06c015870 "classres", admin_comment = 0x0, alias_list = 0x0, 
  alloc_node = 0x7fb06c00c000 "lac-249", alloc_resp_port = 51636, alloc_sid = 26149, array_job_id = 0, 
  array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1607, assoc_ptr = 0x7fb080066fb0, batch_flag = 0, 
  batch_host = 0x7fb06c016820 "lac-338", billable_tres = 2, bit_flags = 16384, burst_buffer = 0x0, 
  burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, 
  cpu_cnt = 2, cr_enabled = 1, db_index = 102715, deadline = 0, delay_boot = 0, derived_ec = 0, 
  details = 0x7fb06c0154f0, direct_set_prio = 0, end_time = 1532452897, end_time_exp = 1532452897, 
  epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, 
  gres_list = 0x0, gres_alloc = 0x7fb06c017710 "", gres_detail_cnt = 0, gres_detail_str = 0x0, 
  gres_req = 0x7fb06c0181d0 "", gres_used = 0x0, group_id = 2103, job_id = 6825, job_next = 0x0, 
  job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1, 
  last_sched_eval = 1532449297, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, 
    tres = 0x7fb06c014ce0}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, 
  name = 0x7fb06c014580 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7fb06c0181f0 "lac-[338-339]", 
  node_addr = 0x7fb06c017b70, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 2, node_cnt_wag = 2, 
  nodes_completing = 0x0, origin_cluster = 0x7fb06c005cd0 "msuhpcc", other_port = 51635, pack_job_id = 0, 
  pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, 
  partition = 0x7fb06c019d60 "general-short-16,classres-14,classres-16,general-short-14,general-short-18", 
  part_ptr_list = 0x904b40, part_nodes_missing = false, part_ptr = 0xb700a0, power_flags = 0 '\000', 
  pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 101, priority_array = 0x7fb06c015990, 
  prio_factors = 0x7fb06c0156a0, profile = 0, qos_id = 1, qos_ptr = 0x7fb080002700, qos_blocking_ptr = 0x0, 
  reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, 
  requid = 4294967295, resp_host = 0x7fb06c0158a0 "192.168.8.49", sched_nodes = 0x0, 
  select_jobinfo = 0x7fb06c015900, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, 
  start_time = 1532449297, 
  state_desc = 0x7fb06c01e780 "ReqNodeNotAvail, UnavailableNodes:csm-[000-023],csn-[000-039],csp-[001-027],css-[001-032,034-075,078-127],lac-[000-197,199-337,340-371,373-445],qml-[000-005],test-skl-000,vim-[000-002]", 
  state_reason = 15, state_reason_prev = 15, step_list = 0x838500, suspend_time = 0, time_last_active = 1532449297, 
  time_limit = 60, time_min = 0, tot_sus_time = 0, total_cpus = 2, total_nodes = 2, tres_req_cnt = 0x7fb06c014d80, 
  tres_req_str = 0x7fb06c0157b0 "1=2,2=6554,4=2", tres_fmt_req_str = 0x7fb06c015810 "cpu=2,mem=6554M,node=2", 
  tres_alloc_cnt = 0x7fb06c017a60, tres_alloc_str = 0x7fb06c017bb0 "1=2,2=6554,3=18446744073709551614,4=2,5=2", 
  tres_fmt_alloc_str = 0x7fb06c017ae0 "cpu=2,mem=6554M,node=2,billing=2", user_id = 804793, 
  user_name = 0x7fb06c017c80 "changc81", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, 
  wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}

PID 17835 __assert_fail, validate_jobs_on_node
$1 = {account = 0x7ff03c00d3c0 "classres", admin_comment = 0x0, alias_list = 0x0, 
  alloc_node = 0x7ff03c00c6d0 "lac-249", alloc_resp_port = 52611, alloc_sid = 26149, array_job_id = 0, 
  array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1607, assoc_ptr = 0x764130, batch_flag = 0, 
  batch_host = 0x7ff03c00da40 "lac-338", billable_tres = 2, bit_flags = 16384, burst_buffer = 0x0, 
  burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, 
  cpu_cnt = 2, cr_enabled = 1, db_index = 102759, deadline = 0, delay_boot = 0, derived_ec = 0, 
  details = 0x7ff03c014e70, direct_set_prio = 0, end_time = 1532455646, end_time_exp = 1532455646, 
  epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, 
  gres_list = 0x0, gres_alloc = 0x7ff03c00ed30 "", gres_detail_cnt = 0, gres_detail_str = 0x0, 
  gres_req = 0x7ff03c004fa0 "", gres_used = 0x0, group_id = 2103, job_id = 6847, job_next = 0x0, 
  job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1, 
  last_sched_eval = 1532452046, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, 
    tres = 0x7ff03c00d290}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, 
  name = 0x7ff03c004920 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7ff03c014420 "lac-[338-339]", 
  node_addr = 0x7ff03c005450, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 2, node_cnt_wag = 2, 
  nodes_completing = 0x0, origin_cluster = 0x7ff03c000e40 "msuhpcc", other_port = 52610, pack_job_id = 0, 
  pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, 
  partition = 0x7ff03c010ad0 "general-short-16,classres-14,classres-16,general-short-14,general-short-18", 
  part_ptr_list = 0x9f7180, part_nodes_missing = false, part_ptr = 0xa957a0, power_flags = 0 '\000', 
  pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 101, priority_array = 0x7ff03c00f0e0, 
  prio_factors = 0x7ff03c015020, profile = 0, qos_id = 0, qos_ptr = 0x0, qos_blocking_ptr = 0x0, reboot = 0 '\000', 
  restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, 
  resp_host = 0x7ff03c00d4f0 "192.168.8.49", sched_nodes = 0x0, select_jobinfo = 0x7ff03c00ed50, 
  spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, start_time = 1532452046, 
  state_desc = 0x7ff03c006a90 "ReqNodeNotAvail, UnavailableNodes:csm-[000-023],csn-[000-039],csp-[001-027],css-[001-075,078-127],lac-[000-197,199-337,340-371,373-445],qml-[000-005],test-skl-000,vim-[000-002]", state_reason = 15, 
  state_reason_prev = 15, step_list = 0x9f7220, suspend_time = 0, time_last_active = 1532452046, time_limit = 60, 
  time_min = 0, tot_sus_time = 0, total_cpus = 2, total_nodes = 2, tres_req_cnt = 0x7ff03c0052a0, 
  tres_req_str = 0x7ff03c0048c0 "1=2,2=6554,4=2", tres_fmt_req_str = 0x7ff03c0050c0 "cpu=2,mem=6554M,node=2", 
  tres_alloc_cnt = 0x7ff03c007380, tres_alloc_str = 0x7ff03c00eeb0 "1=2,2=6554,3=18446744073709551614,4=2,5=2", 
  tres_fmt_alloc_str = 0x7ff03c005320 "cpu=2,mem=6554M,node=2,billing=2", user_id = 804793, 
  user_name = 0x7ff03c0055a0 "changc81", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, 
  wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}

PID 36384 __assert_fail, _purge_missing_jobs
$1 = {account = 0x7f0f64004080 "test1", admin_comment = 0x0, alias_list = 0x0, 
  alloc_node = 0x7f0f64002c70 "lac-249", alloc_resp_port = 51917, alloc_sid = 11118, array_job_id = 0, 
  array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1646, assoc_ptr = 0xfbf880, batch_flag = 0, 
  batch_host = 0x7f0f6400e600 "lac-338", billable_tres = 20, bit_flags = 16384, burst_buffer = 0x0, 
  burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, 
  cpu_cnt = 20, cr_enabled = 1, db_index = 102946, deadline = 0, delay_boot = 0, derived_ec = 0, 
  details = 0x7f0f6400b8c0, direct_set_prio = 0, end_time = 1532472964, end_time_exp = 1532472964, 
  epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, 
  gres_list = 0x0, gres_alloc = 0x7f0f6400e640 "", gres_detail_cnt = 0, gres_detail_str = 0x0, 
  gres_req = 0x7f0f6400e620 "", gres_used = 0x0, group_id = 2000, job_id = 6943, job_next = 0x0, 
  job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1, 
  last_sched_eval = 1532458564, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, 
    tres = 0x7f0f6400b0b0}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, 
  name = 0x7f0f64001d80 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7f0f6400e660 "lac-[338-339]", 
  node_addr = 0x7f0f6400dfc0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 2, node_cnt_wag = 2, 
  nodes_completing = 0x0, origin_cluster = 0x7f0f6400bf30 "msuhpcc", other_port = 51916, pack_job_id = 0, 
  pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, 
  partition = 0x7f0f6400e440 "general-short-18,test1-14,test1-16,test1-18,general-short-14,general-short-16", 
  part_ptr_list = 0x1271b60, part_nodes_missing = false, part_ptr = 0x131f7a0, power_flags = 0 '\000', 
  pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 102, priority_array = 0x7f0f6400c040, 
  prio_factors = 0x7f0f6400ba70, profile = 0, qos_id = 1, qos_ptr = 0xfb82a0, qos_blocking_ptr = 0x0, 
  reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, 
  requid = 4294967295, resp_host = 0x7f0f6400bf00 "192.168.8.49", sched_nodes = 0x0, 
  select_jobinfo = 0x7f0f6400bf70, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, 
  start_time = 1532458564, state_desc = 0x0, state_reason = 20, state_reason_prev = 20, step_list = 0x1281f90, 
  suspend_time = 0, time_last_active = 1532458564, time_limit = 240, time_min = 0, tot_sus_time = 0, 
  total_cpus = 20, total_nodes = 2, tres_req_cnt = 0x7f0f6400b150, 
  tres_req_str = 0x7f0f6400be40 "1=20,2=40960,4=2", tres_fmt_req_str = 0x7f0f6400bea0 "cpu=20,mem=40G,node=2", 
  tres_alloc_cnt = 0x7f0f6400deb0, tres_alloc_str = 0x7f0f6400df30 "1=20,2=40960,3=18446744073709551614,4=2,5=20", 
  tres_fmt_alloc_str = 0x7f0f6400e000 "cpu=20,mem=40G,node=2,billing=20", user_id = 175025, 
  user_name = 0x7f0f6400db80 "jal", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, 
  wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}


It looks like a different job for each segfault. I'll attach the logs from yesterday to this ticket as well.

Comment 16 Steve Ford 2018-07-25 07:58:32 MDT

Created attachment 7403 [details]
slurmctld log from 7-24

Comment 17 Jason Booth 2018-07-25 09:06:28 MDT

Hi Steve,

 We think that we know how to avoid the segfault. It seems these jobs are assigned to a group of nodes, however, the node bitmap that is also assigned to the jobs is NULL when it should have a value. We are still not sure why the bitmap is null so we are still looking into this. 

Kind regards,
Jason

Comment 18 Steve Ford 2018-07-25 10:21:54 MDT

Thanks, Jason. Let me know if there's any more information I can provide to track down why the node bitmap is NULL. I'll attach our conf file and job submit script in case those shed light on the issue.

Comment 19 Steve Ford 2018-07-25 10:25:36 MDT

Created attachment 7406 [details]
Job submit script

Comment 20 Steve Ford 2018-07-25 10:26:26 MDT

Created attachment 7407 [details]
Slurm Config File

Comment 21 Jason Booth 2018-07-25 12:03:40 MDT

Hi Steve,


Please send us the slurmd logs from 2018-07-20 on nodes css-[033,076]. We would like to confirm if the job 6256 was running at the same time as the crash.


Kind regards,
Jason

Comment 22 Steve Ford 2018-07-25 12:26:43 MDT

Created attachment 7409 [details]
css-033 logs

Comment 23 Steve Ford 2018-07-25 12:28:49 MDT

Created attachment 7410 [details]
css-033 logs

Comment 24 Steve Ford 2018-07-25 12:29:59 MDT

Created attachment 7411 [details]
css-076 logs

Comment 25 Steve Ford 2018-07-25 12:35:22 MDT

Jason, I looked at the job from the first segfault on 7/20, it was 6256.

$6 = {account = 0x7f64500161d0 "classres", admin_comment = 0x0, alias_list = 0x0, 
  alloc_node = 0x7f64500161a0 "dev-intel18", alloc_resp_port = 51895, alloc_sid = 21447, array_job_id = 0, 
  array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1607, assoc_ptr = 0x264cad0, batch_flag = 0, 
  batch_host = 0x7f645000fed0 "css-033", billable_tres = 2, bit_flags = 16384, burst_buffer = 0x0, 
  burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, 
  cpu_cnt = 2, cr_enabled = 1, db_index = 101612, deadline = 0, delay_boot = 0, derived_ec = 0, 
  details = 0x7f6450015e20, direct_set_prio = 0, end_time = 1532113356, end_time_exp = 1532113356, 
  epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0, 
  gres_list = 0x0, gres_alloc = 0x7f645000fe70 "", gres_detail_cnt = 0, gres_detail_str = 0x0, 
  gres_req = 0x7f6450012140 "", gres_used = 0x0, group_id = 2103, job_id = 6256, job_next = 0x0, 
  job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1, 
  last_sched_eval = 1532109756, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, 
    tres = 0x7f6450015650}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0, 
  name = 0x7f6450000f50 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7f6450012160 "css-[033,076]", 
  node_addr = 0x7f645000ef90, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 2, node_cnt_wag = 2, 
  nodes_completing = 0x0, origin_cluster = 0x7f6450001720 "msuhpcc", other_port = 51894, pack_job_id = 0, 
  pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, 
  partition = 0x7f645000ef00 "classres,general-short", part_ptr_list = 0x7f64480bc270, part_nodes_missing = false, 
  part_ptr = 0x29a9d40, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, 
  priority = 101, priority_array = 0x7f6450016340, prio_factors = 0x7f6450015fd0, profile = 0, qos_id = 0, 
  qos_ptr = 0x0, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, 
  resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x7f6450016200 "192.168.9.200", 
  sched_nodes = 0x0, select_jobinfo = 0x7f6450016260, spank_job_env = 0x0, spank_job_env_size = 0, 
  start_protocol_ver = 8192, start_time = 1532109756, 
  state_desc = 0x7f645000f8d0 "ReqNodeNotAvail, UnavailableNodes:csm-[000-023],csn-[000-039],csp-[001-027],css-[001-032,034-075,078-127],lac-[000-197,199-337,340-371,373-445],vim-[000-002]", state_reason = 15, 
  state_reason_prev = 15, step_list = 0x290f030, suspend_time = 0, time_last_active = 1532109756, time_limit = 60, 
  time_min = 0, tot_sus_time = 0, total_cpus = 2, total_nodes = 2, tres_req_cnt = 0x7f64500156b0, 
  tres_req_str = 0x7f64500160e0 "1=2,2=6554,4=2", tres_fmt_req_str = 0x7f6450016140 "cpu=2,mem=6554M,node=2", 
  tres_alloc_cnt = 0x7f645000ee80, tres_alloc_str = 0x7f645000ff40 "1=2,2=6554,3=18446744073709551614,4=2,5=2", 
  tres_fmt_alloc_str = 0x7f645000efd0 "cpu=2,mem=6554M,node=2,billing=2", user_id = 804793, 
  user_name = 0x7f6450015680 "changc81", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, 
  wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}

Comment 27 Jason Booth 2018-07-25 16:28:31 MDT

Hi Steve,


Just passing along a quick update. We have not been able to trigger this issue internally yet. We would like to know if things have calmed down for you or if you are still seeing consistent crashes (today). 


Kind regards,
Jason

Comment 28 Steve Ford 2018-07-26 08:27:37 MDT

We saw two more crashes yesterday. Both had validate_jobs_on_node in their backtrace.

Comment 29 Steve Ford 2018-07-26 11:53:01 MDT

Three more segfaults today. Two in validate_jobs_on_node and one in _purge_missing_jobs.

Comment 30 Jason Booth 2018-07-26 13:01:27 MDT

Hi Steve,

 This issue may be related to 5452 and a few others (5438, 5447, 5276). We believe that the issue here is that jobs, which can run in multiple partitions, are started immediately before _select_nodes_parts() finishes and when "EnforcePartLimits=ALL". I will keep you updated as we make progress towards a patch.

Kind regards,
Jason

Comment 31 Jason Booth 2018-07-27 10:23:37 MDT

Created attachment 7441 [details]
Patch 5457

Hi Steve,

This patch should fix this issue. 
It hasn't been committed yet, but we think it will be soon in this or similar form.

Kind regards,
Jason

Comment 32 Steve Ford 2018-07-30 11:21:11 MDT

Jason,

We've applied this patch. We'll see how things go for the next few days.

Thanks

Comment 33 Jason Booth 2018-07-30 17:12:09 MDT

*** Ticket 5472 has been marked as a duplicate of this ticket. ***

Comment 34 Jason Booth 2018-07-30 17:12:39 MDT

*** Ticket 5473 has been marked as a duplicate of this ticket. ***

Comment 35 Jason Booth 2018-07-31 12:55:06 MDT

Hi Steve,

 I am check in with you to see how the last 24 hours has gone with the patch. Have you seen any issues since applying it?

Kind regards,
Jason

Comment 36 Steve Ford 2018-08-01 07:09:39 MDT

Hello Jason,

I'm happy to report that we have not had any crashes since we applied the patch.

Comment 37 Jason Booth 2018-08-01 09:02:07 MDT

Hi Steve,

> I am check in with you to see how the last 24 hours has gone with the patch. Have you seen any issues since applying it?

This is good news. I will drop the priority of this ticket to a sev3 for now since the patch seems to have mitigated the disruptions. 

Kind regards,
Jason

Comment 39 Jason Booth 2018-08-03 14:04:42 MDT

Hi Steve,

 Since this appears to have been resolved with the following commit I will proceed to close this issue out. 

https://github.com/SchedMD/slurm/commit/fef07a40972

Please do let me know if the issue happens again.

Best regards,
Jason