| Summary: | SlurmCtld Segfaulted | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Steve Ford <fordste5> |
| Component: | slurmctld | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | brian |
| Version: | 17.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=5474 | ||
| Site: | MSU | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 17.11.9 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
Slurmctld log
Messages log slurmctld log from 7-24 Job submit script Slurm Config File css-033 logs css-033 logs css-076 logs Patch 5457 |
||
|
Description
Steve Ford
2018-07-20 12:40:58 MDT
Hi Steve, Please send in the output of the following. While in gdb: 'thread 1' , 'frame 5', 'info locals', and 'thread apply all bt full'. It would also be good to know if you have ECC enabled on this system. If so, have any errors been reported? You may also want to look at the output of 'dmesg' and look for any entries in the /var/log/messages around the time of the event. Kind regards, Jason Here is the output from gdb:
[Switching to thread 1 (Thread 0x7f6418f0f700 (LWP 2118))]
#0 0x00007f647aaa2277 in raise () from /lib64/libc.so.6
#5 0x0000000000449a8f in validate_jobs_on_node (reg_msg=reg_msg@entry=0x7f6450006b10) at job_mgr.c:13934
13934 job_mgr.c: No such file or directory.
i = 0
node_inx = 166
jobs_on_node = <optimized out>
node_ptr = 0x29f0a40
job_ptr = 0x7f6450015a70
step_ptr = <optimized out>
step_str = "Z%R[\000\000\000\000\256\305\b\000\000\000\000\000\020\000\000\000[\000\000\000 \352\360\030d\177\000\000\340\351\360\030d\177\000\000 \212\000Pd\177\000\000[\224\000Pd\177\000\000\346mF\000\000\000\000"
now = 1532110170
__func__ = "validate_jobs_on_node"
Thread 15 (Thread 0x7f64774b7700 (LWP 8071)):
#0 0x00007f647ab3156d in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007f647ab31404 in sleep () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007f64774bd745 in _decay_thread (no_data=<optimized out>) at priority_multifactor.c:1333
start_time = 1532109903
last_reset = 1527873057
next_reset = 1546318800
calc_period = 300
decay_hl = <optimized out>
reset_period = 6
now = 1532109903
run_delta = <optimized out>
real_decay = <optimized out>
elapsed = <optimized out>
job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = READ_LOCK, partition = READ_LOCK, federation = NO_LOCK}
locks = {assoc = WRITE_LOCK, file = NO_LOCK, qos = NO_LOCK, res = NO_LOCK, tres = NO_LOCK, user = NO_LOCK, wckey = NO_LOCK}
__func__ = "_decay_thread"
#3 0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4 0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 14 (Thread 0x7f64770b3700 (LWP 8073)):
#0 0x00007f647ab61c73 in select () from /lib64/libc.so.6
No symbol table info available.
#1 0x0000000000425716 in _slurmctld_rpc_mgr (no_data=<optimized out>) at controller.c:1026
max_fd = 7
newsockfd = <optimized out>
sockfd = 0x7f64680008d0
cli_addr = {sin_family = 2, sin_port = 53453, sin_addr = {s_addr = 2819008704}, sin_zero = "\000\000\000\000\000\000\000"}
srv_addr = {sin_family = 2, sin_port = 41242, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}
port = 41242
ip = "0.0.0.0", '\000' <repeats 24 times>
fd_next = 0
i = <optimized out>
nports = 1
rfds = {__fds_bits = {128, 0 <repeats 15 times>}}
conn_arg = <optimized out>
config_read_lock = {config = READ_LOCK, job = NO_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
sigarray = {10, 0}
node_addr = <optimized out>
__func__ = "_slurmctld_rpc_mgr"
#2 0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 13 (Thread 0x7f6476eb1700 (LWP 8075)):
#0 0x00007f647ae44995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x00000000004a2221 in slurmctld_state_save (no_data=<optimized out>) at state_save.c:204
err = <optimized out>
last_save = 1532110141
now = 1532110141
save_delay = <optimized out>
run_save = <optimized out>
save_count = 0
__func__ = "slurmctld_state_save"
#2 0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 12 (Thread 0x7f6476fb2700 (LWP 8074)):
#0 0x00007f647ae48461 in sigwait () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x0000000000429b81 in _slurmctld_signal_hand (no_data=<optimized out>) at controller.c:891
sig = 0
i = <optimized out>
rc = <optimized out>
sig_array = {2, 15, 1, 6, 12, 0}
set = {__val = {18467, 0 <repeats 15 times>}}
__func__ = "_slurmctld_signal_hand"
#2 0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 11 (Thread 0x7f6476caf700 (LWP 8077)):
#0 0x00007f647ae44995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x0000000000423cd6 in _purge_files_thread (no_data=<optimized out>) at controller.c:3182
err = <optimized out>
job_id = 0x0
__func__ = "_purge_files_thread"
#2 0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 10 (Thread 0x7f64773b6700 (LWP 8072)):
#0 0x00007f647ae41f97 in pthread_join () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x00007f64774baee5 in _cleanup_thread (no_data=<optimized out>) at priority_multifactor.c:1462
No locals.
#2 0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 9 (Thread 0x7f64786e0700 (LWP 8022)):
#0 0x00007f647ab3156d in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007f647ab31404 in sleep () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007f64786e4928 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:437
local_job_list = <optimized out>
job_ptr = <optimized out>
itr = <optimized out>
job_read_lock = {config = NO_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
job_write_lock = {config = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
__func__ = "_set_db_inx_thread"
#3 0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4 0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 8 (Thread 0x7f64777c3700 (LWP 8062)):
#0 0x00007f647ae44d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x0000000000431384 in _fed_job_update_thread (arg=<optimized out>) at fed_mgr.c:2161
err = <optimized out>
ts = {tv_sec = 1532110171, tv_nsec = 0}
job_update_info = <optimized out>
__func__ = "_fed_job_update_thread"
#2 0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 7 (Thread 0x7f64778c4700 (LWP 8061)):
#0 0x00007f647ae44d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x000000000042d4c3 in _agent_thread (arg=<optimized out>) at fed_mgr.c:2203
err = <optimized out>
cluster = <optimized out>
ts = {tv_sec = 1532110171, tv_nsec = 0}
rpc_rec = <optimized out>
req_msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, body_offset = 0, buffer = 0x0, conn = 0x0, conn_fd = 0, data = 0x0, data_size = 0, flags = 0, msg_index = 0, msg_type = 0, protocol_version = 0, forward = {cnt = 0, init = 0, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
resp_msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, body_offset = 0, buffer = 0x0, conn = 0x0, conn_fd = 0, data = 0x0, data_size = 0, flags = 0, msg_index = 0, msg_type = 0, protocol_version = 0, forward = {cnt = 0, init = 0, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
ctld_req_msg = {my_list = 0x0}
success_bits = <optimized out>
rc = <optimized out>
resp_inx = <optimized out>
success_size = <optimized out>
fed_read_lock = {config = NO_LOCK, job = NO_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = READ_LOCK}
__func__ = "_agent_thread"
#2 0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 6 (Thread 0x7f6477bc7700 (LWP 8053)):
#0 0x00007f647ae44d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x00007f6477bcbc46 in _my_sleep (usec=30000000) at backfill.c:540
err = <optimized out>
nsec = <optimized out>
sleep_time = 0
ts = {tv_sec = 1532110187, tv_nsec = 808441000}
tv1 = {tv_sec = 1532110157, tv_usec = 808441}
tv2 = {tv_sec = 0, tv_usec = 0}
__func__ = "_my_sleep"
#2 0x00007f6477bd2062 in backfill_agent (args=<optimized out>) at backfill.c:876
now = <optimized out>
wait_time = <optimized out>
last_backfill_time = 1532110157
all_locks = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = READ_LOCK, federation = READ_LOCK}
load_config = <optimized out>
short_sleep = false
backfill_cnt = 556
__func__ = "backfill_agent"
#3 0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#4 0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 5 (Thread 0x7f64782da700 (LWP 8025)):
#0 0x00007f647ae44d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x00007f647b39c4e5 in _agent (x=<optimized out>) at slurmdbd_defs.c:1988
err = <optimized out>
cnt = <optimized out>
rc = <optimized out>
buffer = <optimized out>
abs_time = {tv_sec = 1532110173, tv_nsec = 0}
fail_time = 0
sigarray = {10, 0}
list_req = {msg_type = 1474, data = 0x7f64782d9eb0}
list_msg = {my_list = 0x0, return_code = 0}
__func__ = "_agent"
#2 0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 4 (Thread 0x7f64785df700 (LWP 8023)):
#0 0x00007f647ae41f97 in pthread_join () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x00007f64786e4070 in _cleanup_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:445
No locals.
#2 0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 3 (Thread 0x7f647b827740 (LWP 8020)):
#0 0x00007f647ab3156d in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007f647ab62404 in usleep () from /lib64/libc.so.6
No symbol table info available.
#2 0x0000000000428376 in _slurmctld_background (no_data=0x0) at controller.c:1778
i = 8
job_limit = <optimized out>
delta_t = 27
last_full_sched_time = 1532110143
last_ctld_bu_ping = 1532110169
last_uid_update = 1532106905
last_reboot_msg_time = 1532092504
ping_interval = 100
job_read_lock = {config = READ_LOCK, job = READ_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
job_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = READ_LOCK, federation = READ_LOCK}
job_node_read_lock = {config = NO_LOCK, job = READ_LOCK, node = READ_LOCK, partition = NO_LOCK, federation = NO_LOCK}
last_group_time = 1532109903
last_acct_gather_node_time = 1532092503
last_ext_sensors_time = 1532092503
last_resv_time = 1532110165
tv1 = {tv_sec = 1532110169, tv_usec = 697263}
node_write_lock2 = {config = NO_LOCK, job = NO_LOCK, node = WRITE_LOCK, partition = NO_LOCK, federation = NO_LOCK}
last_timelimit_time = 1532110145
last_assert_primary_time = 1532092503
purge_job_interval = 60
tv2 = {tv_sec = 1532110169, tv_usec = 697290}
config_read_lock = {config = READ_LOCK, job = NO_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
node_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK, federation = NO_LOCK}
last_purge_job_time = 1532110143
last_node_acct = 1532109903
no_resp_msg_interval = <optimized out>
tv_str = "usec=27\000\000\065\000\000\000\000\000\000\000\000\000"
job_write_lock2 = {config = NO_LOCK, job = WRITE_LOCK, node = NO_LOCK, partition = NO_LOCK, federation = NO_LOCK}
last_no_resp_msg_time = 1532109903
now = <optimized out>
last_sched_time = 1532110143
last_ping_node_time = 1532110107
part_write_lock = {config = NO_LOCK, job = NO_LOCK, node = NO_LOCK, partition = WRITE_LOCK, federation = NO_LOCK}
last_health_check_time = 1532110158
last_checkpoint_time = 1532110083
last_ping_srun_time = 1532092503
last_trigger = 1532110169
#3 main (argc=<optimized out>, argv=<optimized out>) at controller.c:604
cnt = <optimized out>
error_code = <optimized out>
i = 3
stat_buf = {st_dev = 64769, st_ino = 33752695, st_nlink = 1, st_mode = 33261, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 392880, st_blksize = 4096, st_blocks = 768, st_atim = {tv_sec = 1532024690, tv_nsec = 668992174}, st_mtim = {tv_sec = 1523430473, tv_nsec = 0}, st_ctim = {tv_sec = 1531333934, tv_nsec = 727046754}, __unused = {0, 0, 0}}
rlim = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
config_write_lock = {config = WRITE_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = WRITE_LOCK, federation = NO_LOCK}
node_part_write_lock = {config = NO_LOCK, job = NO_LOCK, node = WRITE_LOCK, partition = WRITE_LOCK, federation = NO_LOCK}
callbacks = {acct_full = 0x4acad5 <trigger_primary_ctld_acct_full>, dbd_fail = 0x4acce4 <trigger_primary_dbd_fail>, dbd_resumed = 0x4acd72 <trigger_primary_dbd_res_op>, db_fail = 0x4acdf7 <trigger_primary_db_fail>, db_resumed = 0x4ace85 <trigger_primary_db_res_op>}
create_clustername_file = 120
__func__ = "main"
Thread 2 (Thread 0x7f647b826700 (LWP 8021)):
#0 0x00007f647ae44d42 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1 0x000000000041e477 in _agent_init (arg=<optimized out>) at agent.c:1313
err = <optimized out>
min_wait = <optimized out>
mail_too = <optimized out>
ts = {tv_sec = 1532110171, tv_nsec = 0}
__func__ = "_agent_init"
#2 0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.
Thread 1 (Thread 0x7f6418f0f700 (LWP 2118)):
#0 0x00007f647aaa2277 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007f647aaa3968 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007f647aa9b096 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
#3 0x00007f647aa9b142 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#4 0x00007f647b2c097a in bit_test (b=<optimized out>, bit=bit@entry=166) at bitstring.c:228
__PRETTY_FUNCTION__ = "bit_test"
#5 0x0000000000449a8f in validate_jobs_on_node (reg_msg=reg_msg@entry=0x7f6450006b10) at job_mgr.c:13934
i = 0
node_inx = 166
jobs_on_node = <optimized out>
node_ptr = 0x29f0a40
job_ptr = 0x7f6450015a70
step_ptr = <optimized out>
step_str = "Z%R[\000\000\000\000\256\305\b\000\000\000\000\000\020\000\000\000[\000\000\000 \352\360\030d\177\000\000\340\351\360\030d\177\000\000 \212\000Pd\177\000\000[\224\000Pd\177\000\000\346mF\000\000\000\000"
now = 1532110170
__func__ = "validate_jobs_on_node"
#6 0x000000000048dc51 in _slurm_rpc_node_registration (running_composite=false, msg=0x7f6418f0ee50) at proc_req.c:3076
tv1 = {tv_sec = 1532110170, tv_usec = 575017}
tv_str = '\000' <repeats 19 times>
delta_t = 140068815669792
error_code = <optimized out>
tv2 = {tv_sec = 390842023984, tv_usec = 140067891899248}
newly_up = false
node_reg_stat_msg = 0x7f6450006b10
job_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK, federation = READ_LOCK}
uid = 0
#7 slurmctld_req (msg=msg@entry=0x7f6418f0ee50, arg=arg@entry=0x7f6468019510) at proc_req.c:407
tv1 = {tv_sec = 1532110170, tv_usec = 575017}
tv2 = {tv_sec = 8589934593, tv_usec = 2}
tv_str = '\000' <repeats 19 times>
delta_t = 390842023984
i = 0
rpc_type_index = 6
rpc_user_index = 0
rpc_uid = <optimized out>
__func__ = "slurmctld_req"
#8 0x0000000000424f28 in _service_connection (arg=0x7f6468019510) at controller.c:1125
conn = 0x7f6468019510
msg = {address = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x7f6450000a10, body_offset = 173, buffer = 0x7f645002d430, conn = 0x0, conn_fd = 4, data = 0x7f6450006b10, data_size = 0, flags = 0, msg_index = 0, msg_type = 1002, protocol_version = 8192, forward = {cnt = 0, init = 65534, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
__func__ = "_service_connection"
#9 0x00007f647ae40e25 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#10 0x00007f647ab6abad in clone () from /lib64/libc.so.6
No symbol table info available.
Our slurm server is a virtual machine. No ECC errors have been reported by the host system.
Dmesg shows a number of processes that hung up in XFS calls. We has some network issues that day may have affected the VM hosts ability to access its backend storage. Slurmctld was not one of the hung processes.
Hi Steve, Would you also attach your slurmctld.log and the /var/log/messages from that day? Kind regards, Jason Hi Steve,
In addition to my last email would you also include:
thread 1
frame 5
p *job_ptr
the bt shows that node_bitmap is optimized out:
#4 0x00007f647b2c097a in bit_test (b=<optimized out>, bit=bit@entry=166) at
bitstring.c:228
__PRETTY_FUNCTION__ = "bit_test"
It'd be nice to know if it was null or just corrupted.
Kind regards,
Jason
Created attachment 7382 [details]
Slurmctld log
Created attachment 7384 [details]
Messages log
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f6418f0f700 (LWP 2118))]
#0 0x00007f647aaa2277 in raise () from /lib64/libc.so.6
(gdb) frame 5
#5 0x0000000000449a8f in validate_jobs_on_node (reg_msg=reg_msg@entry=0x7f6450006b10) at job_mgr.c:13934
13934 job_mgr.c: No such file or directory.
(gdb) p *job_ptr
$1 = {account = 0x7f64500161d0 "classres", admin_comment = 0x0, alias_list = 0x0,
alloc_node = 0x7f64500161a0 "dev-intel18", alloc_resp_port = 51895, alloc_sid = 21447, array_job_id = 0,
array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1607, assoc_ptr = 0x264cad0, batch_flag = 0,
batch_host = 0x7f645000fed0 "css-033", billable_tres = 2, bit_flags = 16384, burst_buffer = 0x0,
burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0,
cpu_cnt = 2, cr_enabled = 1, db_index = 101612, deadline = 0, delay_boot = 0, derived_ec = 0,
details = 0x7f6450015e20, direct_set_prio = 0, end_time = 1532113356, end_time_exp = 1532113356,
epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0,
gres_list = 0x0, gres_alloc = 0x7f645000fe70 "", gres_detail_cnt = 0, gres_detail_str = 0x0,
gres_req = 0x7f6450012140 "", gres_used = 0x0, group_id = 2103, job_id = 6256, job_next = 0x0,
job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1,
last_sched_eval = 1532109756, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0,
tres = 0x7f6450015650}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0,
name = 0x7f6450000f50 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7f6450012160 "css-[033,076]",
node_addr = 0x7f645000ef90, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 2, node_cnt_wag = 2,
nodes_completing = 0x0, origin_cluster = 0x7f6450001720 "msuhpcc", other_port = 51894, pack_job_id = 0,
pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0,
partition = 0x7f645000ef00 "classres,general-short", part_ptr_list = 0x7f64480bc270, part_nodes_missing = false,
part_ptr = 0x29a9d40, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false,
priority = 101, priority_array = 0x7f6450016340, prio_factors = 0x7f6450015fd0, profile = 0, qos_id = 0,
qos_ptr = 0x0, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0,
resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x7f6450016200 "192.168.9.200",
sched_nodes = 0x0, select_jobinfo = 0x7f6450016260, spank_job_env = 0x0, spank_job_env_size = 0,
start_protocol_ver = 8192, start_time = 1532109756,
state_desc = 0x7f645000f8d0 "ReqNodeNotAvail, UnavailableNodes:csm-[000-023],csn-[000-039],csp-[001-027],css-[001-032,034-075,078-127],lac-[000-197,199-337,340-371,373-445],vim-[000-002]", state_reason = 15,
state_reason_prev = 15, step_list = 0x290f030, suspend_time = 0, time_last_active = 1532109756, time_limit = 60,
time_min = 0, tot_sus_time = 0, total_cpus = 2, total_nodes = 2, tres_req_cnt = 0x7f64500156b0,
tres_req_str = 0x7f64500160e0 "1=2,2=6554,4=2", tres_fmt_req_str = 0x7f6450016140 "cpu=2,mem=6554M,node=2",
tres_alloc_cnt = 0x7f645000ee80, tres_alloc_str = 0x7f645000ff40 "1=2,2=6554,3=18446744073709551614,4=2,5=2",
tres_fmt_alloc_str = 0x7f645000efd0 "cpu=2,mem=6554M,node=2,billing=2", user_id = 804793,
user_name = 0x7f6450015680 "changc81", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0,
wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
Hi Steve, After reviewing the logs you have uploaded, and the traces, I believe this is caused by changes to the cluster configuration without restarting the slurmd and slurmctld process(es). You will notice a number of errors in the slurmctld logs like the following: error: Node lac-338 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. Making changes to the slurm.conf and not restarting the processes will result is strange behavior similar to what you are seeing. We, therefore, suggest that you make sure that each node has the same slurm.conf as the controller and restart all the slurmd processes and slurmctld process. We would also like to know if SLURM is being managed by another utility such as Bright Cluster Manager. This will help us know if these changes are being made somewhere else other then by the admins. Kind regards, Jason All our SLURM clients and the SLURM server are managed with Puppet so the daemons are restarted shortly after being updated. We added some more nodes to the config file this morning, which may have caused the log messages. I don't think that's what caused the segfaults, at least not all of them. We had segfaults occur at 12:39 and 1:20, which was several hours after config files were changed and services restarted. I don't see any of those log messages around this time either. Could something else have caused these? We just saw another segfault. The config files have not changed.
#0 0x00007f0f7beaf277 in raise () from /lib64/libc.so.6
#1 0x00007f0f7beb0968 in abort () from /lib64/libc.so.6
#2 0x00007f0f7bea8096 in __assert_fail_base () from /lib64/libc.so.6
#3 0x00007f0f7bea8142 in __assert_fail () from /lib64/libc.so.6
#4 0x00007f0f7c6cd97a in bit_test (b=<optimized out>, bit=bit@entry=416) at bitstring.c:228
#5 0x0000000000449de6 in _purge_missing_jobs (now=1532458700, node_inx=<optimized out>) at job_mgr.c:14059
#6 validate_jobs_on_node (reg_msg=reg_msg@entry=0x7f0f340171c0) at job_mgr.c:14014
#7 0x000000000048dc51 in _slurm_rpc_node_registration (running_composite=false, msg=0x7f0f19352e50)
at proc_req.c:3076
#8 slurmctld_req (msg=msg@entry=0x7f0f19352e50, arg=arg@entry=0x7f0f680008f0) at proc_req.c:407
#9 0x0000000000424f28 in _service_connection (arg=0x7f0f680008f0) at controller.c:1125
#10 0x00007f0f7c24de25 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f0f7bf77bad in clone () from /lib64/libc.so.6
Hi Steve, Could I have you gather the output from the last backtrace? Find the thread with either: _purge_missing_jobs validate_jobs_on_node For example: thread 1 frame 5 p *job_ptr We are curious to know if this is this may be caused by the same job '6256'. Kind regards, Jason Here is thread 1, frame 5, p *job_ptr for all of the core dumps.
PID 37953 __assert_fail, _purge_missing_jobs
$1 = {account = 0x7f70cc001680 "test1", admin_comment = 0x0, alias_list = 0x0,
alloc_node = 0x7f70cc0e0900 "dev-intel18", alloc_resp_port = 51010, alloc_sid = 14161, array_job_id = 0,
array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1646, assoc_ptr = 0x7f70cc008cc0, batch_flag = 0,
batch_host = 0x7f70cc0c2200 "css-077", billable_tres = 18, bit_flags = 16384, burst_buffer = 0x0,
burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0,
cpu_cnt = 18, cr_enabled = 1, db_index = 102622, deadline = 0, delay_boot = 0, derived_ec = 0,
details = 0x7f70cc0e0580, direct_set_prio = 0, end_time = 1532455676, end_time_exp = 1532455676,
epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0,
gres_list = 0x0, gres_alloc = 0x7f70cc0c2240 "", gres_detail_cnt = 0, gres_detail_str = 0x0,
gres_req = 0x7f70cc0c21c0 "", gres_used = 0x0, group_id = 2000, job_id = 6777, job_next = 0x0,
job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1,
last_sched_eval = 1532441276, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0,
tres = 0x7f70cc0e5450}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0,
name = 0x7f70cc002190 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7f70cc0c21e0 "css-077",
node_addr = 0x7f70cc0c25d0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 1, node_cnt_wag = 1,
nodes_completing = 0x0, origin_cluster = 0x7f70cc00b110 "msuhpcc", other_port = 51009, pack_job_id = 0,
pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0,
partition = 0x7f70cc0c20a0 "general-short-18,test1-14,test1-16,test1-18,general-short-14,general-short-16",
part_ptr_list = 0x7f70cc0bb160, part_nodes_missing = false, part_ptr = 0x1ba9270, power_flags = 0 '\000',
pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 100101,
priority_array = 0x7f70cc0e58b0, prio_factors = 0x7f70cc0e0730, profile = 0, qos_id = 1, qos_ptr = 0x1842350,
qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0,
resv_ptr = 0x0, requid = 4294967295, resp_host = 0x7f70cc0e0930 "192.168.9.200", sched_nodes = 0x0,
select_jobinfo = 0x7f70cc0e09a0, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192,
start_time = 1532441276, state_desc = 0x0, state_reason = 20, state_reason_prev = 20, step_list = 0x1afb8b0,
suspend_time = 0, time_last_active = 1532441276, time_limit = 240, time_min = 0, tot_sus_time = 0,
total_cpus = 18, total_nodes = 1, tres_req_cnt = 0x7f70cc0e54f0,
tres_req_str = 0x7f70cc0e0840 "1=18,2=58986,4=1", tres_fmt_req_str = 0x7f70cc0e08a0 "cpu=18,mem=58986M,node=1",
tres_alloc_cnt = 0x7f70cc0c2580, tres_alloc_str = 0x7f70cc0c2600 "1=18,2=58986,3=18446744073709551614,4=1,5=18",
tres_fmt_alloc_str = 0x7f70cc0c2770 "cpu=18,mem=58986M,node=1,billing=18", user_id = 175025,
user_name = 0x7f70cc0c2260 "jal", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0,
wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
PID 55525 __assert_fail, _purge_missing_jobs
$1 = {account = 0x7f48b4002e60 "test1", admin_comment = 0x0, alias_list = 0x0,
alloc_node = 0x7f48b400be80 "dev-intel18", alloc_resp_port = 52268, alloc_sid = 27577, array_job_id = 0,
array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1646, assoc_ptr = 0xaf5880, batch_flag = 0,
batch_host = 0x7f48b400dd80 "css-077", billable_tres = 10, bit_flags = 16384, burst_buffer = 0x0,
burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0,
cpu_cnt = 10, cr_enabled = 1, db_index = 102674, deadline = 0, delay_boot = 0, derived_ec = 0,
details = 0x7f48b400bb00, direct_set_prio = 0, end_time = 1532457014, end_time_exp = 1532457014,
epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0,
gres_list = 0x0, gres_alloc = 0x7f48b400ddc0 "", gres_detail_cnt = 0, gres_detail_str = 0x0,
gres_req = 0x7f48b400e830 "", gres_used = 0x0, group_id = 2000, job_id = 6805, job_next = 0x0,
job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1,
last_sched_eval = 1532442614, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0,
tres = 0x7f48b400b2f0}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0,
name = 0x7f48b4002670 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7f48b400e850 "css-077",
node_addr = 0x7f48b400e120, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 1, node_cnt_wag = 1,
nodes_completing = 0x0, origin_cluster = 0x7f48b40042c0 "msuhpcc", other_port = 52267, pack_job_id = 0,
pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0,
partition = 0x7f48b400e7b0 "general-short-18,test1-14,test1-16,test1-18,general-short-14,general-short-16",
part_ptr_list = 0xdb7c20, part_nodes_missing = false, part_ptr = 0xe557a0, power_flags = 0 '\000',
pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 18091,
priority_array = 0x7f48b400bff0, prio_factors = 0x7f48b400bcb0, profile = 0, qos_id = 1, qos_ptr = 0xaee2a0,
qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0,
resv_ptr = 0x0, requid = 4294967295, resp_host = 0x7f48b400beb0 "192.168.9.200", sched_nodes = 0x0,
select_jobinfo = 0x7f48b400bf10, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192,
start_time = 1532442614, state_desc = 0x0, state_reason = 20, state_reason_prev = 20, step_list = 0xda7bb0,
suspend_time = 0, time_last_active = 1532442614, time_limit = 240, time_min = 0, tot_sus_time = 0,
total_cpus = 10, total_nodes = 1, tres_req_cnt = 0x7f48b400b390,
tres_req_str = 0x7f48b400bdc0 "1=10,2=32770,4=1", tres_fmt_req_str = 0x7f48b400be20 "cpu=10,mem=32770M,node=1",
tres_alloc_cnt = 0x7f48b400e0d0, tres_alloc_str = 0x7f48b400e150 "1=10,2=32770,3=18446744073709551614,4=1,5=10",
tres_fmt_alloc_str = 0x7f48b400e2c0 "cpu=10,mem=32770M,node=1,billing=10", user_id = 175025,
user_name = 0x7f48b400dde0 "jal", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0,
wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
PID 64876 __assert_fail, validate_jobs_on_node
$1 = {account = 0x7fd54400b430 "test1", admin_comment = 0x0, alias_list = 0x0,
alloc_node = 0x7fd54400b400 "dev-intel18", alloc_resp_port = 51102, alloc_sid = 27577, array_job_id = 0,
array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1646, assoc_ptr = 0xa6e880, batch_flag = 0,
batch_host = 0x7fd54400d3a0 "css-077", billable_tres = 20, bit_flags = 16384, burst_buffer = 0x0,
burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0,
cpu_cnt = 20, cr_enabled = 1, db_index = 102708, deadline = 0, delay_boot = 0, derived_ec = 0,
details = 0x7fd54400b080, direct_set_prio = 0, end_time = 1532457865, end_time_exp = 1532457865,
epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0,
gres_list = 0x0, gres_alloc = 0x7fd54400d3e0 "", gres_detail_cnt = 0, gres_detail_str = 0x0,
gres_req = 0x7fd54400de50 "", gres_used = 0x0, group_id = 2000, job_id = 6821, job_next = 0x0,
job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1,
last_sched_eval = 1532443465, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0,
tres = 0x7fd54400a870}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0,
name = 0x7fd544003780 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7fd54400de70 "css-077",
node_addr = 0x7fd54400d740, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 1, node_cnt_wag = 1,
nodes_completing = 0x0, origin_cluster = 0x7fd54400b480 "msuhpcc", other_port = 51101, pack_job_id = 0,
pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0,
partition = 0x7fd54400ddd0 "general-short-18,test1-14,test1-16,test1-18,general-short-14,general-short-16",
part_ptr_list = 0xd30f40, part_nodes_missing = false, part_ptr = 0xdce7a0, power_flags = 0 '\000',
pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 8266,
priority_array = 0x7fd54400b610, prio_factors = 0x7fd54400b230, profile = 0, qos_id = 1, qos_ptr = 0xa672a0,
qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0,
resv_ptr = 0x0, requid = 4294967295, resp_host = 0x7fd54400b450 "192.168.9.200", sched_nodes = 0x0,
select_jobinfo = 0x7fd54400b500, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192,
start_time = 1532443465, state_desc = 0x0, state_reason = 20, state_reason_prev = 20, step_list = 0xd30f90,
suspend_time = 0, time_last_active = 1532443465, time_limit = 240, time_min = 0, tot_sus_time = 0,
total_cpus = 20, total_nodes = 1, tres_req_cnt = 0x7fd54400a910,
tres_req_str = 0x7fd54400b340 "1=20,2=40960,4=1", tres_fmt_req_str = 0x7fd54400b3a0 "cpu=20,mem=40G,node=1",
tres_alloc_cnt = 0x7fd54400d6f0, tres_alloc_str = 0x7fd54400d770 "1=20,2=40960,3=18446744073709551614,4=1,5=20",
tres_fmt_alloc_str = 0x7fd54400d8e0 "cpu=20,mem=40G,node=1,billing=20", user_id = 175025,
user_name = 0x7fd54400d400 "jal", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0,
wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
PID 12769 __assert_fail, validate_jobs_on_node
$1 = {account = 0x7fb06c015870 "classres", admin_comment = 0x0, alias_list = 0x0,
alloc_node = 0x7fb06c00c000 "lac-249", alloc_resp_port = 51636, alloc_sid = 26149, array_job_id = 0,
array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1607, assoc_ptr = 0x7fb080066fb0, batch_flag = 0,
batch_host = 0x7fb06c016820 "lac-338", billable_tres = 2, bit_flags = 16384, burst_buffer = 0x0,
burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0,
cpu_cnt = 2, cr_enabled = 1, db_index = 102715, deadline = 0, delay_boot = 0, derived_ec = 0,
details = 0x7fb06c0154f0, direct_set_prio = 0, end_time = 1532452897, end_time_exp = 1532452897,
epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0,
gres_list = 0x0, gres_alloc = 0x7fb06c017710 "", gres_detail_cnt = 0, gres_detail_str = 0x0,
gres_req = 0x7fb06c0181d0 "", gres_used = 0x0, group_id = 2103, job_id = 6825, job_next = 0x0,
job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1,
last_sched_eval = 1532449297, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0,
tres = 0x7fb06c014ce0}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0,
name = 0x7fb06c014580 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7fb06c0181f0 "lac-[338-339]",
node_addr = 0x7fb06c017b70, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 2, node_cnt_wag = 2,
nodes_completing = 0x0, origin_cluster = 0x7fb06c005cd0 "msuhpcc", other_port = 51635, pack_job_id = 0,
pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0,
partition = 0x7fb06c019d60 "general-short-16,classres-14,classres-16,general-short-14,general-short-18",
part_ptr_list = 0x904b40, part_nodes_missing = false, part_ptr = 0xb700a0, power_flags = 0 '\000',
pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 101, priority_array = 0x7fb06c015990,
prio_factors = 0x7fb06c0156a0, profile = 0, qos_id = 1, qos_ptr = 0x7fb080002700, qos_blocking_ptr = 0x0,
reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0,
requid = 4294967295, resp_host = 0x7fb06c0158a0 "192.168.8.49", sched_nodes = 0x0,
select_jobinfo = 0x7fb06c015900, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192,
start_time = 1532449297,
state_desc = 0x7fb06c01e780 "ReqNodeNotAvail, UnavailableNodes:csm-[000-023],csn-[000-039],csp-[001-027],css-[001-032,034-075,078-127],lac-[000-197,199-337,340-371,373-445],qml-[000-005],test-skl-000,vim-[000-002]",
state_reason = 15, state_reason_prev = 15, step_list = 0x838500, suspend_time = 0, time_last_active = 1532449297,
time_limit = 60, time_min = 0, tot_sus_time = 0, total_cpus = 2, total_nodes = 2, tres_req_cnt = 0x7fb06c014d80,
tres_req_str = 0x7fb06c0157b0 "1=2,2=6554,4=2", tres_fmt_req_str = 0x7fb06c015810 "cpu=2,mem=6554M,node=2",
tres_alloc_cnt = 0x7fb06c017a60, tres_alloc_str = 0x7fb06c017bb0 "1=2,2=6554,3=18446744073709551614,4=2,5=2",
tres_fmt_alloc_str = 0x7fb06c017ae0 "cpu=2,mem=6554M,node=2,billing=2", user_id = 804793,
user_name = 0x7fb06c017c80 "changc81", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0,
wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
PID 17835 __assert_fail, validate_jobs_on_node
$1 = {account = 0x7ff03c00d3c0 "classres", admin_comment = 0x0, alias_list = 0x0,
alloc_node = 0x7ff03c00c6d0 "lac-249", alloc_resp_port = 52611, alloc_sid = 26149, array_job_id = 0,
array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1607, assoc_ptr = 0x764130, batch_flag = 0,
batch_host = 0x7ff03c00da40 "lac-338", billable_tres = 2, bit_flags = 16384, burst_buffer = 0x0,
burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0,
cpu_cnt = 2, cr_enabled = 1, db_index = 102759, deadline = 0, delay_boot = 0, derived_ec = 0,
details = 0x7ff03c014e70, direct_set_prio = 0, end_time = 1532455646, end_time_exp = 1532455646,
epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0,
gres_list = 0x0, gres_alloc = 0x7ff03c00ed30 "", gres_detail_cnt = 0, gres_detail_str = 0x0,
gres_req = 0x7ff03c004fa0 "", gres_used = 0x0, group_id = 2103, job_id = 6847, job_next = 0x0,
job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1,
last_sched_eval = 1532452046, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0,
tres = 0x7ff03c00d290}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0,
name = 0x7ff03c004920 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7ff03c014420 "lac-[338-339]",
node_addr = 0x7ff03c005450, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 2, node_cnt_wag = 2,
nodes_completing = 0x0, origin_cluster = 0x7ff03c000e40 "msuhpcc", other_port = 52610, pack_job_id = 0,
pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0,
partition = 0x7ff03c010ad0 "general-short-16,classres-14,classres-16,general-short-14,general-short-18",
part_ptr_list = 0x9f7180, part_nodes_missing = false, part_ptr = 0xa957a0, power_flags = 0 '\000',
pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 101, priority_array = 0x7ff03c00f0e0,
prio_factors = 0x7ff03c015020, profile = 0, qos_id = 0, qos_ptr = 0x0, qos_blocking_ptr = 0x0, reboot = 0 '\000',
restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295,
resp_host = 0x7ff03c00d4f0 "192.168.8.49", sched_nodes = 0x0, select_jobinfo = 0x7ff03c00ed50,
spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192, start_time = 1532452046,
state_desc = 0x7ff03c006a90 "ReqNodeNotAvail, UnavailableNodes:csm-[000-023],csn-[000-039],csp-[001-027],css-[001-075,078-127],lac-[000-197,199-337,340-371,373-445],qml-[000-005],test-skl-000,vim-[000-002]", state_reason = 15,
state_reason_prev = 15, step_list = 0x9f7220, suspend_time = 0, time_last_active = 1532452046, time_limit = 60,
time_min = 0, tot_sus_time = 0, total_cpus = 2, total_nodes = 2, tres_req_cnt = 0x7ff03c0052a0,
tres_req_str = 0x7ff03c0048c0 "1=2,2=6554,4=2", tres_fmt_req_str = 0x7ff03c0050c0 "cpu=2,mem=6554M,node=2",
tres_alloc_cnt = 0x7ff03c007380, tres_alloc_str = 0x7ff03c00eeb0 "1=2,2=6554,3=18446744073709551614,4=2,5=2",
tres_fmt_alloc_str = 0x7ff03c005320 "cpu=2,mem=6554M,node=2,billing=2", user_id = 804793,
user_name = 0x7ff03c0055a0 "changc81", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0,
wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
PID 36384 __assert_fail, _purge_missing_jobs
$1 = {account = 0x7f0f64004080 "test1", admin_comment = 0x0, alias_list = 0x0,
alloc_node = 0x7f0f64002c70 "lac-249", alloc_resp_port = 51917, alloc_sid = 11118, array_job_id = 0,
array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1646, assoc_ptr = 0xfbf880, batch_flag = 0,
batch_host = 0x7f0f6400e600 "lac-338", billable_tres = 20, bit_flags = 16384, burst_buffer = 0x0,
burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0,
cpu_cnt = 20, cr_enabled = 1, db_index = 102946, deadline = 0, delay_boot = 0, derived_ec = 0,
details = 0x7f0f6400b8c0, direct_set_prio = 0, end_time = 1532472964, end_time_exp = 1532472964,
epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0,
gres_list = 0x0, gres_alloc = 0x7f0f6400e640 "", gres_detail_cnt = 0, gres_detail_str = 0x0,
gres_req = 0x7f0f6400e620 "", gres_used = 0x0, group_id = 2000, job_id = 6943, job_next = 0x0,
job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1,
last_sched_eval = 1532458564, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0,
tres = 0x7f0f6400b0b0}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0,
name = 0x7f0f64001d80 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7f0f6400e660 "lac-[338-339]",
node_addr = 0x7f0f6400dfc0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 2, node_cnt_wag = 2,
nodes_completing = 0x0, origin_cluster = 0x7f0f6400bf30 "msuhpcc", other_port = 51916, pack_job_id = 0,
pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0,
partition = 0x7f0f6400e440 "general-short-18,test1-14,test1-16,test1-18,general-short-14,general-short-16",
part_ptr_list = 0x1271b60, part_nodes_missing = false, part_ptr = 0x131f7a0, power_flags = 0 '\000',
pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 102, priority_array = 0x7f0f6400c040,
prio_factors = 0x7f0f6400ba70, profile = 0, qos_id = 1, qos_ptr = 0xfb82a0, qos_blocking_ptr = 0x0,
reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0,
requid = 4294967295, resp_host = 0x7f0f6400bf00 "192.168.8.49", sched_nodes = 0x0,
select_jobinfo = 0x7f0f6400bf70, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8192,
start_time = 1532458564, state_desc = 0x0, state_reason = 20, state_reason_prev = 20, step_list = 0x1281f90,
suspend_time = 0, time_last_active = 1532458564, time_limit = 240, time_min = 0, tot_sus_time = 0,
total_cpus = 20, total_nodes = 2, tres_req_cnt = 0x7f0f6400b150,
tres_req_str = 0x7f0f6400be40 "1=20,2=40960,4=2", tres_fmt_req_str = 0x7f0f6400bea0 "cpu=20,mem=40G,node=2",
tres_alloc_cnt = 0x7f0f6400deb0, tres_alloc_str = 0x7f0f6400df30 "1=20,2=40960,3=18446744073709551614,4=2,5=20",
tres_fmt_alloc_str = 0x7f0f6400e000 "cpu=20,mem=40G,node=2,billing=20", user_id = 175025,
user_name = 0x7f0f6400db80 "jal", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0,
wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
It looks like a different job for each segfault. I'll attach the logs from yesterday to this ticket as well.
Created attachment 7403 [details]
slurmctld log from 7-24
Hi Steve, We think that we know how to avoid the segfault. It seems these jobs are assigned to a group of nodes, however, the node bitmap that is also assigned to the jobs is NULL when it should have a value. We are still not sure why the bitmap is null so we are still looking into this. Kind regards, Jason Thanks, Jason. Let me know if there's any more information I can provide to track down why the node bitmap is NULL. I'll attach our conf file and job submit script in case those shed light on the issue. Created attachment 7406 [details]
Job submit script
Created attachment 7407 [details]
Slurm Config File
Hi Steve, Please send us the slurmd logs from 2018-07-20 on nodes css-[033,076]. We would like to confirm if the job 6256 was running at the same time as the crash. Kind regards, Jason Created attachment 7409 [details]
css-033 logs
Created attachment 7410 [details]
css-033 logs
Created attachment 7411 [details]
css-076 logs
Jason, I looked at the job from the first segfault on 7/20, it was 6256.
$6 = {account = 0x7f64500161d0 "classres", admin_comment = 0x0, alias_list = 0x0,
alloc_node = 0x7f64500161a0 "dev-intel18", alloc_resp_port = 51895, alloc_sid = 21447, array_job_id = 0,
array_task_id = 4294967294, array_recs = 0x0, assoc_id = 1607, assoc_ptr = 0x264cad0, batch_flag = 0,
batch_host = 0x7f645000fed0 "css-033", billable_tres = 2, bit_flags = 16384, burst_buffer = 0x0,
burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0,
cpu_cnt = 2, cr_enabled = 1, db_index = 101612, deadline = 0, delay_boot = 0, derived_ec = 0,
details = 0x7f6450015e20, direct_set_prio = 0, end_time = 1532113356, end_time_exp = 1532113356,
epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres = 0x0,
gres_list = 0x0, gres_alloc = 0x7f645000fe70 "", gres_detail_cnt = 0, gres_detail_str = 0x0,
gres_req = 0x7f6450012140 "", gres_used = 0x0, group_id = 2103, job_id = 6256, job_next = 0x0,
job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x0, job_state = 1, kill_on_node_fail = 1,
last_sched_eval = 1532109756, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0,
tres = 0x7f6450015650}, mail_type = 0, mail_user = 0x0, magic = 4038539564, mcs_label = 0x0,
name = 0x7f6450000f50 "sh", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x7f6450012160 "css-[033,076]",
node_addr = 0x7f645000ef90, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 2, node_cnt_wag = 2,
nodes_completing = 0x0, origin_cluster = 0x7f6450001720 "msuhpcc", other_port = 51894, pack_job_id = 0,
pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0,
partition = 0x7f645000ef00 "classres,general-short", part_ptr_list = 0x7f64480bc270, part_nodes_missing = false,
part_ptr = 0x29a9d40, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false,
priority = 101, priority_array = 0x7f6450016340, prio_factors = 0x7f6450015fd0, profile = 0, qos_id = 0,
qos_ptr = 0x0, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0,
resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x7f6450016200 "192.168.9.200",
sched_nodes = 0x0, select_jobinfo = 0x7f6450016260, spank_job_env = 0x0, spank_job_env_size = 0,
start_protocol_ver = 8192, start_time = 1532109756,
state_desc = 0x7f645000f8d0 "ReqNodeNotAvail, UnavailableNodes:csm-[000-023],csn-[000-039],csp-[001-027],css-[001-032,034-075,078-127],lac-[000-197,199-337,340-371,373-445],vim-[000-002]", state_reason = 15,
state_reason_prev = 15, step_list = 0x290f030, suspend_time = 0, time_last_active = 1532109756, time_limit = 60,
time_min = 0, tot_sus_time = 0, total_cpus = 2, total_nodes = 2, tres_req_cnt = 0x7f64500156b0,
tres_req_str = 0x7f64500160e0 "1=2,2=6554,4=2", tres_fmt_req_str = 0x7f6450016140 "cpu=2,mem=6554M,node=2",
tres_alloc_cnt = 0x7f645000ee80, tres_alloc_str = 0x7f645000ff40 "1=2,2=6554,3=18446744073709551614,4=2,5=2",
tres_fmt_alloc_str = 0x7f645000efd0 "cpu=2,mem=6554M,node=2,billing=2", user_id = 804793,
user_name = 0x7f6450015680 "changc81", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0,
wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}
Hi Steve, Just passing along a quick update. We have not been able to trigger this issue internally yet. We would like to know if things have calmed down for you or if you are still seeing consistent crashes (today). Kind regards, Jason We saw two more crashes yesterday. Both had validate_jobs_on_node in their backtrace. Three more segfaults today. Two in validate_jobs_on_node and one in _purge_missing_jobs. Hi Steve, This issue may be related to 5452 and a few others (5438, 5447, 5276). We believe that the issue here is that jobs, which can run in multiple partitions, are started immediately before _select_nodes_parts() finishes and when "EnforcePartLimits=ALL". I will keep you updated as we make progress towards a patch. Kind regards, Jason Created attachment 7441 [details]
Patch 5457
Hi Steve,
This patch should fix this issue.
It hasn't been committed yet, but we think it will be soon in this or similar form.
Kind regards,
Jason
Jason, We've applied this patch. We'll see how things go for the next few days. Thanks *** Ticket 5472 has been marked as a duplicate of this ticket. *** *** Ticket 5473 has been marked as a duplicate of this ticket. *** Hi Steve, I am check in with you to see how the last 24 hours has gone with the patch. Have you seen any issues since applying it? Kind regards, Jason Hello Jason, I'm happy to report that we have not had any crashes since we applied the patch. Hi Steve,
> I am check in with you to see how the last 24 hours has gone with the patch. Have you seen any issues since applying it?
This is good news. I will drop the priority of this ticket to a sev3 for now since the patch seems to have mitigated the disruptions.
Kind regards,
Jason
Hi Steve, Since this appears to have been resolved with the following commit I will proceed to close this issue out. https://github.com/SchedMD/slurm/commit/fef07a40972 Please do let me know if the issue happens again. Best regards, Jason |