| Summary: | Slurmctld crashing after upgrading to 20.11.4 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Steve Ford <fordste5> |
| Component: | slurmctld | Assignee: | Director of Support <support> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | alex, bart, marshall, mcmullan |
| Version: | 20.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=12451 | ||
| Site: | MSU | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 20.11.5 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
thread apply all bt
slurm.conf |
||
Created attachment 18261 [details]
slurm.conf
There are similar thread 1 traces from other dumps. They reference different nodes, but have the same jobid: 15435906. Thread 1 (Thread 0x7f375f6f6700 (LWP 39487)): #0 _next_month (tm=0x7f375f6f5b50, entry=0x0) at cron.c:193 #1 calc_next_cron_start (entry=0x0) at cron.c:310 #2 0x0000000000457311 in job_hold_requeue (job_ptr=job_ptr@entry=0x3e531d0) at job_mgr.c:17740 #3 0x0000000000470ce9 in cleanup_completing (job_ptr=job_ptr@entry=0x3e531d0) at job_scheduler.c:4960 #4 0x0000000000477792 in make_node_idle (node_ptr=0x363f1a8, job_ptr=job_ptr@entry=0x3e531d0) at node_mgr.c:3933 #5 0x000000000045051d in job_epilog_complete (job_id=15435906, node_name=0x7f3760000c50 "css-039", return_code=0) at job_mgr.c:15411 #6 0x000000000049593a in _slurm_rpc_epilog_complete (msg=0x7f3760001870) at proc_req.c:1933 #7 0x000000000049b730 in slurmctld_req (msg=msg@entry=0x7f3760001870) at proc_req.c:6501 #8 0x000000000042aec5 in _service_connection (arg=0x0) at controller.c:1270 #9 0x00007f37ae9a5e25 in start_thread () from /lib64/libpthread.so.0 #10 0x00007f37ae6cfbad in clone () from /lib64/libc.so.6 --- Thread 1 (Thread 0x7f0c90fcf700 (LWP 22547)): #0 _next_month (tm=0x7f0c90fceb50, entry=0x0) at cron.c:193 #1 calc_next_cron_start (entry=0x0) at cron.c:310 #2 0x0000000000457311 in job_hold_requeue (job_ptr=job_ptr@entry=0x2ca51d0) at job_mgr.c:17740 #3 0x0000000000470ce9 in cleanup_completing (job_ptr=job_ptr@entry=0x2ca51d0) at job_scheduler.c:4960 #4 0x0000000000477792 in make_node_idle (node_ptr=0x24911a8, job_ptr=job_ptr@entry=0x2ca51d0) at node_mgr.c:3933 #5 0x000000000045051d in job_epilog_complete (job_id=15435906, node_name=0x7f0c40000b40 "css-039", return_code=0) at job_mgr.c:15411 #6 0x000000000049593a in _slurm_rpc_epilog_complete (msg=0x7f0c40000bc0) at proc_req.c:1933 #7 0x000000000049b730 in slurmctld_req (msg=msg@entry=0x7f0c40000bc0) at proc_req.c:6501 #8 0x000000000042aec5 in _service_connection (arg=0x0) at controller.c:1270 #9 0x00007f0caad53e25 in start_thread () from /lib64/libpthread.so.0 #10 0x00007f0caaa7dbad in clone () from /lib64/libc.so.6 --- Thread 1 (Thread 0x7fa216cac700 (LWP 21031)): #0 _next_month (tm=0x7fa216cabb50, entry=0x0) at cron.c:193 #1 calc_next_cron_start (entry=0x0) at cron.c:310 #2 0x0000000000457311 in job_hold_requeue (job_ptr=job_ptr@entry=0x2ea71d0) at job_mgr.c:17740 #3 0x0000000000470ce9 in cleanup_completing (job_ptr=job_ptr@entry=0x2ea71d0) at job_scheduler.c:4960 #4 0x0000000000477792 in make_node_idle (node_ptr=0x26931a8, job_ptr=job_ptr@entry=0x2ea71d0) at node_mgr.c:3933 #5 0x000000000045051d in job_epilog_complete (job_id=15435906, node_name=0x7fa208009670 "css-039", return_code=0) at job_mgr.c:15411 #6 0x000000000049593a in _slurm_rpc_epilog_complete (msg=0x7fa208009760) at proc_req.c:1933 #7 0x000000000049b730 in slurmctld_req (msg=msg@entry=0x7fa208009760) at proc_req.c:6501 #8 0x000000000042aec5 in _service_connection (arg=0x0) at controller.c:1270 #9 0x00007fa22f57ae25 in start_thread () from /lib64/libpthread.so.0 #10 0x00007fa22f2a4bad in clone () from /lib64/libc.so.6 --- Thread 1 (Thread 0x7f70f65a5700 (LWP 20537)): #0 _next_month (tm=0x7f70f65a4b50, entry=0x0) at cron.c:193 #1 calc_next_cron_start (entry=0x0) at cron.c:310 #2 0x0000000000457311 in job_hold_requeue (job_ptr=job_ptr@entry=0x2955190) at job_mgr.c:17740 #3 0x0000000000470ce9 in cleanup_completing (job_ptr=job_ptr@entry=0x2955190) at job_scheduler.c:4960 #4 0x0000000000477792 in make_node_idle (node_ptr=0x21411a8, job_ptr=job_ptr@entry=0x2955190) at node_mgr.c:3933 #5 0x000000000045051d in job_epilog_complete (job_id=15435906, node_name=0x7f709c52e460 "css-039", return_code=0) at job_mgr.c:15411 #6 0x000000000049593a in _slurm_rpc_epilog_complete (msg=0x7f709c46f360) at proc_req.c:1933 #7 0x000000000049b730 in slurmctld_req (msg=msg@entry=0x7f709c46f360) at proc_req.c:6501 #8 0x000000000042aec5 in _service_connection (arg=0x0) at controller.c:1270 #9 0x00007f7112b80e25 in start_thread () from /lib64/libpthread.so.0 #10 0x00007f71128aabad in clone () from /lib64/libc.so.6 --- Thread 1 (Thread 0x7faee48c8700 (LWP 13476)): #0 _next_month (tm=0x7faee48c7b50, entry=0x0) at cron.c:193 #1 calc_next_cron_start (entry=0x0) at cron.c:310 #2 0x0000000000457311 in job_hold_requeue (job_ptr=job_ptr@entry=0x439c1c0) at job_mgr.c:17740 #3 0x0000000000470ce9 in cleanup_completing (job_ptr=job_ptr@entry=0x439c1c0) at job_scheduler.c:4960 #4 0x0000000000477792 in make_node_idle (node_ptr=0x3b975b8, job_ptr=job_ptr@entry=0x439c1c0) at node_mgr.c:3933 #5 0x000000000045051d in job_epilog_complete (job_id=15435906, node_name=0x7faf14600020 "lac-041", return_code=0) at job_mgr.c:15411 #6 0x000000000049593a in _slurm_rpc_epilog_complete (msg=0x7faf14449170) at proc_req.c:1933 #7 0x000000000049b730 in slurmctld_req (msg=msg@entry=0x7faf14449170) at proc_req.c:6501 #8 0x000000000042aec5 in _service_connection (arg=0x0) at controller.c:1270 #9 0x00007faf78a72e25 in start_thread () from /lib64/libpthread.so.0 #10 0x00007faf7879cbad in clone () from /lib64/libc.so.6 --- Thread 1 (Thread 0x7f0928282700 (LWP 4752)): #0 _next_month (tm=0x7f0928281b50, entry=0x0) at cron.c:193 #1 calc_next_cron_start (entry=0x0) at cron.c:310 #2 0x0000000000457311 in job_hold_requeue (job_ptr=job_ptr@entry=0x2e03190) at job_mgr.c:17740 #3 0x0000000000470ce9 in cleanup_completing (job_ptr=job_ptr@entry=0x2e03190) at job_scheduler.c:4960 #4 0x0000000000477792 in make_node_idle (node_ptr=0x25fe5b8, job_ptr=job_ptr@entry=0x2e03190) at node_mgr.c:3933 #5 0x000000000045051d in job_epilog_complete (job_id=15435906, node_name=0x7f09000de810 "lac-041", return_code=0) at job_mgr.c:15411 #6 0x000000000049593a in _slurm_rpc_epilog_complete (msg=0x7f090000aef0) at proc_req.c:1933 #7 0x000000000049b730 in slurmctld_req (msg=msg@entry=0x7f090000aef0) at proc_req.c:6501 #8 0x000000000042aec5 in _service_connection (arg=0x0) at controller.c:1270 #9 0x00007f0945a88e25 in start_thread () from /lib64/libpthread.so.0 #10 0x00007f09457b2bad in clone () from /lib64/libc.so.6 Forground logs before a crash also implicate this job: slurmctld: debug2: node_did_resp amr-162 slurmctld: debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB slurmctld: cleanup_completing: JobId=15435906 completion process took 24652 seconds Segmentation fault Digging a little further, it looks like this block on line 17736 in job_mgr.c is where things go awry. The jobs bit_flags pass this check to see it is a cron job, but then a null crontab_entry gets passed to calc_next_cron_start.
/* handle crontab jobs */
if (job_ptr->bit_flags & CRON_JOB) {
job_ptr->job_state |= JOB_REQUEUE;
job_ptr->details->begin_time =
calc_next_cron_start(job_ptr->details->crontab_entry);
}
from frame 2 in gdb:
p job_ptr->bit_flags
$9 = 516163584
p job_ptr->details->crontab_entry
$10 = (cron_entry_t *) 0x0
I'm thinking a check could be added here that aborts the requeue of the cron job if crontab_entry is null.
/* handle crontab jobs */
if (job_ptr->bit_flags & CRON_JOB && job_ptr->details->crontab_entry) {
job_ptr->job_state |= JOB_REQUEUE;
job_ptr->details->begin_time =
calc_next_cron_start(job_ptr->details->crontab_entry);
}
I'm going to go ahead and compile this. Let me know if you have any other thoughts about what might fix this issue.
Thanks,
Steve
Hi Steve we are looking into this for you and we will update you soon. That workaround looks good, at least until we found out why crontab_entry ended up null. I have a few engineers looking into the root cause in the meantime. Hi Could you send us additional gdb output from? t 1 f 2 p *job_ptr p *job_ptr->details DOminik Hello Jason, We're back online with that patch. I imagine the worst this might do is prevent scrontab jobs from working properly, but that would be a fair tradeoff for having the scheduler online. Thanks, Steve Hello Dominik,
print *job_ptr
$1 = {
magic = 4038539564,
account = 0x34866f0 "XXXXX",
admin_comment = 0x0,
alias_list = 0x0,
alloc_node = 0x34866c0 "XXXXX",
alloc_resp_port = 0,
alloc_sid = 3765,
array_job_id = 0,
array_task_id = 4294967294,
array_recs = 0x0,
assoc_id = 5516,
assoc_ptr = 0x1f7f840,
batch_features = 0x0,
batch_flag = 3,
batch_host = 0x86a63a0 "css-039",
billable_tres = 1245.184,
bit_flags = 516163584,
burst_buffer = 0x0,
burst_buffer_state = 0x0,
clusters = 0x0,
comment = 0x3486720 "stdout=XXXXX",
cpu_cnt = 4,
cpus_per_tres = 0x0,
cr_enabled = 0,
db_flags = 4,
db_index = 1112966738,
deadline = 0,
delay_boot = 0,
derived_ec = 0,
details = 0x3485dc0,
direct_set_prio = 0,
end_time = 1614921755,
end_time_exp = 4294967294,
epilog_running = false,
exit_code = 0,
fed_details = 0x0,
front_end_ptr = 0x0,
gres_list = 0x0,
gres_detail_cnt = 0,
gres_detail_str = 0x0,
gres_used = 0x0,
group_id = 2156,
het_details = 0x0,
het_job_id = 0,
het_job_id_set = 0x0,
het_job_offset = 0,
het_job_list = 0x0,
job_id = 15435906,
job_next = 0x0,
job_array_next_j = 0x0,
job_array_next_t = 0x0,
job_preempt_comp = 0x0,
job_resrcs = 0x34867f0,
job_state = 1027,
kill_on_node_fail = 1,
last_sched_eval = 1614921750,
licenses = 0x0,
license_list = 0x0,
limit_set = {
qos = 0,
time = 0,
tres = 0x3486040
},
mail_type = 0,
mail_user = 0x3486790 "XXXXX",
mem_per_tres = 0x0,
mcs_label = 0x0,
name = 0x341fc90 "HTML",
network = 0x0,
next_step_id = 0,
nodes = 0x3451bb0 "css-039",
node_addr = 0x85a8310,
node_bitmap = 0x8644c60,
node_bitmap_cg = 0x869c9f0,
node_cnt = 0,
node_cnt_wag = 0,
nodes_completing = 0x347fb10 "css-039",
origin_cluster = 0x0,
other_port = 0,
partition = 0x3486530 "XXXXX,general-short",
part_ptr_list = 0x3485930,
part_nodes_missing = false,
part_ptr = 0x2bfd480,
power_flags = 0 '\000',
pre_sus_time = 0,
preempt_time = 0,
preempt_in_progress = false,
prep_epilog_cnt = 0,
prep_prolog_cnt = 0,
prep_prolog_failed = false,
priority = 62889,
priority_array = 0x0,
prio_factors = 0x3485f90,
profile = 0,
qos_id = 1,
qos_ptr = 0x0,
qos_blocking_ptr = 0x0,
reboot = 0 '\000',
restart_cnt = 2,
resize_time = 0,
resv_id = 0,
resv_list = 0x0,
resv_name = 0x0,
resv_ptr = 0x0,
requid = 4294967295,
resp_host = 0x0,
sched_nodes = 0x0,
select_jobinfo = 0x34867c0,
site_factor = 2147483648,
spank_job_env = 0x0,
spank_job_env_size = 0,
start_protocol_ver = 8704,
start_time = 1614921750,
state_desc = 0x0,
state_reason = 0,
state_reason_prev_db = 36,
step_list = 0x3486480,
suspend_time = 0,
system_comment = 0x0,
time_last_active = 1614944043,
time_limit = 60,
time_min = 0,
tot_sus_time = 0,
total_cpus = 4,
total_nodes = 1,
tres_bind = 0x0,
tres_freq = 0x0,
tres_per_job = 0x0,
tres_per_node = 0x0,
tres_per_socket = 0x0,
tres_per_task = 0x0,
tres_req_cnt = 0x3487990,
tres_req_str = 0x3487550 "1=4,2=8192,4=1,5=1245",
tres_fmt_req_str = 0x3487c70 "cpu=4,mem=8G,node=1,billing=1245",
tres_alloc_cnt = 0x34875b0,
tres_alloc_str = 0x7f329c02f960 "1=4,2=8192,4=1,5=1245",
tres_fmt_alloc_str = 0x7f329c01a180 "cpu=4,mem=8G,node=1,billing=1245",
user_id = 926004,
user_name = 0x3486690 "XXXXX",
wait_all_nodes = 0,
warn_flags = 0,
warn_signal = 0,
warn_time = 0,
wckey = 0x0,
req_switch = 0,
wait4switch = 0,
best_switch = true,
wait4switch_start = 0
}
(gdb) print *job_ptr->details
$2 = {
magic = 233473255,
acctg_freq = 0x0,
accrue_time = 1614921750,
argc = 1,
argv = 0x3486ae0,
begin_time = 1614921750,
cluster_features = 0x0,
contiguous = 0,
core_spec = 65534,
cpu_bind = 0x0,
cpu_bind_type = 65534,
cpu_freq_min = 4294967294,
cpu_freq_max = 4294967294,
cpu_freq_gov = 4294967294,
cpus_per_task = 4,
crontab_entry = 0x0,
orig_cpus_per_task = 4,
depend_list = 0x0,
dependency = 0x0,
orig_dependency = 0x3486940 "afterok:15435901,afterok:15435879,afterok:15435883,afterok:15435888,afterok:15435892,afterok:15435897",
env_cnt = 0,
env_sup = 0x0,
exc_node_bitmap = 0x0,
exc_nodes = 0x0,
expanding_jobid = 0,
extra = 0x0,
feature_list = 0x8cf85b0,
features = 0x3486570 "[intel14|intel16|intel18|amr|nvf]",
max_cpus = 4294967294,
orig_max_cpus = 4294967294,
max_nodes = 1,
mc_ptr = 0x3486ab0,
mem_bind = 0x0,
mem_bind_type = 0,
min_cpus = 4,
orig_min_cpus = 4,
min_gres_cpu = 0,
min_nodes = 1,
nice = 2147483648,
ntasks_per_node = 1,
ntasks_per_tres = 0,
num_tasks = 1,
open_mode = 0 '\000',
overcommit = 0 '\000',
plane_size = 0,
pn_min_cpus = 4,
orig_pn_min_cpus = 4,
pn_min_memory = 8192,
orig_pn_min_memory = 8192,
pn_min_tmp_disk = 0,
prolog_running = 0 '\000',
reserved_resources = 0,
req_node_bitmap = 0x0,
preempt_start_time = 0,
req_nodes = 0x0,
requeue = 1,
share_res = 254 '\376',
std_err = 0x0,
std_in = 0x34869c0 "/dev/null",
std_out = 0x34869f0 "XXXXX",
submit_time = 1614921629,
task_dist = 8192,
usable_nodes = 0,
whole_node = 0 '\000',
work_dir = 0x3486a60 "XXXXX",
x11 = 0,
x11_magic_cookie = 0x0,
x11_target = 0x0,
x11_target_port = 0
}
>We're back online with that patch. I imagine the worst this might do is prevent scrontab jobs from working properly, but that would be a fair tradeoff for having the scheduler online.
Steve - I am going to downgrade this in the meantime. Feel free to change priority back if needed.
Jason and Dominik, FYI, we do not have scrontab enabled on our cluster. Best, Steve (In reply to Steve Ford from comment #24) > FYI, we do not have scrontab enabled on our cluster. You don't need scrontab for this issue to happen. We removed the INVALID_DEPEND job flag from 19.05 and resued that spot in 20.11 for the CRON_JOB flag for scrontab. (Commit https://github.com/SchedMD/slurm/commit/4abf5423b4 removes INVALID_DEPEND and commit https://github.com/SchedMD/slurm/commit/5ff307dfdc reuses that spot for CRON_JOB). (In reply to Steve Ford from comment #0) > After updating our SLURM server to 20.11.4, we are seeing frequent crashes > of the slurmctld daemon. What version did you upgrade from? (In reply to Michael Hinton from comment #26) > We removed the INVALID_DEPEND job flag from 19.05 Correction: That flag was last in 19.05, and we removed it starting in 20.02 Michael, We updated from 19.05.7. Steve Steve, This issue has now been fixed in the following commits and will land in 20.11.5: * https://github.com/SchedMD/slurm/commit/507508257f * https://github.com/SchedMD/slurm/commit/2db5823105 I put you down as a co-author for the second commit, since you did the initial legwork to find the segfaulting spot and to develop the workaround. I'll go ahead and close this out. If you have any more questions, let me know. Thanks! -Michael |
Created attachment 18260 [details] thread apply all bt Hello, After updating our SLURM server to 20.11.4, we are seeing frequent crashes of the slurmctld daemon. Attached is the output of `thread apply all bt` for the latest crash. Thanks, Steve