Hi! We're having issues with a job making the primary (and the backup) controller segfault at start. It stops on the following message: slurmctld[73758]: error: job_resources_node_inx_to_cpu_inx: no job_resrcs or node_bitmap slurmctld[73758]: error: job_update_tres_cnt: problem getting offset of job 22135921 slurmctld[73758]: cleanup_completing: job 22135921 completion process took 1355 seconds and in dmesg: srvcn[76591]: segfault at 58 ip 00000000004aa9c2 sp 00007fda2df00f70 error 4 in slurmctld[400000+de000] Right now, the cluster is more or less down, so I made this a Sev 1 issue. Some info about the crash itself: # addr2line -e /usr/sbin/slurmctld 00000000004aa9c2 /root/rpmbuild/BUILD/slurm-17.11.6/src/slurmctld/step_mgr.c:2081 (gdb) bt #0 _step_dealloc_lps (step_ptr=0x5246690) at step_mgr.c:2081 #1 post_job_step (step_ptr=step_ptr@entry=0x5246690) at step_mgr.c:4652 #2 0x00000000004aafb3 in _post_job_step (step_ptr=0x5246690) at step_mgr.c:266 #3 _internal_step_complete (job_ptr=job_ptr@entry=0x5245b80, step_ptr=step_ptr@entry=0x5246690) at step_mgr.c:307 #4 0x00000000004ab031 in delete_step_records (job_ptr=job_ptr@entry=0x5245b80) at step_mgr.c:336 #5 0x0000000000463d3f in cleanup_completing (job_ptr=job_ptr@entry=0x5245b80) at job_scheduler.c:4695 #6 0x000000000046e02f in make_node_idle (node_ptr=0x296ed20, job_ptr=job_ptr@entry=0x5245b80) at node_mgr.c:3744 #7 0x000000000044ca53 in job_epilog_complete (job_id=<optimized out>, node_name=0x7fd9d80566e0 "sh-28-04", return_code=0) at job_mgr.c:14584 #8 0x00000000004870a7 in _slurm_rpc_epilog_complete (running_composite=true, run_scheduler=0x7fda2df019ef, msg=0x7fd9d8057550) at proc_req.c:2200 #9 _slurm_rpc_comp_msg_list (comp_msg=comp_msg@entry=0x7fd9d80d3a80, run_scheduler=run_scheduler@entry=0x7fda2df019ef, msg_list_in=0x5b240e0, start_tv=start_tv@entry=0x7fda2df019f0, timeout=timeout@entry=2000000) at proc_req.c:6696 #10 0x00000000004867db in _slurm_rpc_comp_msg_list (comp_msg=comp_msg@entry=0x7fd9d80d0710, run_scheduler=run_scheduler@entry=0x7fda2df019ef, msg_list_in=0x5b23960, start_tv=start_tv@entry=0x7fda2df019f0, timeout=2000000) at proc_req.c:6659 #11 0x0000000000487627 in _slurm_rpc_composite_msg (msg=msg@entry=0x7fda2df01e50) at proc_req.c:6574 #12 0x000000000048edf4 in slurmctld_req (msg=msg@entry=0x7fda2df01e50, arg=arg@entry=0x7fda00000b50) at proc_req.c:579 #13 0x0000000000424e78 in _service_connection (arg=0x7fda00000b50) at controller.c:1125 #14 0x00007fda3239de25 in start_thread () from /lib64/libpthread.so.0 #15 0x00007fda320c7bad in clone () from /lib64/libc.so.6 Could you please advise on: 1. how to remove the job that is causing issues 2. avoid the segfault in case such a job is produced again? Thanks!
Can you grab 'thread apply all bt full'? Is it possible to get 17.11.7 installed on the controller quickly?
Created attachment 7320 [details] gdb output
(In reply to Tim Wickberg from comment #1) > Can you grab 'thread apply all bt full'? Sure, it's attached. > Is it possible to get 17.11.7 installed on the controller quickly? I'm gonna try this temporarily, yes (we can't run 17.11.7 because of #5240) Thanks, -- Kilian
(In reply to Kilian Cavalotti from comment #3) > (In reply to Tim Wickberg from comment #1) > > Can you grab 'thread apply all bt full'? > > Sure, it's attached. > > > Is it possible to get 17.11.7 installed on the controller quickly? > > I'm gonna try this temporarily, yes (we can't run 17.11.7 because of #5240) Could you run slurm-17.11 head? That has that fix in it, and will be close to the 17.11.8 release due out this week.
(In reply to Tim Wickberg from comment #4) > Could you run slurm-17.11 head? That has that fix in it, and will be close > to the 17.11.8 release due out this week. Good idea. It's compiling right now.
Created attachment 7321 [details] bypass crash for missing job_resrsc struct This should bypass that specific issue. There's something odd here in your state - it looks like the job itself has no resources at this point, and this is a stray extra epilog message coming in very late. I can't guarantee this won't shift the crash elsewhere, but that will hopefully get you up and running.
(In reply to Tim Wickberg from comment #6) > Created attachment 7321 [details] > bypass crash for missing job_resrsc struct > > This should bypass that specific issue. There's something odd here in your > state - it looks like the job itself has no resources at this point, and > this is a stray extra epilog message coming in very late. > > I can't guarantee this won't shift the crash elsewhere, but that will > hopefully get you up and running. Appreciate the patch, thank you. I'll give it a try if 17.11.7-master still segfaults. This is apparently a job that is being recovered when the controller starts, but which is done already. There's no trace of any associated process on the compute node and the epilog is done too. We've been suffering a lot of accumulating jobs stuck in CG state lately, pretty much like what's been reported in those bug reports: • https://bugs.schedmd.com/show_bug.cgi?id=5401 • https://bugs.schedmd.com/show_bug.cgi?id=5121 • https://bugs.schedmd.com/show_bug.cgi?id=5111 • https://bugs.schedmd.com/show_bug.cgi?id=5177 Our current workaround is to restart the controller, which is as a 50% success rate. Is it possible that a job got corrupted during a controller restart?
(In reply to Kilian Cavalotti from comment #7) > > This should bypass that specific issue. There's something odd here in your > > state - it looks like the job itself has no resources at this point, and > > this is a stray extra epilog message coming in very late. > > > > I can't guarantee this won't shift the crash elsewhere, but that will > > hopefully get you up and running. > > Appreciate the patch, thank you. I'll give it a try if 17.11.7-master still > segfaults. 17.11.7-master still segfaults on the same job. I'll try the patch next.
Patch has been applied, seems to be holding up so far!
Since the patch seems to have mitigated the issue for now, I'm moving this to a sev-4. This is very similar to bug 5276, so I'm taking over this bug as well.
(In reply to Marshall Garey from comment #12) > Since the patch seems to have mitigated the issue for now, I'm moving this > to a sev-4. This is very similar to bug 5276, so I'm taking over this bug as > well. Good, thanks!
We're back in business, thanks! We would still appreciate: 1. if that fix could be incorporated in 17.11.8 and up, in case this happens again, 2. a chance to understand how that job's structure could be missing its resources part. I don't think there's anything funky about the way the job has been submitted (it was a 2-items array job, if that's useful)
(In reply to Kilian Cavalotti from comment #14) > We're back in business, thanks! > > We would still appreciate: > 1. if that fix could be incorporated in 17.11.8 and up, in case this happens > again, I've added Tim to CC to see what he thinks. We're wanting to get 17.11.8 out this week. > 2. a chance to understand how that job's structure could be missing its > resources part. I don't think there's anything funky about the way the job > has been submitted (it was a 2-items array job, if that's useful) We're still trying to figure out how this happens. It really shouldn't be missing the job_resrcs.
Good news: I accidentally reproduced this. What happened: - A batch array job was running (I don't know if it matters that it's a batch job, or a job array) - Some other srun jobs were running, too - My computer lost power (due to a brief power outage at work) - I rebooted - I started slurmdbd, then the slurmd's - Then I started the slurmctld - Then slurmctld crashed because of a batch job in the same place as yours So, my job PIDs obviously didn't exist anymore, because the computer had shutdown. The batch script had also disappeared. But the job still existed in the slurmctld. Anyway, I can investigate this a little better now, since I've got it on my own box, and I've saved the entire state of my system at the time of the crash (StateSaveLocation, database, core dump, slurmctld binary). We'll keep you updated.
*** Ticket 5487 has been marked as a duplicate of this ticket. ***
Created attachment 7433 [details] Prevent job_resrcs from being overwritten for multi-partition job submissions Can you apply this patch and see if it fixes the issue? We've been able to reliably reproduce this segfault, and this patch fixes it for us. It hasn't been committed yet, but we think it will be soon.
(In reply to Marshall Garey from comment #19) > Created attachment 7433 [details] > Prevent job_resrcs from being overwritten for multi-partition job submissions > > Can you apply this patch and see if it fixes the issue? We've been able to > reliably reproduce this segfault, and this patch fixes it for us. It hasn't > been committed yet, but we think it will be soon. Thanks! I'll apply the patch and remove the earlier "bypass" patch to make sure we can verify it fixes the issue. Thanks! -- Kilian
Have you seen this segfault again?
Hi Marshall, (In reply to Marshall Garey from comment #21) > Have you seen this segfault again? Not since we've applied the patch on 17.11.8, no. Cheers, -- Kilian
Well, that's almost 4 weeks. I'm going to close this as resolved/fixed for now. But please reopen it if you see it again. This patch was committed to 17.11.9. https://github.com/SchedMD/slurm/commit/fef07a40972 I marked this as a duplicate of bug 5452 since Dominik committed this patch and that's the bug he was working on. *** This ticket has been marked as a duplicate of ticket 5452 ***