Description
Kilian Cavalotti
2018-07-16 19:38:33 MDT
Can you grab 'thread apply all bt full'? Is it possible to get 17.11.7 installed on the controller quickly? Created attachment 7320 [details]
gdb output
(In reply to Tim Wickberg from comment #1) > Can you grab 'thread apply all bt full'? Sure, it's attached. > Is it possible to get 17.11.7 installed on the controller quickly? I'm gonna try this temporarily, yes (we can't run 17.11.7 because of #5240) Thanks, -- Kilian (In reply to Kilian Cavalotti from comment #3) > (In reply to Tim Wickberg from comment #1) > > Can you grab 'thread apply all bt full'? > > Sure, it's attached. > > > Is it possible to get 17.11.7 installed on the controller quickly? > > I'm gonna try this temporarily, yes (we can't run 17.11.7 because of #5240) Could you run slurm-17.11 head? That has that fix in it, and will be close to the 17.11.8 release due out this week. (In reply to Tim Wickberg from comment #4) > Could you run slurm-17.11 head? That has that fix in it, and will be close > to the 17.11.8 release due out this week. Good idea. It's compiling right now. Created attachment 7321 [details]
bypass crash for missing job_resrsc struct
This should bypass that specific issue. There's something odd here in your state - it looks like the job itself has no resources at this point, and this is a stray extra epilog message coming in very late.
I can't guarantee this won't shift the crash elsewhere, but that will hopefully get you up and running.
(In reply to Tim Wickberg from comment #6) > Created attachment 7321 [details] > bypass crash for missing job_resrsc struct > > This should bypass that specific issue. There's something odd here in your > state - it looks like the job itself has no resources at this point, and > this is a stray extra epilog message coming in very late. > > I can't guarantee this won't shift the crash elsewhere, but that will > hopefully get you up and running. Appreciate the patch, thank you. I'll give it a try if 17.11.7-master still segfaults. This is apparently a job that is being recovered when the controller starts, but which is done already. There's no trace of any associated process on the compute node and the epilog is done too. We've been suffering a lot of accumulating jobs stuck in CG state lately, pretty much like what's been reported in those bug reports: • https://bugs.schedmd.com/show_bug.cgi?id=5401 • https://bugs.schedmd.com/show_bug.cgi?id=5121 • https://bugs.schedmd.com/show_bug.cgi?id=5111 • https://bugs.schedmd.com/show_bug.cgi?id=5177 Our current workaround is to restart the controller, which is as a 50% success rate. Is it possible that a job got corrupted during a controller restart? (In reply to Kilian Cavalotti from comment #7) > > This should bypass that specific issue. There's something odd here in your > > state - it looks like the job itself has no resources at this point, and > > this is a stray extra epilog message coming in very late. > > > > I can't guarantee this won't shift the crash elsewhere, but that will > > hopefully get you up and running. > > Appreciate the patch, thank you. I'll give it a try if 17.11.7-master still > segfaults. 17.11.7-master still segfaults on the same job. I'll try the patch next. Patch has been applied, seems to be holding up so far! Since the patch seems to have mitigated the issue for now, I'm moving this to a sev-4. This is very similar to bug 5276, so I'm taking over this bug as well. (In reply to Marshall Garey from comment #12) > Since the patch seems to have mitigated the issue for now, I'm moving this > to a sev-4. This is very similar to bug 5276, so I'm taking over this bug as > well. Good, thanks! We're back in business, thanks! We would still appreciate: 1. if that fix could be incorporated in 17.11.8 and up, in case this happens again, 2. a chance to understand how that job's structure could be missing its resources part. I don't think there's anything funky about the way the job has been submitted (it was a 2-items array job, if that's useful) (In reply to Kilian Cavalotti from comment #14) > We're back in business, thanks! > > We would still appreciate: > 1. if that fix could be incorporated in 17.11.8 and up, in case this happens > again, I've added Tim to CC to see what he thinks. We're wanting to get 17.11.8 out this week. > 2. a chance to understand how that job's structure could be missing its > resources part. I don't think there's anything funky about the way the job > has been submitted (it was a 2-items array job, if that's useful) We're still trying to figure out how this happens. It really shouldn't be missing the job_resrcs. Good news: I accidentally reproduced this. What happened: - A batch array job was running (I don't know if it matters that it's a batch job, or a job array) - Some other srun jobs were running, too - My computer lost power (due to a brief power outage at work) - I rebooted - I started slurmdbd, then the slurmd's - Then I started the slurmctld - Then slurmctld crashed because of a batch job in the same place as yours So, my job PIDs obviously didn't exist anymore, because the computer had shutdown. The batch script had also disappeared. But the job still existed in the slurmctld. Anyway, I can investigate this a little better now, since I've got it on my own box, and I've saved the entire state of my system at the time of the crash (StateSaveLocation, database, core dump, slurmctld binary). We'll keep you updated. *** Ticket 5487 has been marked as a duplicate of this ticket. *** Created attachment 7433 [details]
Prevent job_resrcs from being overwritten for multi-partition job submissions
Can you apply this patch and see if it fixes the issue? We've been able to reliably reproduce this segfault, and this patch fixes it for us. It hasn't been committed yet, but we think it will be soon.
(In reply to Marshall Garey from comment #19) > Created attachment 7433 [details] > Prevent job_resrcs from being overwritten for multi-partition job submissions > > Can you apply this patch and see if it fixes the issue? We've been able to > reliably reproduce this segfault, and this patch fixes it for us. It hasn't > been committed yet, but we think it will be soon. Thanks! I'll apply the patch and remove the earlier "bypass" patch to make sure we can verify it fixes the issue. Thanks! -- Kilian Have you seen this segfault again? Hi Marshall, (In reply to Marshall Garey from comment #21) > Have you seen this segfault again? Not since we've applied the patch on 17.11.8, no. Cheers, -- Kilian Well, that's almost 4 weeks. I'm going to close this as resolved/fixed for now. But please reopen it if you see it again. This patch was committed to 17.11.9. https://github.com/SchedMD/slurm/commit/fef07a40972 I marked this as a duplicate of bug 5452 since Dominik committed this patch and that's the bug he was working on. *** This ticket has been marked as a duplicate of ticket 5452 *** |