Ticket 5438

Summary:	slurmctld segfault at start
Product:	Slurm	Reporter:	Kilian Cavalotti <kilian>
Component:	slurmctld	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	bart, tim
Version:	17.11.6
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=5276 https://bugs.schedmd.com/show_bug.cgi?id=5447
Site:	Stanford	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.11.9 18.08.0pre2
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	gdb output bypass crash for missing job_resrsc struct Prevent job_resrcs from being overwritten for multi-partition job submissions

Description Kilian Cavalotti 2018-07-16 19:38:33 MDT

Hi!

We're having issues with a job making the primary (and the backup) controller segfault at start. It stops on the following message:

slurmctld[73758]: error: job_resources_node_inx_to_cpu_inx: no job_resrcs or node_bitmap
slurmctld[73758]: error: job_update_tres_cnt: problem getting offset of job 22135921
slurmctld[73758]: cleanup_completing: job 22135921 completion process took 1355 seconds

and in dmesg:

srvcn[76591]: segfault at 58 ip 00000000004aa9c2 sp 00007fda2df00f70 error 4 in slurmctld[400000+de000]


Right now, the cluster is more or less down, so I made this a Sev 1 issue.



Some info about the crash itself:

# addr2line -e /usr/sbin/slurmctld 00000000004aa9c2
/root/rpmbuild/BUILD/slurm-17.11.6/src/slurmctld/step_mgr.c:2081

(gdb) bt
#0  _step_dealloc_lps (step_ptr=0x5246690) at step_mgr.c:2081
#1  post_job_step (step_ptr=step_ptr@entry=0x5246690) at step_mgr.c:4652
#2  0x00000000004aafb3 in _post_job_step (step_ptr=0x5246690) at step_mgr.c:266
#3  _internal_step_complete (job_ptr=job_ptr@entry=0x5245b80, step_ptr=step_ptr@entry=0x5246690) at step_mgr.c:307
#4  0x00000000004ab031 in delete_step_records (job_ptr=job_ptr@entry=0x5245b80) at step_mgr.c:336
#5  0x0000000000463d3f in cleanup_completing (job_ptr=job_ptr@entry=0x5245b80) at job_scheduler.c:4695
#6  0x000000000046e02f in make_node_idle (node_ptr=0x296ed20, job_ptr=job_ptr@entry=0x5245b80) at node_mgr.c:3744
#7  0x000000000044ca53 in job_epilog_complete (job_id=<optimized out>, node_name=0x7fd9d80566e0 "sh-28-04", return_code=0) at job_mgr.c:14584
#8  0x00000000004870a7 in _slurm_rpc_epilog_complete (running_composite=true, run_scheduler=0x7fda2df019ef, msg=0x7fd9d8057550) at proc_req.c:2200
#9  _slurm_rpc_comp_msg_list (comp_msg=comp_msg@entry=0x7fd9d80d3a80, run_scheduler=run_scheduler@entry=0x7fda2df019ef, msg_list_in=0x5b240e0, start_tv=start_tv@entry=0x7fda2df019f0, timeout=timeout@entry=2000000) at proc_req.c:6696
#10 0x00000000004867db in _slurm_rpc_comp_msg_list (comp_msg=comp_msg@entry=0x7fd9d80d0710, run_scheduler=run_scheduler@entry=0x7fda2df019ef, msg_list_in=0x5b23960, start_tv=start_tv@entry=0x7fda2df019f0, timeout=2000000) at proc_req.c:6659
#11 0x0000000000487627 in _slurm_rpc_composite_msg (msg=msg@entry=0x7fda2df01e50) at proc_req.c:6574
#12 0x000000000048edf4 in slurmctld_req (msg=msg@entry=0x7fda2df01e50, arg=arg@entry=0x7fda00000b50) at proc_req.c:579
#13 0x0000000000424e78 in _service_connection (arg=0x7fda00000b50) at controller.c:1125
#14 0x00007fda3239de25 in start_thread () from /lib64/libpthread.so.0
#15 0x00007fda320c7bad in clone () from /lib64/libc.so.6


Could you please advise on:
1. how to remove the job that is causing issues
2. avoid the segfault in case such a job is produced again?


Thanks!

Comment 1 Tim Wickberg 2018-07-16 19:51:26 MDT

Can you grab 'thread apply all bt full'?

Is it possible to get 17.11.7 installed on the controller quickly?

Comment 2 Kilian Cavalotti 2018-07-16 20:06:57 MDT

Created attachment 7320 [details]
gdb output

Comment 3 Kilian Cavalotti 2018-07-16 20:07:08 MDT

(In reply to Tim Wickberg from comment #1)
> Can you grab 'thread apply all bt full'?

Sure, it's attached.

> Is it possible to get 17.11.7 installed on the controller quickly?

I'm gonna try this temporarily, yes (we can't run 17.11.7 because of #5240)

Thanks,
--
Kilian

Comment 4 Tim Wickberg 2018-07-16 20:08:23 MDT

(In reply to Kilian Cavalotti from comment #3)
> (In reply to Tim Wickberg from comment #1)
> > Can you grab 'thread apply all bt full'?
> 
> Sure, it's attached.
> 
> > Is it possible to get 17.11.7 installed on the controller quickly?
> 
> I'm gonna try this temporarily, yes (we can't run 17.11.7 because of #5240)

Could you run slurm-17.11 head? That has that fix in it, and will be close to the 17.11.8 release due out this week.

Comment 5 Kilian Cavalotti 2018-07-16 20:17:31 MDT

(In reply to Tim Wickberg from comment #4)
> Could you run slurm-17.11 head? That has that fix in it, and will be close
> to the 17.11.8 release due out this week.

Good idea. It's compiling right now.

Comment 6 Tim Wickberg 2018-07-16 20:17:46 MDT

Created attachment 7321 [details]
bypass crash for missing job_resrsc struct

This should bypass that specific issue. There's something odd here in your state - it looks like the job itself has no resources at this point, and this is a stray extra epilog message coming in very late.

I can't guarantee this won't shift the crash elsewhere, but that will hopefully get you up and running.

Comment 7 Kilian Cavalotti 2018-07-16 20:22:27 MDT

(In reply to Tim Wickberg from comment #6)
> Created attachment 7321 [details]
> bypass crash for missing job_resrsc struct
> 
> This should bypass that specific issue. There's something odd here in your
> state - it looks like the job itself has no resources at this point, and
> this is a stray extra epilog message coming in very late.
> 
> I can't guarantee this won't shift the crash elsewhere, but that will
> hopefully get you up and running.

Appreciate the patch, thank you. I'll give it a try if 17.11.7-master still segfaults.

This is apparently a job that is being recovered when the controller starts, but which is done already. There's no trace of any associated process on the compute node and the epilog is done too.

We've been suffering a lot of accumulating jobs stuck in CG state lately, pretty much like what's been reported in those bug reports:
• https://bugs.schedmd.com/show_bug.cgi?id=5401
• https://bugs.schedmd.com/show_bug.cgi?id=5121
• https://bugs.schedmd.com/show_bug.cgi?id=5111
• https://bugs.schedmd.com/show_bug.cgi?id=5177 
Our current workaround is to restart the controller, which is as a 50% success rate.

Is it possible that a job got corrupted during a controller restart?

Comment 8 Kilian Cavalotti 2018-07-16 20:28:11 MDT

(In reply to Kilian Cavalotti from comment #7)
> > This should bypass that specific issue. There's something odd here in your
> > state - it looks like the job itself has no resources at this point, and
> > this is a stray extra epilog message coming in very late.
> > 
> > I can't guarantee this won't shift the crash elsewhere, but that will
> > hopefully get you up and running.
> 
> Appreciate the patch, thank you. I'll give it a try if 17.11.7-master still
> segfaults.

17.11.7-master still segfaults on the same job.
I'll try the patch next.

Comment 9 Kilian Cavalotti 2018-07-16 20:36:32 MDT

Patch has been applied, seems to be holding up so far!

Comment 12 Marshall Garey 2018-07-17 10:00:03 MDT

Since the patch seems to have mitigated the issue for now, I'm moving this to a sev-4. This is very similar to bug 5276, so I'm taking over this bug as well.

Comment 13 Kilian Cavalotti 2018-07-17 11:38:55 MDT

(In reply to Marshall Garey from comment #12)
> Since the patch seems to have mitigated the issue for now, I'm moving this
> to a sev-4. This is very similar to bug 5276, so I'm taking over this bug as
> well.

Good, thanks!

Comment 14 Kilian Cavalotti 2018-07-17 11:44:18 MDT

We're back in business, thanks!

We would still appreciate:
1. if that fix could be incorporated in 17.11.8 and up, in case this happens again,
2. a chance to understand how that job's structure could be missing its resources part. I don't think there's anything funky about the way the job has been submitted (it was a 2-items array job, if that's useful)

Comment 15 Marshall Garey 2018-07-17 11:48:28 MDT

(In reply to Kilian Cavalotti from comment #14)
> We're back in business, thanks!
> 
> We would still appreciate:
> 1. if that fix could be incorporated in 17.11.8 and up, in case this happens
> again,

I've added Tim to CC to see what he thinks. We're wanting to get 17.11.8 out this week.


> 2. a chance to understand how that job's structure could be missing its
> resources part. I don't think there's anything funky about the way the job
> has been submitted (it was a 2-items array job, if that's useful)

We're still trying to figure out how this happens. It really shouldn't be missing the job_resrcs.

Comment 17 Marshall Garey 2018-07-18 16:01:29 MDT

Good news: I accidentally reproduced this. What happened:

- A batch array job was running (I don't know if it matters that it's a batch job, or a job array)
- Some other srun jobs were running, too
- My computer lost power (due to a brief power outage at work)
- I rebooted
- I started slurmdbd, then the slurmd's
- Then I started the slurmctld
- Then slurmctld crashed because of a batch job in the same place as yours

So, my job PIDs obviously didn't exist anymore, because the computer had shutdown. The batch script had also disappeared. But the job still existed in the slurmctld. Anyway, I can investigate this a little better now, since I've got it on my own box, and I've saved the entire state of my system at the time of the crash (StateSaveLocation, database, core dump, slurmctld binary).

We'll keep you updated.

Comment 18 Marshall Garey 2018-07-26 10:55:18 MDT

*** Ticket 5487 has been marked as a duplicate of this ticket. ***

Comment 19 Marshall Garey 2018-07-26 16:57:06 MDT

Created attachment 7433 [details]
Prevent job_resrcs from being overwritten for multi-partition job submissions

Can you apply this patch and see if it fixes the issue? We've been able to reliably reproduce this segfault, and this patch fixes it for us. It hasn't been committed yet, but we think it will be soon.

Comment 20 Kilian Cavalotti 2018-07-26 17:29:31 MDT

(In reply to Marshall Garey from comment #19)
> Created attachment 7433 [details]
> Prevent job_resrcs from being overwritten for multi-partition job submissions
> 
> Can you apply this patch and see if it fixes the issue? We've been able to
> reliably reproduce this segfault, and this patch fixes it for us. It hasn't
> been committed yet, but we think it will be soon.

Thanks! I'll apply the patch and remove the earlier "bypass" patch to make sure we can verify it fixes the issue.

Thanks!
-- 
Kilian

Comment 21 Marshall Garey 2018-08-21 10:01:53 MDT

Have you seen this segfault again?

Comment 22 Kilian Cavalotti 2018-08-21 10:04:30 MDT

Hi Marshall, 

(In reply to Marshall Garey from comment #21)
> Have you seen this segfault again?

Not since we've applied the patch on 17.11.8, no.

Cheers,
--
Kilian

Comment 23 Marshall Garey 2018-08-21 10:12:57 MDT

Well, that's almost 4 weeks. I'm going to close this as resolved/fixed for now. But please reopen it if you see it again.

This patch was committed to 17.11.9.

https://github.com/SchedMD/slurm/commit/fef07a40972

I marked this as a duplicate of bug 5452 since Dominik committed this patch and that's the bug he was working on.

*** This ticket has been marked as a duplicate of ticket 5452 ***