Ticket 5613

Summary:	slurmctld segfault (slurm_persist_msg_pack)
Product:	Slurm	Reporter:	Kilian Cavalotti <kilian>
Component:	slurmctld	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	felip.moll
Version:	17.11.8
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=5772 https://bugs.schedmd.com/show_bug.cgi?id=5875
Site:	Stanford	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	19.05
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	gdb output slurmctld logs

Description Kilian Cavalotti 2018-08-22 16:42:41 MDT

Created attachment 7678 [details]
gdb output

Got a new slurmctld segfault on the backup controller (which was active at the time) when the primary slurmdbd came back online:

kernel: sched[30918]: segfault at 30 ip 00007f0d0a34c8ff sp 00007f0d05d7aa10 error 4 in libslurmfull.so[7f0d0a282000+1ae000]




# gdb /usr/sbin/slurmctld /var/spool/slurm.state/core.23483
(gdb) bt
#0  slurm_persist_msg_pack (persist_conn=0x0, req_msg=req_msg@entry=0x7f0d05d7ab40) at slurm_persist_conn.c:938
#1  0x00007f0d0a3c5ab8 in slurm_send_slurmdbd_msg (rpc_version=rpc_version@entry=8192, req=req@entry=0x7f0d05d7ab40) at slurmdbd_defs.c:348
#2  0x00007f0d06eb1250 in jobacct_storage_p_step_start (db_conn=<optimized out>, step_ptr=0x7f0bfc0012c0) at accounting_storage_slurmdbd.c:2758
#3  0x00007f0d0a33e5de in jobacct_storage_g_step_start (db_conn=0x0, step_ptr=step_ptr@entry=0x7f0bfc0012c0) at slurm_accounting_storage.c:968
#4  0x00000000004ab9b6 in build_extern_step (job_ptr=job_ptr@entry=0x3262360) at step_mgr.c:4702
#5  0x000000000047247d in launch_prolog (job_ptr=job_ptr@entry=0x3262360) at node_scheduler.c:2954
#6  0x0000000000440104 in job_config_fini (job_ptr=job_ptr@entry=0x3262360) at job_mgr.c:8387
#7  0x0000000000460d6f in prolog_running_decr (job_ptr=job_ptr@entry=0x3262360) at job_scheduler.c:4437
#8  0x0000000000464b65 in _run_prolog (arg=0x3262360) at job_scheduler.c:4389
#9  0x00007f0d09c53e25 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f0d0997dbad in clone () from /lib64/libc.so.6



gdb "thread apply all bt full" output attached.

Comment 1 Felip Moll 2018-08-23 06:10:30 MDT

Hi Kilian

I see that slurmdb connection is null.

I'd need the slurmctld logs for both the controller and the backup, I tried but cannot reproduce it with normal operation/workload.

Thanks

Comment 3 Kilian Cavalotti 2018-08-23 09:52:37 MDT

Created attachment 7682 [details]
slurmctld logs

Logs for both controllers around 15:36 on Aug 22, when the segfault happened.

Comment 4 Kilian Cavalotti 2018-08-23 10:05:54 MDT

Hi Felip, 

(In reply to Felip Moll from comment #1)
> I see that slurmdb connection is null.
> 
> I'd need the slurmctld logs for both the controller and the backup, I tried
> but cannot reproduce it with normal operation/workload.

Sure, I attached the controllers logs. And I appreciate the willingness to find out the root cause of the problem, but my more immediate concern is the segfault itself.

Even if the slurmdb connection is null, the controller should not segfault. Slurm is way past the debugging code stage now and when values are unexpectedly null or missing or empty or 0, it should catch the problem, log an error message, and eventually exit if there's nothing else it can do. But not segfault.

Controller crashes upon segmentation faults are not really things that users should have to deal with, especially coming from a piece of software as mature as Slurm. 

Those segfaults happen way too often, and they don't give the best impression about the code quality. It really feels like checks are just missing in many places. There are 142 bugs here with the word "segfault" in their description or comments, so it's clearly not an isolated case. I'm not saying catching all those cases and fixing this is an easy task, but I really think there's some room for improvement.

I know some other open-source projects of major scale that went through code-cleaning and hardening phases. They are not fun, they don't feel gratifying, but in the end, they're tremendously useful in improving the overall robustness, the  stability and the perceived quality of the software. And I can only hope that Slurm will get there too.

Sorry for the rant, and thanks for looking into this crash!

Cheers,
-- 
Kilian

Comment 5 Felip Moll 2018-08-24 07:03:06 MDT

(In reply to Kilian Cavalotti from comment #4)
> Hi Felip, 
> 
> (In reply to Felip Moll from comment #1)
> > I see that slurmdb connection is null.
> > 
> > I'd need the slurmctld logs for both the controller and the backup, I tried
> > but cannot reproduce it with normal operation/workload.
> 
> Sure, I attached the controllers logs. And I appreciate the willingness to
> find out the root cause of the problem, but my more immediate concern is the
> segfault itself.
> 
> Even if the slurmdb connection is null, the controller should not segfault.
> Slurm is way past the debugging code stage now and when values are
> unexpectedly null or missing or empty or 0, it should catch the problem, log
> an error message, and eventually exit if there's nothing else it can do. But
> not segfault.
> 
> Controller crashes upon segmentation faults are not really things that users
> should have to deal with, especially coming from a piece of software as
> mature as Slurm. 
> 
> Those segfaults happen way too often, and they don't give the best
> impression about the code quality. It really feels like checks are just
> missing in many places. There are 142 bugs here with the word "segfault" in
> their description or comments, so it's clearly not an isolated case. I'm not
> saying catching all those cases and fixing this is an easy task, but I
> really think there's some room for improvement.
> 
> I know some other open-source projects of major scale that went through
> code-cleaning and hardening phases. They are not fun, they don't feel
> gratifying, but in the end, they're tremendously useful in improving the
> overall robustness, the  stability and the perceived quality of the
> software. And I can only hope that Slurm will get there too.
> 
> Sorry for the rant, and thanks for looking into this crash!
> 
> Cheers,
> -- 
> Kilian

Hi Kilian,

First of all technical stuff:

I see your primary failing and not responding a couple of times. In the first time, the backup takes the control and performs the tasks correctly. Then primary recovers and takes control again. Later, at a certain moment, primary is frozen:

Aug 22 15:35:15 sh-sl01 slurmctld[1287]: prolog_running_decr: Configuration for JobID=23949980 is complete
Aug 22 15:36:58 sh-sl01 slurmctld[1287]: prolog_running_decr: Configuration for JobID=23882066 is complete

During this time, the backup sees this, but does not take control.

Aug 22 15:36:20 sh-sl02 slurmctld[23483]: error: Invalid RPC received 4001 while in standby mode
Aug 22 15:36:20 sh-sl02 slurmctld[23483]: error: Last message to the controller was at 1534977309, but the last heartbeat was written at 1534977374, trusting the filesystem instead of the network and not asserting control at this time.
Aug 22 15:36:21 sh-sl02 slurmctld[23483]: error: Invalid RPC received 2049 while in standby mode

Nevertheless, there seems to be an operation still there from the time when it had control, that's being exec'd, or it seems that does the operation even not having taken the control. I guess this makes the segfault since tries to connect to the database while it is still in backup mode:

Aug 22 15:36:24 sh-sl02 slurmctld[23483]: error: Invalid signal (17) received
Aug 22 15:36:24 sh-sl02 slurmctld[23483]: prolog_running_decr: Configuration for JobID=23946670 is complete
Aug 22 15:36:26 sh-sl02 systemd: slurmctld.service: main process exited, code=killed, status=11/SEGV

I am trying to figure out what could've happened here, I see you have a slurmctld prolog, does it take a long time to execute? Can it be stuck by I/O at some point?

I'll continue trying to find out what's going on here.


Regarding to segfaults and so on.. I fully understand your concerns and I share your opinion. A segfault is not an acceptable issue in a program and this should not happen at all. The idea is to carefully control all the parts of the code to avoid any segfault, and in parts that can be vulnerable for some reason then add asserts and extra checks. In certain cases it should be impossible to get a segfault (like in this one) so we do not check it. Checking every place in the code for a possible null pointer would mean to add *a lot* of maybe unnecessary checks. We always try to minimize the possibilities but Slurm is a very complex software and there are occasions and very edge cases that happen only on certain machines and configurations, that are out of our control, mainly because testing all the configurations combinations in very different OS and HW is impossible. We are currently working hard to grow our team in order to give even better quality to Slurm. I will transmit your thoughts to all the team since I think it is a valuable opinion.

I am also aware that you have been hit with many ugly bugs lately, so I apologize for this and I want to let you know we're conscious about your situation, so we perfectly understand your observations.

I'll update you on any news I have in respect.

Thanks for your patience,
Felip M

Comment 6 Kilian Cavalotti 2018-08-24 11:05:52 MDT

Hi Felip, 

First, thanks a lot for everything you replied, this is much appreciated. More details in-line below.

(In reply to Felip Moll from comment #5)
> I see your primary failing and not responding a couple of times. 

Yeah, that happened in the middle our our battle to regain control of things in the context of https://bugs.schedmd.com/show_bug.cgi?id=5443. Many slurmctld restarts happened.

> I am trying to figure out what could've happened here, 

Thanks for the step by step analysis, it's very useful!

> I see you have a
> slurmctld prolog, does it take a long time to execute? Can it be stuck by
> I/O at some point?

It shouldn't take a long time to execute, but there's a possibility for it to be stuck on I/O, yes. What it does is:
1. create a directory on a NFS mountpoint if it doesn't exist already
2. dump the environment, the job script (from the Slurm spool dir) and the output of "scontrol show job" into that directory.

We haven't noticed any particular issue with the NFS mount at the time of the crash, but those could be particularly difficult to spot.

> I'll continue trying to find out what's going on here.

Much appreciated, thanks.





> Regarding to segfaults and so on.. I fully understand your concerns and I
> share your opinion. A segfault is not an acceptable issue in a program and
> this should not happen at all. The idea is to carefully control all the
> parts of the code to avoid any segfault, and in parts that can be vulnerable
> for some reason then add asserts and extra checks. In certain cases it
> should be impossible to get a segfault (like in this one) so we do not check
> it. Checking every place in the code for a possible null pointer would mean
> to add *a lot* of maybe unnecessary checks. We always try to minimize the
> possibilities but Slurm is a very complex software and there are occasions
> and very edge cases that happen only on certain machines and configurations,
> that are out of our control, mainly because testing all the configurations
> combinations in very different OS and HW is impossible. 

I completely understand and again realize that it's not an easy task, I know. And I'm sure you guys are doing everything possible to avoid those situations as much as possible.

> We are currently
> working hard to grow our team in order to give even better quality to Slurm.
> I will transmit your thoughts to all the team since I think it is a valuable
> opinion.

That's great to know, thank you.

> I am also aware that you have been hit with many ugly bugs lately, so I
> apologize for this and I want to let you know we're conscious about your
> situation, so we perfectly understand your observations.
> 
> I'll update you on any news I have in respect.
> 
> Thanks for your patience,

Much much appreciated, thanks a lot.

Cheers,
--
Kilian

Comment 7 Felip Moll 2018-08-28 08:04:50 MDT

Kilian,

I managed to reproduce the issue.

I set a backup controller which at a certain point takes the control. There are jobs starting in the backup and the slurmctld prolog is stuck (I intentionally do a sleep 200 inside).

While the job is in CF state, I recover the primary.

I then wait for 200 seconds, or kill one sleep process, I get then the segfault.

           ├─sshd(1165)─┬─sshd(6111)───sshd(6113)───bash(6114)───slurmctld(6131)─┬─slurm.ctld_prol(6807)───sleep(6810)
           │            │                                                        ├─slurm.ctld_prol(6850)───sleep(6854)
           │            │                                                        ├─slurm.ctld_prol(6851)───sleep(6857)
           │            │                                                        ├─slurm.ctld_prol(6870)───sleep(6873)
           │            │                                                        ├─slurm.ctld_prol(6903)───sleep(6909)
           │            │                                                        ├─slurm.ctld_prol(6904)───sleep(6907)
           │            │                                                        ├─{slurmctld}(6132)
           │            │                                                        ├─{slurmctld}(6806)
           │            │                                                        ├─{slurmctld}(6848)
           │            │                                                        ├─{slurmctld}(6849)
           │            │                                                        ├─{slurmctld}(6869)
           │            │                                                        ├─{slurmctld}(6901)
           │            │                                                        ├─{slurmctld}(6902)
           │            │                                                        ├─{slurmctld}(6922)
           │            │                                                        └─{slurmctld}(6923)
           │            └─sshd(6924)───sshd(6926)───bash(6927)───pstree(6955)



[slurm@moll1 ~]$ kill -9 6907


slurmctld: error: Invalid RPC received 6018 while in standby mode
slurmctld: error: Invalid RPC received 6018 while in standby mode
slurmctld: trigger pulled for SLURMCTLD event 16384 successful
slurmctld: error: prolog_slurmctld job 1112 prolog exit status 137:0
slurmctld: error: slurm_jobcomp plugin context not initialized
slurmctld: error: Invalid signal (17) received
slurmctld: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
slurmctld: error: ../../../slurm/src/common/slurm_persist_conn.c:936: slurm_persist_msg_pack(): Assertion (persist_conn) failed.


or if I let it go:

[2018-08-28T16:00:55.122] trigger pulled for SLURMCTLD event 16384 successful
[2018-08-28T16:03:07.607] prolog_running_decr: Configuration for JobID=1209 is complete
[2018-08-28T16:03:07.607] Extending job 1209 time limit by 200 secs for configuration
[2018-08-28T16:03:07.607] error: Invalid signal (17) received
[2018-08-28T16:03:07.609] error: Invalid signal (17) received
[2018-08-28T16:03:07.609] error: Invalid signal (17) received
[2018-08-28T16:03:07.612] Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
[2018-08-28T16:03:07.612] error: ../../../slurm/src/common/slurm_persist_conn.c:936: slurm_persist_msg_pack(): Assertion (persist_conn) failed.


I'll study the situation and prepare a patch to fix this.

Comment 8 Kilian Cavalotti 2018-08-28 11:07:44 MDT

Hi Felip, 

Thanks for the update and great sleuthing! Thanks for the details, and I'm glad you found a way to reproduce the issue.

Cheers,
-- 
Kilian

Comment 22 Felip Moll 2018-09-25 03:29:34 MDT

Hi Kilian,

Segfault has been fixed in commit c07784b4b6f6, in 17.11 and up.

We are working to redesign how slurmctld prolog is run and their related detached threads. Hopefully in 19.05 we will change that to have a more reliable ctld prolog infraestructure.

In what regards to your questions about segfaults, as I told we're currently working internally to improve detection mechanisms and dedicate more time to testing.

Best regards,
Felip M

Comment 23 Kilian Cavalotti 2018-09-25 08:56:34 MDT

Hi Felip. 

(In reply to Felip Moll from comment #22)
> Segfault has been fixed in commit c07784b4b6f6, in 17.11 and up.

Thanks for fixing it!

> We are working to redesign how slurmctld prolog is run and their related
> detached threads. Hopefully in 19.05 we will change that to have a more
> reliable ctld prolog infraestructure.
> 
> In what regards to your questions about segfaults, as I told we're currently
> working internally to improve detection mechanisms and dedicate more time to
> testing.


That's excellent to hear, thanks for the update.

Cheers,
-- 
Kilian

Comment 25 Felip Moll 2018-10-25 02:25:28 MDT

Hi Kilian,

We reopened this bug since the fix had to be reverted in 17.11.12 and in 18.08.3 because it caused another important issue, see bug 5875.

I don't know at which version are you now, but we encourage everybody to move to 17.11.12 or to 18.08.3, which will still have this bug's issue but not 5875 which was worst.

A new patch is already being reviewed and will be included into future 18.08.4.

Sorry for the inconvenience.

Comment 29 Kilian Cavalotti 2018-10-25 08:34:39 MDT

(In reply to Felip Moll from comment #25)
> We reopened this bug since the fix had to be reverted in 17.11.12 and in
> 18.08.3 because it caused another important issue, see bug 5875.
> 
> I don't know at which version are you now, but we encourage everybody to
> move to 17.11.12 or to 18.08.3, which will still have this bug's issue but
> not 5875 which was worst.
> 
> A new patch is already being reviewed and will be included into future
> 18.08.4.

Gotcha! Thanks for letting me know.

Cheers,
-- 
Kilian

Comment 66 Felip Moll 2019-04-24 01:52:21 MDT

Hi Kilian,

This bug has been around for a long time.

We've finally commited a definitive solution into master branch. We added a new feature that will keep track of these problematic prolog scripts which caused your segfault in the first place. Even more, it will allow us to keep track of several threads that manipulate global structures and are in a wait state while we are changing the controller's role, thus avoiding any possible segfault.

It's been long and we had to deal with many race conditions, but now everything seems settled. Its in commit 78f5a097b944085888f16c1e75eef5d81437f77f and will be available in the next coming 19.05 release.

Since the probabilities that this happens are very low (though it happened to you), and the severity of a segfault in the backup controller is lower, we are not applying this to previous releases.

Hope you are fine with the solutions.
I am closing the bug now.

Thanks for reporting and sorry again for this delay.

Comment 67 Kilian Cavalotti 2019-04-24 09:21:12 MDT

Hi Felip. 

Including the patch in 19.05 only seems reasonable to me, because as you mentioned, that's not too frequent of a case (although, we seem to have developed some ability to put ourselves in those low-probability corner cases, bug #6739 comes to mind).

Anyway, that's great! Thanks a lot for your work on this.

Cheers,
--
Kilian