Ticket 10560 - slurmctld continues to seg fault after starting
Summary: slurmctld continues to seg fault after starting
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.02.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-01-05 11:50 MST by Wei Feinstein
Modified: 2021-01-22 19:57 MST (History)
1 user (show)

See Also:
Site: UC Berkley
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm log files (52.31 MB, application/x-tar)
2021-01-05 11:50 MST, Wei Feinstein
Details
slurm config files (40.00 KB, application/x-tar)
2021-01-05 11:51 MST, Wei Feinstein
Details
last core file (5.16 MB, application/x-bzip2)
2021-01-05 11:54 MST, Wei Feinstein
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Wei Feinstein 2021-01-05 11:50:19 MST
Created attachment 17344 [details]
slurm log files

Upgraded slurm from 18.08.7 to 19.05.3 to 20.02.3 and now when I start slurmctld it continues to segfault and crash.

It is receiving a signal 11.
Comment 1 Wei Feinstein 2021-01-05 11:51:07 MST
Created attachment 17345 [details]
slurm config files
Comment 2 Wei Feinstein 2021-01-05 11:54:03 MST
Created attachment 17346 [details]
last core file
Comment 3 Jason Booth 2021-01-05 11:55:47 MST
We can not read these locally. You will need to generate the back trace for us.

> gdb $(which slurmctld) $PATH_TO_CORE
> (gdb) t a a bt full
Comment 4 Marshall Garey 2021-01-05 12:04:05 MST
There are several fixes for possible slurmctld segfaults in Slurm 20.02 past 20.02.3. The latest version is 20.02.6. Can you upgrade slurmctld to 20.02.6?
Comment 5 Wei Feinstein 2021-01-05 12:27:31 MST
Here is a google doc with the output from the gdb commands.

https://docs.google.com/document/d/1s-84ysRbgrgNARr3fV_05BmvXlc9yOdNt0Q3VdWRTJU/edit?usp=sharing

I will give access to you Jason and Marshall.

Thanks

Jackie

On Tue, Jan 5, 2021 at 10:55 AM <bugs@schedmd.com> wrote:

> Jason Booth <jbooth@schedmd.com> changed bug 10560
> <https://bugs.schedmd.com/show_bug.cgi?id=10560>
> What Removed Added
> Assignee support@schedmd.com hinton@schedmd.com
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=10560#c3> on bug
> 10560 <https://bugs.schedmd.com/show_bug.cgi?id=10560> from Jason Booth
> <jbooth@schedmd.com> *
>
> We can not read these locally. You will need to generate the back trace for us.
> > gdb $(which slurmctld) $PATH_TO_CORE
> > (gdb) t a a bt full
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 7 Wei Feinstein 2021-01-05 13:39:54 MST
I have upgraded to 20.2.6 and I am experiencing the same issue.

Is there anything you have discovered in the core file?

We have to bring the system back online today before 5:00 p.m.

Thanks

Jackie

On Tue, Jan 5, 2021 at 11:27 AM Jacqueline Scoggins <jscoggins@lbl.gov>
wrote:

> Here is a google doc with the output from the gdb commands.
>
>
> https://docs.google.com/document/d/1s-84ysRbgrgNARr3fV_05BmvXlc9yOdNt0Q3VdWRTJU/edit?usp=sharing
>
> I will give access to you Jason and Marshall.
>
> Thanks
>
> Jackie
>
> On Tue, Jan 5, 2021 at 10:55 AM <bugs@schedmd.com> wrote:
>
>> Jason Booth <jbooth@schedmd.com> changed bug 10560
>> <https://bugs.schedmd.com/show_bug.cgi?id=10560>
>> What Removed Added
>> Assignee support@schedmd.com hinton@schedmd.com
>>
>> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=10560#c3> on bug
>> 10560 <https://bugs.schedmd.com/show_bug.cgi?id=10560> from Jason Booth
>> <jbooth@schedmd.com> *
>>
>> We can not read these locally. You will need to generate the back trace for us.
>> > gdb $(which slurmctld) $PATH_TO_CORE
>> > (gdb) t a a bt full
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>
Comment 8 Wei Feinstein 2021-01-05 14:01:49 MST
What changed between 19 and 20 regarding the spank plugins?

Thanks

Jackie

On Tue, Jan 5, 2021 at 12:39 PM Jacqueline Scoggins <jscoggins@lbl.gov>
wrote:

> I have upgraded to 20.2.6 and I am experiencing the same issue.
>
> Is there anything you have discovered in the core file?
>
> We have to bring the system back online today before 5:00 p.m.
>
> Thanks
>
> Jackie
>
> On Tue, Jan 5, 2021 at 11:27 AM Jacqueline Scoggins <jscoggins@lbl.gov>
> wrote:
>
>> Here is a google doc with the output from the gdb commands.
>>
>>
>> https://docs.google.com/document/d/1s-84ysRbgrgNARr3fV_05BmvXlc9yOdNt0Q3VdWRTJU/edit?usp=sharing
>>
>> I will give access to you Jason and Marshall.
>>
>> Thanks
>>
>> Jackie
>>
>> On Tue, Jan 5, 2021 at 10:55 AM <bugs@schedmd.com> wrote:
>>
>>> Jason Booth <jbooth@schedmd.com> changed bug 10560
>>> <https://bugs.schedmd.com/show_bug.cgi?id=10560>
>>> What Removed Added
>>> Assignee support@schedmd.com hinton@schedmd.com
>>>
>>> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=10560#c3> on bug
>>> 10560 <https://bugs.schedmd.com/show_bug.cgi?id=10560> from Jason Booth
>>> <jbooth@schedmd.com> *
>>>
>>> We can not read these locally. You will need to generate the back trace for us.
>>> > gdb $(which slurmctld) $PATH_TO_CORE
>>> > (gdb) t a a bt full
>>>
>>> ------------------------------
>>> You are receiving this mail because:
>>>
>>>    - You reported the bug.
>>>
>>>
Comment 9 Marshall Garey 2021-01-05 14:13:31 MST
It looks like it might be segfaulting in the JobComp plugin. You have this set:

JobCompType=jobcomp/slurm_banking
JobCompLoc=/var/log/slurm/jobcomp.log


This isn't one of our plugins, so is it a plugin written by your site? Can you try disabling this plugin (comment out JobCompType and JobCompLoc in slurm.conf) and restarting slurmctld?



RE the spank plugins - spank plugins don't run on slurmctld, so they won't have anything to do with this bug. But whenever you upgrade, you have to recompile all your spank plugins for them to work.
Comment 12 Wei Feinstein 2021-01-05 14:20:13 MST
I have done so and the jobs are being accepted now.  We are trying to
figure out what is broken within the plugins.  We have definitely
recompiled against the version of slurm we just built - 20.02.6.  Is there
any changes from 18 to 20 that could significantly break our plugins.  Just
checking.

Jackie

On Tue, Jan 5, 2021 at 1:13 PM <bugs@schedmd.com> wrote:

> *Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=10560#c9> on bug
> 10560 <https://bugs.schedmd.com/show_bug.cgi?id=10560> from Marshall Garey
> <marshall@schedmd.com> *
>
> It looks like it might be segfaulting in the JobComp plugin. You have this set:
>
> JobCompType=jobcomp/slurm_banking
> JobCompLoc=/var/log/slurm/jobcomp.log
>
>
> This isn't one of our plugins, so is it a plugin written by your site? Can you
> try disabling this plugin (comment out JobCompType and JobCompLoc in
> slurm.conf) and restarting slurmctld?
>
>
>
> RE the spank plugins - spank plugins don't run on slurmctld, so they won't have
> anything to do with this bug. But whenever you upgrade, you have to recompile
> all your spank plugins for them to work.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 15 Marshall Garey 2021-01-05 14:35:49 MST
There have been many changes from 18.08 to 20.02 that could potentially break a plugin. There are always protocol and API changes that change between major versions. I recommend starting by checking the backtrace (it's thread 1 in the backtrace you left me) and doing some debugging with gdb to see what's broken.

My colleague Michael told me that you might be using the `slurm-rs` Rust crate, which is built and tested against Slurm 17.11. It's possible that 18.08 was fine, but 20.02 is not. From https://crates.io/crates/slurm:

"At the moment, this crate is being developed against Slurm 17.11. The Slurm C API is not especially stable, so it is possible that this crate will fail to compile against other versions of Slurm, or even exhibit wrong runtime behavior."

Michael tried looking into using Rust with Slurm at one point but only found libraries that are compatible with older versions of Slurm (17.11, 18.08). So it's quite possible that if you are using a Rust library that it isn't compatible with Slurm 20.02.

Since we found the problem and you're up and running, can I close this ticket?



p.s.
18/19/20 aren't Slurm versions. The major version number is Year.Month, and the micro (bug fix) version number is the number after the second period. This is like the way Ubuntu does it. For example, 20.02 and 20.11 are two completely different major Slurm versions. 20.02.3 and 20.02.6 are different micro (bug fix) versions but are both the same major Slurm version (20.02).
Comment 19 Wei Feinstein 2021-01-05 14:45:51 MST
Please leave the case open in case I run into other issues since this is an
upgrade on a production system.

Thanks

Jackie

On Tue, Jan 5, 2021 at 1:35 PM <bugs@schedmd.com> wrote:

> *Comment # 15 <https://bugs.schedmd.com/show_bug.cgi?id=10560#c15> on bug
> 10560 <https://bugs.schedmd.com/show_bug.cgi?id=10560> from Marshall Garey
> <marshall@schedmd.com> *
>
> There have been many changes from 18.08 to 20.02 that could potentially break a
> plugin. There are always protocol and API changes that change between major
> versions. I recommend starting by checking the backtrace (it's thread 1 in the
> backtrace you left me) and doing some debugging with gdb to see what's broken.
>
> My colleague Michael told me that you might be using the `slurm-rs` Rust crate,
> which is built and tested against Slurm 17.11. It's possible that 18.08 was
> fine, but 20.02 is not. From https://crates.io/crates/slurm:
>
> "At the moment, this crate is being developed against Slurm 17.11. The Slurm C
> API is not especially stable, so it is possible that this crate will fail to
> compile against other versions of Slurm, or even exhibit wrong runtime
> behavior."
>
> Michael tried looking into using Rust with Slurm at one point but only found
> libraries that are compatible with older versions of Slurm (17.11, 18.08). So
> it's quite possible that if you are using a Rust library that it isn't
> compatible with Slurm 20.02.
>
> Since we found the problem and you're up and running, can I close this ticket?
>
>
>
> p.s.
> 18/19/20 aren't Slurm versions. The major version number is Year.Month, and the
> micro (bug fix) version number is the number after the second period. This is
> like the way Ubuntu does it. For example, 20.02 and 20.11 are two completely
> different major Slurm versions. 20.02.3 and 20.02.6 are different micro (bug
> fix) versions but are both the same major Slurm version (20.02).
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 20 Marshall Garey 2021-01-05 14:49:15 MST
I can certainly leave this open, but I'll mark it as a sev-4 for now since the cluster is up and running. If you run into issues unrelated to the slurmctld segfault, I'll probably ask you to open a new ticket.
Comment 21 Marshall Garey 2021-01-22 17:44:46 MST
I'm marking this as resolved/infogiven. Let us know if you run into more issues.
Comment 22 Wei Feinstein 2021-01-22 19:57:22 MST
Yes please go ahead, thanks!

Jackie

On Fri, Jan 22, 2021 at 4:44 PM <bugs@schedmd.com> wrote:

> Marshall Garey <marshall@schedmd.com> changed bug 10560
> <https://bugs.schedmd.com/show_bug.cgi?id=10560>
> What Removed Added
> Resolution --- INFOGIVEN
> Status OPEN RESOLVED
>
> *Comment # 21 <https://bugs.schedmd.com/show_bug.cgi?id=10560#c21> on bug
> 10560 <https://bugs.schedmd.com/show_bug.cgi?id=10560> from Marshall Garey
> <marshall@schedmd.com> *
>
> I'm marking this as resolved/infogiven. Let us know if you run into more
> issues.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>