Ticket 17205

Summary: slurmctld segfault while trying to set an external dbd for a second cluster
Product: Slurm Reporter: Richard Johnson <rjohnson>
Component: ConfigurationAssignee: Nate Rini <nate>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: mcmullan, nate
Version: 23.02.3   
Hardware: Linux   
OS: Linux   
Site: HudsonAlpha Biotechnology Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurm.conf
back trace from slurmctld

Description Richard Johnson 2023-07-13 14:03:24 MDT
I'm trying to setup the scenario where I have 2 clusters each with their own slurmctld/dbd in separate physical locations separated by a WAN.

I have both clusters running on their own, but when I add AccountingStorageExternalHost for the remote cluster to the local slurm.conf, slurmctld segmentation faults on startup.

What am I doing wrong?  I will attach a slurm.conf for the local cluster.
Comment 1 Richard Johnson 2023-07-13 14:05:22 MDT
Created attachment 31229 [details]
slurm.conf
Comment 2 Jason Booth 2023-07-13 16:25:31 MDT
Would you please gather a backtrace from the segfault and attach that here?

gdb slurmctld

inside gdb:
> set print pretty
> r -D
> thread apply all bt full

Or simply

> bt full
Comment 4 Richard Johnson 2023-07-13 16:57:14 MDT
Created attachment 31232 [details]
back trace from slurmctld
Comment 5 Nate Rini 2023-07-13 17:31:45 MDT
This should have already been resolved for the upcoming Slurm-23.02.4 patch release:
> https://github.com/SchedMD/slurm/commit/833ca8dd2121a2c980736c05821608324c7ae97a

You can cherry-pick the commit and re-compile if waiting for the next release is not fast enough. Please reply if more detailed instructions are needed.
Comment 7 Richard Johnson 2023-07-14 08:39:12 MDT
Thanks Nate.  Do you have an estimate on when 23.02.4 will be released?

Thanks,
Rich

On Thu, Jul 13, 2023 at 6:31 PM <bugs@schedmd.com> wrote:

> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=17205#c5> on bug
> 17205 <https://bugs.schedmd.com/show_bug.cgi?id=17205> from Nate Rini
> <nate@schedmd.com> *
>
> This should have already been resolved for the upcoming Slurm-23.02.4 patch
> release:> https://github.com/SchedMD/slurm/commit/833ca8dd2121a2c980736c05821608324c7ae97a
>
> You can cherry-pick the commit and re-compile if waiting for the next release
> is not fast enough. Please reply if more detailed instructions are needed.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 8 Nate Rini 2023-07-14 09:24:11 MDT
(In reply to Richard Johnson from comment #7)
> Thanks Nate.  Do you have an estimate on when 23.02.4 will be released?

I don't currently have an ETA, but we usually release one about every 2 months on the latest major release.
Comment 9 Nate Rini 2023-07-14 09:24:28 MDT
Are there any more questions?
Comment 10 Richard Johnson 2023-07-14 09:26:44 MDT
No.  Thank you.  I was able to apply that commit and it did fix my issue.

Thanks,
Rich

On Fri, Jul 14, 2023 at 10:24 AM <bugs@schedmd.com> wrote:

> *Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=17205#c9> on bug
> 17205 <https://bugs.schedmd.com/show_bug.cgi?id=17205> from Nate Rini
> <nate@schedmd.com> *
>
> Are there any more questions?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 11 Nate Rini 2023-07-14 09:28:14 MDT
(In reply to Richard Johnson from comment #10)
> No.  Thank you.  I was able to apply that commit and it did fix my issue.

Understood. Closing out ticket as a duplicate. Please respond if any new related questions should arise.

*** This ticket has been marked as a duplicate of ticket 16669 ***