17350 – Segmentation Fault on Slurmctld failover

Ticket 17350 - Segmentation Fault on Slurmctld failover

Summary: Segmentation Fault on Slurmctld failover

Status:	RESOLVED DUPLICATE of ticket 16669

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	23.02.2
Hardware:	Linux Linux

Severity:	6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-08-03 15:18 MDT by Emerson
Modified:	2023-08-03 15:51 MDT (History)
CC List:	1 user (show)

See Also:
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Tracefile (4.00 KB, text/plain) 2023-08-03 15:34 MDT, Emerson	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Emerson 2023-08-03 15:18:01 MDT

We are attempting to deploy slurm in High Availability using 3 machines:
- A Slurmdb server also hosting an NFS mount
- 2 Controllers, each mounting the NFS share and using it for the shared directory.

While attempting to test failover, I noticed that the backup controller would segfault both when attempting to transfer responsibility back to the primary controller, and when the primary controller dies and it needs to take control.

Running through Valgrind gives me the following stacktrace on the crash.

==93912== Jump to the invalid address stated on the next line
==93912==    at 0x7FBFB28: ???
==93912==    by 0x42DDDE: ctld_assoc_mgr_init (controller.c:2513)
==93912==    by 0x42A1B4: run_backup (backup.c:250)
==93912==    by 0x430552: main (controller.c:660)


What I believe to be relevent slurm.conf parameters are as follows (more details can be supplied if requested):
SlurmctldDebug=debug5
# Managment Policies
ClusterName=hpcc
SlurmctldHost=slurmctl1a
SlurmctldHost=slurmctl1b
#SlurmctldPrimaryOnProg=
#SlurmctldPrimaryOffProg=
SlurmctldTimeout=30
SlurmctldPort=6817
SlurmctldParameters=enable_configless
SlurmdPort=6819
AuthType=auth/munge
CredType=cred/munge
SlurmUser=root
MessageTimeout=100

# Location of logs and state info
StateSaveLocation=/mnt/slurm_shared/slurmctld
# Accounting
JobAcctGatherType=jobacct_gather/cgroup
JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageEnforce=limits

AccountingStoragePort=6819
AccountingStorageHost=slurmdb1
AccountingStorageTRES=gres/gpu
JobCompType=jobcomp/elasticsearch
JobCompLoc=kibana2:9200/slurm/_doc


Additionally, I notice in the logs for slurmdbd that a "A new registration for cluster hpcc CONN:8 just came in, but I am already talking to that cluster (CONN:7), closing other connection." This seems to happen when the primary node comes back online and attempts to take over from the backup node.

Slurm on all machines were also compiled using the following:
./configure --with-mysql_config=/usr/bin \
    --enable-pam --enable-shared \
    --disable-gtktest --enable-multiple-slurmd \
    --with-pam_dir=/usr/include/security \
    --with-munge --with-hwloc --with-libcurl \
    --with-pmix=/opt/linux/rhel/8.x/x86_64/pkgs/pmix/3.1.3rc2/ \
    --prefix=/usr/opt/slurm-23.02.2

Please let me know if you would like any additional configuration details.

Comment 1 Jason Booth 2023-08-03 15:21:44 MDT

If you have a core file we would be interested in looking at the backtrace.

> > gdb -ex 't a a bt full' -batch /path/to/slurmctld <core_file>

Comment 2 Emerson 2023-08-03 15:34:46 MDT

Created attachment 31590 [details]
Tracefile

Let me know if this is what you expect

Comment 3 Jason Booth 2023-08-03 15:51:29 MDT

You may be interested in bug#16669. You may or may not be aware that SchedMD provides commercial support. Normally issues logged without a support contract go unanswered. In this case, it seems you ran into an issue we fixed in 23.02.3. 

If you encounter more issues I would encourage you to reach out to sales@schedmd.com and they can work with you to get you onto support.

*** This ticket has been marked as a duplicate of ticket 16669 ***