We are attempting to deploy slurm in High Availability using 3 machines: - A Slurmdb server also hosting an NFS mount - 2 Controllers, each mounting the NFS share and using it for the shared directory. While attempting to test failover, I noticed that the backup controller would segfault both when attempting to transfer responsibility back to the primary controller, and when the primary controller dies and it needs to take control. Running through Valgrind gives me the following stacktrace on the crash. ==93912== Jump to the invalid address stated on the next line ==93912== at 0x7FBFB28: ??? ==93912== by 0x42DDDE: ctld_assoc_mgr_init (controller.c:2513) ==93912== by 0x42A1B4: run_backup (backup.c:250) ==93912== by 0x430552: main (controller.c:660) What I believe to be relevent slurm.conf parameters are as follows (more details can be supplied if requested): SlurmctldDebug=debug5 # Managment Policies ClusterName=hpcc SlurmctldHost=slurmctl1a SlurmctldHost=slurmctl1b #SlurmctldPrimaryOnProg= #SlurmctldPrimaryOffProg= SlurmctldTimeout=30 SlurmctldPort=6817 SlurmctldParameters=enable_configless SlurmdPort=6819 AuthType=auth/munge CredType=cred/munge SlurmUser=root MessageTimeout=100 # Location of logs and state info StateSaveLocation=/mnt/slurm_shared/slurmctld # Accounting JobAcctGatherType=jobacct_gather/cgroup JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageEnforce=limits AccountingStoragePort=6819 AccountingStorageHost=slurmdb1 AccountingStorageTRES=gres/gpu JobCompType=jobcomp/elasticsearch JobCompLoc=kibana2:9200/slurm/_doc Additionally, I notice in the logs for slurmdbd that a "A new registration for cluster hpcc CONN:8 just came in, but I am already talking to that cluster (CONN:7), closing other connection." This seems to happen when the primary node comes back online and attempts to take over from the backup node. Slurm on all machines were also compiled using the following: ./configure --with-mysql_config=/usr/bin \ --enable-pam --enable-shared \ --disable-gtktest --enable-multiple-slurmd \ --with-pam_dir=/usr/include/security \ --with-munge --with-hwloc --with-libcurl \ --with-pmix=/opt/linux/rhel/8.x/x86_64/pkgs/pmix/3.1.3rc2/ \ --prefix=/usr/opt/slurm-23.02.2 Please let me know if you would like any additional configuration details.
If you have a core file we would be interested in looking at the backtrace. > > gdb -ex 't a a bt full' -batch /path/to/slurmctld <core_file>
Created attachment 31590 [details] Tracefile Let me know if this is what you expect
You may be interested in bug#16669. You may or may not be aware that SchedMD provides commercial support. Normally issues logged without a support contract go unanswered. In this case, it seems you ran into an issue we fixed in 23.02.3. If you encounter more issues I would encourage you to reach out to sales@schedmd.com and they can work with you to get you onto support. *** This ticket has been marked as a duplicate of ticket 16669 ***