Ticket 11274

Summary: slurmctld exits when using "systemctl reload slurmctld" and a node change as been made in slurm.conf
Product: Slurm Reporter: Nick Ihli <nick>
Component: slurmctldAssignee: Tim McMullan <mcmullan>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 5 - Enhancement    
Priority: Lowest    
Version: 20.11.4   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=10597
Site: SchedMD Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Nick Ihli 2021-04-01 14:34:49 MDT
I am working with Jump Trading, and they reported that when using "systemctl reload slurmctld" after adding, removing, or modifying a node, slurmctld will exit. A SIGHUP is sent and the slurmctld is terminated. I have reproduced this as well. If the reload is done when no node changes are made, slurmctld will not exit.

They are aware now that they need to do a restart, but having it terminate itself after a SIGHUP was not expected behavior. 


[2021-04-01T12:36:20.299] Reconfigure signal (SIGHUP) received
[2021-04-01T12:36:20.300] debug:  Reading slurm.conf file: /opt/slurm/fpip1_testing_default/etc/slurm.conf
[2021-04-01T12:36:20.300] debug:  NodeNames=fpip1-compute0002 setting Sockets=35 based on CPUs(35)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2021-04-01T12:36:20.300] debug:  NodeNames=fpip1-compute0004 setting Sockets=35 based on CPUs(35)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2021-04-01T12:36:20.300] debug:  NodeNames=fpip1-compute0005 setting Sockets=35 based on CPUs(35)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2021-04-01T12:36:20.300] debug:  NodeNames=fpip1-compute0006 setting Sockets=35 based on CPUs(35)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2021-04-01T12:36:20.300] debug:  NodeNames=fpip1-login0001 setting Sockets=35 based on CPUs(35)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2021-04-01T12:36:20.300] debug:  Reading cgroup.conf file /opt/slurm/fpip1_testing_default/etc/cgroup.conf
[2021-04-01T12:36:20.301] error: _compare_hostnames: node count has changed before reconfiguration from 4 to 5. You have to restart slurmctld.
[2021-04-01T12:36:20.301] fatal: read_slurm_conf: hostnames inconsistency detected
Comment 4 Tim Wickberg 2021-05-03 14:36:39 MDT
Marking as a duplicate of bug 10597.

*** This ticket has been marked as a duplicate of ticket 10597 ***