Ticket 3829

Summary: Existing jobs were restarting when slurm.conf being updated.
Product: Slurm Reporter: Damien <damien.leong>
Component: ConfigurationAssignee: Tim Wickberg <tim>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 16.05.4   
Hardware: Linux   
OS: Linux   
Site: Monash University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Damien 2017-05-22 20:20:31 MDT
Dear Support,

We're running slurm 16.05.4. We keep our slurm.conf on local disks and push to all nodes (scheduler and compute). Yesterday we updated the slurm.conf (Added a node, added two partitions, and changed the default partition). We then restarted slurmd on the compute nodes and restarted slurmctld on the schedulers. 

Jobs that were running on the compute nodes were restarted. We didn't expect this (we expected to restart slurmd and have it find the existing processes). Is it expected behaviour? Was it necessary for me to restart slurmd on the compute nodes or should I have skipped this step?

Kind regards
Comment 1 Tim Wickberg 2017-05-23 18:30:04 MDT
(In reply to Damien from comment #0)
> Dear Support,
> 
> We're running slurm 16.05.4. We keep our slurm.conf on local disks and push
> to all nodes (scheduler and compute). Yesterday we updated the slurm.conf
> (Added a node, added two partitions, and changed the default partition). We
> then restarted slurmd on the compute nodes and restarted slurmctld on the
> schedulers. 
> 
> Jobs that were running on the compute nodes were restarted. We didn't expect
> this (we expected to restart slurmd and have it find the existing
> processes). Is it expected behaviour? Was it necessary for me to restart
> slurmd on the compute nodes or should I have skipped this step?

The logs would help isolate the exact issue, but I believe the problem would be that adding additional nodes to the config, then restarting slurmd on the nodes lead to a mismatch between the nodes and slurmctld as to which node was running which job. The nodes register themselves with a sequence number corresponding to the full set of nodes in the system - if something was added ahead of them in the sorted list of nodes this would lead slurmctld to think the wrong node was running the job, resulting in it being requeued.

When adding or removing nodes, the safe order is to restart slurmctld, then run 'scontrol reconfigure' to have the slurmd on the nodes re-read the configuration.

- Tim
Comment 2 Tim Wickberg 2017-06-21 16:24:15 MDT
Hey Damien -

I'm marking this as 'timedout', as without the logs I'd requested I don't have much to work off here. Please reopen if you'd like to continue discussing this.

- Tim
Comment 3 Damien 2017-06-21 22:10:52 MDT
Hi Tim

Thanks