Ticket 3829

Summary:	Existing jobs were restarting when slurm.conf being updated.
Product:	Slurm	Reporter:	Damien <damien.leong>
Component:	Configuration	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	16.05.4
Hardware:	Linux
OS:	Linux
Site:	Monash University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Damien 2017-05-22 20:20:31 MDT

Dear Support,

We're running slurm 16.05.4. We keep our slurm.conf on local disks and push to all nodes (scheduler and compute). Yesterday we updated the slurm.conf (Added a node, added two partitions, and changed the default partition). We then restarted slurmd on the compute nodes and restarted slurmctld on the schedulers. 

Jobs that were running on the compute nodes were restarted. We didn't expect this (we expected to restart slurmd and have it find the existing processes). Is it expected behaviour? Was it necessary for me to restart slurmd on the compute nodes or should I have skipped this step?

Kind regards

Comment 1 Tim Wickberg 2017-05-23 18:30:04 MDT

(In reply to Damien from comment #0)
> Dear Support,
> 
> We're running slurm 16.05.4. We keep our slurm.conf on local disks and push
> to all nodes (scheduler and compute). Yesterday we updated the slurm.conf
> (Added a node, added two partitions, and changed the default partition). We
> then restarted slurmd on the compute nodes and restarted slurmctld on the
> schedulers. 
> 
> Jobs that were running on the compute nodes were restarted. We didn't expect
> this (we expected to restart slurmd and have it find the existing
> processes). Is it expected behaviour? Was it necessary for me to restart
> slurmd on the compute nodes or should I have skipped this step?

The logs would help isolate the exact issue, but I believe the problem would be that adding additional nodes to the config, then restarting slurmd on the nodes lead to a mismatch between the nodes and slurmctld as to which node was running which job. The nodes register themselves with a sequence number corresponding to the full set of nodes in the system - if something was added ahead of them in the sorted list of nodes this would lead slurmctld to think the wrong node was running the job, resulting in it being requeued.

When adding or removing nodes, the safe order is to restart slurmctld, then run 'scontrol reconfigure' to have the slurmd on the nodes re-read the configuration.

- Tim

Comment 2 Tim Wickberg 2017-06-21 16:24:15 MDT

Hey Damien -

I'm marking this as 'timedout', as without the logs I'd requested I don't have much to work off here. Please reopen if you'd like to continue discussing this.

- Tim

Comment 3 Damien 2017-06-21 22:10:52 MDT

Hi Tim

Thanks