Ticket 8293

Summary:	Jobs not requeued when nodes DOWN due to reconfigure
Product:	Slurm	Reporter:	Gordon Dexter <gmdexter>
Component:	slurmctld	Assignee:	Jacob Jenson <jacob>
Status:	RESOLVED INVALID	QA Contact:
Severity:	6 - No support contract
Priority:	---
Version:	19.05.3
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Gordon Dexter 2020-01-06 10:04:25 MST

We have a few nodes set to permanently DOWN in slurm.conf due to long-term hardware issues.

e.g.
DownNodes=node2,node17 state=DOWN reason="GPU failure"

Some of those nodes were temporarily resumed (I think they came up after an scontrol reboot) and jobs were scheduled on them.  On the next scontrol reconfigure the nodes were correctly put back in DOWN state, but the jobs on them were not requeued, despite that being the default behavior.

I've tried with a few other jobs and can confirm that DOWNing a node by hand (i.e. scontrol update nodename=node2 state=DOWN reason=whatever) will requeue the killed jobs.  However in the case of a reconfigure putting the node in a DOWN state, they aren't requeued.

Is this expected behavior?