Ticket 8293

Summary: Jobs not requeued when nodes DOWN due to reconfigure
Product: Slurm Reporter: Gordon Dexter <gmdexter>
Component: slurmctldAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 19.05.3   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Gordon Dexter 2020-01-06 10:04:25 MST
We have a few nodes set to permanently DOWN in slurm.conf due to long-term hardware issues.

e.g.
DownNodes=node2,node17 state=DOWN reason="GPU failure"

Some of those nodes were temporarily resumed (I think they came up after an scontrol reboot) and jobs were scheduled on them.  On the next scontrol reconfigure the nodes were correctly put back in DOWN state, but the jobs on them were not requeued, despite that being the default behavior.

I've tried with a few other jobs and can confirm that DOWNing a node by hand (i.e. scontrol update nodename=node2 state=DOWN reason=whatever) will requeue the killed jobs.  However in the case of a reconfigure putting the node in a DOWN state, they aren't requeued.

Is this expected behavior?