Ticket 8293 - Jobs not requeued when nodes DOWN due to reconfigure
Summary: Jobs not requeued when nodes DOWN due to reconfigure
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 19.05.3
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-01-06 10:04 MST by Gordon Dexter
Modified: 2020-01-06 10:04 MST (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Gordon Dexter 2020-01-06 10:04:25 MST
We have a few nodes set to permanently DOWN in slurm.conf due to long-term hardware issues.

e.g.
DownNodes=node2,node17 state=DOWN reason="GPU failure"

Some of those nodes were temporarily resumed (I think they came up after an scontrol reboot) and jobs were scheduled on them.  On the next scontrol reconfigure the nodes were correctly put back in DOWN state, but the jobs on them were not requeued, despite that being the default behavior.

I've tried with a few other jobs and can confirm that DOWNing a node by hand (i.e. scontrol update nodename=node2 state=DOWN reason=whatever) will requeue the killed jobs.  However in the case of a reconfigure putting the node in a DOWN state, they aren't requeued.

Is this expected behavior?