Ticket 7942

Summary: drain node if TERMINATE_JOB fails too many times
Product: Slurm Reporter: Nate Rini <nate>
Component: slurmctldAssignee: Unassigned Developer <dev-unassigned>
Status: OPEN --- QA Contact:
Severity: 5 - Enhancement    
Priority: ---    
Version: 20.02.x   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=7872
https://bugs.schedmd.com/show_bug.cgi?id=7941
https://bugs.schedmd.com/show_bug.cgi?id=7949
Site: NOAA Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: NESCC NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Nate Rini 2019-10-16 12:17:22 MDT
Breaking this issue out from bug#7872 comment#49

When a node becomes unresponsive and the slurm controller is trying to terminate a job, it should eventually drain the node since something is wrong with the node or the network connection to the node putting the system in an unknown state.

Aiming this change at 20.02 since it changes how Slurm behaves.