7942 – drain node if TERMINATE_JOB fails too many times

Ticket 7942 - drain node if TERMINATE_JOB fails too many times

Summary: drain node if TERMINATE_JOB fails too many times

Status:	OPEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	20.02.x
Hardware:	Linux Linux

Severity:	5 - Enhancement
Assignee:	Unassigned Developer
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-10-16 12:17 MDT by Nate Rini
Modified:	2019-12-03 11:09 MST (History)
CC List:	0 users

See Also:	7872 7941 7949
Site:	NOAA
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	NESCC
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Nate Rini 2019-10-16 12:17:22 MDT

Breaking this issue out from bug#7872 comment#49

When a node becomes unresponsive and the slurm controller is trying to terminate a job, it should eventually drain the node since something is wrong with the node or the network connection to the node putting the system in an unknown state.

Aiming this change at 20.02 since it changes how Slurm behaves.