11758 – Preemption leading to drained nodes

Ticket 11758 - Preemption leading to drained nodes

Summary: Preemption leading to drained nodes

Status:	RESOLVED INVALID

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	20.02.6
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-06-03 10:34 MDT by Adam
Modified:	2021-06-03 11:47 MDT (History)
CC List:	0 users

See Also:
Site:	Yale
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Adam 2021-06-03 10:34:05 MDT

So far this seems to be user-specific but we're seeing job preemptions leading to drained nodes. For example (from a compute node):

Jun  3 12:12:32 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: error: *** JOB 27415722 ON p08r05n13 CANCELLED AT 2021-06-03T12:12:32 DUE TO PREEMPTION ***
Jun  3 12:23:03 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: error: *** JOB 27415722 STEPD TERMINATED ON p08r05n13 AT 2021-06-03T12:23:02 DUE TO JOB NOT ENDING WITH SIGNALS ***
Jun  3 12:23:03 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status:-1
Jun  3 12:23:03 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: done with job

...after which the node is drained due to "Kill task failed". Is there a way we can/should configure things to address this? We can follow-up with the user but it is of concern that a single user has the power to drain the entire cluster in this way.

Thank you,
Adam

Comment 1 Adam 2021-06-03 11:19:01 MDT

Hi, this bug can be closed. We are fairly certain that this situation is occurring because a storage system isn't responding quickly enough (there's nothing SLURM can do about that except increase the unkillable timeout).

Thank you,
Adam

Comment 2 Jason Booth 2021-06-03 11:47:45 MDT

Resolving out.