Ticket 11758 - Preemption leading to drained nodes
Summary: Preemption leading to drained nodes
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 20.02.6
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-06-03 10:34 MDT by Adam
Modified: 2021-06-03 11:47 MDT (History)
0 users

See Also:
Site: Yale
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Adam 2021-06-03 10:34:05 MDT
So far this seems to be user-specific but we're seeing job preemptions leading to drained nodes. For example (from a compute node):

Jun  3 12:12:32 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: error: *** JOB 27415722 ON p08r05n13 CANCELLED AT 2021-06-03T12:12:32 DUE TO PREEMPTION ***
Jun  3 12:23:03 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: error: *** JOB 27415722 STEPD TERMINATED ON p08r05n13 AT 2021-06-03T12:23:02 DUE TO JOB NOT ENDING WITH SIGNALS ***
Jun  3 12:23:03 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status:-1
Jun  3 12:23:03 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: done with job

...after which the node is drained due to "Kill task failed". Is there a way we can/should configure things to address this? We can follow-up with the user but it is of concern that a single user has the power to drain the entire cluster in this way.

Thank you,
Adam
Comment 1 Adam 2021-06-03 11:19:01 MDT
Hi, this bug can be closed. We are fairly certain that this situation is occurring because a storage system isn't responding quickly enough (there's nothing SLURM can do about that except increase the unkillable timeout).

Thank you,
Adam
Comment 2 Jason Booth 2021-06-03 11:47:45 MDT
Resolving out.