Ticket 11758

Summary: Preemption leading to drained nodes
Product: Slurm Reporter: Adam <adam.munro>
Component: SchedulingAssignee: Director of Support <support>
Status: RESOLVED INVALID QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 20.02.6   
Hardware: Linux   
OS: Linux   
Site: Yale Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Adam 2021-06-03 10:34:05 MDT
So far this seems to be user-specific but we're seeing job preemptions leading to drained nodes. For example (from a compute node):

Jun  3 12:12:32 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: error: *** JOB 27415722 ON p08r05n13 CANCELLED AT 2021-06-03T12:12:32 DUE TO PREEMPTION ***
Jun  3 12:23:03 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: error: *** JOB 27415722 STEPD TERMINATED ON p08r05n13 AT 2021-06-03T12:23:02 DUE TO JOB NOT ENDING WITH SIGNALS ***
Jun  3 12:23:03 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status:-1
Jun  3 12:23:03 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: done with job

...after which the node is drained due to "Kill task failed". Is there a way we can/should configure things to address this? We can follow-up with the user but it is of concern that a single user has the power to drain the entire cluster in this way.

Thank you,
Adam
Comment 1 Adam 2021-06-03 11:19:01 MDT
Hi, this bug can be closed. We are fairly certain that this situation is occurring because a storage system isn't responding quickly enough (there's nothing SLURM can do about that except increase the unkillable timeout).

Thank you,
Adam
Comment 2 Jason Booth 2021-06-03 11:47:45 MDT
Resolving out.