So far this seems to be user-specific but we're seeing job preemptions leading to drained nodes. For example (from a compute node): Jun 3 12:12:32 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: error: *** JOB 27415722 ON p08r05n13 CANCELLED AT 2021-06-03T12:12:32 DUE TO PREEMPTION *** Jun 3 12:23:03 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: error: *** JOB 27415722 STEPD TERMINATED ON p08r05n13 AT 2021-06-03T12:23:02 DUE TO JOB NOT ENDING WITH SIGNALS *** Jun 3 12:23:03 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status:-1 Jun 3 12:23:03 p08r05n13.grace.hpc.yale.internal slurmstepd[134597]: done with job ...after which the node is drained due to "Kill task failed". Is there a way we can/should configure things to address this? We can follow-up with the user but it is of concern that a single user has the power to drain the entire cluster in this way. Thank you, Adam
Hi, this bug can be closed. We are fairly certain that this situation is occurring because a storage system isn't responding quickly enough (there's nothing SLURM can do about that except increase the unkillable timeout). Thank you, Adam
Resolving out.