Hi there, I'm new to Slurm and learning how to use it. I've noticed an issue with some of our nodes going into a drain state after a job finishes or nears completion. The messages I'm getting are the following: Reason=Kill task failed [root@2025-03-22T01:24:21] [2025-03-22T01:24:21.001] [91099.batch] error: *** JOB 91099 STEPD TERMINATED ON nodeamd018 AT 2025-03-22T01:24:20 DUE TO JOB NOT ENDING WITH SIGNALS *** I did some research on the error logs and found that increasing the UnkillableStepTimeout to around 120 seconds could give I/O more time before crashing, preventing the nodes from draining https://support.schedmd.com/show_bug.cgi?id=3941 My question is, how do you update the value for UnkillableStepTimeout? I've read discussions and documentation and found that it's updated in slurm.conf. However, it's not showing up in my slurm.conf. When I run the scontrol command I can see that it shows up, but it's not in the file itself: [root@koko-slurm1 slurm]# scontrol show config | grep Unkill UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec [root@koko-slurm1 slurm]# cat /etc/slurm/slurm.conf | grep Unkill [root@koko-slurm1 slurm]# We do use Warewulf for our cluster but not even that slurm.conf has UnkillableStepTimeout written in it. Can this value be updated through scontrol? Could it be located in another Slurm file? I'm sorry if this is a bad question but I cannot figure out how or where to update this value. Please let me know if you need additional information
.