Ticket 22425 - How to update UnkillableStepTimeout?
Summary: How to update UnkillableStepTimeout?
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: - Unsupported Older Versions
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-03-24 14:42 MDT by wkudo
Modified: 2025-03-25 09:24 MDT (History)
0 users

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: slurm 20.11.8
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description wkudo 2025-03-24 14:42:29 MDT
Hi there,

I'm new to Slurm and learning how to use it. I've noticed an issue with some of our nodes going into a drain state after a job finishes or nears completion. The messages I'm getting are the following:

Reason=Kill task failed [root@2025-03-22T01:24:21]
[2025-03-22T01:24:21.001] [91099.batch] error: *** JOB 91099 STEPD TERMINATED ON nodeamd018 AT 2025-03-22T01:24:20 DUE TO JOB NOT ENDING WITH SIGNALS ***

I did some research on the error logs and found that increasing the UnkillableStepTimeout to around 120 seconds could give I/O more time before crashing, preventing the nodes from draining
https://support.schedmd.com/show_bug.cgi?id=3941

My question is, how do you update the value for UnkillableStepTimeout? I've read discussions and documentation and found that it's updated in slurm.conf. However, it's not showing up in my slurm.conf. When I run the scontrol command I can see that it shows up, but it's not in the file itself:

[root@koko-slurm1 slurm]# scontrol show config | grep Unkill
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec

[root@koko-slurm1 slurm]# cat /etc/slurm/slurm.conf | grep Unkill
[root@koko-slurm1 slurm]# 

We do use Warewulf for our cluster but not even that slurm.conf has UnkillableStepTimeout written in it. Can this value be updated through scontrol? Could it be located in another Slurm file? I'm sorry if this is a bad question but I cannot figure out how or where to update this value. Please let me know if you need additional information
Comment 1 wkudo 2025-03-25 09:24:14 MDT
.