Ticket 17597

Summary: preemption grace timeout
Product: Slurm Reporter: devops <richard>
Component: ConfigurationAssignee: Marshall Garey <marshall>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 23.02.4   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=17191
https://bugs.schedmd.com/show_bug.cgi?id=16263
Site: Stability AI Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description devops@stability.ai 2023-09-01 07:59:39 MDT
Hi, we need to introduce a preemption timeout, but I noticed we need to change the PreemptMode from requeue to cancel to make this happen.

So it seems we need to either allow the grace time or allow the requeuing of the jobs, and this is a hard choice to make.

Is there any scenario where we could have both?
Comment 3 Marshall Garey 2023-09-01 09:45:04 MDT
Actually, the documentation is wrong. GraceTime also works with PreemptMode=requeue. We have already updated the documentation in commit e889aa0c9e0a, which will be live on the website after 23.02.5 is released.

However, GraceTime actually does not totally work with preemption (cancel or requeue).

If you have GraceTime configured, then if a job is preempted, the job's steps/tasks are signaled with SIGCONT and SIGTERM.
* If the job exits before GraceTime is over, then the job will not be considered preempted. The job will not be requeued if PreemptMode=requeue and the state of the job in accounting is based on whatever the job's exit code was.
* If the job does not exit before GraceTime is over, then the job and all its steps are signalled with SIGKILL and the job is considered preempted. This includes requeuing the job if PreemptMode=requeue and the job's state is PREEMPTED.


This was reported in bug 16263, and we are exploring how this can be fixed. For now, have users make their jobs catch SIGTERM and not exit so that after GraceTime has passed, the jobs will be killed with SIGKILL and preemption will work properly.
Comment 4 devops@stability.ai 2023-09-01 11:51:37 MDT
cool, thanks
Comment 5 Marshall Garey 2023-09-01 12:11:21 MDT
Closing as info given per your response.