Hi, we need to introduce a preemption timeout, but I noticed we need to change the PreemptMode from requeue to cancel to make this happen. So it seems we need to either allow the grace time or allow the requeuing of the jobs, and this is a hard choice to make. Is there any scenario where we could have both?
Actually, the documentation is wrong. GraceTime also works with PreemptMode=requeue. We have already updated the documentation in commit e889aa0c9e0a, which will be live on the website after 23.02.5 is released. However, GraceTime actually does not totally work with preemption (cancel or requeue). If you have GraceTime configured, then if a job is preempted, the job's steps/tasks are signaled with SIGCONT and SIGTERM. * If the job exits before GraceTime is over, then the job will not be considered preempted. The job will not be requeued if PreemptMode=requeue and the state of the job in accounting is based on whatever the job's exit code was. * If the job does not exit before GraceTime is over, then the job and all its steps are signalled with SIGKILL and the job is considered preempted. This includes requeuing the job if PreemptMode=requeue and the job's state is PREEMPTED. This was reported in bug 16263, and we are exploring how this can be fixed. For now, have users make their jobs catch SIGTERM and not exit so that after GraceTime has passed, the jobs will be killed with SIGKILL and preemption will work properly.
cool, thanks
Closing as info given per your response.