| Summary: | preemption grace timeout | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | devops <richard> |
| Component: | Configuration | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 23.02.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=17191 https://bugs.schedmd.com/show_bug.cgi?id=16263 |
||
| Site: | Stability AI | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
devops@stability.ai
2023-09-01 07:59:39 MDT
Actually, the documentation is wrong. GraceTime also works with PreemptMode=requeue. We have already updated the documentation in commit e889aa0c9e0a, which will be live on the website after 23.02.5 is released. However, GraceTime actually does not totally work with preemption (cancel or requeue). If you have GraceTime configured, then if a job is preempted, the job's steps/tasks are signaled with SIGCONT and SIGTERM. * If the job exits before GraceTime is over, then the job will not be considered preempted. The job will not be requeued if PreemptMode=requeue and the state of the job in accounting is based on whatever the job's exit code was. * If the job does not exit before GraceTime is over, then the job and all its steps are signalled with SIGKILL and the job is considered preempted. This includes requeuing the job if PreemptMode=requeue and the job's state is PREEMPTED. This was reported in bug 16263, and we are exploring how this can be fixed. For now, have users make their jobs catch SIGTERM and not exit so that after GraceTime has passed, the jobs will be killed with SIGKILL and preemption will work properly. cool, thanks Closing as info given per your response. |