5867 – GraceTime for PreemptMode=REQUEUE

Ticket 5867 - GraceTime for PreemptMode=REQUEUE

Summary: GraceTime for PreemptMode=REQUEUE

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	17.11.7
Hardware:	Linux Linux

Severity:	5 - Enhancement
Assignee:	Brian Christiansen
QA Contact:

URL:

Duplicates (1):	5515 (view as ticket list)
Depends on:
Blocks:

Reported:	2018-10-17 08:17 MDT by Paul Edmon
Modified:	2021-10-07 09:20 MDT (History)
CC List:	6 users (show)

See Also:
Site:	Harvard University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	19.05.0pre4
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Patch to allow user-selectable warning signal (859 bytes, patch) 2019-04-02 10:23 MDT, S Senator	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Paul Edmon 2018-10-17 08:17:09 MDT

I see in the docs this option:

GraceTime
    Specifies, in units of seconds, the preemption grace time to be extended to a job which has been selected for preemption. The default value is zero, no preemption grace time is allowed on this partition. Once a job has been selected for preemption, it's end time is set to the current time plus GraceTime. The job is immediately sent SIGCONT and SIGTERM signals in order to provide notification of its imminent termination. This is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence upon reaching its new end time. (Meaningful only for PreemptMode=CANCEL) 

I know that I've always understood that when preemption occurs that a series of signals were sent to the job all separated by 30 seconds.  Apparently this changed, which is fine but I see that this GraceTime only applies for PreemptMode=CANCEL.  Is that true?  Can it be used in PreemptMode=REQUEUE?  We have users that have codes that run in our requeue partition that key off of these signals to dump a checkpoint and thus not lose compute time.  If this feature is not available with PreemptMode=REQUEUE can it be added?  I can't imagine it is that hard given that the idea is very similar to the CANCEL mode but with the added step of the job being reentered into the PENDING queue.

Comment 1 Brian Christiansen 2018-10-18 16:59:08 MDT

Hey Paul,

It is correct that GraceTime is directly used for PreemptMode=cancel. It is also used in the case the other preempt modes failed for whatever reason and will attempt to cancel the job instead.

However, in my testing for PreemptMode=requeue the requeues are subject to KillWait which will send SIGCONT, SIGTERM, and then SIGKILL after KillWait time. Also, if I use PreemptMode=cancel with a gracetime, the a SIGTERM is sent at preemption time and then another SIGTERM at gracetime and then SIGKILL after KillWait time. I'm still digging into the code and understanding the different paths.

Are you not seeing the SIGCONT, SIGTERM, SIGKILL sequence with requeuing?

Thanks,
Brian


e.g.
brian@lappy:~/slurm/17.11/lappy$ cat sub.sh 
#!/bin/bash

handler() {
        echo `date | tr -d '\n'`: signal caught
}
trap handler SIGTERM

echo `date | tr -d '\n'`: pid: $$

while [ /bin/true ]; do
        echo `date | tr -d '\n'`: pid: $$
        sleep 1
done

brian@lappy:~/slurm/17.11/lappy$ scontrol show config | grep -i killwait
KillWait                = 90 sec

brian@lappy:~/slurm/17.11/lappy$ sbatch -wlappy1 -c8 -n1 -pdebug sub.sh
Submitted batch job 235353

brian@lappy:~/slurm/17.11/lappy$ sbatch -wlappy1 -c8 -n1 -pdebug2 sub.sh
Submitted batch job 235354

debug2 preempts debug.

slpurmctld logs:
Oct 17 23:04:55.460887 31734 srvcn        0x7f9da0254700: _slurm_rpc_submit_batch_job: JobId=235354 InitPrio=461 usec=237
Oct 17 23:04:55.946483 31734 sched        0x7f9da5b6ab80: preempted job 235353 has been requeued to reclaim resources for job 235354
Oct 17 23:06:27.14000  31734 srvcn        0x7f9da095b700: cleanup_completing: job 235353 completion process took 92 seconds


brian@lappy:~/slurm/17.11/lappy$ cat slurm-235353.out 
Wed Oct 17 23:04:47 MDT 2018: pid: 9502
...
Wed Oct 17 23:04:55 MDT 2018: pid: 9502
Oct 17 23:04:55.951189  9497 slurmstepd   0x7f42170c5700: error: *** JOB 235353 ON lappy1 CANCELLED AT 2018-10-17T23:04:55 DUE TO PREEMPTION ***
Wed Oct 17 23:04:56 MDT 2018: signal caught
Wed Oct 17 23:04:56 MDT 2018: pid: 9502
...
Wed Oct 17 23:06:25 MDT 2018: pid: 9502

Comment 2 Paul Edmon 2018-10-18 19:22:01 MDT

One of my users wrote me and said he didn't see it when requeuing and 
then he found the GraceTime option online and sent it to me. I will have 
to double check my KillWait.

-Paul Edmon-


On 10/18/2018 6:59 PM, bugs@schedmd.com wrote:
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=5867#c1> on bug 
> 5867 <https://bugs.schedmd.com/show_bug.cgi?id=5867> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Hey Paul,
>
> It is correct that GraceTime is directly used for PreemptMode=cancel. It is
> also used in the case the other preempt modes failed for whatever reason and
> will attempt to cancel the job instead.
>
> However, in my testing for PreemptMode=requeue the requeues are subject to
> KillWait which will send SIGCONT, SIGTERM, and then SIGKILL after KillWait
> time. Also, if I use PreemptMode=cancel with a gracetime, the a SIGTERM is sent
> at preemption time and then another SIGTERM at gracetime and then SIGKILL after
> KillWait time. I'm still digging into the code and understanding the different
> paths.
>
> Are you not seeing the SIGCONT, SIGTERM, SIGKILL sequence with requeuing?
>
> Thanks,
> Brian
>
>
> e.g.
> brian@lappy:~/slurm/17.11/lappy$ cat sub.sh
> #!/bin/bash
>
> handler() {
>          echo `date | tr -d '\n'`: signal caught
> }
> trap handler SIGTERM
>
> echo `date | tr -d '\n'`: pid: $$
>
> while [ /bin/true ]; do
>          echo `date | tr -d '\n'`: pid: $$
>          sleep 1
> done
>
> brian@lappy:~/slurm/17.11/lappy$ scontrol show config | grep -i killwait
> KillWait                = 90 sec
>
> brian@lappy:~/slurm/17.11/lappy$ sbatch -wlappy1 -c8 -n1 -pdebug sub.sh
> Submitted batch job 235353
>
> brian@lappy:~/slurm/17.11/lappy$ sbatch -wlappy1 -c8 -n1 -pdebug2 sub.sh
> Submitted batch job 235354
>
> debug2 preempts debug.
>
> slpurmctld logs:
> Oct 17 23:04:55.460887 31734 srvcn        0x7f9da0254700:
> _slurm_rpc_submit_batch_job: JobId=235354 InitPrio=461 usec=237
> Oct 17 23:04:55.946483 31734 sched        0x7f9da5b6ab80: preempted job 235353
> has been requeued to reclaim resources for job 235354
> Oct 17 23:06:27.14000  31734 srvcn        0x7f9da095b700: cleanup_completing:
> job 235353 completion process took 92 seconds
>
>
> brian@lappy:~/slurm/17.11/lappy$ cat slurm-235353.out
> Wed Oct 17 23:04:47 MDT 2018: pid: 9502
> ...
> Wed Oct 17 23:04:55 MDT 2018: pid: 9502
> Oct 17 23:04:55.951189  9497 slurmstepd   0x7f42170c5700: error: *** JOB 235353
> ON lappy1 CANCELLED AT 2018-10-17T23:04:55 DUE TO PREEMPTION ***
> Wed Oct 17 23:04:56 MDT 2018: signal caught
> Wed Oct 17 23:04:56 MDT 2018: pid: 9502
> ...
> Wed Oct 17 23:06:25 MDT 2018: pid: 9502
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 3 Paul Edmon 2018-10-19 08:41:34 MDT

Looks like our KillWait is set at 30s:

[root@holyitc01 ~]# scontrol show conf | grep KillWait
KillWait                = 30 sec

Assuming the terminate process uses this timer on REQUEUE then unless 
there is a bug it should work the way we want.

-Paul Edmon-


On 10/18/18 6:59 PM, bugs@schedmd.com wrote:
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=5867#c1> on bug 
> 5867 <https://bugs.schedmd.com/show_bug.cgi?id=5867> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Hey Paul,
>
> It is correct that GraceTime is directly used for PreemptMode=cancel. It is
> also used in the case the other preempt modes failed for whatever reason and
> will attempt to cancel the job instead.
>
> However, in my testing for PreemptMode=requeue the requeues are subject to
> KillWait which will send SIGCONT, SIGTERM, and then SIGKILL after KillWait
> time. Also, if I use PreemptMode=cancel with a gracetime, the a SIGTERM is sent
> at preemption time and then another SIGTERM at gracetime and then SIGKILL after
> KillWait time. I'm still digging into the code and understanding the different
> paths.
>
> Are you not seeing the SIGCONT, SIGTERM, SIGKILL sequence with requeuing?
>
> Thanks,
> Brian
>
>
> e.g.
> brian@lappy:~/slurm/17.11/lappy$ cat sub.sh
> #!/bin/bash
>
> handler() {
>          echo `date | tr -d '\n'`: signal caught
> }
> trap handler SIGTERM
>
> echo `date | tr -d '\n'`: pid: $$
>
> while [ /bin/true ]; do
>          echo `date | tr -d '\n'`: pid: $$
>          sleep 1
> done
>
> brian@lappy:~/slurm/17.11/lappy$ scontrol show config | grep -i killwait
> KillWait                = 90 sec
>
> brian@lappy:~/slurm/17.11/lappy$ sbatch -wlappy1 -c8 -n1 -pdebug sub.sh
> Submitted batch job 235353
>
> brian@lappy:~/slurm/17.11/lappy$ sbatch -wlappy1 -c8 -n1 -pdebug2 sub.sh
> Submitted batch job 235354
>
> debug2 preempts debug.
>
> slpurmctld logs:
> Oct 17 23:04:55.460887 31734 srvcn        0x7f9da0254700:
> _slurm_rpc_submit_batch_job: JobId=235354 InitPrio=461 usec=237
> Oct 17 23:04:55.946483 31734 sched        0x7f9da5b6ab80: preempted job 235353
> has been requeued to reclaim resources for job 235354
> Oct 17 23:06:27.14000  31734 srvcn        0x7f9da095b700: cleanup_completing:
> job 235353 completion process took 92 seconds
>
>
> brian@lappy:~/slurm/17.11/lappy$ cat slurm-235353.out
> Wed Oct 17 23:04:47 MDT 2018: pid: 9502
> ...
> Wed Oct 17 23:04:55 MDT 2018: pid: 9502
> Oct 17 23:04:55.951189  9497 slurmstepd   0x7f42170c5700: error: *** JOB 235353
> ON lappy1 CANCELLED AT 2018-10-17T23:04:55 DUE TO PREEMPTION ***
> Wed Oct 17 23:04:56 MDT 2018: signal caught
> Wed Oct 17 23:04:56 MDT 2018: pid: 9502
> ...
> Wed Oct 17 23:06:25 MDT 2018: pid: 9502
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 4 Brian Christiansen 2018-10-19 09:14:52 MDT

ok. Could you try the example bash script that I used and let me know if you aren't seeing the SIGTERM be sent? I get the same behavior if I have a job step inside my allocation that ignores SIGTERM as well.

Comment 5 Paul Edmon 2018-10-22 09:58:51 MDT

Sorry took a bit to get around to test this.  It does capture the 
SIGTERM if I send it both scancel and scontrol requeue

-Paul Edmon-

On 10/19/18 11:14 AM, bugs@schedmd.com wrote:
>
> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=5867#c4> on bug 
> 5867 <https://bugs.schedmd.com/show_bug.cgi?id=5867> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> ok. Could you try the example bash script that I used and let me know if you
> aren't seeing the SIGTERM be sent? I get the same behavior if I have a job step
> inside my allocation that ignores SIGTERM as well.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 6 Paul Edmon 2018-10-22 09:59:51 MDT

By the way the user in question is using job steps I believe so if there 
is an issue with that then he would be impacted.  I think he did some 
previous testing that showed that if you run subshells the signals don't 
get passed down properly.

-Paul Edmon-

On 10/19/18 11:14 AM, bugs@schedmd.com wrote:
>
> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=5867#c4> on bug 
> 5867 <https://bugs.schedmd.com/show_bug.cgi?id=5867> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> ok. Could you try the example bash script that I used and let me know if you
> aren't seeing the SIGTERM be sent? I get the same behavior if I have a job step
> inside my allocation that ignores SIGTERM as well.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 7 Brian Christiansen 2018-10-23 09:56:05 MDT

Hey Paul,

I can confirm that with proctrack/cgroup that only the direct children of the stepd are only getting the SIGTERM.

https://github.com/SchedMD/slurm/blob/slurm-17.11/src/plugins/proctrack/cgroup/proctrack_cgroup.c#L540

https://github.com/SchedMD/slurm/blob/slurm-17.11/src/plugins/proctrack/cgroup/proctrack_cgroup.c#L421

With proctrack/linuxproc, all descendants of the stepd are being signaled -- which causes the my srun's two get signaled twice -- once by the batch step and one by the actual step.

I believe the way that proctrack/cgroup is working is the most desirable, otherwise everything in the call stack would need to be prepared to handle SIGTERM. Handling it at the top layer allows one process to be in control of the shutdown phase.

Thanks,
Brian

Comment 8 Paul Edmon 2018-10-23 10:04:55 MDT

Yup, that's fine.  I will let that user know.

-Paul Edmon-

On 10/23/18 11:56 AM, bugs@schedmd.com wrote:
>
> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=5867#c7> on bug 
> 5867 <https://bugs.schedmd.com/show_bug.cgi?id=5867> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Hey Paul,
>
> I can confirm that with proctrack/cgroup that only the direct children of the
> stepd are only getting the SIGTERM.
>
> https://github.com/SchedMD/slurm/blob/slurm-17.11/src/plugins/proctrack/cgroup/proctrack_cgroup.c#L540
>
> https://github.com/SchedMD/slurm/blob/slurm-17.11/src/plugins/proctrack/cgroup/proctrack_cgroup.c#L421
>
> With proctrack/linuxproc, all descendants of the stepd are being signaled --
> which causes the my srun's two get signaled twice -- once by the batch step and
> one by the actual step.
>
> I believe the way that proctrack/cgroup is working is the most desirable,
> otherwise everything in the call stack would need to be prepared to handle
> SIGTERM. Handling it at the top layer allows one process to be in control of
> the shutdown phase.
>
> Thanks,
> Brian
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 9 Brian Christiansen 2018-10-23 10:19:23 MDT

Thanks Paul.

Comment 10 Paul Edmon 2018-10-24 08:37:02 MDT

Just a follow up from my user, he says:

The signal my jobs are looking for is SIGUSR1, not SIGTERM. My jobs get
SIGUSR1 because I set the flag "--signal B:USR1@240". This causes SIGUSR1
to be sent to the job script 240 seconds before it hits the time limit. It
would be great if this signal were sent before pre-emption as well.

I don't if you have any suggestions for him about this?  Or is there anyway that --signal could apply to preemption too?

-Paul Edmon-

On 10/23/18 12:19 PM, bugs@schedmd.com wrote:
> Brian Christiansen <mailto:brian@schedmd.com> changed bug 5867 
> <https://bugs.schedmd.com/show_bug.cgi?id=5867>
> What 	Removed 	Added
> Status 	UNCONFIRMED 	RESOLVED
> Resolution 	--- 	INFOGIVEN
>
> *Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=5867#c9> on bug 
> 5867 <https://bugs.schedmd.com/show_bug.cgi?id=5867> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Thanks Paul.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 11 Brian Christiansen 2018-10-24 15:55:15 MDT

This is where GraceTime does come into play -- but only for cancel. I'm looking into the adding the ability for the signal to be sent at preemption time.

Comment 19 S Senator 2018-12-19 11:13:07 MST

Our users would benefit from the request described in comments #10 and #11. Their use case is identical, catching a signal whenever a checkpoint may be necessary, whether due to requeueing or end of allocation time. The ability to select that signal delivery is to be sent to the script and/or the srunning executable is needed. ("--signal B:USR1@240", as below)

Comment 20 Brian Christiansen 2018-12-21 11:35:28 MST

Sorry for the delay. I had been investigating sending the signal for any type of preemption. However, after testing and thinking about it more I think that the signal should be sent any time the job is being cancelled (this would include requeued jobs) if it hasn't already been sent due to specified time with --signal. For example, even scancel <jobid> would trigger the signal (if configured on the job) to be sent.

So what I'm proposing is that there be no default time for --signal (currently 60 seconds) and that the signal be sent when the job is being cancelled. If no time is given to --signal, then the job will have KillWait time to do any extra work before the job is SIGKILLED.

Let me know if you have any thoughts on this.

Comment 21 Ben Santos 2018-12-21 11:35:46 MST

I am currently out of the office.  I will have limited email access but will follow up as soon as possible. If you have an urgent issue, please contact the HPC Consulting Office consult@lanl.gov or 505-665-4444 opt 3.

Comment 22 Paul Edmon 2018-12-21 11:39:46 MST

I think that makes sense.  Thanks!

-Paul Edmon-

On 12/21/2018 1:35 PM, bugs@schedmd.com wrote:
> Brian Christiansen <mailto:brian@schedmd.com> changed bug 5867 
> <https://bugs.schedmd.com/show_bug.cgi?id=5867>
> What 	Removed 	Added
> Severity 	4 - Minor Issue 	5 - Enhancement
>
> *Comment # 20 <https://bugs.schedmd.com/show_bug.cgi?id=5867#c20> on 
> bug 5867 <https://bugs.schedmd.com/show_bug.cgi?id=5867> from Brian 
> Christiansen <mailto:brian@schedmd.com> *
> Sorry for the delay. I had been investigating sending the signal for any type
> of preemption. However, after testing and thinking about it more I think that
> the signal should be sent any time the job is being cancelled (this would
> include requeued jobs) if it hasn't already been sent due to specified time
> with --signal. For example, even scancel <jobid> would trigger the signal (if
> configured on the job) to be sent.
>
> So what I'm proposing is that there be no default time for --signal (currently
> 60 seconds) and that the signal be sent when the job is being cancelled. If no
> time is given to --signal, then the job will have KillWait time to do any extra
> work before the job is SIGKILLED.
>
> Let me know if you have any thoughts on this.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 23 S Senator 2018-12-21 14:04:13 MST

First cut thoughts, re- send the signal without regard to the Requeue setting, yes. This is the simplest and most general case, so it is easy to communicate to users as well as have them code to.

Comment 24 mike coyne 2019-01-07 10:01:26 MST

Brian, 
  Question , if GraceTime is specified ... will the end time of the job be set to 
now + gracetime when it is preempted   "hopefully for both cancel and requeue"  and then the "user specified signal" will be sent to the job when at @userspecifed time - jobendtime  if GraceTime >= @userspecifed time  otherwise at the time of preemption ? Then the SIGKILL will be sent at KillWait  + jobendtime is the job has not otherwise exited ?

Mike Coyne
mcoyne@lanl.gov

Comment 25 Brian Christiansen 2019-01-09 10:31:54 MST

Steven,

Just to clarify, it's not "re-sending". It'll just send the signal if it hasn't yet.


Mike,

With GraceTime, this is what happens currently:

1. Job is "preempted" (in gracetime fashion)
1a. Job is sent SIGCONT, SIGTERM
1b. Job's end_time is extended to now + gracetime
2. the signal will be sent if the signal time is within the end_time. But because the end has been brought up in time, the signal gets sent.
3. Job is sent SIGCONT, SIGTERM as gracetime/new job endtime.

Using GraceTime gives the behavior that I would expect with getting the signal at the end of job. The proposed changes makes it so that the signal will be sent if the job is ever killed.

Also note that currently GraceTime only works for PreemptMode=cancel.

Comment 26 S Senator 2019-04-02 10:23:41 MDT

Created attachment 9771 [details]
Patch to allow user-selectable warning signal

Please consider reviewing and including the attached patch which enables user-selectable signals for the job warning

Comment 27 Brian Christiansen 2019-04-09 10:49:52 MDT

Sorry for delay. Here is the branch with my proposed changes.

https://github.com/SchedMD/slurm/commits/preemptsig

Let me know if you have any questions with the changes. I'll look at getting them into master/19.05 this week.

Thanks,
Brian

Comment 28 S Senator 2019-04-09 11:08:09 MDT

Would you be able to review the patch that I had attached earlier which allows the specific signal number to also be specified? That is a local patch that we're presently carrying forward for our users. Ideally, we would appreciate seeing this in the main branch.

Thank you,
-Steve Senator

________________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, April 9, 2019 10:49:52 AM
To: Senator, Steven Terry
Subject: [Bug 5867] GraceTime for PreemptMode=REQUEUE

Comment # 27<https://bugs.schedmd.com/show_bug.cgi?id=5867#c27> on bug 5867<https://bugs.schedmd.com/show_bug.cgi?id=5867> from Brian Christiansen<mailto:brian@schedmd.com>

Sorry for delay. Here is the branch with my proposed changes.

https://github.com/SchedMD/slurm/commits/preemptsig

Let me know if you have any questions with the changes. I'll look at getting
them into master/19.05 this week.

Thanks,
Brian

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.

Comment 29 Brian Christiansen 2019-04-09 12:32:06 MDT

Sorry, if I'm missing something. Your patch looks like you're just preventing the SIGTERM from being sent if you warning signal hasn't been sent yet.

A user can specify which signal they want with the --signal=<sig_num> optino.

e.g.
sbatch --signal=USR1 script.sh

https://slurm.schedmd.com/sbatch.html#OPT_signal

Comment 30 Brian Christiansen 2019-04-09 12:37:16 MDT

And in the branch, the warn_signal will be sent at preemption time.

Comment 31 S Senator 2019-04-09 13:12:46 MDT

I think the 5867 patch covers the case that we were trying to cover too.

Thank you,
-Steve Senator

________________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, April 9, 2019 12:32:06 PM
To: Senator, Steven Terry
Subject: [Bug 5867] GraceTime for PreemptMode=REQUEUE

Comment # 29<https://bugs.schedmd.com/show_bug.cgi?id=5867#c29> on bug 5867<https://bugs.schedmd.com/show_bug.cgi?id=5867> from Brian Christiansen<mailto:brian@schedmd.com>

Sorry, if I'm missing something. Your patch looks like you're just preventing
the SIGTERM from being sent if you warning signal hasn't been sent yet.

A user can specify which signal they want with the --signal=<sig_num> optino.

e.g.
sbatch --signal=USR1 script.sh

https://slurm.schedmd.com/sbatch.html#OPT_signal

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.

Comment 32 S Senator 2019-04-09 14:18:12 MDT

Rather than send "SIGCONT" followed by "SIGTERM", as the preemption warning signal, send "SIGCONT" followed by the user's preferred warning signal or default to SIGTERM.

Comment 33 Brian Christiansen 2019-04-09 16:19:24 MDT

ok. I understand what you are doing now. You're forcing the user signal at the beginning of gracetime. 

Currently, even without my changes, you will still get the signal, as described in Comment 25, just not as soon as you are wanting/expecting.

Since we do say -- the comment in the preempt.c -- that we are signaling the job at the beginning of gracetime with SIGCONT, SIGTERM I could understand sending the user signal at the same time.

I can also understand not sending it there too because it hasn't hit it's time. Let me think about it some more.

What led you to make the change that you have?

What is your use case for GraceTime? I ask because a few customers have been using it to guarantee a minimum run time before being preempted. We are adding a new parameter to do this but in a better way.

Comment 34 S Senator 2019-04-09 18:22:44 MDT

Our users (or application readiness/development team) trigger a checkpoint and some other preparation for subsequent job runs upon receipt of the signal.
At present, we use GraceTime to bound the time for the jobs to guarantee that this checkpoint is written and associated metadata is synchronized with it.

Thank you,
-Steve Senator

________________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, April 9, 2019 4:19:24 PM
To: Senator, Steven Terry
Subject: [Bug 5867] GraceTime for PreemptMode=REQUEUE

Comment # 33<https://bugs.schedmd.com/show_bug.cgi?id=5867#c33> on bug 5867<https://bugs.schedmd.com/show_bug.cgi?id=5867> from Brian Christiansen<mailto:brian@schedmd.com>

ok. I understand what you are doing now. You're forcing the user signal at the
beginning of gracetime.

Currently, even without my changes, you will still get the signal, as described
in Comment 25<show_bug.cgi?id=5867#c25>, just not as soon as you are wanting/expecting.

Since we do say -- the comment in the preempt.c -- that we are signaling the
job at the beginning of gracetime with SIGCONT, SIGTERM I could understand
sending the user signal at the same time.

I can also understand not sending it there too because it hasn't hit it's time.
Let me think about it some more.

What led you to make the change that you have?

What is your use case for GraceTime? I ask because a few customers have been
using it to guarantee a minimum run time before being preempted. We are adding
a new parameter to do this but in a better way.

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.

Comment 35 mike coyne 2019-04-10 06:57:50 MDT

The original reason for creating the attached patch is that user jobs are being killed by the sending of the sigTerm with out being able to catch it. The jobs are  killed in such a way the they can not perform a checkpoint and recover from pre-emption. This is why in my patch i  am "replacing" the sigcont sigterm combination with sigcont <user selected signal> re-using the --signal option for that function as well. If the --signal option is not set  the signal sent at pre-emption falls back to sigterm.  In effect the use of sigterm has become unacceptable for many user codes.

Comment 36 Brian Christiansen 2019-04-10 09:40:33 MDT

ok. Thanks for the explanations. I want to discuss it internally and will get back to you.

My inclination is to remove SIGCONT, SIGTERM from gracetime, because it hasn't necessarily been preempted yet -- just had it's end_time shortened -- and have the user signal sent either when the signal time hits or when the job is actually being killed. This would make it more like normal preemption.

We have one customer with a 10min gracetime and are using the user signal at the end of gracetime to signal the job -- they are ignoring the SIGTERMS. However, they are using gracetime to provide a minimum run time before being preempted and will benefit from the new parameter.

Comment 37 Brian Christiansen 2019-04-12 16:51:55 MDT

ok, I/we have come full circle on this. We've decided to back off on my changes for the following reasons.

--signal=<sig>[@<time>] is used to signal a job so many seconds before the job's end time. If we were to send <sig> at preemption time or termination (e.g. cancel, requeue, time limit, etc.), the job wouldn't have the expected <time> to do necessary cleanup. It would also be overloading what SIGTERM is being used for. SIGTERM is the signal used to tell the job that the job is going to be terminated soon. In the GraceTime situation, the first set of SIGCONT,SIGTERM signals represent that the job has GraceTime+KillWait seconds left and the second set represents KillWait seconds left until the job is SIGKILL'ed.

As for gracetime preemption, we aren't ready to make changes here yet -- at least in 19.05. From previous run-ins with gracetime, we feel that GraceTime was initially intended to be used as a way to guarantee a minimum run time before the job could be preempted -- however the implementation was otherwise. This is why we are adding the PreemptExemptTime feature.

We would expect users to be able to wrapper applications to catch and ignore the SIGTERM.

Let me know if you have any questions.

Thanks,
Brian

Comment 38 Brian Christiansen 2019-04-12 17:24:56 MDT

Actually, one more full circle (maybe half) :). We're considering adding this behavior with a flag. The flag will turn on sending the user signal at preemption time whether it's gractime preemption or normal preemption. And it will not send the SIGCONT and SIGTERM in gracetime preemption.

I'll make these changes in the branch and let you know.

Thanks,
Brian

Comment 39 S Senator 2019-04-12 17:26:03 MDT

Thank you for the detailed commentary.
Adding this behavior with a flag would be appreciated.

Thank you,
-Steve Senator

________________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Friday, April 12, 2019 5:24:56 PM
To: Senator, Steven Terry
Subject: [Bug 5867] GraceTime for PreemptMode=REQUEUE

Comment # 38<https://bugs.schedmd.com/show_bug.cgi?id=5867#c38> on bug 5867<https://bugs.schedmd.com/show_bug.cgi?id=5867> from Brian Christiansen<mailto:brian@schedmd.com>

Actually, one more full circle (maybe half) :). We're considering adding this
behavior with a flag. The flag will turn on sending the user signal at
preemption time whether it's gractime preemption or normal preemption. And it
will not send the SIGCONT and SIGTERM in gracetime preemption.

I'll make these changes in the branch and let you know.

Thanks,
Brian

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.

Comment 40 Brian Christiansen 2019-04-16 16:34:06 MDT

Alright, I've committed the changes to 19.05. It is turned on with:

SlurmctldParameters=preempt_send_user_signal

https://github.com/SchedMD/slurm/commit/d36947a8a53e8381d4e6a8ee9af98fcce8c22696

Let me know if you have any questions.

Thanks for eveyone's help.
Brian

Comment 41 S Senator 2019-04-16 16:57:33 MDT

Thank you very much for all of your time and effort on this ticket,
-Steve Senator

________________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, April 16, 2019 4:34:06 PM
To: Senator, Steven Terry
Subject: [Bug 5867] GraceTime for PreemptMode=REQUEUE

Brian Christiansen<mailto:brian@schedmd.com> changed bug 5867<https://bugs.schedmd.com/show_bug.cgi?id=5867>
What    Removed Added
Version Fixed           19.05.0pre4
Status  CONFIRMED       RESOLVED
Resolution      ---     FIXED

Comment # 40<https://bugs.schedmd.com/show_bug.cgi?id=5867#c40> on bug 5867<https://bugs.schedmd.com/show_bug.cgi?id=5867> from Brian Christiansen<mailto:brian@schedmd.com>

Alright, I've committed the changes to 19.05. It is turned on with:

SlurmctldParameters=preempt_send_user_signal

https://github.com/SchedMD/slurm/commit/d36947a8a53e8381d4e6a8ee9af98fcce8c22696

Let me know if you have any questions.

Thanks for eveyone's help.
Brian

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.

Comment 42 Marshall Garey 2021-10-07 09:20:54 MDT

*** Ticket 5515 has been marked as a duplicate of this ticket. ***