Ticket 954 - Add over-time limit notification
Summary: Add over-time limit notification
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 2.6.7
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-07-09 04:52 MDT by Michael Gutteridge
Modified: 2014-07-10 06:02 MDT (History)
1 user (show)

See Also:
Site: FHCRC - Fred Hutchinson Cancer Research Center
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 14.11.0-pre2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
implements email triggers at 50, 80, 90 and 100 of time limit (9.08 KB, patch)
2014-07-10 06:01 MDT, Moe Jette
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Michael Gutteridge 2014-07-09 04:52:17 MDT
We use overtime limits (OverTimeLimit=720) in our configuration.  One handy feature we had with our previous scheduler was an email notification that a job had gone over its limit.  The actual notification logic for time limits allowed for notification at some percentage of the requested wall time:

notify = 80% * time_limt

we'd just had the percentage set to 100% to notify when it went over.

We'd like to see about getting a feature like that into Slurm.  I think we'd be fine with setting options in a job submit plugin to handle default behaviour if you thought it more appropriate to set these thresholds in the job request rather than having a option in slurm.conf.

Simply having a mechanism where a job would send email when the job reached some percentage of its time limit would be really helpful.

We're at 2.6.9, with plans to go to 14.x in late summer/early fall FWIW.

Thanks.

Michael
Comment 1 Moe Jette 2014-07-09 05:04:08 MDT
The strigger command as described here:
http://slurm.schedmd.com/strigger.html
can execute an arbitrary program when a program approaches its time limit today. Here is an example:
Execute the program "/home/joe/clean_up" when job 1234 is within 10 minutes of reaching its time limit.

> strigger --set --jobid=1234 --time --offset=-600 \
           --program=/home/joe/clean_up

A simpler solution for users would be to add a mail flag to Slurm's current email notification mechanism, which would be pretty simple. 

In your current environment, is that notification value a global or per job percentage?
Comment 2 Michael Gutteridge 2014-07-09 05:45:21 MDT
My only concern with the trigger approach, is that we'd like this as default for all jobs.  While I can set a mail type parameter in a job submit plugin, I'm not sure about setting a trigger.  I get kind of worried about setting triggers when we have the users who dump thousands of jobs at a batch.

I do like the idea of an additional mail type.  The old scheduler had this as a global setting- IIRC, there's no per-job option.

Thanks

Michael

----- Original Message -----
> 
> 
> 
> Comment # 1 on bug 954 from Moe Jette The strigger command as
> described here: http://slurm.schedmd.com/strigger.html can execute
> an arbitrary program when a program approaches its time limit
> today. Here is an example:
> Execute the program "/home/joe/clean_up" when job 1234 is within 10
> minutes of
> reaching its time limit. > strigger --set --jobid=1234 --time
> --offset=-600 \ --program=/home/joe/clean_up
> 
> A simpler solution for users would be to add a mail flag to Slurm's
> current
> email notification mechanism, which would be pretty simple.
> 
> In your current environment, is that notification value a global or
> per job
> percentage?
> 
> 
> You are receiving this mail because:
> 
>     * You reported the bug.
>
Comment 3 Moe Jette 2014-07-09 08:33:29 MDT
What do you think about adding a handful of new mail triggers. This is what I have in mind:
time_limit
Comment 4 Moe Jette 2014-07-09 08:35:29 MDT
That last comment got sent a bit prematurely...

What do you think about adding a handful of new mail triggers. This is what I have in mind:
time_limit     - reached 100 of time limit
time_limit_90  - reached 90 of time limit
time_limit_80  - reached 80 of time limit
Multiple trigger times can be set on the same job if desired.
Comment 5 Michael Gutteridge 2014-07-09 09:27:48 MDT
(In reply to Moe Jette from comment #4)
> That last comment got sent a bit prematurely...
> 
> What do you think about adding a handful of new mail triggers. This is what
> I have in mind:
> time_limit     - reached 100 of time limit
> time_limit_90  - reached 90 of time limit
> time_limit_80  - reached 80 of time limit
> Multiple trigger times can be set on the same job if desired.

These being options to sbatch, or strigger?
Comment 6 Moe Jette 2014-07-09 09:31:19 MDT
(In reply to Michael Gutteridge from comment #5)
> (In reply to Moe Jette from comment #4)
> > What do you think about adding a handful of new mail triggers. This is what
> > I have in mind:
> > time_limit     - reached 100 of time limit
> > time_limit_90  - reached 90 of time limit
> > time_limit_80  - reached 80 of time limit
> > Multiple trigger times can be set on the same job if desired.
> 
> These being options to sbatch, or strigger?

sbatch (or set with job_submit plugin). The following would send email to the user at 80%, 90% and 100% of time limit:
$ sbatch -n1 --mail-type=TIME_LIMIT,TIME_LIMIT_90,TIME_LIMIT_80 -t10 tmp

Or just on reaching the time limit:
$ sbatch -n1 --mail-type=TIME_LIMIT -t10 tmp
Comment 7 Moe Jette 2014-07-09 09:34:58 MDT
Let me know if those trigger points are good for you too or if you want different ones. This seems a bit better than having a single configurable trigger point in that users have some control over when they get notified. They can also be notified multiple times if desired.
Comment 8 Michael Gutteridge 2014-07-10 05:33:11 MDT
(In reply to Moe Jette from comment #7)
> Let me know if those trigger points are good for you too or if you want
> different ones. This seems a bit better than having a single configurable
> trigger point in that users have some control over when they get notified.
> They can also be notified multiple times if desired.

I agree- much better than a single notification level.  I am inclined to think that having a greater spread is desirable, say 50%, 80%, and 100% of TimeLimit.  This particularly for shorter jobs- it will allow folks a bit more time to take corrective action.
Comment 9 Moe Jette 2014-07-10 06:01:41 MDT
Created attachment 1047 [details]
implements email triggers at 50, 80, 90 and 100 of time limit
Comment 10 Moe Jette 2014-07-10 06:02:51 MDT
I've implemented this in version 14.11, which will not be released until November. The attached patch should also work with version 14.03, but that has not been tested.