We use overtime limits (OverTimeLimit=720) in our configuration. One handy feature we had with our previous scheduler was an email notification that a job had gone over its limit. The actual notification logic for time limits allowed for notification at some percentage of the requested wall time: notify = 80% * time_limt we'd just had the percentage set to 100% to notify when it went over. We'd like to see about getting a feature like that into Slurm. I think we'd be fine with setting options in a job submit plugin to handle default behaviour if you thought it more appropriate to set these thresholds in the job request rather than having a option in slurm.conf. Simply having a mechanism where a job would send email when the job reached some percentage of its time limit would be really helpful. We're at 2.6.9, with plans to go to 14.x in late summer/early fall FWIW. Thanks. Michael
The strigger command as described here: http://slurm.schedmd.com/strigger.html can execute an arbitrary program when a program approaches its time limit today. Here is an example: Execute the program "/home/joe/clean_up" when job 1234 is within 10 minutes of reaching its time limit. > strigger --set --jobid=1234 --time --offset=-600 \ --program=/home/joe/clean_up A simpler solution for users would be to add a mail flag to Slurm's current email notification mechanism, which would be pretty simple. In your current environment, is that notification value a global or per job percentage?
My only concern with the trigger approach, is that we'd like this as default for all jobs. While I can set a mail type parameter in a job submit plugin, I'm not sure about setting a trigger. I get kind of worried about setting triggers when we have the users who dump thousands of jobs at a batch. I do like the idea of an additional mail type. The old scheduler had this as a global setting- IIRC, there's no per-job option. Thanks Michael ----- Original Message ----- > > > > Comment # 1 on bug 954 from Moe Jette The strigger command as > described here: http://slurm.schedmd.com/strigger.html can execute > an arbitrary program when a program approaches its time limit > today. Here is an example: > Execute the program "/home/joe/clean_up" when job 1234 is within 10 > minutes of > reaching its time limit. > strigger --set --jobid=1234 --time > --offset=-600 \ --program=/home/joe/clean_up > > A simpler solution for users would be to add a mail flag to Slurm's > current > email notification mechanism, which would be pretty simple. > > In your current environment, is that notification value a global or > per job > percentage? > > > You are receiving this mail because: > > * You reported the bug. >
What do you think about adding a handful of new mail triggers. This is what I have in mind: time_limit
That last comment got sent a bit prematurely... What do you think about adding a handful of new mail triggers. This is what I have in mind: time_limit - reached 100 of time limit time_limit_90 - reached 90 of time limit time_limit_80 - reached 80 of time limit Multiple trigger times can be set on the same job if desired.
(In reply to Moe Jette from comment #4) > That last comment got sent a bit prematurely... > > What do you think about adding a handful of new mail triggers. This is what > I have in mind: > time_limit - reached 100 of time limit > time_limit_90 - reached 90 of time limit > time_limit_80 - reached 80 of time limit > Multiple trigger times can be set on the same job if desired. These being options to sbatch, or strigger?
(In reply to Michael Gutteridge from comment #5) > (In reply to Moe Jette from comment #4) > > What do you think about adding a handful of new mail triggers. This is what > > I have in mind: > > time_limit - reached 100 of time limit > > time_limit_90 - reached 90 of time limit > > time_limit_80 - reached 80 of time limit > > Multiple trigger times can be set on the same job if desired. > > These being options to sbatch, or strigger? sbatch (or set with job_submit plugin). The following would send email to the user at 80%, 90% and 100% of time limit: $ sbatch -n1 --mail-type=TIME_LIMIT,TIME_LIMIT_90,TIME_LIMIT_80 -t10 tmp Or just on reaching the time limit: $ sbatch -n1 --mail-type=TIME_LIMIT -t10 tmp
Let me know if those trigger points are good for you too or if you want different ones. This seems a bit better than having a single configurable trigger point in that users have some control over when they get notified. They can also be notified multiple times if desired.
(In reply to Moe Jette from comment #7) > Let me know if those trigger points are good for you too or if you want > different ones. This seems a bit better than having a single configurable > trigger point in that users have some control over when they get notified. > They can also be notified multiple times if desired. I agree- much better than a single notification level. I am inclined to think that having a greater spread is desirable, say 50%, 80%, and 100% of TimeLimit. This particularly for shorter jobs- it will allow folks a bit more time to take corrective action.
Created attachment 1047 [details] implements email triggers at 50, 80, 90 and 100 of time limit
I've implemented this in version 14.11, which will not be released until November. The attached patch should also work with version 14.03, but that has not been tested.