| Summary: | Add over-time limit notification | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Michael Gutteridge <mrg> |
| Component: | slurmctld | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | da |
| Version: | 2.6.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | FHCRC - Fred Hutchinson Cancer Research Center | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 14.11.0-pre2 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | implements email triggers at 50, 80, 90 and 100 of time limit | ||
|
Description
Michael Gutteridge
2014-07-09 04:52:17 MDT
The strigger command as described here: http://slurm.schedmd.com/strigger.html can execute an arbitrary program when a program approaches its time limit today. Here is an example: Execute the program "/home/joe/clean_up" when job 1234 is within 10 minutes of reaching its time limit. > strigger --set --jobid=1234 --time --offset=-600 \ --program=/home/joe/clean_up A simpler solution for users would be to add a mail flag to Slurm's current email notification mechanism, which would be pretty simple. In your current environment, is that notification value a global or per job percentage? My only concern with the trigger approach, is that we'd like this as default for all jobs. While I can set a mail type parameter in a job submit plugin, I'm not sure about setting a trigger. I get kind of worried about setting triggers when we have the users who dump thousands of jobs at a batch.
I do like the idea of an additional mail type. The old scheduler had this as a global setting- IIRC, there's no per-job option.
Thanks
Michael
----- Original Message -----
>
>
>
> Comment # 1 on bug 954 from Moe Jette The strigger command as
> described here: http://slurm.schedmd.com/strigger.html can execute
> an arbitrary program when a program approaches its time limit
> today. Here is an example:
> Execute the program "/home/joe/clean_up" when job 1234 is within 10
> minutes of
> reaching its time limit. > strigger --set --jobid=1234 --time
> --offset=-600 \ --program=/home/joe/clean_up
>
> A simpler solution for users would be to add a mail flag to Slurm's
> current
> email notification mechanism, which would be pretty simple.
>
> In your current environment, is that notification value a global or
> per job
> percentage?
>
>
> You are receiving this mail because:
>
> * You reported the bug.
>
What do you think about adding a handful of new mail triggers. This is what I have in mind: time_limit That last comment got sent a bit prematurely... What do you think about adding a handful of new mail triggers. This is what I have in mind: time_limit - reached 100 of time limit time_limit_90 - reached 90 of time limit time_limit_80 - reached 80 of time limit Multiple trigger times can be set on the same job if desired. (In reply to Moe Jette from comment #4) > That last comment got sent a bit prematurely... > > What do you think about adding a handful of new mail triggers. This is what > I have in mind: > time_limit - reached 100 of time limit > time_limit_90 - reached 90 of time limit > time_limit_80 - reached 80 of time limit > Multiple trigger times can be set on the same job if desired. These being options to sbatch, or strigger? (In reply to Michael Gutteridge from comment #5) > (In reply to Moe Jette from comment #4) > > What do you think about adding a handful of new mail triggers. This is what > > I have in mind: > > time_limit - reached 100 of time limit > > time_limit_90 - reached 90 of time limit > > time_limit_80 - reached 80 of time limit > > Multiple trigger times can be set on the same job if desired. > > These being options to sbatch, or strigger? sbatch (or set with job_submit plugin). The following would send email to the user at 80%, 90% and 100% of time limit: $ sbatch -n1 --mail-type=TIME_LIMIT,TIME_LIMIT_90,TIME_LIMIT_80 -t10 tmp Or just on reaching the time limit: $ sbatch -n1 --mail-type=TIME_LIMIT -t10 tmp Let me know if those trigger points are good for you too or if you want different ones. This seems a bit better than having a single configurable trigger point in that users have some control over when they get notified. They can also be notified multiple times if desired. (In reply to Moe Jette from comment #7) > Let me know if those trigger points are good for you too or if you want > different ones. This seems a bit better than having a single configurable > trigger point in that users have some control over when they get notified. > They can also be notified multiple times if desired. I agree- much better than a single notification level. I am inclined to think that having a greater spread is desirable, say 50%, 80%, and 100% of TimeLimit. This particularly for shorter jobs- it will allow folks a bit more time to take corrective action. Created attachment 1047 [details]
implements email triggers at 50, 80, 90 and 100 of time limit
I've implemented this in version 14.11, which will not be released until November. The attached patch should also work with version 14.03, but that has not been tested. |