| Summary: | question about maximum time set for slurm jobs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | RAMYA ERANNA <reranna> |
| Component: | Configuration | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 22.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | SLAC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf | ||
This option could be something you are interested in: https://slurm.schedmd.com/sacctmgr.html#OPT_MaxWallDurationPerJob You could add this assoc limit on the accounts you want, and set them to what you want. sacctmgr update account name=sub1 set maxwalldurationperjob+=10 I added a MaxWall time of 10 minutes for my account "sub1" I can submit a job that is 12 minutes long to my account "sub2" that doesn't have a limit. When I submit a job to "sub1" that is 12 minutes long it submits fine but stays pending showing this: $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 210 normal wrap caden PD 0:00 0 (AssocMaxWallDurationPerJobLimit) If you want these kinds of jobs to be rejected upfront, you can add the DenyOnLimit QOS flag. sacctmgr update qos name=normal set flags+=denyonlimit My "sub1" account has qos normal, so now the 12 minute job is rejected at submission. sbatch -Asub1 -t12 --wrap="sleep 10" sbatch: error: AssocMaxWallDurationPerJobLimit sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits) It then does not land in the queue at all. Does this answer your question? Caden Hi, Can this value can be higher than system default time? maxwalldurationperjob Thanks Ramya These are the order in which resource limits are enforced. https://slurm.schedmd.com/resource_limits.html#hierarchy If by system default time you mean the partition default time: https://slurm.schedmd.com/slurm.conf.html#OPT_DefaultTime Then yes, it can be higher. That is just the time used if one is not specified. You can't go over the partition "MaxTime" parameter with the method I described. Caden Hi, We have set partition "MaxTime" parameter in slurm.conf as 10 days. So now If I set the user association limit to 5 days for lcls:default and scavengers, Does it take effect of 5 days? Or will it be still 10 days as described in slurm.conf Thank you Ramya Created attachment 32476 [details]
slurm.conf
lcls:default and scavengers will have the limit of 5 days if you set the assoc limit to 5 days, even if the partition has a MaxTime of 10 days. Other accounts that don't have a limit of 5 days will be able to go to 10 days on your partitions. Caden Let me know if you have other questions. Closing Caden |
Hi Team, Is it possible to set the max time for slurm jobs based on the accounts ? Below lcls:default and scavenger accounts should have maximum time of 5 days, while other normal accounts should have 10days [reranna@sdfmgr002 ~]$ sacctmgr show account | grep lcls lcls lcls lcls lcls:amol+ lcls:amolu0017 lcls:amolu0017 lcls:amox+ lcls:amox33017 lcls:amox33017 lcls:cxic+ lcls:cxic00121 lcls:cxic00121 lcls:data lcls:data lcls:data lcls:detd+ lcls:detdaq21 lcls:detdaq21 lcls:diad+ lcls:diadaq13 lcls:diadaq13 lcls:mecc+ lcls:mecc00121 lcls:mecc00121 lcls:mecl+ lcls:mecl1002021 lcls:mecl1002021