| Summary: | Limit TRES per user across the cluster? | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Greg Wickham <greg.wickham> |
| Component: | Accounting | Assignee: | Skyler Malinowski <skyler> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.11.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | KAUST | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Greg Wickham
2021-02-09 03:01:41 MST
Hi Greg, You can use `sacctmgr` to set a `grpTRES` on an account to constrain it and all their children. This acts independently of QOS. > sacctmgr modify account root set grpTRES=gres/gpu=1 In this example I have set an arbitrarily limit of `grpTRES=gres/gpu=1`. Please replace it with your desired limits. Below you can see what it would look like when applied (albeit in a simple environment). > sacctmgr show assoc format=account,user,GrpTRES Account User GrpTRES ---------- ---------- ------------- root gres/gpu=1 root root qa qa malinowski Hi Skyler, GrpTRES= The total count of TRES able to be used at any given time from jobs running from an association and its children or QOS. If this limit is reached new jobs will be queued but only allowed to run after resources have been relinquished from this group. GrpTRES is the limit on the association, hence your example would place a limit of a maximum of 1 GPU across all users of the 'root' account. I'm after a "per user" limit. -Greg Hi Greg, You are correct in that `GrpTRES` is not a per user constraint. Unfortunately there is not a `MaxTRESPerUser` option for account associations at this time. So no, you cannot apply a per-user cluster wide limit independent of QOS, at this time. An option is to have the partition bound QOS have `MaxTRESPerUser`. Depending on your configuration, this may meet what you want. Anything beyond that would be a feature request. Regards, Skyler Hi Skyler, Thanks. Please consider this a feature request. -greg Hi Greg,
There may be another option. You can a run a similar command on all the users (per cluster). I know this may be not very ergonomic especially with thousands of user and multiple clusters.
> sacctmgr modify user malinowski set grpTRES=gres/gpu=1 where cluster=qa
Does this cover your use case? Would you prefer this to be a single setting on a cluster?
Regards,
SKyler
Hi Skyler, Your suggestion won't work for us as an individual user is associated with multiple accounts, and hence the GrpTRES would apply distinctly for each account. -Greg Hi Greg, I will need some more information regarding your configuration and what specifically you are attempting to accomplish. Please provide a sample configuration based on your site and a situation to illustrate the dilemma. We (SchedMD) need to better understand the why account and QOS cannot satisfy your use case before we can proceed with a formal feature request. Thanks, Skyler Hi Gres, There is internal push-back on a feature like `MaxTRESPerUser` for accounts, unless it can be proven that there is not another method to accomplish the desired outcome. Hence your feature request is halted until more information can prove the feature necessary. After reviewing the ticket, I believe that a QOS on the partition with `MaxTRESPerUser` with optional flag `OverPartQOS` may work for you. In the following example I will create a qos that limits Gres cluster wide. > sacctmgr add qos global_limits > sacctmgr modify qos global_limits set MaxTRESPerUser=gres/gpu=1 > sacctmgr mod qos global_limits set flags=overpartqos # optional: if set, jobs using this QOS will be able to override any limits used by the requested partition's QOS limits. In your `slurm.conf` add the following line before the partition definitions. > PartitionName=DEFAULT Qos=global_limits Or attach to select partition like normal. > PartitionName=debug Qos=global_limits Then reconfigure `slurmctld` > scontrol reconfigure The above example will limit Gres on a per-user basis across all partitions, hence the entire cluster. Regards, Skyler Hi Skyler, Before this is implemented please let me explain our environment: - we have many partitions - we have many accounts - we have multiple QOS - users can access all partitions - users can belong to multiple accounts Will your suggestion work if there are multiple partitions? Is TRES tracking for a partition QOS only accounting for usage on that partition? The primary issue we are facing is users belonging to multiple accounts, thus be choosing different accounts when they submit jobs they can exceed the global desired limit. -greg Hi Greg, > Will your suggestion work if there are multiple partitions? Is TRES tracking for a partition QOS only accounting for usage on that partition? A partition QOS will track all TRES across all partitions with that partition same QOS. If you need partition level control, you could make a QOS for each partition giving you even more fine grain control. > The primary issue we are facing is users belonging to multiple accounts, thus be choosing different accounts when they submit jobs they can exceed the global desired limit. Users can still submit from multiple accounts and QOS but the partition QOS will never be breached unless the user submits from a QOS that has `Flags=OverPartQOS`. Does that help and clear things up? Regards, Skyler Hi Skyler, Many thanks. A global QOS has been setup and is working a treat. -Greg Hi Greg, Great to hear! I am glad I could find you a solution. Regards, Skyler |