We're trying to limit the TRES (specifically GPUs) per user across the cluster. The issue being faced is that there are multiple QOSs (ie: normal and priority), and as MaxTRESPerUser is an attribute of a QOS this limit applies per QOS. This results in a user being able to access more TRES than required. Is there anyway to apply a per-user cluster wide limit independent of QOS? -greg
Hi Greg, You can use `sacctmgr` to set a `grpTRES` on an account to constrain it and all their children. This acts independently of QOS. > sacctmgr modify account root set grpTRES=gres/gpu=1 In this example I have set an arbitrarily limit of `grpTRES=gres/gpu=1`. Please replace it with your desired limits. Below you can see what it would look like when applied (albeit in a simple environment). > sacctmgr show assoc format=account,user,GrpTRES Account User GrpTRES ---------- ---------- ------------- root gres/gpu=1 root root qa qa malinowski
Hi Skyler, GrpTRES= The total count of TRES able to be used at any given time from jobs running from an association and its children or QOS. If this limit is reached new jobs will be queued but only allowed to run after resources have been relinquished from this group. GrpTRES is the limit on the association, hence your example would place a limit of a maximum of 1 GPU across all users of the 'root' account. I'm after a "per user" limit. -Greg
Hi Greg, You are correct in that `GrpTRES` is not a per user constraint. Unfortunately there is not a `MaxTRESPerUser` option for account associations at this time. So no, you cannot apply a per-user cluster wide limit independent of QOS, at this time. An option is to have the partition bound QOS have `MaxTRESPerUser`. Depending on your configuration, this may meet what you want. Anything beyond that would be a feature request. Regards, Skyler
Hi Skyler, Thanks. Please consider this a feature request. -greg
Hi Greg, There may be another option. You can a run a similar command on all the users (per cluster). I know this may be not very ergonomic especially with thousands of user and multiple clusters. > sacctmgr modify user malinowski set grpTRES=gres/gpu=1 where cluster=qa Does this cover your use case? Would you prefer this to be a single setting on a cluster? Regards, SKyler
Hi Skyler, Your suggestion won't work for us as an individual user is associated with multiple accounts, and hence the GrpTRES would apply distinctly for each account. -Greg
Hi Greg, I will need some more information regarding your configuration and what specifically you are attempting to accomplish. Please provide a sample configuration based on your site and a situation to illustrate the dilemma. We (SchedMD) need to better understand the why account and QOS cannot satisfy your use case before we can proceed with a formal feature request. Thanks, Skyler
Hi Gres, There is internal push-back on a feature like `MaxTRESPerUser` for accounts, unless it can be proven that there is not another method to accomplish the desired outcome. Hence your feature request is halted until more information can prove the feature necessary. After reviewing the ticket, I believe that a QOS on the partition with `MaxTRESPerUser` with optional flag `OverPartQOS` may work for you. In the following example I will create a qos that limits Gres cluster wide. > sacctmgr add qos global_limits > sacctmgr modify qos global_limits set MaxTRESPerUser=gres/gpu=1 > sacctmgr mod qos global_limits set flags=overpartqos # optional: if set, jobs using this QOS will be able to override any limits used by the requested partition's QOS limits. In your `slurm.conf` add the following line before the partition definitions. > PartitionName=DEFAULT Qos=global_limits Or attach to select partition like normal. > PartitionName=debug Qos=global_limits Then reconfigure `slurmctld` > scontrol reconfigure The above example will limit Gres on a per-user basis across all partitions, hence the entire cluster. Regards, Skyler
Hi Skyler, Before this is implemented please let me explain our environment: - we have many partitions - we have many accounts - we have multiple QOS - users can access all partitions - users can belong to multiple accounts Will your suggestion work if there are multiple partitions? Is TRES tracking for a partition QOS only accounting for usage on that partition? The primary issue we are facing is users belonging to multiple accounts, thus be choosing different accounts when they submit jobs they can exceed the global desired limit. -greg
Hi Greg, > Will your suggestion work if there are multiple partitions? Is TRES tracking for a partition QOS only accounting for usage on that partition? A partition QOS will track all TRES across all partitions with that partition same QOS. If you need partition level control, you could make a QOS for each partition giving you even more fine grain control. > The primary issue we are facing is users belonging to multiple accounts, thus be choosing different accounts when they submit jobs they can exceed the global desired limit. Users can still submit from multiple accounts and QOS but the partition QOS will never be breached unless the user submits from a QOS that has `Flags=OverPartQOS`. Does that help and clear things up? Regards, Skyler
Hi Skyler, Many thanks. A global QOS has been setup and is working a treat. -Greg
Hi Greg, Great to hear! I am glad I could find you a solution. Regards, Skyler