Ticket 11007

Summary:	What are the QOS sacctmgr commands to set the below for all users.
Product:	Slurm	Reporter:	Bill Pappas <bpappas>
Component:	Configuration	Assignee:	Ben Roberts <ben>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	akail
Version:	20.02.6
Hardware:	Linux
OS:	Linux
Site:	Analysis Group	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.02.06
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Bill Pappas 2021-03-03 12:11:30 MST

What are the QOS sacctmgr commands to set the below for all users.

I see these to set the 24 gpu per user running simultaneously and max job size:

sacctmgr modify QOS default set MaxTRESPerUser=gres/gpu=24
sacctmgr modify QOS default set MaxTRESPerJob=gres/gpu=24

I am not sure how to restrict 48 submitted in the partition

See below...

(iii) Restrict user with 24 GPUs using simultaneously (Resource use per user )
(iv) Restrict user with 48 GPUs job submission in queue (Resource request submit per user)

            Configuring limits (max job count, max job size, etc.)

         ○ Per Job limits (e.g. MaxNodes)

         ○ Aggregate limits by user, account or QOS (e.g. GrpJobs)       

(v)  Configure max job size = 24 GPUs in Slurm


acctmgr: list tres
    Type            Name     ID
-------- --------------- ------
     cpu                      1
     mem                      2
  energy                      3
    node                      4
 billing                      5
      fs            disk      6
    vmem                      7
   pages                      8
    gres             gpu   1001
    gres  gpu:tesla_v100   1002


sacctmgr: list qos
      Name   Priority  GraceTime    Preempt   PreemptExemptTime PreemptMode                                    Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU     MaxTRESPA MaxJobsPA MaxSubmitPA       MinTRES
---------- ---------- ---------- ---------- ------------------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- -------------
    normal          0   00:00:00                                    cluster                                                        1.000000                                                                                                                                                      3                                   3
   default          0   00:00:00                                    cluster                                                        1.000000                                                                                                                                                      3        
   special        100   00:00:00                                    cluster                                                        1.000000                                                                                                                                                              
sacctmgr:


sacctmgr: list user
      User   Def Acct     Admin
---------- ---------- ---------
  pragnesh    default      None
      root       root Administ+
   trickey    default      None
  z003tjsa   trinidad      None


sacctmgr: list account
   Account                Descr                  Org
---------- -------------------- --------------------
   default              default              default
    needle               needle               needle
      root default root account                 root
  trinidad             trinidad              default
sacctmgr:

Comment 1 Ben Roberts 2021-03-03 16:22:21 MST

Hi Bill,

You should be able to accomplish what you're looking for by setting a limit on the number of GPUs per user and then setting a group limit for the QOS and then associating that QOS with a partition.  Setting the MaxTRESPerJob limit is a bit redundant since the per-user limit will effectively enforce a per-job limit as well, but it doesn't hurt to set it as well.  I set up an example to demonstrate how to accomplish this. 

I started by setting limits on the number of GPUs for MaxTRESPerUser, MaxTRESPerJob and GrpTRES (I used 4, 4 and 8 respectively to make it easier to demonstrate on my test system).  

$ sacctmgr show qos limited format=name,maxtrespu,maxtresperjob,grptres
      Name     MaxTRESPU       MaxTRES       GrpTRES 
---------- ------------- ------------- ------------- 
   limited    gres/gpu=4    gres/gpu=4    gres/gpu=8 


I associated this QOS with the 'high' partition in my slurm.conf.  You can see that the scontrol shows the 'limited' QOS as being associated.  This means that users who request this partition will have any limits from the 'limited' QOS applied to their jobs.

$ scontrol show partition high | grep QoS
   AllocNodes=ALL Default=NO QoS=limited



As my user I submit two jobs to this partition, each requesting 4 GPUs.  The first job is able to start but the second stays Pending because it would violate the MaxTRESPerUser limit.

$ sbatch -phigh --gpus=4 --wrap='srun sleep 60'
Submitted batch job 25782

$ sbatch -phigh --gpus=4 --wrap='srun sleep 60'
Submitted batch job 25783

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             25783      high     wrap      ben PD       0:00      1 (QOSMaxGRESPerUser)
             25782      high     wrap      ben  R       0:01      1 node01




Then I become 'user1' and submitted two more jobs like this.  Again, the first job is able to run, but the second is blocked.  This time it shows that the jobs are hitting the GrpTRES limit.

$ sbatch -phigh --gpus=4 --wrap='srun sleep 60'
Submitted batch job 25784

$ sbatch -phigh --gpus=4 --wrap='srun sleep 60'
Submitted batch job 25785

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             25785      high     wrap    user1 PD       0:00      1 (QOSGrpGRES)
             25783      high     wrap      ben PD       0:00      1 (QOSGrpGRES)
             25784      high     wrap    user1  R       0:05      1 node02
             25782      high     wrap      ben  R       0:15      1 node01




I become 'user2' and submit a similar job.  This isn't able to run until there are fewer than 8 GPUs in use by jobs, even though it hasn't reached the MaxTRESPerUser limit.

$ sbatch -phigh --gpus=4 --wrap='srun sleep 60'
Submitted batch job 25786

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             25786      high     wrap    user2 PD       0:00      1 (QOSGrpGRES)
             25785      high     wrap    user1 PD       0:00      1 (QOSGrpGRES)
             25783      high     wrap      ben PD       0:00      1 (QOSGrpGRES)
             25784      high     wrap    user1  R       0:16      1 node02
             25782      high     wrap      ben  R       0:26      1 node01


Does this look like it will work for what you are trying to do?  

Thanks,
Ben

Comment 2 Ben Roberts 2021-03-11 08:52:00 MST

Hi Bill,

I wanted to make suer the information I sent on enforcing the limits you described looks like it will work for you.  Let me know if you have any additional questions on this or if it's ok to close the ticket.

Thanks,
Ben

Comment 3 Ben Roberts 2021-03-19 13:31:57 MDT

Hi Bill,

I believe the information I sent should have answered your questions about enforcing limits and I haven't heard any follow up questions.  I'll go ahead and close this ticket but feel free to update the ticket if you do have additional questions about this.

Thanks,
Ben

Comment 4 Bill Pappas 2021-03-19 13:33:05 MDT

Please close



Bill Pappas

901-619-0585<tel:901-619-0585>

bpappas@dstonline.com<mailto:bpappas@dstonline.com>

[cid:24BE9C0E-3BED-4FEF-AE8A-34AD2F51BA7D]


On Mar 19, 2021, at 2:32 PM, bugs@schedmd.com wrote:

 Ben Roberts<mailto:ben@schedmd.com> changed bug 11007<https://bugs.schedmd.com/show_bug.cgi?id=11007>
What    Removed Added
Resolution      ---     INFOGIVEN
Status  OPEN    RESOLVED

Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=11007#c3> on bug 11007<https://bugs.schedmd.com/show_bug.cgi?id=11007> from Ben Roberts<mailto:ben@schedmd.com>

Hi Bill,

I believe the information I sent should have answered your questions about
enforcing limits and I haven't heard any follow up questions.  I'll go ahead
and close this ticket but feel free to update the ticket if you do have
additional questions about this.

Thanks,
Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 5 Andrew Kail 2021-03-22 11:55:32 MDT

Hi Ben,

I'm working with Bill on this project and we are unable to verify the settings have taken effect.

When we submit jobs using the settings Bill provided and I submit a job requesting over 24 GPU's in total user's are not being limited.  For instance, we can submit 50 1 GPU jobs and they all run.  I will provide output on that from off hours when I can there is more availability.

This afternoon we submitted 30 1 gpu jobs, and several run, but the Pending reason is Priority or Resources, not "QoS".

Is there another way we can confirm these limits are working?

Thanks,
Andrew

Comment 6 Ben Roberts 2021-03-22 13:58:11 MDT

Hi Andrew,

The limits you set with sacctmgr are stored in the database and the command communicates with slurmdbd.  You should be able to see what slurmctld knows about the limits that are defined with sacctmgr by running 'scontrol show assoc flags=qos'.  

As an example I have the following limits defined with sacctmgr:
$ sacctmgr show qos limited format=name,maxtres,grptres%20
      Name       MaxTRES              GrpTRES 
---------- ------------- -------------------- 
   limited    gres/gpu=4   gres/gpu=10,node=6 




Here is what scontrol sees for these limits:

$ scontrol show assoc flags=qos qos=limited
Current Association Manager state

QOS Records

QOS=limited(60)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpJobsAccrue=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=6(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/asdf=N(0),gres/gpu=10(0),gres/test=N(0),license/local=N(0),license/testlic=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/asdf=N(0),gres/gpu=N(0),gres/test=N(0),license/local=N(0),license/testlic=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/asdf=N(0),gres/gpu=N(0),gres/test=N(0),license/local=N(0),license/testlic=N(0)
    MaxWallPJ=
    MaxTRESPJ=gres/gpu=4
    MaxTRESPN=
    MaxTRESMinsPJ=
    MinPrioThresh= 
    MinTRESPJ=
    PreemptMode=OFF
    Priority=50
    Account Limits
        No Accounts
    User Limits
        No Users



You can see that the GrpTRES shows the limits of 6 nodes and 10 GPUs.  
GrpTRES=...,node=6(0)...,gres/gpu=10(0)

The MaxTRES limit also shows up in the output.
MaxTRESPJ=gres/gpu=4


If you aren't seeing these limits enforced can I have you send the output of 'scontrol show assoc flags=qos' along with the 'squeue' output and 'scontrol show job <jobid>' output for one of the jobs that should have this limit enforced but is still able to run.

Thanks,
Ben

Comment 7 Andrew Kail 2021-03-31 14:39:58 MDT

[root@head ~]# sacctmgr show qos gpu_limits format=name,maxtres,grptres%20
      Name       MaxTRES              GrpTRES 
---------- ------------- -------------------- 
gpu_limits   gres/gpu=24                      

[root@head01 ~]# scontrol show assoc flags=qos qos=gpu_limits
Current Association Manager state

QOS Records

QOS=gpu_limits(14)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpJobsAccrue=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0),gres/gpu:tesla_v100=N(0)
   GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0),gres/gpu:tesla_v100=N(0)
    MaxWallPJ=
    MaxTRESPJ=gres/gpu=24
    MaxTRESPN=
    MaxTRESMinsPJ=
    MinPrioThresh= 
    MinTRESPJ=
    PreemptMode=OFF
    Priority=0
    Account Limits
        No Accounts
    User Limits
        No Users

Looks like GrpTRES doesn't have any gpu's configured which is odd.

Looking at the partition also

[root@head01 ~]# scontrol show partition hgx-1
PartitionName=hgx-1
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=gpu_limits

Does AccountStorageEnforce need to be set

Comment 8 Andrew Kail 2021-03-31 14:47:51 MDT

Do we need to set AccountingStorageEnforce to qos to enforce these settings?  

If we do I believe we also need to configure a the new qos as default right?

Comment 9 Ben Roberts 2021-04-05 11:47:10 MDT

Hi Andrew,

My apologies for the delayed response, I was out of the office last week.  You do need to have AccountingStorageEnforce configured with 'qos' specified for this to be enforced.  My apologies for not verifying that you had this configured previously.  

You don't need to have this QOS configured as default because it looks like you have it associated with the hgx-1 partition.  This means that any job that goes to that partition will have the limits defined in the gpu_limits QOS applied without the user specifying that QOS manually for the job.  

Let me know if you still have problems getting this enforced with AccountingStorageEnforce set.

Thanks,
Ben

Comment 10 Andrew Kail 2021-04-12 08:25:52 MDT

Thanks Ben.

I believe we have narrowed down our issue to being a lack of associations in the slurmdb.  

We have tested on another system and the QOS works only for the root user which is in the slurmdb.

If that is the case, the system owner is not interested in also maintaining their slurmdb list of users right now so we need to find another way to get around it if possible.

Comment 11 Ben Roberts 2021-04-12 11:11:58 MDT

Hi Andrew,

I'm afraid there isn't a way to enforce limits in the way you're asking without having user associations created.  I understand that it can be a lot of extra work to maintain another list of users.  Since we support the ability to put users in multiple accounts, which can get complicated in a hurry, we don't have an automated option for creating users from AD.  One option you might be able to implement would be to create a script that monitors new users and adds them to a default account in Slurm.  If you don't care about splitting up the users in different accounts to track different types of usage then that might work for you.  Or you could use the script to handle most cases and manually put users in different accounts in special cases.  I'm afraid that automating something like that is outside the scope of our support, but I wanted to bring it up as a possibility.  

Thanks,
Ben

Comment 12 Andrew Kail 2021-04-12 11:39:35 MDT

Thanks Ben.  Appreciate the help on this.

We'll be going the script route to automate the process.