Ticket 10964

Summary: Setting maxnode has not the expected behavior; users able to bypass the limit
Product: Slurm Reporter: Teddy Valette <teddy.valette>
Component: LimitsAssignee: Marshall Garey <marshall>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: abatcha.olloh, hyacinthe.cartiaux, lyeager, Sebastien.Varrette, teddy.valette
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: University of Luxembourg Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: CentOS
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm limit table
associations.txt
slurm.conf

Description Teddy Valette 2021-02-26 07:40:00 MST
Dear all,

We have detected a user who has an abusive usage of our cluster. Thus, we decided to restrict the user the time he understands and modifies the way he's using his scripts and jobs with the following command: 

 sacctmgr modify user where name=USERNAME set maxnodes=4

To check that the restriction is active, we run this command (we see here that the user is limited in jobs amount and nodes resources):

 sacctmgr show association where users=USERNAME format=cluster,account%20,user%15,share,qos%50,maxjobs,maxsubmit,maxtres,
   Cluster              Account            User     Share                                                QOS MaxJobs MaxSubmit       MaxTRES 
 ---------- -------------------- --------------- --------- -------------------------------------------------- ------- --------- ------------- 
       iris              ACCOUNT        USERNAME         1                   besteffort,debug,long,low,normal       7                  node=4 


Nevertheless, this user is still able to submit and have many running jobs, exceding the 4 nodes restriction:

$ susage -u USERNAME
# sacct -X -S 2021-02-26 -E 2021-02-26 -u USERNAME --format User,JobID,partition%12,qos,state,time,elapsed,nnodes,ncpus,nodelist
     User        JobID    Partition        QOS      State  Timelimit    Elapsed   NNodes      NCPUS        NodeList 
--------- ------------ ------------ ---------- ---------- ---------- ---------- -------- ---------- --------------- 
   USERNAME 2270718             batch     normal    RUNNING 2-00:00:00   14:31:02        1         28        iris-112 
   USERNAME 2270719             batch     normal    RUNNING 2-00:00:00   14:20:36        1         28        iris-133 
   USERNAME 2270720             batch     normal    RUNNING 2-00:00:00   14:20:36        1         28        iris-140 
   USERNAME 2270721             batch     normal    RUNNING 2-00:00:00   14:01:35        1         28        iris-127 
   USERNAME 2270722             batch     normal    RUNNING 2-00:00:00   14:01:35        1         28        iris-111 
   USERNAME 2270723             batch     normal    RUNNING 2-00:00:00   13:55:34        1         28        iris-135 
   USERNAME 2270724             batch     normal    RUNNING 2-00:00:00   13:55:34        1         28        iris-128 
   USERNAME 2270725             batch     normal    RUNNING 2-00:00:00   13:55:34        1         28        iris-146 
   USERNAME 2270726             batch     normal    RUNNING 2-00:00:00   13:55:01        1         28        iris-162 
   USERNAME 2270727             batch     normal    RUNNING 2-00:00:00   13:37:31        1         28        iris-116 
   USERNAME 2270728             batch     normal    RUNNING 2-00:00:00   13:37:31        1         28        iris-132 
   USERNAME 2270729             batch     normal    RUNNING 2-00:00:00   13:37:31        1         28        iris-168 
   USERNAME 2270730             batch     normal    RUNNING 2-00:00:00   13:29:28        1         28        iris-115 
   USERNAME 2270731             batch     normal    RUNNING 2-00:00:00   13:29:28        1         28        iris-124 
   USERNAME 2270732             batch     normal    RUNNING 2-00:00:00   13:10:23        1         28        iris-117 
   USERNAME 2270733             batch     normal    RUNNING 2-00:00:00   13:10:23        1         28        iris-121 
   USERNAME 2270734             batch     normal    RUNNING 2-00:00:00   13:10:23        1         28        iris-126 
   USERNAME 2270735             batch     normal    RUNNING 2-00:00:00   11:54:06        1         28        iris-134 
   USERNAME 2270736             batch     normal    RUNNING 2-00:00:00   11:54:06        1         28        iris-136 
   USERNAME 2270737             batch     normal    RUNNING 2-00:00:00   08:58:04        1         28        iris-114 
   USERNAME 2270738             batch     normal    RUNNING 2-00:00:00   08:17:58        1         28        iris-120 
   USERNAME 2270739             batch     normal    RUNNING 2-00:00:00   07:46:52        1         28        iris-166 
   USERNAME 2270740             batch     normal    RUNNING 2-00:00:00   06:15:49        1         28        iris-129 
   USERNAME 2270741             batch     normal    RUNNING 2-00:00:00   05:57:46        1         28        iris-130 
   USERNAME 2270742             batch     normal    RUNNING 2-00:00:00   05:51:03        1         28        iris-157 
   USERNAME 2270743             batch     normal    RUNNING 2-00:00:00   05:45:43        1         28        iris-145 
   USERNAME 2270744             batch     normal    RUNNING 2-00:00:00   05:15:39        1         28        iris-167 
   USERNAME 2270745             batch     normal    RUNNING 2-00:00:00   04:25:41        1         28        iris-109 
   USERNAME 2270746             batch     normal    RUNNING 2-00:00:00   01:35:05        1         28        iris-164 
   USERNAME 2270747             batch     normal    RUNNING 2-00:00:00   01:12:25        1         28        iris-147 
   USERNAME 2270748             batch     normal    RUNNING 2-00:00:00   01:03:02        1         28        iris-152 
   USERNAME 2270749             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270750             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270751             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270752             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270753             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270754             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270755             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270756             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270757             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270758             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270759             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270760             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270761             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270762             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270763             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270764             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 
   USERNAME 2270765             batch     normal    PENDING 2-00:00:00   00:00:00        1         28   None assigned 

### Statistics on 'batch,gpu,bigmem' partition(s)
# sacct -X -S 2021-02-26 -E 2021-02-26 -u USERNAME --partition batch,gpu,bigmem --format state --noheader -P | sort | uniq -c
     17 PENDING
     31 RUNNING

$ sabuse
=> List users with running jobs totalling more than 140 cores / 
          [...]
            USERNAME: 1344
          [...]


Is that behavior normal? Do we are using limitations the right way? Also, it is a bit confusing, the node limitation is not working as we expected, same issue for MaxJobs and we don't get the differences between TRES, GRES, ...

Best regards,
Teddy
Comment 1 Marshall Garey 2021-02-26 08:15:13 MST
Can you run this command and upload its output?

scontrol show config |grep AccountingStorageEnforce
Comment 2 Teddy Valette 2021-02-26 08:36:13 MST
Of course:

$ scontrol show config |grep AccountingStorageEnforce
AccountingStorageEnforce = associations,limits,qos
Comment 3 Marshall Garey 2021-02-26 08:40:35 MST
Okay, I was just checking that you have limits enforced. I believe I know what the problem here is. The "MaxNodes" setting actually translates to "MaxTresPerJob". And looking at the output I believe none of that user's jobs have more than 4 nodes per job. What you actually want to set is

MaxTresPerUser=node=4


Can you unset the MaxNodes value and try MaxTresPerUser instead and let me know if it works for you?

As a (probably unnecessary) reminder, you can unset things with sacctmgr using -1:

sacctmgr mod user where name=USERNAME set maxndoes=-1


Link to documentation:

https://slurm.schedmd.com/sacctmgr.html
Comment 4 Marshall Garey 2021-02-26 08:41:23 MST
Also, setting the limits won't cancel currently running jobs. It will only prevent future jobs from being submitted that would exceed the limits.
Comment 5 Teddy Valette 2021-02-26 10:50:21 MST
Created attachment 18156 [details]
slurm limit table
Comment 6 Teddy Valette 2021-02-26 10:50:39 MST
This command doesn't seem to work (unknown option)

  $ sacctmgr mod user where name=USERNAME set MaxTresPerUser=node=4

According to your documentation, we're trying with:

  $ sacctmgr modify user where name=USERNAME set GrpTRES=node=4

Also, we are confused as, why the user manage to have more than 7 jobs running at the same time despite the MaxJobs set to 7?

I attach in this ticket a table we made to resume Slurm limits. Can you please confirm that this table is correct? Or correct us? Because your response seems to be in conflict with our understanding (cf. 1st line)

Thank you in advance

Best regards,
Teddy
Comment 7 Marshall Garey 2021-02-26 11:00:11 MST
I'll look at the table.

The reason the MaxTresPerUser=node=4 didn't work is because that's a QOS limit - I mistakenly thought it was both a QOS and an association limit, but it is only for QOS.

> Why MaxJobs wasn't working:

- Did you set MaxJobs before or after those jobs starting running?
- Can you upload your slurm.conf?
- Does this user currently have more than the MaxJobs limit of jobs running? If so, can you attach the output of this command (as an attachment, since it will be very verbose):

scontrol show assoc

and let me know specifically which user to look for? Alternatively you can look for the specific user in that output and only upload the information for that user.
Comment 8 Teddy Valette 2021-02-27 02:52:34 MST
Hello,

> I'll look at the table.
Thank you!

> - Did you set MaxJobs before or after those jobs starting running?
We set MaxJobs before those jobs starting running

> - Can you upload your slurm.conf?
Yes, please check slurm.conf in attachments

> - Does this user currently have more than the MaxJobs limit of jobs running?
I gave you in the original post the detail of his running jobs (31 RUNNING)

> can you attach the output of this command
Same, please check associations.txt in attachments

> let me know specifically which user to look for?
The user involved here is mzheng

Best regards,
Teddy
Comment 9 Teddy Valette 2021-02-27 02:52:59 MST
Created attachment 18166 [details]
associations.txt
Comment 10 Teddy Valette 2021-02-27 02:53:28 MST
Created attachment 18167 [details]
slurm.conf
Comment 11 Marshall Garey 2021-03-02 18:19:52 MST
Looking at your table of limits:

The only thing I found may be slightly incorrect is the description for GrpCPUMins. You have:

"Maximum combined CPU*minutes for all jobs running under association/QOS."

It's actually not just running jobs but past and future jobs as well, per the sacctmgr man page:

For associations:

GrpTRESMins=<TRES=max TRES minutes,...>
    The total number of TRES minutes that can possibly be used by past, present and future jobs running from this association and its children.

For QOS:

GrpTRESMins
    The total number of TRES minutes that can possibly be used by past, present and future jobs running from this QOS. 

ALSO NOTE: This limit only applies when using the Priority Multifactor plugin. The time is decayed using the value of PriorityDecayHalfLife or PriorityUsageResetPeriod as set in the slurm.conf. When this limit is reached all associated jobs running will be killed and all future jobs submitted with associations in the group will be delayed until they are able to run inside the limit. 



The limit that applies only to running jobs is GrpTRESRunMins. (I think GrpCPURunMins will work - internally it will translate to GrpTRESRunMins.)
Comment 12 Marshall Garey 2021-03-02 18:26:18 MST
Thanks for uploading the debugging information I asked for and helping me debug this problem.

Looking at the scontrol show assoc output that you uploaded:

You indicated that user "mzheng" was the one you had concerns about:

ClusterName=iris Account=antonio.delsol UserName=mzheng(5824) Partition= Priority=0 ID=491
    SharesRaw/Norm/Level/Factor=1/0.12/8/0.31
    UsageRaw/Norm/Efctv=148686867.29/0.02/0.32
    ParentAccount= Lft=3369 DefAssoc=Yes
    GrpJobs=N(12) GrpJobsAccrue=N(28)
    GrpSubmitJobs=N(40) GrpWall=N(88085.04)
    GrpTRES=cpu=N(48),mem=N(196608),energy=N(0),node=4(4),billing=N(96),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0),gres/gpu:volta=N(0)
    GrpTRESMins=cpu=N(1765434),mem=N(3537617414),energy=N(0),node=N(88085),billing=N(2476807),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0),gres/gpu:volta=N(0)
    GrpTRESRunMins=cpu=N(118704),mem=N(486213495),energy=N(0),node=N(29676),billing=N(237408),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0),gres/gpu:volta=N(0)
    MaxJobs=7(12) MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESMinsPJ=
    MinPrioThresh=



The number inside the parentheses is the current usage. The number outside the parentheses is the limit, or "N" if there is no limit.

The limits I can see applied:

* GrpTRES=node=4. mzheng is using 4/4 nodes.
* MaxJobs=7. mzheng has 12 running jobs.

That clearly shows mzheng has more running jobs than is allowed.


What version of Slurm is this cluster running? I want to do some more testing using that version.
Comment 13 Teddy Valette 2021-03-03 00:34:18 MST
Thank you for your feedback on our table. We will update it then.

We are running slurm 19.05.6, we planned to update it with our next cluster, but this action is still pending at the moment.
Comment 15 Marshall Garey 2021-04-19 17:21:22 MDT
Hi Teddy,

I'm sorry for the delay. I was finally able to reproduce what you're seeing (on Slurm 20.11, but I think it applies to all versions of Slurm):

If I have a GrpTres=node limit set on both an association and a QOS, then the QOS limit overrides the association limit.

My example:

* My user association has a 4 node limit
* My QOS "normal" (which is also my default QOS) has an 8 node limit

$ sacctmgr show assoc cluster=c1 account=acct1 format=account,user,grptres
   Account       User       GrpTRES 
---------- ---------- ------------- 
     acct1                          
     acct1   marshall        node=4 

$ sacctmgr show qos normal format=name,grptres
      Name       GrpTRES 
---------- ------------- 
    normal        node=8 


With this configuration, I can run on 8 nodes, but no more.


$ for i in {1..9}; do sbatch -N1 --exclusive --wrap='sleep 100'; done
Submitted batch job 697
Submitted batch job 698
Submitted batch job 699
Submitted batch job 700
Submitted batch job 701
Submitted batch job 702
Submitted batch job 703
Submitted batch job 704
Submitted batch job 705

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               705     debug     wrap marshall PD       0:00      1 (QOSGrpNodeLimit)
               697     debug     wrap marshall  R       0:03      1 n1-1
               698     debug     wrap marshall  R       0:03      1 n1-2
               699     debug     wrap marshall  R       0:03      1 n1-3
               700     debug     wrap marshall  R       0:03      1 n1-4
               701     debug     wrap marshall  R       0:03      1 n1-5
               702     debug     wrap marshall  R       0:03      1 n1-6
               703     debug     wrap marshall  R       0:03      1 n1-7
               704     debug     wrap marshall  R       0:03      1 n1-8


If I cancel all my jobs, then remove the GrpTres=nodes=4 limit from the QOS normal, then I am limited to running on only 4 nodes.

$ scancel -u marshall
$ sacctmgr mod qos normal set grptres=node=-1
 Modified qos...
  normal
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
$ sacctmgr show qos normal format=name,grptres
      Name       GrpTRES 
---------- ------------- 
    normal               

$ for i in {1..5}; do sbatch -N1 --exclusive --wrap='sleep 100'; done
Submitted batch job 706
Submitted batch job 707
Submitted batch job 708
Submitted batch job 709
Submitted batch job 710

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               710     debug     wrap marshall PD       0:00      1 (AssocGrpNodeLimit)
               706     debug     wrap marshall  R       0:02      1 n1-1
               707     debug     wrap marshall  R       0:02      1 n1-2
               708     debug     wrap marshall  R       0:02      1 n1-3
               709     debug     wrap marshall  R       0:02      1 n1-4


Can you confirm this is what you are seeing? Does the QOS that user "mzheng" uses for their jobs have a GrpTRES=node limit?


I don't yet know if this is intended behavior, but I will find out.
Comment 16 Marshall Garey 2021-04-19 17:26:34 MDT
It didn't take long. I found out this behavior is intended. From our resource_limits web page:

https://slurm.schedmd.com/resource_limits.html

"
Hierarchy

Slurm's hierarchical limits are enforced in the following order with Job QOS and Partition QOS order being reversible by using the QOS flag 'OverPartQOS':

    Partition QOS limit
    Job QOS limit
    User association
    Account association(s), ascending the hierarchy
    Root/Cluster association
    Partition limit
    None

Note: If limits are defined at multiple points in this hierarchy, the point in this list where the limit is first defined will be used. Consider the following example:

    MaxJobs=20 and MaxSubmitJobs is undefined in the partition QOS
    No limits are set in the job QOS and
    MaxJobs=4 and MaxSubmitJobs=50 in the user association

The limits in effect will be MaxJobs=20 and MaxSubmitJobs=50.
"


So, all QOS limits will override all association limits.
Comment 17 Marshall Garey 2021-04-29 11:43:25 MDT
Since you haven't responded I assume that I answered your questions, so I'm closing this as infogiven. Please re-open it if you have more questions or issues.