| Summary: | Setting maxnode has not the expected behavior; users able to bypass the limit | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Teddy Valette <teddy.valette> |
| Component: | Limits | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | abatcha.olloh, hyacinthe.cartiaux, lyeager, Sebastien.Varrette, teddy.valette |
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Luxembourg | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | CentOS | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm limit table
associations.txt slurm.conf |
||
Can you run this command and upload its output? scontrol show config |grep AccountingStorageEnforce Of course: $ scontrol show config |grep AccountingStorageEnforce AccountingStorageEnforce = associations,limits,qos Okay, I was just checking that you have limits enforced. I believe I know what the problem here is. The "MaxNodes" setting actually translates to "MaxTresPerJob". And looking at the output I believe none of that user's jobs have more than 4 nodes per job. What you actually want to set is MaxTresPerUser=node=4 Can you unset the MaxNodes value and try MaxTresPerUser instead and let me know if it works for you? As a (probably unnecessary) reminder, you can unset things with sacctmgr using -1: sacctmgr mod user where name=USERNAME set maxndoes=-1 Link to documentation: https://slurm.schedmd.com/sacctmgr.html Also, setting the limits won't cancel currently running jobs. It will only prevent future jobs from being submitted that would exceed the limits. Created attachment 18156 [details]
slurm limit table
This command doesn't seem to work (unknown option) $ sacctmgr mod user where name=USERNAME set MaxTresPerUser=node=4 According to your documentation, we're trying with: $ sacctmgr modify user where name=USERNAME set GrpTRES=node=4 Also, we are confused as, why the user manage to have more than 7 jobs running at the same time despite the MaxJobs set to 7? I attach in this ticket a table we made to resume Slurm limits. Can you please confirm that this table is correct? Or correct us? Because your response seems to be in conflict with our understanding (cf. 1st line) Thank you in advance Best regards, Teddy I'll look at the table.
The reason the MaxTresPerUser=node=4 didn't work is because that's a QOS limit - I mistakenly thought it was both a QOS and an association limit, but it is only for QOS.
> Why MaxJobs wasn't working:
- Did you set MaxJobs before or after those jobs starting running?
- Can you upload your slurm.conf?
- Does this user currently have more than the MaxJobs limit of jobs running? If so, can you attach the output of this command (as an attachment, since it will be very verbose):
scontrol show assoc
and let me know specifically which user to look for? Alternatively you can look for the specific user in that output and only upload the information for that user.
Hello, > I'll look at the table. Thank you! > - Did you set MaxJobs before or after those jobs starting running? We set MaxJobs before those jobs starting running > - Can you upload your slurm.conf? Yes, please check slurm.conf in attachments > - Does this user currently have more than the MaxJobs limit of jobs running? I gave you in the original post the detail of his running jobs (31 RUNNING) > can you attach the output of this command Same, please check associations.txt in attachments > let me know specifically which user to look for? The user involved here is mzheng Best regards, Teddy Created attachment 18166 [details]
associations.txt
Created attachment 18167 [details]
slurm.conf
Looking at your table of limits:
The only thing I found may be slightly incorrect is the description for GrpCPUMins. You have:
"Maximum combined CPU*minutes for all jobs running under association/QOS."
It's actually not just running jobs but past and future jobs as well, per the sacctmgr man page:
For associations:
GrpTRESMins=<TRES=max TRES minutes,...>
The total number of TRES minutes that can possibly be used by past, present and future jobs running from this association and its children.
For QOS:
GrpTRESMins
The total number of TRES minutes that can possibly be used by past, present and future jobs running from this QOS.
ALSO NOTE: This limit only applies when using the Priority Multifactor plugin. The time is decayed using the value of PriorityDecayHalfLife or PriorityUsageResetPeriod as set in the slurm.conf. When this limit is reached all associated jobs running will be killed and all future jobs submitted with associations in the group will be delayed until they are able to run inside the limit.
The limit that applies only to running jobs is GrpTRESRunMins. (I think GrpCPURunMins will work - internally it will translate to GrpTRESRunMins.)
Thanks for uploading the debugging information I asked for and helping me debug this problem.
Looking at the scontrol show assoc output that you uploaded:
You indicated that user "mzheng" was the one you had concerns about:
ClusterName=iris Account=antonio.delsol UserName=mzheng(5824) Partition= Priority=0 ID=491
SharesRaw/Norm/Level/Factor=1/0.12/8/0.31
UsageRaw/Norm/Efctv=148686867.29/0.02/0.32
ParentAccount= Lft=3369 DefAssoc=Yes
GrpJobs=N(12) GrpJobsAccrue=N(28)
GrpSubmitJobs=N(40) GrpWall=N(88085.04)
GrpTRES=cpu=N(48),mem=N(196608),energy=N(0),node=4(4),billing=N(96),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0),gres/gpu:volta=N(0)
GrpTRESMins=cpu=N(1765434),mem=N(3537617414),energy=N(0),node=N(88085),billing=N(2476807),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0),gres/gpu:volta=N(0)
GrpTRESRunMins=cpu=N(118704),mem=N(486213495),energy=N(0),node=N(29676),billing=N(237408),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0),gres/gpu:volta=N(0)
MaxJobs=7(12) MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESMinsPJ=
MinPrioThresh=
The number inside the parentheses is the current usage. The number outside the parentheses is the limit, or "N" if there is no limit.
The limits I can see applied:
* GrpTRES=node=4. mzheng is using 4/4 nodes.
* MaxJobs=7. mzheng has 12 running jobs.
That clearly shows mzheng has more running jobs than is allowed.
What version of Slurm is this cluster running? I want to do some more testing using that version.
Thank you for your feedback on our table. We will update it then. We are running slurm 19.05.6, we planned to update it with our next cluster, but this action is still pending at the moment. Hi Teddy,
I'm sorry for the delay. I was finally able to reproduce what you're seeing (on Slurm 20.11, but I think it applies to all versions of Slurm):
If I have a GrpTres=node limit set on both an association and a QOS, then the QOS limit overrides the association limit.
My example:
* My user association has a 4 node limit
* My QOS "normal" (which is also my default QOS) has an 8 node limit
$ sacctmgr show assoc cluster=c1 account=acct1 format=account,user,grptres
Account User GrpTRES
---------- ---------- -------------
acct1
acct1 marshall node=4
$ sacctmgr show qos normal format=name,grptres
Name GrpTRES
---------- -------------
normal node=8
With this configuration, I can run on 8 nodes, but no more.
$ for i in {1..9}; do sbatch -N1 --exclusive --wrap='sleep 100'; done
Submitted batch job 697
Submitted batch job 698
Submitted batch job 699
Submitted batch job 700
Submitted batch job 701
Submitted batch job 702
Submitted batch job 703
Submitted batch job 704
Submitted batch job 705
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
705 debug wrap marshall PD 0:00 1 (QOSGrpNodeLimit)
697 debug wrap marshall R 0:03 1 n1-1
698 debug wrap marshall R 0:03 1 n1-2
699 debug wrap marshall R 0:03 1 n1-3
700 debug wrap marshall R 0:03 1 n1-4
701 debug wrap marshall R 0:03 1 n1-5
702 debug wrap marshall R 0:03 1 n1-6
703 debug wrap marshall R 0:03 1 n1-7
704 debug wrap marshall R 0:03 1 n1-8
If I cancel all my jobs, then remove the GrpTres=nodes=4 limit from the QOS normal, then I am limited to running on only 4 nodes.
$ scancel -u marshall
$ sacctmgr mod qos normal set grptres=node=-1
Modified qos...
normal
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
$ sacctmgr show qos normal format=name,grptres
Name GrpTRES
---------- -------------
normal
$ for i in {1..5}; do sbatch -N1 --exclusive --wrap='sleep 100'; done
Submitted batch job 706
Submitted batch job 707
Submitted batch job 708
Submitted batch job 709
Submitted batch job 710
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
710 debug wrap marshall PD 0:00 1 (AssocGrpNodeLimit)
706 debug wrap marshall R 0:02 1 n1-1
707 debug wrap marshall R 0:02 1 n1-2
708 debug wrap marshall R 0:02 1 n1-3
709 debug wrap marshall R 0:02 1 n1-4
Can you confirm this is what you are seeing? Does the QOS that user "mzheng" uses for their jobs have a GrpTRES=node limit?
I don't yet know if this is intended behavior, but I will find out.
It didn't take long. I found out this behavior is intended. From our resource_limits web page: https://slurm.schedmd.com/resource_limits.html " Hierarchy Slurm's hierarchical limits are enforced in the following order with Job QOS and Partition QOS order being reversible by using the QOS flag 'OverPartQOS': Partition QOS limit Job QOS limit User association Account association(s), ascending the hierarchy Root/Cluster association Partition limit None Note: If limits are defined at multiple points in this hierarchy, the point in this list where the limit is first defined will be used. Consider the following example: MaxJobs=20 and MaxSubmitJobs is undefined in the partition QOS No limits are set in the job QOS and MaxJobs=4 and MaxSubmitJobs=50 in the user association The limits in effect will be MaxJobs=20 and MaxSubmitJobs=50. " So, all QOS limits will override all association limits. Since you haven't responded I assume that I answered your questions, so I'm closing this as infogiven. Please re-open it if you have more questions or issues. |
Dear all, We have detected a user who has an abusive usage of our cluster. Thus, we decided to restrict the user the time he understands and modifies the way he's using his scripts and jobs with the following command: sacctmgr modify user where name=USERNAME set maxnodes=4 To check that the restriction is active, we run this command (we see here that the user is limited in jobs amount and nodes resources): sacctmgr show association where users=USERNAME format=cluster,account%20,user%15,share,qos%50,maxjobs,maxsubmit,maxtres, Cluster Account User Share QOS MaxJobs MaxSubmit MaxTRES ---------- -------------------- --------------- --------- -------------------------------------------------- ------- --------- ------------- iris ACCOUNT USERNAME 1 besteffort,debug,long,low,normal 7 node=4 Nevertheless, this user is still able to submit and have many running jobs, exceding the 4 nodes restriction: $ susage -u USERNAME # sacct -X -S 2021-02-26 -E 2021-02-26 -u USERNAME --format User,JobID,partition%12,qos,state,time,elapsed,nnodes,ncpus,nodelist User JobID Partition QOS State Timelimit Elapsed NNodes NCPUS NodeList --------- ------------ ------------ ---------- ---------- ---------- ---------- -------- ---------- --------------- USERNAME 2270718 batch normal RUNNING 2-00:00:00 14:31:02 1 28 iris-112 USERNAME 2270719 batch normal RUNNING 2-00:00:00 14:20:36 1 28 iris-133 USERNAME 2270720 batch normal RUNNING 2-00:00:00 14:20:36 1 28 iris-140 USERNAME 2270721 batch normal RUNNING 2-00:00:00 14:01:35 1 28 iris-127 USERNAME 2270722 batch normal RUNNING 2-00:00:00 14:01:35 1 28 iris-111 USERNAME 2270723 batch normal RUNNING 2-00:00:00 13:55:34 1 28 iris-135 USERNAME 2270724 batch normal RUNNING 2-00:00:00 13:55:34 1 28 iris-128 USERNAME 2270725 batch normal RUNNING 2-00:00:00 13:55:34 1 28 iris-146 USERNAME 2270726 batch normal RUNNING 2-00:00:00 13:55:01 1 28 iris-162 USERNAME 2270727 batch normal RUNNING 2-00:00:00 13:37:31 1 28 iris-116 USERNAME 2270728 batch normal RUNNING 2-00:00:00 13:37:31 1 28 iris-132 USERNAME 2270729 batch normal RUNNING 2-00:00:00 13:37:31 1 28 iris-168 USERNAME 2270730 batch normal RUNNING 2-00:00:00 13:29:28 1 28 iris-115 USERNAME 2270731 batch normal RUNNING 2-00:00:00 13:29:28 1 28 iris-124 USERNAME 2270732 batch normal RUNNING 2-00:00:00 13:10:23 1 28 iris-117 USERNAME 2270733 batch normal RUNNING 2-00:00:00 13:10:23 1 28 iris-121 USERNAME 2270734 batch normal RUNNING 2-00:00:00 13:10:23 1 28 iris-126 USERNAME 2270735 batch normal RUNNING 2-00:00:00 11:54:06 1 28 iris-134 USERNAME 2270736 batch normal RUNNING 2-00:00:00 11:54:06 1 28 iris-136 USERNAME 2270737 batch normal RUNNING 2-00:00:00 08:58:04 1 28 iris-114 USERNAME 2270738 batch normal RUNNING 2-00:00:00 08:17:58 1 28 iris-120 USERNAME 2270739 batch normal RUNNING 2-00:00:00 07:46:52 1 28 iris-166 USERNAME 2270740 batch normal RUNNING 2-00:00:00 06:15:49 1 28 iris-129 USERNAME 2270741 batch normal RUNNING 2-00:00:00 05:57:46 1 28 iris-130 USERNAME 2270742 batch normal RUNNING 2-00:00:00 05:51:03 1 28 iris-157 USERNAME 2270743 batch normal RUNNING 2-00:00:00 05:45:43 1 28 iris-145 USERNAME 2270744 batch normal RUNNING 2-00:00:00 05:15:39 1 28 iris-167 USERNAME 2270745 batch normal RUNNING 2-00:00:00 04:25:41 1 28 iris-109 USERNAME 2270746 batch normal RUNNING 2-00:00:00 01:35:05 1 28 iris-164 USERNAME 2270747 batch normal RUNNING 2-00:00:00 01:12:25 1 28 iris-147 USERNAME 2270748 batch normal RUNNING 2-00:00:00 01:03:02 1 28 iris-152 USERNAME 2270749 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270750 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270751 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270752 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270753 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270754 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270755 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270756 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270757 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270758 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270759 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270760 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270761 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270762 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270763 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270764 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned USERNAME 2270765 batch normal PENDING 2-00:00:00 00:00:00 1 28 None assigned ### Statistics on 'batch,gpu,bigmem' partition(s) # sacct -X -S 2021-02-26 -E 2021-02-26 -u USERNAME --partition batch,gpu,bigmem --format state --noheader -P | sort | uniq -c 17 PENDING 31 RUNNING $ sabuse => List users with running jobs totalling more than 140 cores / [...] USERNAME: 1344 [...] Is that behavior normal? Do we are using limitations the right way? Also, it is a bit confusing, the node limitation is not working as we expected, same issue for MaxJobs and we don't get the differences between TRES, GRES, ... Best regards, Teddy