| Summary: | GPU subtype not allowed when setting up MaxTRES per user | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Marc Caubet Serrabou <marc.caubet> |
| Component: | Limits | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Paul Scherrer | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Slurm config file | ||
Hi Marc,
You should be able to define a MaxTRES limit that includes the type of a gres. It looks like the syntax you were trying is just a little off. You used an equal sign before the type and a colon for the number, like this:
MAXTRES='...gres/gpu=A100:2...'
You should put a colon before the type and an equal for the number of the gres. Here's an example where I set a limit that included the type (though I used GrpTRES instead of MaxTRES).
$ sacctmgr modify user ben account=sub6_scav set grptres=gres/gpu:tesla1=2
Modified user associations...
C = knight A = sub6_scav U = ben
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
$ sacctmgr show assoc tree account=sub6_scav user=ben format=account,user,grptres%30
Account User GrpTRES
-------------------- ---------- ------------------------------
sub6_scav ben gres/gpu:tesla1=2
Let me know if you have any trouble setting the limit with this syntax.
Thanks,
Ben
Dear Ben, thanks a lot for your help, and my sincerest apologies for this silly mistake. I was using the format when using --gpus/--gres options and I did not realize that the syntax was different, in fact is matching with AccountingStorageTRES values (with the difference that one adds the number of GPUs). Thanks a lot for pointing this out and for your help, and once again sorry for bothering you with this ticket. Best regards, Marc Hi Marc, No problem, it's an easy mistake to make. I'm glad to hear it's working the way you want now. Thanks, Ben |
Created attachment 19318 [details] Slurm config file Hi, having defined the following TRES resources: AccountingStorageTRES=gres/gpu,gres/gpu:geforce_gtx_1080,gres/gpu:geforce_gtx_1080_ti,gres/gpu:geforce_rtx_2080_ti,gres/gpu:K4200,gres/gpu:P2000,gres/gpu:A100,ic/ofed,gres/mps I would like to limit the usage of A100 GPU cards. However, it seems that this is not working. In example, originally having: [root@merlin-slurmctld03 ~]# sacctmgr show assoc User=caubet_m Clusters=gmerlin6 Account=merlin -p Cluster|Account|User|Partition|Share|Priority|GrpJobs|GrpTRES|GrpSubmit|GrpWall|GrpTRESMins|MaxJobs|MaxTRES|MaxTRESPerNode|MaxSubmit|MaxWall|MaxTRESMins|QOS|Def QOS|GrpTRESRunMins| gmerlin6|merlin|caubet_m||1||||||||cpu=40,gres/gpu=8,mem=200G|||||gpu_normal||| When updating the user to limit A100 by adding "gres/gpu=A100:2", it does not accept it: [root@merlin-slurmctld03 ~]# sacctmgr update user caubet_m Clusters=gmerlin6 Account=merlin set MAXTRES='cpu=40,gres/gpu=8,gres/gpu=A100:2,mem=200G' sacctmgr: error: Invalid unit type 'A'. Possible options are 'KMGTP' Modified user associations... C = gmerlin6 A = merlin U = caubet_m Would you like to commit changes? (You have 30 seconds to decide) (N/y): y In fact, it corrupts the input and sets "gres/gpu=0". Looks like it expects only integers with the K,M,G,T or P units. [root@merlin-slurmctld03 ~]# sacctmgr show assoc User=caubet_m Clusters=gmerlin6 Account=merlin -p Cluster|Account|User|Partition|Share|Priority|GrpJobs|GrpTRES|GrpSubmit|GrpWall|GrpTRESMins|MaxJobs|MaxTRES|MaxTRESPerNode|MaxSubmit|MaxWall|MaxTRESMins|QOS|Def QOS|GrpTRESRunMins| gmerlin6|merlin|caubet_m||1||||||||cpu=40,gres/gpu=0,mem=200G|||||gpu_normal||| Looks like this feature is missing, and one should be able to set MaxTRES according to the different resources defined in Slurm.