Ticket 11531 - GPU subtype not allowed when setting up MaxTRES per user
Summary: GPU subtype not allowed when setting up MaxTRES per user
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 20.11.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Ben Roberts
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-05-05 03:23 MDT by Marc Caubet Serrabou
Modified: 2021-05-10 08:54 MDT (History)
0 users

See Also:
Site: Paul Scherrer
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Slurm config file (6.22 KB, text/plain)
2021-05-05 03:23 MDT, Marc Caubet Serrabou
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Marc Caubet Serrabou 2021-05-05 03:23:38 MDT
Created attachment 19318 [details]
Slurm config file

Hi,

having defined the following TRES resources:

AccountingStorageTRES=gres/gpu,gres/gpu:geforce_gtx_1080,gres/gpu:geforce_gtx_1080_ti,gres/gpu:geforce_rtx_2080_ti,gres/gpu:K4200,gres/gpu:P2000,gres/gpu:A100,ic/ofed,gres/mps

I would like to limit the usage of A100 GPU cards. However, it seems that this is not working. In example, originally having:

[root@merlin-slurmctld03 ~]# sacctmgr show assoc User=caubet_m Clusters=gmerlin6 Account=merlin -p
Cluster|Account|User|Partition|Share|Priority|GrpJobs|GrpTRES|GrpSubmit|GrpWall|GrpTRESMins|MaxJobs|MaxTRES|MaxTRESPerNode|MaxSubmit|MaxWall|MaxTRESMins|QOS|Def QOS|GrpTRESRunMins|
gmerlin6|merlin|caubet_m||1||||||||cpu=40,gres/gpu=8,mem=200G|||||gpu_normal|||

When updating the user to limit A100 by adding "gres/gpu=A100:2", it does not accept it:

[root@merlin-slurmctld03 ~]# sacctmgr update user caubet_m Clusters=gmerlin6 Account=merlin set MAXTRES='cpu=40,gres/gpu=8,gres/gpu=A100:2,mem=200G'
sacctmgr: error: Invalid unit type 'A'. Possible options are 'KMGTP'
 Modified user associations...
  C = gmerlin6   A = merlin               U = caubet_m 
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y

In fact, it corrupts the input and sets "gres/gpu=0". Looks like it expects only integers with the K,M,G,T or P units.

[root@merlin-slurmctld03 ~]# sacctmgr show assoc User=caubet_m Clusters=gmerlin6 Account=merlin -p
Cluster|Account|User|Partition|Share|Priority|GrpJobs|GrpTRES|GrpSubmit|GrpWall|GrpTRESMins|MaxJobs|MaxTRES|MaxTRESPerNode|MaxSubmit|MaxWall|MaxTRESMins|QOS|Def QOS|GrpTRESRunMins|
gmerlin6|merlin|caubet_m||1||||||||cpu=40,gres/gpu=0,mem=200G|||||gpu_normal|||

Looks like this feature is missing, and one should be able to set MaxTRES according to the different resources defined in Slurm.
Comment 1 Ben Roberts 2021-05-06 10:55:48 MDT
Hi Marc,

You should be able to define a MaxTRES limit that includes the type of a gres.  It looks like the syntax you were trying is just a little off.  You used an equal sign before the type and a colon for the number, like this:
MAXTRES='...gres/gpu=A100:2...'

You should put a colon before the type and an equal for the number of the gres.  Here's an example where I set a limit that included the type (though I used GrpTRES instead of MaxTRES).

$ sacctmgr modify user ben account=sub6_scav set grptres=gres/gpu:tesla1=2
 Modified user associations...
  C = knight     A = sub6_scav            U = ben      
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y

$ sacctmgr show assoc tree account=sub6_scav user=ben format=account,user,grptres%30
             Account       User                        GrpTRES 
-------------------- ---------- ------------------------------ 
sub6_scav                   ben              gres/gpu:tesla1=2 


Let me know if you have any trouble setting the limit with this syntax.

Thanks,
Ben
Comment 2 Marc Caubet Serrabou 2021-05-07 01:10:14 MDT
Dear Ben,

thanks a lot for your help, and my sincerest apologies for this silly mistake. I was using the format when using --gpus/--gres options and I did not realize that the syntax was different, in fact is matching with AccountingStorageTRES values (with the difference that one adds the number of GPUs).

Thanks a lot for pointing this out and for your help, and once again sorry for bothering you with this ticket.

Best regards,
Marc
Comment 3 Ben Roberts 2021-05-10 08:54:14 MDT
Hi Marc,

No problem, it's an easy mistake to make.  I'm glad to hear it's working the way you want now.

Thanks,
Ben