Ticket 2757

Summary:	Limiting GPUs per Account
Product:	Slurm	Reporter:	Will French <will>
Component:	Limits	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	davide.vanzo
Version:	15.08.11
Hardware:	Linux
OS:	Linux
Site:	Vanderbilt	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf sacctmgr show assoc tree sacctmgr show qos

Description Will French 2016-05-23 05:16:40 MDT

Hello,

We have two types of GPUs on our cluster: old Fermi cards and new Maxwell Titan X's. We are trying to use AccountingStorageTRES to limit the number of Titan X cards a group can be using at once, but we don't want to place in limits on the older Fermi hardware. 

The way I attempted to set this up was the following:

In slurm.conf (attached):

.
.
GresTypes=gpu
AccountingStorageTRES=gres/gpu:maxwell,gres/gpu:fermi
NodeName=vmp[802,805-808,813,815,818,824-826,833-838,844] RealMemory=45000 CPUs=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 Gres=gpu:fermi:4 Feature=fermi
NodeName=vmp[1243-1254] RealMemory=128000 CPUs=12 Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 Gres=gpu:maxwell:4 Feature=maxwell
PartitionName=fermi Nodes=vmp[802,805-808,813,815,818,824-826,833-838,844] Default=NO MaxTime=20160 DefaultTime=15 DefMemPerNode=2000 MaxMemPerNode=45000 State=UP
PartitionName=maxwell Nodes=vmp[1243-1254] Default=NO MaxTime=20160 DefaultTime=15 DefMemPerNode=2000 MaxMemPerNode=124000 State=UP
.
.

Here's our gres.conf:

NodeName=vmp[802,805-808,813,815,818,824-826,833-838,844] Name=gpu Type=fermi File=/dev/nvidia0 CPUs=0-1
NodeName=vmp[802,805-808,813,815,818,824-826,833-838,844] Name=gpu Type=fermi File=/dev/nvidia1 CPUs=2-3
NodeName=vmp[802,805-808,813,815,818,824-826,833-838,844] Name=gpu Type=fermi File=/dev/nvidia2 CPUs=4-5
NodeName=vmp[802,805-808,813,815,818,824-826,833-838,844] Name=gpu Type=fermi File=/dev/nvidia3 CPUs=6-7
NodeName=vmp[1243-1254] Name=gpu Type=maxwell File=/dev/nvidia0 CPUs=0-2
NodeName=vmp[1243-1254] Name=gpu Type=maxwell File=/dev/nvidia1 CPUs=3-5
NodeName=vmp[1243-1254] Name=gpu Type=maxwell File=/dev/nvidia2 CPUs=6-8
NodeName=vmp[1243-1254] Name=gpu Type=maxwell File=/dev/nvidia3 CPUs=9-11

I then set up the TRES limits on the Maxwell cards:

$ sacctmgr modify account accre_gpu set GrpTRES=gres/gpu:maxwell=4
$ sacctmgr show associations format=account,grptres%30 | grep accre_g
 accre_gpu             gres/gpu:maxwell=4 
 accre_gpu                                
 accre_gpu                                
 accre_gpu                                
 accre_gpu                                
 accre_gpu                                
 accre_gpu                                
 accre_gpu                                
 accre_gpu                              

Then I submit jobs with directives like:

#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --partition=maxwell
#SBATCH --account=accre_gpu
#SBATCH --mem=100G
#SBATCH --gres=gpu:maxwell:1
#SBATCH --time=1-0



However, it does not appear that the limits are being imposed as I hoped they would. For instance, I am able to submit six jobs like the one above (with a 4 Maxwell GPU limit in place on the accre_gpu account) and they all start immediately.

When I check scontrol and squeue it looks like the maxwell TRES has not been registered with any of the jobs:

$ squeue -u frenchwr --Format=username,tres:%50,gres:.50
USER                TRES                                              GRES
frenchwr            cpu=1,mem=102400,node=1                                     gpu:maxwell:1

$ scontrol show job 8758733
JobId=8758733 JobName=test.slurm
   UserId=frenchwr(112888) GroupId=accre(36014)
   Priority=8908 Nice=0 Account=accre_gpu QOS=normal
   JobState=PENDING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2016-05-23T10:49:14 EligibleTime=2016-05-23T10:49:14
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=maxwell AllocNode:Sid=vmps12:33092
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=102400,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=100G MinTmpDiskNode=0
   Features=(null) Gres=gpu:maxwell:1 Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/gpfs22/home/frenchwr/hoomd/simple-rand-init-lj/million-particles/single-maxwell-gpu/test1/test.slurm
   WorkDir=/gpfs22/home/frenchwr/hoomd/simple-rand-init-lj/million-particles/single-maxwell-gpu/test1
   StdErr=/gpfs22/home/frenchwr/hoomd/simple-rand-init-lj/million-particles/single-maxwell-gpu/test1/slurm-8758733.out
   StdIn=/dev/null
   StdOut=/gpfs22/home/frenchwr/hoomd/simple-rand-init-lj/million-particles/single-maxwell-gpu/test1/slurm-8758733.out
   Power= SICP=0


Is this a bug or a known limitation with managing GPUs with TRES? Or am I going about this the wrong way?

Thanks, Will

Comment 1 Will French 2016-05-23 05:21:45 MDT

Created attachment 3127 [details]
slurm.conf

Comment 3 Alejandro Sanchez 2016-05-24 00:21:28 MDT

Hi Will. We're taking a look at it and will come back to you.

Comment 6 Will French 2016-05-24 01:27:19 MDT

(In reply to Alejandro Sanchez from comment #3)
> Hi Will. We're taking a look at it and will come back to you.

Thanks, Alejandro. We have users waiting to run on our GPU nodes and we'd like to have the configuration nailed down before allowing jobs to run.

Comment 7 Alejandro Sanchez 2016-05-24 04:18:25 MDT

Hi Will. I can reproduce this on 15.08.11, but tried with 16.05.0rc2 and it works. In the newer version, when GrpTRES is exceeded jobs remain PD with AssocGrpGRES reason. I believe the exact commit where this was fixed is this one:

https://github.com/SchedMD/slurm/commit/0cd692967b

In fact, this bug is a duplicate of bug #2482, where GRES TRES were just enforced with granularity as gpu:4 and not the so fine grained gpu:maxwell:4. I believe we should mark this bug as duplicate and close it, unless you have any more questions. Please, let me know what do you think.

Comment 8 Will French 2016-05-24 06:24:14 MDT

(In reply to Alejandro Sanchez from comment #7)
> Hi Will. I can reproduce this on 15.08.11, but tried with 16.05.0rc2 and it
> works. In the newer version, when GrpTRES is exceeded jobs remain PD with
> AssocGrpGRES reason. I believe the exact commit where this was fixed is this
> one:
> 
> https://github.com/SchedMD/slurm/commit/0cd692967b
> 
> In fact, this bug is a duplicate of bug #2482, where GRES TRES were just
> enforced with granularity as gpu:4 and not the so fine grained
> gpu:maxwell:4. I believe we should mark this bug as duplicate and close it,
> unless you have any more questions. Please, let me know what do you think.

Thanks, Alejandro. We probably won't transition to 16.05.XX until the end of the summer. In the meantime, what would you suggest? I'm thinking through the options and here's what I've come up with: 

1. Apply patch
2. Remove GRES from fermi partition and apply generic gres/gpu to maxwell partition only
3. Apply generic gres/gpu across both partitions
4. Other option? Is there some sort of QOS or association limits we could apply that could give us the desired behavior?

To reiterate, what we're wanting to do is limit the number of Maxwell GPUs a group can access at a given time, but allow a group to use as many Fermi GPUs as they want.

Comment 9 Alejandro Sanchez 2016-05-24 21:41:43 MDT

> 1. Apply patch

The patch depends on other commits done against 16.05 and it can not be directly applied to 15.08.11. Git shows a patch failed error when trying to apply.

> 2. Remove GRES from fermi partition and apply generic gres/gpu to maxwell
> partition only

I think we can preserve generic gres/gpu across both partitions.

> 3. Apply generic gres/gpu across both partitions

This is the option that I like the most. Let's see if this workaround works for you:

Cange GRES parameter for NodeNames in both partitions so that Gres=gpu:4. For partition maxwell, also add a QOS=maxwell and create a new qos named 'maxwell' with:

$ sacctmgr create qos maxwell GrpTRES=gres/gpu=4

The default is that partition QOS limits overrides job's qos limits, so no need to change OverPartQOS.

Remove the Type parameter in gres.conf too. I tested and now reason PD jobs in maxwell show (QOSGrpGRES) reason and I have no limits on fermi.


> 4. Other option? Is there some sort of QOS or association limits we could
> apply that could give us the desired behavior?
> 
> To reiterate, what we're wanting to do is limit the number of Maxwell GPUs a
> group can access at a given time, but allow a group to use as many Fermi
> GPUs as they want.

I think the approach in point 3 is a good alternative. Please, let me know whether this works for you or no.

Comment 10 Will French 2016-05-25 04:05:44 MDT

> This is the option that I like the most. Let's see if this workaround works
> for you:
> 
> Cange GRES parameter for NodeNames in both partitions so that Gres=gpu:4.
> For partition maxwell, also add a QOS=maxwell and create a new qos named
> 'maxwell' with:
> 
> $ sacctmgr create qos maxwell GrpTRES=gres/gpu=4
> 
> The default is that partition QOS limits overrides job's qos limits, so no
> need to change OverPartQOS.
> 
> Remove the Type parameter in gres.conf too. I tested and now reason PD jobs
> in maxwell show (QOSGrpGRES) reason and I have no limits on fermi.
> 


We've made these changes and they work great - thanks!

One last snag - we want to be able to control the number of Maxwell GPUs that are accessible on a group-by-group basis. Actually, in practice we only want two levels for now: two groups should be allowed to use 20 Maxwell GPUs all at once, while all other groups should be allowed to use only 4 Maxwell GPUs at once. 

I was hoping I could just set a GrpTRES=gres/gpu=20 on the two groups in combination with the OverPartQOS flag set on the maxwell QOS, but that does not appear to work. 

Can you explain how we might accomplish this? We also have not played with QOS's much, we tend to set account-level limits instead.

Thanks!

Will

Comment 11 Alejandro Sanchez 2016-05-25 20:15:11 MDT

Will, could you try this?

# Remove OverPartQos flag from maxwell
$ sacctmgr modify qos maxwell set flags=-1

# Create a second QOS
$ sacctmgr add qos maxwell_20 GrpTRES=gres/gpu=20 Flags=OverPartQos

# Add the QOS maxwell_20 to the accounts that should be allowed 20 maxwell GPUs
$ sacctmgr modify account <allowed_acct> set qos=<other_qos_they_had>,maxwell_20

Now accounts with with maxwell_20 qos should be allowed to have 20 running maxwell GPUs if they request the job with --qos=maxwell_20. You can either train that groups to request the jobs that way or force it through a job_submit plugin by detecting comparing the job request partition=maxwell, account=<one of the allowed accounts> and in that case force --qos=maxwell_20 for the jobs satisfying that. I tried to setup this scenario and it works for me. Please, let me know if this also works for you.

Comment 12 Will French 2016-05-26 07:22:37 MDT

(In reply to Alejandro Sanchez from comment #11)
> Will, could you try this?
> 
> # Remove OverPartQos flag from maxwell
> $ sacctmgr modify qos maxwell set flags=-1
> 
> # Create a second QOS
> $ sacctmgr add qos maxwell_20 GrpTRES=gres/gpu=20 Flags=OverPartQos
> 
> # Add the QOS maxwell_20 to the accounts that should be allowed 20 maxwell
> GPUs
> $ sacctmgr modify account <allowed_acct> set
> qos=<other_qos_they_had>,maxwell_20
> 
> Now accounts with with maxwell_20 qos should be allowed to have 20 running
> maxwell GPUs if they request the job with --qos=maxwell_20. You can either
> train that groups to request the jobs that way or force it through a
> job_submit plugin by detecting comparing the job request partition=maxwell,
> account=<one of the allowed accounts> and in that case force
> --qos=maxwell_20 for the jobs satisfying that. I tried to setup this
> scenario and it works for me. Please, let me know if this also works for you.

Hey Alejandro, this is working well. In fact, it appears that --qos=maxwell_20 is not even needed at submit time. When I run:

sacctmgr modify account accre_gpu set qos=maxwell_20

the account has the qos tied to it, and all the associations with this account also have this qos applied automatically.

One thing I'm still failing to understand with qos's:

How do you limit a qos to a group of users? For example, if I run:

sacctmgr modify account accre_gpu set qos=normal

SLURM only allows me to run with up to 4 Maxwell GPUs at once because of the GrpTRES limit placed on the maxwell partition. However, I am able to run on up to 20 Maxwell GPUs if I submit jobs with --qos=maxwell_20. Is there a way to limit access to the qos to only those groups and users who have the qos assigned to their association?

Thanks again, Will

Comment 13 Alejandro Sanchez 2016-05-26 19:09:36 MDT

> Hey Alejandro, this is working well. In fact, it appears that
> --qos=maxwell_20 is not even needed at submit time. When I run:
> 
> sacctmgr modify account accre_gpu set qos=maxwell_20
> 
> the account has the qos tied to it, and all the associations with this
> account also have this qos applied automatically.

This is not needed unless the assocs also have other qos applied. In that case the job might be launched with a different qos and thus not having the limit at 20 gpus. But if assoc has only 1 qos, you're right that there's no need for explicit --qos=maxwell_20 parameter at request time.

> 
> One thing I'm still failing to understand with qos's:
> 
> How do you limit a qos to a group of users? For example, if I run:
> 
> sacctmgr modify account accre_gpu set qos=normal
> 
> SLURM only allows me to run with up to 4 Maxwell GPUs at once because of the
> GrpTRES limit placed on the maxwell partition. However, I am able to run on
> up to 20 Maxwell GPUs if I submit jobs with --qos=maxwell_20. Is there a way
> to limit access to the qos to only those groups and users who have the qos
> assigned to their association?
> 
> Thanks again, Will

Slurm should reject submissions with --qos=maxwell_20 if submitted by an account in an assoc not having maxwell_20 qos:

$ sbatch --qos=maxwell_20 test.batch 
sbatch: error: Batch job submission failed: Invalid qos specification

Can you attach the output of:

$ sacctmgr show assoc tree
$ sacctmgr show qos

Thanks.

Comment 14 Will French 2016-05-26 23:01:12 MDT

> Slurm should reject submissions with --qos=maxwell_20 if submitted by an
> account in an assoc not having maxwell_20 qos:
> 
> $ sbatch --qos=maxwell_20 test.batch 
> sbatch: error: Batch job submission failed: Invalid qos specification


For some reason I'm allowed. I thought maybe it was because I have administrative privileges in SLURM but even when I removed these I'm still allowed to submit jobs with maxwell_20 qos when none of my accounts or associations have this qos:

$ salloc --qos=maxwell_20 --partition=maxwell --account=accre_gpu
salloc: Granted job allocation 8823505

$ sacctmgr show user frenchwr
      User   Def Acct     Admin 
---------- ---------- --------- 
  frenchwr      accre      None

> 
> Can you attach the output of:
> 
> $ sacctmgr show assoc tree
> $ sacctmgr show qos

Attaching. Thanks.

Comment 15 Will French 2016-05-26 23:01:49 MDT

Created attachment 3157 [details]
sacctmgr show assoc tree

Comment 16 Will French 2016-05-26 23:02:11 MDT

Created attachment 3158 [details]
sacctmgr show qos

Comment 17 Alejandro Sanchez 2016-05-26 23:53:54 MDT

I think the solution is appending 'qos' to AccountingStorageEnforce. I see you have 'safe', which automatically sets 'limits' and 'associations', but not 'qos'. Note that a change in AccountingStorageEnforce requires a restart on slurmctld daemon, not just 'scontrol reconfigure'.

Also I see you've not added QOS 'maxwell_20' and 'maxwell_40' to any account. After updating AccountingStorageEnforce you should add these qos to the desired accounts.

Comment 18 Will French 2016-05-27 02:45:19 MDT

(In reply to Alejandro Sanchez from comment #17)
> I think the solution is appending 'qos' to AccountingStorageEnforce. I see
> you have 'safe', which automatically sets 'limits' and 'associations', but
> not 'qos'. Note that a change in AccountingStorageEnforce requires a restart
> on slurmctld daemon, not just 'scontrol reconfigure'.


Yep, this was the issue. We are now all set and we have the ability to enforce qos requests and GPU limits from SLURM. Woot!


> Also I see you've not added QOS 'maxwell_20' and 'maxwell_40' to any
> account. After updating AccountingStorageEnforce you should add these qos to
> the desired accounts.

Right, I had temporarily all accounts from QOS while I was testing whether groups not pinned to a qos could still submit to that qos.

We're happy on our end, feel free to close this ticket. Many thanks for your assistance! We look forward to the 16.05 release, that will make management of multiple GRES cleaner.

Comment 19 Alejandro Sanchez 2016-05-29 18:09:01 MDT

Great, thanks for your cooperation. Closing the bug.