Ticket 7809

Summary:	How to ignore specific qos maxwall for a specific partition
Product:	Slurm	Reporter:	Anthony DelSorbo <anthony.delsorbo>
Component:	Configuration	Assignee:	Nate Rini <nate>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	19.05.1
Hardware:	Linux
OS:	Linux
Site:	NOAA	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	NESCC	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Anthony DelSorbo 2019-09-25 06:47:32 MDT

We have created a partition, fgewf, for our GPU nodes that is limited to a specific QOS (windfall).  This QOS has the lowest priority.  Here's the configuration of the fgewf partition:

233 PartitionName=fgewf Nodes=h26n[01-16],h27n[01-16],h28n[01-16],h29n[01-16],h30n[01-18],h31n[01-18] \
234   OverSubscribe=EXCLUSIVE \ # Force jobs to use the entire node
235   MaxTime=1-6:00:00 \
236   DefMemPerCpu=12500 \
237   AllowQos=windfall,admin

The same QOS is permitted throughout our configured partitions, and its MaxWall is set to 8:00:00.  However, for the fgewf partition, we permit runtimes as long as 30 hours.  

The problem is when a user submits a job to this partition, it gets rejected with:

sbatch: error: QOSMaxWallDurationPerJobLimit

For this partition only, we would like the QOS MaxWall to be ignored, but instead enforce the partition's MaxTime.  

What are my options for achieving this goal?

Thanks,

Tony.

Comment 2 Nate Rini 2019-09-25 12:05:25 MDT

Tony,

Looking at how best to implement your request.

--Nate

Comment 4 Nate Rini 2019-09-25 14:57:35 MDT

(In reply to Anthony DelSorbo from comment #0)
> What are my options for achieving this goal?

I believe the cleanest solution is to use a Partition Qos per https://slurm.schedmd.com/qos.html:
> The Partition QOS will override the job's QOS. If the opposite is desired you need to have the job's QOS have the 'OverPartQOS' flag which will reverse the order of precedence.

Here is a simple example:
1. Create the partition QOS 
> $ sacctmgr create qos fgewf-windfall set MaxWall=1-6:00:00
2. Assign the partition QOS in slurm.conf:
> 233 PartitionName=fgewf Nodes=h26n[01-16],h27n[01-16],h28n[01-16],h29n[01-16],h30n[01-18],h31n[01-18] \
> 234   OverSubscribe=EXCLUSIVE \ # Force jobs to use the entire node
> 235   MaxTime=1-6:00:00 \
> 236   DefMemPerCpu=12500 \
> 237   AllowQos=windfall,admin Qos=fgewf-windfall
3. Restart slurmctld

Please give a try.

Thanks,
--Nate

Comment 5 Anthony DelSorbo 2019-09-26 06:43:43 MDT

(In reply to Nate Rini from comment #4)
> (In reply to Anthony DelSorbo from comment #0)
> > What are my options for achieving this goal?
> 
> I believe the cleanest solution is to use a Partition Qos per

Nate,

Sorry about the missing information: we had that solution and moved away from it.  The customer wanted to reduce the total number of QOSs available to the user.  The point was "we already have a qos of windfall, why do we need fgewindfall qos"

Tony.

Comment 6 Nate Rini 2019-09-26 09:44:04 MDT

Tony,

Please provide the output of the following and a copy of your slurm.conf:
> sacctmgr show qos -p
> scontrol show part

Thanks,
--Nate

Comment 7 Anthony DelSorbo 2019-09-26 09:53:18 MDT

Thanks Nate.  See below.

[root@bqs1 ~]# sacctmgr show qos -p
Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES|
batch|20|00:00:00|||cluster|DenyOnLimit||1.000000||||||||||||||||||
windfall|1|00:00:00|||cluster|DenyOnLimit||0.000000||||||||||||||||||
debug|30|00:00:00|||cluster|DenyOnLimit||1.000000|||||||cpu=4104|||00:30:00||||||||
urgent|40|00:00:00|||cluster|DenyOnLimit||1.000000|||||||cpu=4104|||08:00:00||||||||
novel|50|00:00:00|||cluster|DenyOnLimit||1.000000||||||||||08:00:00|||||||cpu=4105|
admin|90|00:00:00|||cluster|DenyOnLimit||1.000000|||||||cpu=4104|||1-00:00:00||||||||
maximum-qos-normalization|100|00:00:00|||cluster|DenyOnLimit||1.000000|||||||||||||0|||0||


[root@bqs1 ~]# scontrol show part
PartitionName=fge
   AllowGroups=ALL AllowAccounts=nesccmgmt,rda-aidata,rda-esrl-ai,rda-ghpcs,rda-gpucm,rda-isp1,rda-nmfs,rda-rdo1,sena AllowQos=batch,debug,admin
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-06:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=h26n[01-16],h27n[01-16],h28n[01-16],h29n[01-16],h30n[01-18],h31n[01-18]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2000 TotalNodes=100 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=12500 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=cpu=1.0

PartitionName=fgewf
   AllowGroups=ALL AllowAccounts=ALL AllowQos=windfall,admin
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-06:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=h26n[01-16],h27n[01-16],h28n[01-16],h29n[01-16],h30n[01-18],h31n[01-18]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2000 TotalNodes=100 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=12500 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=cpu=1.0

PartitionName=hera
   AllowGroups=ALL AllowAccounts=ALL AllowQos=windfall,batch,debug,novel,urgent,admin
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=08:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=h1c[01-52],h[2-5]c[01-56],h6c[01-57],h8c[01-56],h9c[01-54],h10c[01-57],h[11-12]c[01-56],h[13,14]c[01-52],h[15-24]c[01-56],h[25]c[01-52]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=53120 TotalNodes=1328 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=2300 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=cpu=1.0

PartitionName=service
   AllowGroups=ALL AllowAccounts=ALL AllowQos=windfall,batch,debug,urgent,admin
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=1-00:00:00 MinNodes=0 LLN=YES MaxCPUsPerNode=32
   Nodes=hfe[01-12]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=480 TotalNodes=12 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=2300 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=cpu=1.0

PartitionName=bigmem
   AllowGroups=ALL AllowAccounts=ALL AllowQos=windfall,batch,debug,urgent,admin
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=08:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=h1m[01-04],h13m[01-04]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=320 TotalNodes=8 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=9600 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=cpu=1

PartitionName=admin
   AllowGroups=ALL AllowAccounts=nesccmgmt AllowQos=windfall,batch,debug,urgent,admin,novel
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=YES
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=h1c[01-52],h1m[01-04],h2c[01-56],h3c[01-56],h4c[01-56],h5c[01-56],h6c[01-57],h8c[01-56],h9c[01-54],h10c[01-57],h11c[01-56],h12c[01-56],h13c[01-52],h13m[01-04],h14c[01-52],h15c[01-56],h16c[01-56],h17c[01-56],h18c[01-56],h19c[01-56],h20c[01-56],h21c[01-56],h22c[01-56],h23c[01-56],h24c[01-56],h25c[01-52],h26n[01-16],h27n[01-16],h28n[01-16],h29n[01-16],h30n[01-18],h31n[01-18],hfe[01-12]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=55920 TotalNodes=1448 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=2300 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=cpu=1.0

Comment 8 Nate Rini 2019-09-26 10:09:34 MDT

Tony,

The debug, urgent, novel, and admin QOS set MaxWall and there are multiple partitions that allow multiple QOS for users. This removes the easier solution of just appling wall limit by partitions.

I believe the next best solution is to use a job_submit plugin to enforce the WallClock limits explicitly per your site's rules. This will avoid the need to create permutations of the QOS per partition pairs.

How does that sound?

Thanks,
--Nate

Comment 9 Nate Rini 2019-10-07 15:14:07 MDT

Tony,

There hasn't been a response in over a week. Please reply to reopen this ticket.

Thanks,
--Nate