Ticket 7809

Summary: How to ignore specific qos maxwall for a specific partition
Product: Slurm Reporter: Anthony DelSorbo <anthony.delsorbo>
Component: ConfigurationAssignee: Nate Rini <nate>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 19.05.1   
Hardware: Linux   
OS: Linux   
Site: NOAA Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: NESCC OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Anthony DelSorbo 2019-09-25 06:47:32 MDT
We have created a partition, fgewf, for our GPU nodes that is limited to a specific QOS (windfall).  This QOS has the lowest priority.  Here's the configuration of the fgewf partition:

233 PartitionName=fgewf Nodes=h26n[01-16],h27n[01-16],h28n[01-16],h29n[01-16],h30n[01-18],h31n[01-18] \
234   OverSubscribe=EXCLUSIVE \ # Force jobs to use the entire node
235   MaxTime=1-6:00:00 \
236   DefMemPerCpu=12500 \
237   AllowQos=windfall,admin

The same QOS is permitted throughout our configured partitions, and its MaxWall is set to 8:00:00.  However, for the fgewf partition, we permit runtimes as long as 30 hours.  

The problem is when a user submits a job to this partition, it gets rejected with:

sbatch: error: QOSMaxWallDurationPerJobLimit

For this partition only, we would like the QOS MaxWall to be ignored, but instead enforce the partition's MaxTime.  

What are my options for achieving this goal?

Thanks,

Tony.
Comment 2 Nate Rini 2019-09-25 12:05:25 MDT
Tony,

Looking at how best to implement your request.

--Nate
Comment 4 Nate Rini 2019-09-25 14:57:35 MDT
(In reply to Anthony DelSorbo from comment #0)
> What are my options for achieving this goal?

I believe the cleanest solution is to use a Partition Qos per https://slurm.schedmd.com/qos.html:
> The Partition QOS will override the job's QOS. If the opposite is desired you need to have the job's QOS have the 'OverPartQOS' flag which will reverse the order of precedence.

Here is a simple example:
1. Create the partition QOS 
> $ sacctmgr create qos fgewf-windfall set MaxWall=1-6:00:00
2. Assign the partition QOS in slurm.conf:
> 233 PartitionName=fgewf Nodes=h26n[01-16],h27n[01-16],h28n[01-16],h29n[01-16],h30n[01-18],h31n[01-18] \
> 234   OverSubscribe=EXCLUSIVE \ # Force jobs to use the entire node
> 235   MaxTime=1-6:00:00 \
> 236   DefMemPerCpu=12500 \
> 237   AllowQos=windfall,admin Qos=fgewf-windfall
3. Restart slurmctld

Please give a try.

Thanks,
--Nate
Comment 5 Anthony DelSorbo 2019-09-26 06:43:43 MDT
(In reply to Nate Rini from comment #4)
> (In reply to Anthony DelSorbo from comment #0)
> > What are my options for achieving this goal?
> 
> I believe the cleanest solution is to use a Partition Qos per

Nate,

Sorry about the missing information: we had that solution and moved away from it.  The customer wanted to reduce the total number of QOSs available to the user.  The point was "we already have a qos of windfall, why do we need fgewindfall qos"

Tony.
Comment 6 Nate Rini 2019-09-26 09:44:04 MDT
Tony,

Please provide the output of the following and a copy of your slurm.conf:
> sacctmgr show qos -p
> scontrol show part

Thanks,
--Nate
Comment 7 Anthony DelSorbo 2019-09-26 09:53:18 MDT
Thanks Nate.  See below.

[root@bqs1 ~]# sacctmgr show qos -p
Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES|
batch|20|00:00:00|||cluster|DenyOnLimit||1.000000||||||||||||||||||
windfall|1|00:00:00|||cluster|DenyOnLimit||0.000000||||||||||||||||||
debug|30|00:00:00|||cluster|DenyOnLimit||1.000000|||||||cpu=4104|||00:30:00||||||||
urgent|40|00:00:00|||cluster|DenyOnLimit||1.000000|||||||cpu=4104|||08:00:00||||||||
novel|50|00:00:00|||cluster|DenyOnLimit||1.000000||||||||||08:00:00|||||||cpu=4105|
admin|90|00:00:00|||cluster|DenyOnLimit||1.000000|||||||cpu=4104|||1-00:00:00||||||||
maximum-qos-normalization|100|00:00:00|||cluster|DenyOnLimit||1.000000|||||||||||||0|||0||


[root@bqs1 ~]# scontrol show part
PartitionName=fge
   AllowGroups=ALL AllowAccounts=nesccmgmt,rda-aidata,rda-esrl-ai,rda-ghpcs,rda-gpucm,rda-isp1,rda-nmfs,rda-rdo1,sena AllowQos=batch,debug,admin
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-06:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=h26n[01-16],h27n[01-16],h28n[01-16],h29n[01-16],h30n[01-18],h31n[01-18]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2000 TotalNodes=100 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=12500 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=cpu=1.0

PartitionName=fgewf
   AllowGroups=ALL AllowAccounts=ALL AllowQos=windfall,admin
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-06:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=h26n[01-16],h27n[01-16],h28n[01-16],h29n[01-16],h30n[01-18],h31n[01-18]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2000 TotalNodes=100 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=12500 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=cpu=1.0

PartitionName=hera
   AllowGroups=ALL AllowAccounts=ALL AllowQos=windfall,batch,debug,novel,urgent,admin
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=08:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=h1c[01-52],h[2-5]c[01-56],h6c[01-57],h8c[01-56],h9c[01-54],h10c[01-57],h[11-12]c[01-56],h[13,14]c[01-52],h[15-24]c[01-56],h[25]c[01-52]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=53120 TotalNodes=1328 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=2300 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=cpu=1.0

PartitionName=service
   AllowGroups=ALL AllowAccounts=ALL AllowQos=windfall,batch,debug,urgent,admin
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=1-00:00:00 MinNodes=0 LLN=YES MaxCPUsPerNode=32
   Nodes=hfe[01-12]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=480 TotalNodes=12 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=2300 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=cpu=1.0

PartitionName=bigmem
   AllowGroups=ALL AllowAccounts=ALL AllowQos=windfall,batch,debug,urgent,admin
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=08:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=h1m[01-04],h13m[01-04]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=320 TotalNodes=8 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=9600 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=cpu=1

PartitionName=admin
   AllowGroups=ALL AllowAccounts=nesccmgmt AllowQos=windfall,batch,debug,urgent,admin,novel
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=YES
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=h1c[01-52],h1m[01-04],h2c[01-56],h3c[01-56],h4c[01-56],h5c[01-56],h6c[01-57],h8c[01-56],h9c[01-54],h10c[01-57],h11c[01-56],h12c[01-56],h13c[01-52],h13m[01-04],h14c[01-52],h15c[01-56],h16c[01-56],h17c[01-56],h18c[01-56],h19c[01-56],h20c[01-56],h21c[01-56],h22c[01-56],h23c[01-56],h24c[01-56],h25c[01-52],h26n[01-16],h27n[01-16],h28n[01-16],h29n[01-16],h30n[01-18],h31n[01-18],hfe[01-12]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=55920 TotalNodes=1448 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=2300 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=cpu=1.0
Comment 8 Nate Rini 2019-09-26 10:09:34 MDT
Tony,

The debug, urgent, novel, and admin QOS set MaxWall and there are multiple partitions that allow multiple QOS for users. This removes the easier solution of just appling wall limit by partitions.

I believe the next best solution is to use a job_submit plugin to enforce the WallClock limits explicitly per your site's rules. This will avoid the need to create permutations of the QOS per partition pairs.

How does that sound?

Thanks,
--Nate
Comment 9 Nate Rini 2019-10-07 15:14:07 MDT
Tony,

There hasn't been a response in over a week. Please reply to reopen this ticket.

Thanks,
--Nate