Ticket 2788

Summary: QOS's PartitionTimeLimit flag not honored anymore?
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: LimitsAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: sthiell
Version: 16.05.0   
Hardware: Linux   
OS: Linux   
Site: Stanford Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 16.05.1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Kilian Cavalotti 2016-06-01 03:23:02 MDT
Hi,

Just upgraded from 14.08.11 to 16.05, and our "long" QOS doesn't seem to work anymore.

We have a MaxTime of 2 days on our "normal" partition:

# scontrol show partition normal  | grep Time
   DefaultTime=02:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED


and a 7 days MaxWall on our "long" QOS, with the PartitionTimeLimit flag set, so the QOS could override the partition limit:

# sacctmgr show qos long format=name,flags%30,maxwall
      Name                          Flags     MaxWall
---------- ------------------------------ -----------
      long DenyOnLimit,PartitionTimeLimit  7-00:00:00


It worked perfectly fine in 14.11, users could submit jobs with a time limit greater than 2 days using the "long" QOS. Now, this fails:

$ srun --qos long -p normal --time=2-0:0:1  --pty bash
srun: error: Unable to allocate resources: Requested time limit is invalid (missing or exceeds some limit)

Works fine within the partition limits though:

$ srun --qos long -p normal --time=2-0:0:0  --pty bash
srun: job 8533628 queued and waiting for resources

We didn't change the rest of our configuration, and have:

# scontrol show config | grep -i PartLimits
EnforcePartLimits       = ANY

so this definitely looks like a behavior change between 14.11 and 15.06. Is this expected?

Thanks!
Comment 1 Kilian Cavalotti 2016-06-01 03:50:51 MDT
Quick correction about version numbers, sorry for the confusion: I meant we upgraded from 15.08.11 to 16.05, and it worked fine in 15.08.11.
Comment 2 Tim Wickberg 2016-06-01 03:56:04 MDT
I can reproduce this easily, I'm looking into a fix now.

As a possible temporary workaround, EnforcePartLimits=no looks like it'll do the right thing, except that you may end up with invalid jobs queued in the meantime.
Comment 3 Kilian Cavalotti 2016-06-01 04:02:25 MDT
(In reply to Tim Wickberg from comment #2)
> I can reproduce this easily, I'm looking into a fix now.

Great, thanks Tim!

> As a possible temporary workaround, EnforcePartLimits=no looks like it'll do
> the right thing, except that you may end up with invalid jobs queued in the
> meantime.

Thanks for the suggestion.
Comment 4 Tim Wickberg 2016-06-02 09:56:28 MDT
Fixed in commit 377b448a34f7b.

Patch is available here if you want to apply it ahead of 16.05.1 being released: https://github.com/SchedMD/slurm/commit/377b448a34f7bbb.patch
Comment 5 Kilian Cavalotti 2016-06-02 10:24:05 MDT
Hi Tim, 

Awesome! Applied the patch, and the issue looks resolved now. Thank you!