Ticket 2788 - QOS's PartitionTimeLimit flag not honored anymore?
Summary: QOS's PartitionTimeLimit flag not honored anymore?
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 16.05.0
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-06-01 03:23 MDT by Kilian Cavalotti
Modified: 2016-06-02 10:24 MDT (History)
1 user (show)

See Also:
Site: Stanford
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 16.05.1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2016-06-01 03:23:02 MDT
Hi,

Just upgraded from 14.08.11 to 16.05, and our "long" QOS doesn't seem to work anymore.

We have a MaxTime of 2 days on our "normal" partition:

# scontrol show partition normal  | grep Time
   DefaultTime=02:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED


and a 7 days MaxWall on our "long" QOS, with the PartitionTimeLimit flag set, so the QOS could override the partition limit:

# sacctmgr show qos long format=name,flags%30,maxwall
      Name                          Flags     MaxWall
---------- ------------------------------ -----------
      long DenyOnLimit,PartitionTimeLimit  7-00:00:00


It worked perfectly fine in 14.11, users could submit jobs with a time limit greater than 2 days using the "long" QOS. Now, this fails:

$ srun --qos long -p normal --time=2-0:0:1  --pty bash
srun: error: Unable to allocate resources: Requested time limit is invalid (missing or exceeds some limit)

Works fine within the partition limits though:

$ srun --qos long -p normal --time=2-0:0:0  --pty bash
srun: job 8533628 queued and waiting for resources

We didn't change the rest of our configuration, and have:

# scontrol show config | grep -i PartLimits
EnforcePartLimits       = ANY

so this definitely looks like a behavior change between 14.11 and 15.06. Is this expected?

Thanks!
Comment 1 Kilian Cavalotti 2016-06-01 03:50:51 MDT
Quick correction about version numbers, sorry for the confusion: I meant we upgraded from 15.08.11 to 16.05, and it worked fine in 15.08.11.
Comment 2 Tim Wickberg 2016-06-01 03:56:04 MDT
I can reproduce this easily, I'm looking into a fix now.

As a possible temporary workaround, EnforcePartLimits=no looks like it'll do the right thing, except that you may end up with invalid jobs queued in the meantime.
Comment 3 Kilian Cavalotti 2016-06-01 04:02:25 MDT
(In reply to Tim Wickberg from comment #2)
> I can reproduce this easily, I'm looking into a fix now.

Great, thanks Tim!

> As a possible temporary workaround, EnforcePartLimits=no looks like it'll do
> the right thing, except that you may end up with invalid jobs queued in the
> meantime.

Thanks for the suggestion.
Comment 4 Tim Wickberg 2016-06-02 09:56:28 MDT
Fixed in commit 377b448a34f7b.

Patch is available here if you want to apply it ahead of 16.05.1 being released: https://github.com/SchedMD/slurm/commit/377b448a34f7bbb.patch
Comment 5 Kilian Cavalotti 2016-06-02 10:24:05 MDT
Hi Tim, 

Awesome! Applied the patch, and the issue looks resolved now. Thank you!