| Summary: | QOS's PartitionTimeLimit flag not honored anymore? | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
| Component: | Limits | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | sthiell |
| Version: | 16.05.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Stanford | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 16.05.1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
Quick correction about version numbers, sorry for the confusion: I meant we upgraded from 15.08.11 to 16.05, and it worked fine in 15.08.11. I can reproduce this easily, I'm looking into a fix now. As a possible temporary workaround, EnforcePartLimits=no looks like it'll do the right thing, except that you may end up with invalid jobs queued in the meantime. (In reply to Tim Wickberg from comment #2) > I can reproduce this easily, I'm looking into a fix now. Great, thanks Tim! > As a possible temporary workaround, EnforcePartLimits=no looks like it'll do > the right thing, except that you may end up with invalid jobs queued in the > meantime. Thanks for the suggestion. Fixed in commit 377b448a34f7b. Patch is available here if you want to apply it ahead of 16.05.1 being released: https://github.com/SchedMD/slurm/commit/377b448a34f7bbb.patch Hi Tim, Awesome! Applied the patch, and the issue looks resolved now. Thank you! |
Hi, Just upgraded from 14.08.11 to 16.05, and our "long" QOS doesn't seem to work anymore. We have a MaxTime of 2 days on our "normal" partition: # scontrol show partition normal | grep Time DefaultTime=02:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED and a 7 days MaxWall on our "long" QOS, with the PartitionTimeLimit flag set, so the QOS could override the partition limit: # sacctmgr show qos long format=name,flags%30,maxwall Name Flags MaxWall ---------- ------------------------------ ----------- long DenyOnLimit,PartitionTimeLimit 7-00:00:00 It worked perfectly fine in 14.11, users could submit jobs with a time limit greater than 2 days using the "long" QOS. Now, this fails: $ srun --qos long -p normal --time=2-0:0:1 --pty bash srun: error: Unable to allocate resources: Requested time limit is invalid (missing or exceeds some limit) Works fine within the partition limits though: $ srun --qos long -p normal --time=2-0:0:0 --pty bash srun: job 8533628 queued and waiting for resources We didn't change the rest of our configuration, and have: # scontrol show config | grep -i PartLimits EnforcePartLimits = ANY so this definitely looks like a behavior change between 14.11 and 15.06. Is this expected? Thanks!