Ticket 2129

Summary:	backfill scheduler not considering GrpNodes limit on partition qos [was bf_busy_node issue]
Product:	Slurm	Reporter:	Doug Jacobsen <dmjacobsen>
Component:	slurmctld	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	tim
Version:	15.08.3
Hardware:	Cray XC
OS:	Linux
Site:	NERSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	15.08.5
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf

Description Doug Jacobsen 2015-11-10 06:56:32 MST

Hello,

I was using:
SchedulerParameters=no_backup_scheduling,bf_window=2880,bf_resolution=60,bf_max_job_array_resv=20,default_queue_depth=200,bf_max_job_test=600,bf_max_job_user=20,bf_continue

and was finding that the system was draining --- a lot --- for high priority jobs, but that it was getting overlapping reservations of nodelists for large jobs (e.g. nid1-600 and nid400-800), so was effectively draining 800 nodes instead of 600.  

On top of this we were having some issues where jobs would age into higher priority and steal the needed nodes which seemed to exacerbate the issue.  I'm working on the priorities now to clear things up.

Right now I have:
SchedulerParameters=no_backup_scheduling,bf_window=2880,bf_resolution=60,bf_max_job_array_resv=20,default_queue_depth=200,bf_max_job_test=600,bf_max_job_user=5,bf_busy_nodes,bf_continue

So, I dropped the max_job_user to 5 (from 20) and added bf_busy_nodes -- to try to get it to prefer using the nodes already in use.  This immediately and dramatically improved utilization (hooray!), but it seems that the highest priority job is getting ignored and never scheduled.  No jobs have a set start time.

What information can I provide to help with this?

Thanks,
Doug

Comment 1 Doug Jacobsen 2015-11-10 09:43:39 MST

gah, I see now, that bf_busy_nodes only works with Select/cons_res and we don't really actually use cons_res, we just have this other_cons_res capability of the select/cray plugin.

So, I guess bf_busy_nodes really won't work with select/cray -- is that right?

Comment 2 Tim Wickberg 2015-11-10 11:02:31 MST

Can you attach the full slurm.conf?

cons_res or linear are still used depending on your configuration, the select/cray plugin layers on top of those. select/linear is the default, but bf_busy_nodes should still work if you have other_cons_rest set in SelectTypeParameters.

Can you better define "draining" ? At some point the system is always going to hold off backfilling other jobs - we have a strict backfill algorithm and if jobs that could otherwise be run now on open resources would delay the anticipated start time of the highest priority job we won't schedule them to start. I'm not sure if this is the issue you're describing or now, or exactly how you may want to adjust some of the parameters to suit.

- Tim

Comment 3 Doug Jacobsen 2015-11-10 11:10:20 MST

The behavior we were seeing without bf_busy_nodes set (again with select/cray using other_cons_res enabled) is that a big job would be in the top priority spot.  Nodes would be left idle to allow it to start.  The time for it to start would come up and it would get delayed.  Sometimes a different job would start instead (lower priority, I think).  Looking further down the backfill map I'd also see a non-overlapping set of nodes selected for later jobs (e.g., job1 nid001-600, then job2 nid400-700).

One concern I had about the job getting delayed was possibly something not playing well with GrpNodes set on the partition QOS (e.g., starting would exceed that limit, so it wouldn't start), but I couldn't gather enough evidence to determine if this was the case.

Turning on "bf_busy_nodes" has the effect, however, of keeping the system extremely busy -- but as far as I can tell no jobs are getting reservations at all, and it's just picking some job that can start given the available resources.

I will turn off bf_busy_nodes tomorrow morning to once again try to get backfill scheduling working correctly.

-Doug

Comment 4 Doug Jacobsen 2015-11-11 06:04:24 MST

Hi, 

I removed bf_busy_nodes and backfill started planning again.  It appears that the GrpNodes limit on the regular partition QOS is not being considered.  The system has an 864 node job to schedule.  There are 832 nodes available (in total -- but really only 632 of these would be accessible owing to the GrpNodes limit).  So with an additional 32 node job finishing it will not be able to start:

nid00837:~ # squeue --start --sort=S | grep -v "N/A" | head -n 3
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
             19407   regular hmc_72_8     bjoo PD 2015-11-11T11:59:16    864 nid0[0025-0056,0111- (Resources)
             20725   regular make_pro  smeinel PD 2015-11-11T12:19:44    128 nid0[0059-0063,0080- (Priority)
nid00837:~ # squeue -t R --sort=L
   JOBID       USER  ACCOUNT       NAME  PARTITION    QOS NODES   TIME_LIMIT       TIME   ST
   23722    jcorrea     ngbi         sh    regular normal     1         5:00       3:18    R
   21061   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00    3:26:32    R
   23720    baustin    mpccc run_ep.sru    regular normal     1        10:00       4:50    R
   23724   dolmsted    m1090 lbe_bi_v1_      debug normal     2        10:00       1:50    R
   21065   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00    3:19:55    R
   23718       tslo    m1393      myJob      debug normal     1        30:00      12:06    R
   21066   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00    3:11:39    R
   21067   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00    3:06:10    R
   23719        dks     mp27 test.scrip      debug normal    36        30:00       6:04    R
   23725      qyang    m1867     Secwrf      debug normal    35        30:00       1:32    R
   21076   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00    2:37:55    R
   21077   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00    2:37:24    R
   23699    jllyons    m1900   vpbs.com    regular normal     4      2:30:00       6:21    R
   20721    smeinel     m789 make_props    regular normal   128      6:00:00    1:02:43    R
   20722    smeinel     m789 make_props    regular normal   128      6:00:00    1:02:43    R
   20723    smeinel     m789 make_props    regular normal   128      6:00:00    1:02:43    R
   20724    smeinel     m789 make_props    regular normal   128      6:00:00    1:02:43    R
   23588   dolmsted    m1090   bi_v2_d0    regular normal     2     12:00:00    3:59:42    R
   22674   luisruiz     m657     my_job    regular normal     2     12:00:00    3:57:08    R
   22676   luisruiz     m657     my_job    regular normal     2     12:00:00    3:57:08    R
   23621   dolmsted    m1090 lbe_bi_v1_    regular normal     2     12:00:00    3:05:01    R
nid00837:~ # sinfo -p regular
PARTITION AVAIL JOB_SIZE  TIMELIMIT   CPUS  S:C:T   NODES STATE      NODELIST
regular   up    1-infini   12:00:00     64 2:16:2       2 down*      nid00[663,881]
regular   up    1-infini   12:00:00     64 2:16:2       2 draining   nid0[1338,1785]
regular   up    1-infini   12:00:00     64 2:16:2     792 allocated  nid0[0024,0057-0063,0080-0083,0088-0110,0238-0255,0272-0319,0336-0383,0408-0423,0561-0575,0596-0611,0660-0662,0696-0703,0720-0731,1056-1087,1104-1151,1172-1213,1232-1235,1240-1279,1300-1309,1326-1329,1489-1511,1564-1595,1786-1791,1808-1839,1844-1855,1872-1919,1940-1983,2000-2003,2008-2047,2068-2111,2128-2148,2201-2239,2256-2303]
regular   up    1-infini   12:00:00     64 2:16:2     832 idle       nid0[0025-0056,0111-0127,0148-0191,0208-0211,0216-0237,0424-0447,0464-0467,0472-0511,0532-0560,0612-0639,0656-0659,0664-0695,0732-0767,0788-0831,0848-0851,0856-0880,0882-0895,0916-0959,0980-1023,1040-1055,1214-1215,1310-1325,1330-1337,1339-1343,1364-1407,1424-1471,1488,1512-1535,1556-1563,1596-1599,1616-1619,1624-1663,1684-1727,1748-1784,1840-1843,2149-2175,2192-2200]
nid00837:~ #


It ended up starting jobs from further down the list -- so all the draining time was wasted and it has now scheduled the job for later -- so I assume we will drain more.

nid00837:~ # squeue --start --sort=S | grep -v "N/A" | head -n 3
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
             19407   regular hmc_72_8     bjoo PD 2015-11-11T16:53:05    864 nid0[0059-0063,0080- (Resources)
             20727   regular make_pro  smeinel PD 2015-11-11T16:53:05    128 nid0[0238-0255,0272- (AssocGrpCPURunMinutesLimit)
nid00837:~ #

Comment 5 Doug Jacobsen 2015-11-11 06:21:54 MST

Created attachment 2406 [details]
slurm.conf

Comment 6 Doug Jacobsen 2015-11-11 06:22:12 MST

please find attached the slurm.conf

none of the jobs in the system right now are using the job-level qos that define QosOverPart

nid00837:~ # sacctmgr show qos -p
Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MinTRES|
normal|5000|00:00:00||cluster|DenyOnLimit||1.000000|||||||||||||||
premium|10000|00:00:00||cluster|DenyOnLimit||1.000000|||||||||||||||
low|1000|00:00:00||cluster|DenyOnLimit||1.000000|||||||||||||||
serialize|5000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||1|||||||||||
scavenger|0|00:00:00||cluster|DenyOnLimit||1.000000|||||||||||||||
normal_regular_0|5000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||||||||02:00:00||1|||
normal_regular_1|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||04:00:00||1|||
normal_regular_2|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00|||||
normal_regular_3|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00|||||
normal_regular_4|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||12:00:00|||||
premium_regular_0|10000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||||||||02:00:00||1|||
premium_regular_1|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||04:00:00||1|||
premium_regular_2|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00|||||
premium_regular_3|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00|||||
premium_regular_4|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||12:00:00|||||
low_regular_0|1000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||||||||02:00:00||1|||
low_regular_1|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||04:00:00||1|||
low_regular_2|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00|||||
low_regular_3|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00|||||
low_regular_4|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||12:00:00|||||
part_debug|0|00:00:00||cluster|DenyOnLimit||1.000000|node=160||||||node=128|||||1|||
part_reg|0|00:00:00||cluster|DenyOnLimit||1.000000|node=1428||||||||||||||
part_immed|0|00:00:00||cluster|DenyOnLimit||1.000000|node=32||||||||||||||
part_shared|0|00:00:00||cluster|DenyOnLimit||1.000000|||||||||||||||
killable|0|00:00:00||cluster|DenyOnLimit||1.000000|node=1628||||||||||||||
nid00837:~ #

Comment 7 Tim Wickberg 2015-11-11 06:28:20 MST

Thanks, Alex and I are looking into this now.

Please understand that Veterans' Day is a scheduled holiday for SchedMD, 
and as such most of our staff are out of the office today.

- Tim

On 11/11/2015 12:22 PM, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=2129
>
> --- Comment #6 from Doug Jacobsen <dmjacobsen@lbl.gov> ---
> please find attached the slurm.conf
>
> none of the jobs in the system right now are using the job-level qos that
> define QosOverPart
>
> nid00837:~ # sacctmgr show qos -p
> Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MinTRES|
> normal|5000|00:00:00||cluster|DenyOnLimit||1.000000|||||||||||||||
> premium|10000|00:00:00||cluster|DenyOnLimit||1.000000|||||||||||||||
> low|1000|00:00:00||cluster|DenyOnLimit||1.000000|||||||||||||||
> serialize|5000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||1|||||||||||
> scavenger|0|00:00:00||cluster|DenyOnLimit||1.000000|||||||||||||||
> normal_regular_0|5000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||||||||02:00:00||1|||
> normal_regular_1|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||04:00:00||1|||
> normal_regular_2|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00|||||
> normal_regular_3|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00|||||
> normal_regular_4|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||12:00:00|||||
> premium_regular_0|10000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||||||||02:00:00||1|||
> premium_regular_1|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||04:00:00||1|||
> premium_regular_2|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00|||||
> premium_regular_3|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00|||||
> premium_regular_4|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||12:00:00|||||
> low_regular_0|1000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||||||||02:00:00||1|||
> low_regular_1|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||04:00:00||1|||
> low_regular_2|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00|||||
> low_regular_3|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00|||||
> low_regular_4|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||12:00:00|||||
> part_debug|0|00:00:00||cluster|DenyOnLimit||1.000000|node=160||||||node=128|||||1|||
> part_reg|0|00:00:00||cluster|DenyOnLimit||1.000000|node=1428||||||||||||||
> part_immed|0|00:00:00||cluster|DenyOnLimit||1.000000|node=32||||||||||||||
> part_shared|0|00:00:00||cluster|DenyOnLimit||1.000000|||||||||||||||
> killable|0|00:00:00||cluster|DenyOnLimit||1.000000|node=1628||||||||||||||
> nid00837:~ #
>

Comment 8 Tim Wickberg 2015-11-11 10:44:19 MST

Just to make sure I'm understanding right:

#19407 has the highest priority, so we start reserving and setting aside 
resources in anticipation of launch. But you don't think that the 864 
nodes requested would ever become available to that job due to a lower 
GrpNodes limit?.

I don't see why you're saying "only 632 of these would be accessible 
owing to the GrpNodes limit)" - where do those numbers (632 and/or -200) 
come from? I see the part_reg qos is set to node=1628, which means it 
shouldn't be a factor in what you've described.

These jobs that "steal" resources out from #19407 - which partition are 
they submitted under?

In the config at least both your "realtime" and "debug" partitions have 
nodes=all and a higher priority value, which would mean anything 
submitted to those would schedule and run ahead of anything under 
regular? Although the priority for regular in your slurm.conf is "0000" 
which I'm assuming is a mistake.

One thing I've been confused by before is that priority between 
partitions is considered first, and at a whole different level than the 
calculated priority values within multifactor. If anything on a higher 
priority partition would use nodes that overlap with lower priority 
partitions, that higher priority always wins. The multifactor priorities 
are only compared between partitions of equal priority levels. That may 
explain the behavior you're seeing, unless you've reset those partition 
priorities through scontrol after the last restart.

Given that your config doesn't seem to match perfectly to what is 
currently running, would you be able to attach the output of

scontrol show assoc
scontrol show part

as well?

- Tim

On 11/11/2015 12:04 PM, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=2129
>
> Doug Jacobsen <dmjacobsen@lbl.gov> changed:
>
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Summary|bf_busy_nodes not reserving |backfill scheduler not
>                     |resources?                  |considering GrpNodes limit
>                     |                            |on partition qos [was
>                     |                            |bf_busy_node issue]
>             Severity|3 - Medium Impact           |2 - High Impact
>
> --- Comment #4 from Doug Jacobsen <dmjacobsen@lbl.gov> ---
> Hi,
>
> I removed bf_busy_nodes and backfill started planning again.  It appears that
> the GrpNodes limit on the regular partition QOS is not being considered.  The
> system has an 864 node job to schedule.  There are 832 nodes available (in
> total -- but really only 632 of these would be accessible owing to the GrpNodes
> limit).  So with an additional 32 node job finishing it will not be able to
> start:
>
> nid00837:~ # squeue --start --sort=S | grep -v "N/A" | head -n 3
>               JOBID PARTITION     NAME     USER ST          START_TIME  NODES
> SCHEDNODES           NODELIST(REASON)
>               19407   regular hmc_72_8     bjoo PD 2015-11-11T11:59:16    864
> nid0[0025-0056,0111- (Resources)
>               20725   regular make_pro  smeinel PD 2015-11-11T12:19:44    128
> nid0[0059-0063,0080- (Priority)
> nid00837:~ # squeue -t R --sort=L
>     JOBID       USER  ACCOUNT       NAME  PARTITION    QOS NODES   TIME_LIMIT
>     TIME   ST
>     23722    jcorrea     ngbi         sh    regular normal     1         5:00
>     3:18    R
>     21061   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00
> 3:26:32    R
>     23720    baustin    mpccc run_ep.sru    regular normal     1        10:00
>     4:50    R
>     23724   dolmsted    m1090 lbe_bi_v1_      debug normal     2        10:00
>     1:50    R
>     21065   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00
> 3:19:55    R
>     23718       tslo    m1393      myJob      debug normal     1        30:00
>    12:06    R
>     21066   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00
> 3:11:39    R
>     21067   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00
> 3:06:10    R
>     23719        dks     mp27 test.scrip      debug normal    36        30:00
>     6:04    R
>     23725      qyang    m1867     Secwrf      debug normal    35        30:00
>     1:32    R
>     21076   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00
> 2:37:55    R
>     21077   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00
> 2:37:24    R
>     23699    jllyons    m1900   vpbs.com    regular normal     4      2:30:00
>     6:21    R
>     20721    smeinel     m789 make_props    regular normal   128      6:00:00
> 1:02:43    R
>     20722    smeinel     m789 make_props    regular normal   128      6:00:00
> 1:02:43    R
>     20723    smeinel     m789 make_props    regular normal   128      6:00:00
> 1:02:43    R
>     20724    smeinel     m789 make_props    regular normal   128      6:00:00
> 1:02:43    R
>     23588   dolmsted    m1090   bi_v2_d0    regular normal     2     12:00:00
> 3:59:42    R
>     22674   luisruiz     m657     my_job    regular normal     2     12:00:00
> 3:57:08    R
>     22676   luisruiz     m657     my_job    regular normal     2     12:00:00
> 3:57:08    R
>     23621   dolmsted    m1090 lbe_bi_v1_    regular normal     2     12:00:00
> 3:05:01    R
> nid00837:~ # sinfo -p regular
> PARTITION AVAIL JOB_SIZE  TIMELIMIT   CPUS  S:C:T   NODES STATE      NODELIST
> regular   up    1-infini   12:00:00     64 2:16:2       2 down*
> nid00[663,881]
> regular   up    1-infini   12:00:00     64 2:16:2       2 draining
> nid0[1338,1785]
> regular   up    1-infini   12:00:00     64 2:16:2     792 allocated
> nid0[0024,0057-0063,0080-0083,0088-0110,0238-0255,0272-0319,0336-0383,0408-0423,0561-0575,0596-0611,0660-0662,0696-0703,0720-0731,1056-1087,1104-1151,1172-1213,1232-1235,1240-1279,1300-1309,1326-1329,1489-1511,1564-1595,1786-1791,1808-1839,1844-1855,1872-1919,1940-1983,2000-2003,2008-2047,2068-2111,2128-2148,2201-2239,2256-2303]
> regular   up    1-infini   12:00:00     64 2:16:2     832 idle
> nid0[0025-0056,0111-0127,0148-0191,0208-0211,0216-0237,0424-0447,0464-0467,0472-0511,0532-0560,0612-0639,0656-0659,0664-0695,0732-0767,0788-0831,0848-0851,0856-0880,0882-0895,0916-0959,0980-1023,1040-1055,1214-1215,1310-1325,1330-1337,1339-1343,1364-1407,1424-1471,1488,1512-1535,1556-1563,1596-1599,1616-1619,1624-1663,1684-1727,1748-1784,1840-1843,2149-2175,2192-2200]
> nid00837:~ #
>
>
> It ended up starting jobs from further down the list -- so all the draining
> time was wasted and it has now scheduled the job for later -- so I assume we
> will drain more.
>
> nid00837:~ # squeue --start --sort=S | grep -v "N/A" | head -n 3
>               JOBID PARTITION     NAME     USER ST          START_TIME  NODES
> SCHEDNODES           NODELIST(REASON)
>               19407   regular hmc_72_8     bjoo PD 2015-11-11T16:53:05    864
> nid0[0059-0063,0080- (Resources)
>               20727   regular make_pro  smeinel PD 2015-11-11T16:53:05    128
> nid0[0238-0255,0272- (AssocGrpCPURunMinutesLimit)
> nid00837:~ #
>

Comment 9 Doug Jacobsen 2015-11-11 10:58:47 MST

Hi Tim,

From Above:
part_reg|0|00:00:00||cluster|DenyOnLimit||1.000000|node=1428||||||||||||||


We have a qos (job class) which can achieve node=1628 (full system), but none of those are in the system at present.


We have over 4000 users in the system now, so I assume you only want the partitions and qoss from the scontrol cache.


dmj@cori01:~> scontrol show part
PartitionName=debug
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=part_debug
   DefaultTime=00:10:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=00:30:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=nid0[0024-0063,0080-0083,0088-0127,0148-0191,0208-0211,0216-0255,0272-0319,0336-0383,0408-0447,0464-0467,0472-0511,0532-0575,0596-0639,0656-0703,0720-0767,0788-0831,0848-0851,0856-0895,0916-0959,0980-1023,1040-1087,1104-1151,1172-1215,1232-1235,1240-1279,1300-1343,1364-1407,1424-1471,1488-1535,1556-1599,1616-1619,1624-1663,1684-1727,1748-1791,1808-1855,1872-1919,1940-1983,2000-2003,2008-2047,2068-2111,2128-2175,2192-2239,2256-2303]
   Priority=2000 RootOnly=NO ReqResv=NO Shared=EXCLUSIVE PreemptMode=REQUEUE
   State=UP TotalCPUs=104192 TotalNodes=1628 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=124928

PartitionName=regular
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=part_reg
   DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=12:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=nid0[0024-0063,0080-0083,0088-0127,0148-0191,0208-0211,0216-0255,0272-0319,0336-0383,0408-0447,0464-0467,0472-0511,0532-0575,0596-0639,0656-0703,0720-0767,0788-0831,0848-0851,0856-0895,0916-0959,0980-1023,1040-1087,1104-1151,1172-1215,1232-1235,1240-1279,1300-1343,1364-1407,1424-1471,1488-1535,1556-1599,1616-1619,1624-1663,1684-1727,1748-1791,1808-1855,1872-1919,1940-1983,2000-2003,2008-2047,2068-2111,2128-2175,2192-2239,2256-2303]
   Priority=0 RootOnly=NO ReqResv=NO Shared=EXCLUSIVE PreemptMode=REQUEUE
   State=UP TotalCPUs=104192 TotalNodes=1628 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=124928

PartitionName=realtime
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=part_immed
   DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=06:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=nid0[0024-0063,0080-0083,0088-0127,0148-0191,0208-0211,0216-0255,0272-0319,0336-0383,0408-0447,0464-0467,0472-0511,0532-0575,0596-0639,0656-0703,0720-0767,0788-0831,0848-0851,0856-0895,0916-0959,0980-1023,1040-1087,1104-1151,1172-1215,1232-1235,1240-1279,1300-1343,1364-1407,1424-1471,1488-1535,1556-1599,1616-1619,1624-1663,1684-1727,1748-1791,1808-1855,1872-1919,1940-1983,2000-2003,2008-2047,2068-2111,2128-2175,2192-2239,2256-2303]
   Priority=3000 RootOnly=NO ReqResv=NO Shared=YES:32 PreemptMode=REQUEUE
   State=DOWN TotalCPUs=104192 TotalNodes=1628 SelectTypeParameters=CR_CORE
   DefMemPerCPU=3904 MaxMemPerCPU=3094

PartitionName=shared
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=part_shared
   DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=12:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=nid001[88-91],nid0038[0-3],nid0057[2-5],nid0076[4-7],nid0095[6-9],nid011[48-51],nid0134[0-3],nid0153[2-5],nid0172[4-7],nid0191[6-9]
   Priority=1000 RootOnly=NO ReqResv=NO Shared=FORCE:32 PreemptMode=REQUEUE
   State=UP TotalCPUs=2560 TotalNodes=40 SelectTypeParameters=CR_CORE
   DefMemPerCPU=1952 MaxMemPerCPU=1952

dmj@cori01:~>

dmj@cori01:~> scontrol show cache
Current Association Manager state

User Records

UserName=a0o(61087) DefAccount=m1820 DefWckey= AdminLevel=None
UserName=a2832ba(49508) DefAccount=m888 DefWckey= AdminLevel=None
UserName=a3uw(69297) DefAccount=m452 DefWckey= AdminLevel=None
UserName=aae109(57847) DefAccount=m1673 DefWckey= AdminLevel=None
UserName=aagaga(31122) DefAccount=mpccc DefWckey= AdminLevel=None
UserName=aakhan(4294967294) DefAccount=(null) DefWckey= AdminLevel=None
UserName=aalbaugh(57551) DefAccount=m1876 DefWckey= AdminLevel=None
UserName=aaronkim(61299) DefAccount=m1869 DefWckey= AdminLevel=None
UserName=aaronm(61043) DefAccount=lux DefWckey= AdminLevel=None
....
QOS Records

QOS=normal(1)
    UsageRaw=331170672.457604
    GrpJobs=N(1) GrpSubmitJobs=N(1) GrpWall=N(790714.41)
    GrpTRES=cpu=N(3840),mem=N(7495680),energy=N(0),node=N(60),bb/cray=N(0)
    GrpTRESMins=cpu=N(5519511),mem=N(10773220719),energy=N(0),node=N(843827),bb/cray=N(550317400)
    GrpTRESRunMins=cpu=N(106880),mem=N(208629760),energy=N(0),node=N(1670),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium(6)
    UsageRaw=243564.720693
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(7.85)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(4059),mem=N(7923972),energy=N(0),node=N(63),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low(7)
    UsageRaw=1672867811.190714
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(3564.15)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(27881130),mem=N(54423966124),energy=N(0),node=N(435642),bb/cray=N(153147777)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=serialize(11)
    UsageRaw=0.000000
    GrpJobs=1(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=1628(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=scavenger(12)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=normal_regular_0(13)
    UsageRaw=2451945063.593852
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(404.75)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=1628(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(40865751),mem=N(79769946068),energy=N(0),node=N(638527),bb/cray=N(1698879661)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=1(0) MaxSubmitJobs= MaxWallPJ=120
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=normal_regular_1(14)
    UsageRaw=5231961807.901316
    GrpJobs=N(0) GrpSubmitJobs=N(10) GrpWall=N(1689.23)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(87199363),mem=N(170213157483),energy=N(0),node=N(1362490),bb/cray=N(1448802875)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=1(0) MaxSubmitJobs= MaxWallPJ=240
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=normal_regular_2(15)
    UsageRaw=2070800115.052875
    GrpJobs=N(1) GrpSubmitJobs=N(1) GrpWall=N(1610.34)
    GrpTRES=cpu=N(16512),mem=N(32231424),energy=N(0),node=N(258),bb/cray=N(0)
    GrpTRESMins=cpu=N(34513335),mem=N(67370030409),energy=N(0),node=N(539270),bb/cray=N(1977480605)
    GrpTRESRunMins=cpu=N(1008332),mem=N(1968265625),energy=N(0),node=N(15755),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=360
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=normal_regular_3(16)
    UsageRaw=34471007135.085697
    GrpJobs=N(7) GrpSubmitJobs=N(731) GrpWall=N(178855.57)
    GrpTRES=cpu=N(29568),mem=N(57716736),energy=N(0),node=N(462),bb/cray=N(0)
    GrpTRESMins=cpu=N(574516785),mem=N(1121456765461),energy=N(0),node=N(8976824),bb/cray=N(651658573)
    GrpTRESRunMins=cpu=N(7858877),mem=N(15340529595),energy=N(0),node=N(122794),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=360
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=normal_regular_4(17)
    UsageRaw=99116791.116665
    GrpJobs=N(10) GrpSubmitJobs=N(18) GrpWall=N(14233.57)
    GrpTRES=cpu=N(1216),mem=N(2373632),energy=N(0),node=N(19),bb/cray=N(0)
    GrpTRESMins=cpu=N(1651946),mem=N(3224599604),energy=N(0),node=N(25811),bb/cray=N(1352098725)
    GrpTRESRunMins=cpu=N(495053),mem=N(966345147),energy=N(0),node=N(7735),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=720
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium_regular_0(19)
    UsageRaw=89485143.139266
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(14.31)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=1628(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(1491419),mem=N(2911249990),energy=N(0),node=N(23303),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=1(0) MaxSubmitJobs= MaxWallPJ=120
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium_regular_1(20)
    UsageRaw=130201380.162831
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(42.87)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(2170023),mem=N(4235884901),energy=N(0),node=N(33906),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=1(0) MaxSubmitJobs= MaxWallPJ=240
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium_regular_2(21)
    UsageRaw=2772314.787207
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(1.31)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(46205),mem=N(90192641),energy=N(0),node=N(721),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=360
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium_regular_3(22)
    UsageRaw=165071227.514326
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(190.20)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(2751187),mem=N(5370317268),energy=N(0),node=N(42987),bb/cray=N(59895075)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=360
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium_regular_4(23)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=720
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low_regular_0(25)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=1628(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=1(0) MaxSubmitJobs= MaxWallPJ=120
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low_regular_1(26)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=1(0) MaxSubmitJobs= MaxWallPJ=240
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low_regular_2(27)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=360
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low_regular_3(28)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=360
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low_regular_4(29)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=720
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=part_debug(32)
    UsageRaw=1910439247.828047
    GrpJobs=N(1) GrpSubmitJobs=N(1) GrpWall=N(12263.23)
    GrpTRES=cpu=N(3840),mem=N(7495680),energy=N(0),node=160(60),bb/cray=N(0)
    GrpTRESMins=cpu=N(31840654),mem=N(62152091705),energy=N(0),node=N(497510),bb/cray=N(703465178)
    GrpTRESRunMins=cpu=N(118843),mem=N(231982967),energy=N(0),node=N(1856),bb/cray=N(0)
    MaxJobsPU=1(1) MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=node=128
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=part_reg(33)
    UsageRaw=44737923138.446764
    GrpJobs=N(18) GrpSubmitJobs=N(760) GrpWall=N(197250.85)
    GrpTRES=cpu=N(47296),mem=N(92321792),energy=N(0),node=1428(739),bb/cray=N(0)
    GrpTRESMins=cpu=N(745632052),mem=N(1455473766104),energy=N(0),node=N(11650500),bb/cray=N(7188815516)
    GrpTRESRunMins=cpu=N(17640787),mem=N(34434816614),energy=N(0),node=N(275637),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=part_immed(35)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=32(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=part_shared(36)
    UsageRaw=93875352.155751
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(782294.54)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(1564589),mem=N(3054078123),energy=N(0),node=N(782294),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=killable(37)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=1628(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
dmj@cori01:~>

Comment 10 Doug Jacobsen 2015-11-11 11:17:58 MST

My recollection is that the jobs that were started were also in the regular partition.

That is interesting about the subtle difference in ways that partition priorities differ from other priorities.  So, in general, to prevent these kinds of issues should I set everything with equal partition priorities and use perhaps a more extensive mapping of QOSs (the mappings are done in the job submit plugin) to set baseline priorities for the different job classes?


In fact it just did it again.  That same job was scheduled to start at 17:00, had drained the system to achieve that.  When it failed it instead started a bunch of lower priority regular jobs:

dmj@cori03:~> squeue -t R
   JOBID       USER  ACCOUNT       NAME  PARTITION    QOS NODES   TIME_LIMIT       TIME   ST
   23851   inascime    mpopn killable_q    regular normal    10      2:00:00    1:44:31    R
   20768     pankin    m1043 run.nimrod    regular normal   258      6:00:00    5:12:48    R
   23978      masao    dessn EXPNUM_401    regular normal    16      2:00:00      11:10    R
   21093   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00    1:24:51    R
   23588   dolmsted    m1090   bi_v2_d0    regular normal     2     12:00:00    9:13:35    R
   22674   luisruiz     m657     my_job    regular normal     2     12:00:00    9:11:01    R
   22676   luisruiz     m657     my_job    regular normal     2     12:00:00    9:11:01    R
   21094   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00      33:26    R
   21095   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00      11:11    R
   21096   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00      11:11    R
   21097   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00      11:11    R
   21098   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00      11:11    R
   21099   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00      10:40    R
   21100   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00      10:40    R
   21101   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00      10:40    R
   21102   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00      10:40    R
   21103   psteinbr    m2078 l328f21b64    regular normal    32      3:30:00      10:09    R
   23621   dolmsted    m1090 lbe_bi_v1_    regular normal     2     12:00:00    8:18:54    R
   23814    jllyons    m1900   vpbs.com    regular normal     4      6:00:00    1:25:26    R
   20727    smeinel     m789 make_props    regular normal   128      6:00:00    1:25:20    R
   20728    smeinel     m789 make_props    regular normal   128      6:00:00    1:25:20    R
   20729    smeinel     m789 make_props    regular normal   128      6:00:00    1:25:16    R
   20730    smeinel     m789 make_props    regular normal   128      6:00:00      11:11    R
   20731    smeinel     m789 make_props    regular normal   128      6:00:00      11:11    R
   20732    smeinel     m789 make_props    regular normal   128      6:00:00      11:11    R
   23741       tslo    m1393      myJob    regular normal     1     12:00:00    4:19:41    R
   22677   luisruiz     m657     my_job    regular normal     2     12:00:00    3:19:47    R
   22679   luisruiz     m657     my_job    regular normal     2     12:00:00    3:14:43    R
   23624   dolmsted    m1090 lbe_bi_v2_    regular normal     2     12:00:00    2:54:53    R
   23636   luisruiz     m657     my_job    regular normal     2     12:00:00    2:43:44    R
   23638   luisruiz     m657     my_job    regular normal     2     12:00:00    1:26:39    R
dmj@cori03:~>


When I was looking at the jobs that would finish prior to starting job 19407, it was relying on the completion of a 128 node debug job.  The issue is that there weren't enough regular jobs completing to allow it to start and keep the GrpNodes sum of regular jobs less than 1428.

part_reg is limited to 1428 nodes using GrpNodes.  The only thing I can see to explain this is that the backfill scheduler is not considering the partition and job QOS Grp limits when planning out the job.

I can set the partition priorities equal if you think that will help for debugging purposes.

I may also come up with node lists explicitly limiting regular to a 1428 node subset temporarily just so we can see high utilization during this initial acceptance period.

I would, however, like these GrpNode limits to work in the long term because I would like to have flexibility to change the limits on the fly without adjusting slurm.conf or restarting slurmctld.  One example of why we want this is to allow the GrpNodes of debug and regular to change during the day and night so we can devote more or fewer resources to debug. 

One thing that was disappointing about GrpNodes is that each job that shares a node contributes to the sum -- that is why shared is on a defined list of nodes.  I'd prefer to have all nodes in all partitions and use these QOS GrpNodes barriers to create floating partitions that can adjust with node availability or policy changes.

-Doug

Comment 11 Tim Wickberg 2015-11-11 11:30:13 MST

On 11/11/2015 05:17 PM, bugs@schedmd.com wrote:
> *Comment # 10 <http://bugs.schedmd.com/show_bug.cgi?id=2129#c10> on bug
> 2129 <http://bugs.schedmd.com/show_bug.cgi?id=2129> from Doug Jacobsen
> <mailto:dmjacobsen@lbl.gov> *
>
> My recollection is that the jobs that were started were also in the regular
> partition.
>
> That is interesting about the subtle difference in ways that partition
> priorities differ from other priorities.  So, in general, to prevent these
> kinds of issues should I set everything with equal partition priorities and use
> perhaps a more extensive mapping of QOSs (the mappings are done in the job
> submit plugin) to set baseline priorities for the different job classes?
>
>
> In fact it just did it again.  That same job was scheduled to start at 17:00,
> had drained the system to achieve that.  When it failed it instead started a
> bunch of lower priority regular jobs:


We should definitely improve the documentation around those partition 
priorities - I'm still suspicious that they may be causing some of these 
problems. Resetting those to the same level or turning them off 
completely may help in the short term. E.g., a brief debug job asking 
for 32-nodes and 2-hours could wreck havoc on the schedule planner, even 
if it only runs for one minute and then disappears.

I think I now get what you're saying around GrpNodes not being factored 
into the main schedule planner correctly, and will start looking into 
that tomorrow.

> When I was looking at the jobs that would finish prior to starting job 19407,
> it was relying on the completion of a 128 node debug job.  The issue is that
> there weren't enough regular jobs completing to allow it to start and keep the
> GrpNodes sum of regular jobs less than 1428.
>
> part_reg is limited to 1428 nodes using GrpNodes.  The only thing I can see to
> explain this is that the backfill scheduler is not considering the partition
> and job QOS Grp limits when planning out the job.
>
> I can set the partition priorities equal if you think that will help for
> debugging purposes.

Please do. It'll at least rule out one source of contention.

> I may also come up with node lists explicitly limiting regular to a 1428 node
> subset temporarily just so we can see high utilization during this initial
> acceptance period.
>
> I would, however, like these GrpNode limits to work in the long term because I
> would like to have flexibility to change the limits on the fly without
> adjusting slurm.conf or restarting slurmctld.  One example of why we want this
> is to allow the GrpNodes of debug and regular to change during the day and
> night so we can devote more or fewer resources to debug.
>
> One thing that was disappointing about GrpNodes is that each job that shares a
> node contributes to the sum -- that is why shared is on a defined list of
> nodes.  I'd prefer to have all nodes in all partitions and use these QOS
> GrpNodes barriers to create floating partitions that can adjust with node
> availability or policy changes.

That's why those partition QOS's got added with 15.08. :)

GrpCPUs may better handle what you're trying to do, although I'd caution 
against mixing it together with GrpNodes.

Comment 12 Doug Jacobsen 2015-11-11 13:28:40 MST

I've set all partitions to priority 1000 and restarted slurmctld.  The job
is scheduled to start at 21:44, if it fails to start this time, i'll
reconfigure the partitions (temporarily!) to be on mutually exclusive sets
of node and (temporarily!) not use GrpNodes until we figure this out.

GrpCPUs may be a better option for the future.  I'll eventually want to
figure out how this can be used one a partition spanning two different
architectures (Haswell and KNL) -- e.g., GrpCPUS or GrpTRES=haswell=X,knl=y
or similar --- but that can wait!

Thanks for looking at this,
Doug

Comment 13 Doug Jacobsen 2015-11-13 08:05:20 MST

Hello,

I'm just wondering if there is an update on this.  I've temporarily modified cori to have static partitions for the major production partitions (debug, regular, shared) running on distinct nodes.  I hope to be able to move back to GrpNodes (or GrpCpus) once we can get the backfill scheduler to honor the limits.

One additional thing I wanted to mention - the design I have in mind for edison (our much larger system moving to Native SLURM next month) will have two major partitions: regular and debug.  The "normal" jobs that would be run on these will use a GrpCpus that limit the partitions to mutually exclusive counts (e.g. if the system is 5500 nodes, regular would have GrpNodes=5250 and debug with no limit).  Larger jobs would be put into a job QOS allowing larger GrpNodes limit and defining OverPartQos.

Thus, it will be important for the backfill scheduler to honor not only the GrpNodes (GrpCpus) limit on the partition QOS, but the effective QOS for a job running in an OverPartQos QOS.  Is that feasible?

Thanks,
Doug

Comment 14 Tim Wickberg 2015-11-13 10:53:32 MST

> --- Comment #13 from Doug Jacobsen <dmjacobsen@lbl.gov> ---
> Hello,
>
> I'm just wondering if there is an update on this.  I've temporarily modified
> cori to have static partitions for the major production partitions (debug,
> regular, shared) running on distinct nodes.  I hope to be able to move back to
> GrpNodes (or GrpCpus) once we can get the backfill scheduler to honor the
> limits.

I'm still researching this, it's not a trivial fix unfortunately. I take 
it disabling the partition priorities alone didn't solve the problem for 
you?

As you suspected we use GrpNodes/GrpCpus only as a reason to defer job 
execution; we don't appear to consider it at all during the normal 
schedule plan. I'm trying to see how difficult it would be to add the 
partition qos constraints to the main schedule planner, but I'm going to 
have to loop in the developers on how best to structure this or if thats 
a change they'd want to take on.

> One additional thing I wanted to mention - the design I have in mind for edison
> (our much larger system moving to Native SLURM next month) will have two major
> partitions: regular and debug.  The "normal" jobs that would be run on these
> will use a GrpCpus that limit the partitions to mutually exclusive counts (e.g.
> if the system is 5500 nodes, regular would have GrpNodes=5250 and debug with no
> limit).  Larger jobs would be put into a job QOS allowing larger GrpNodes limit
> and defining OverPartQos.
>
> Thus, it will be important for the backfill scheduler to honor not only the
> GrpNodes (GrpCpus) limit on the partition QOS, but the effective QOS for a job
> running in an OverPartQos QOS.  Is that feasible?

Possibly, although that certainly adds another complication for the 
scheduler to consider. I'll take this up internally and have to get back 
to you on that as well.

If you're out at SC15 please stop by our booth - I'd love to talk 
through some of this in person to better understand how you'd expect to 
tie all the components together.

- Tim

Comment 15 Tim Wickberg 2015-11-13 14:46:23 MST

Hey Doug -

I think I finally understand exactly what you're seeing after putting together a reproducer for us to use internally. 

Can you take a look through the notes below and verify this is what you were seeing on cori?

- Tim

#### reproducer:

# i have 10 nodes in my test config, which is sufficient for 
# problemjob + goodjob2 to run alongside each other at t=5min 
# if the QOS didn't block it. scale these values as needed

scontrol create partitionname=confusion nodes=ALL
sacctmgr create qos part_confusion maxnodes=6
scontrol update partitionname=confusion qos=part_confusion

# start these two immediately
sbatch -J goodjob1 --wrap "sleep 600" -t 5  -p confusion -N 3 --exclusive
sbatch -J goodjob2 --wrap "sleep 600" -t 10 -p confusion -N 3 --exclusive

# sleep to ensure those two start up first
sleep 10

# this will be delayed by the two above, but slurm *thinks* it 
# could start @ t=5min since we'd have 7 nodes available in the partition
# however, the qos means we're only allowed access to 4 of those 7 nodes!
# won't actually be eligible until t=10min
# note that it wants most nodes in partition,
# so other jobs would have to drain for it
sbatch -J problemjob --wrap "sleep 600" -t 3 -p confusion -N 5 --exclusive

# this *should not* run at t=5min, since problem job (with higher priority) 
# would be delayed out to t=10min
# nice used to ensure problemjob has higher priority, so shouldbackfillbutwont
# will be attempting to backfill
sbatch -J shouldnotrunyet --wrap "sleep 600" -t 10 -p confusion -N 3 --exclusive --nice=200



#### example and further analysis:

# let full scheduler loop have a chance to run
sleep 100

### this is now t~= 2 minutes:
scontrol show jobs|grep 'JobId\|Time\|Reas'
JobId=127 JobName=goodjob1
   JobState=RUNNING Reason=None Dependency=(null)
   RunTime=00:01:36 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2015-11-13T20:14:07 EligibleTime=2015-11-13T20:14:07
   StartTime=2015-11-13T20:14:08 EndTime=2015-11-13T20:19:08
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
JobId=128 JobName=goodjob2
   JobState=RUNNING Reason=None Dependency=(null)
   RunTime=00:01:36 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2015-11-13T20:14:07 EligibleTime=2015-11-13T20:14:07
   StartTime=2015-11-13T20:14:08 EndTime=2015-11-13T20:24:08
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
JobId=129 JobName=problemjob
   JobState=PENDING Reason=Resources Dependency=(null)
   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
   SubmitTime=2015-11-13T20:14:17 EligibleTime=2015-11-13T20:14:17
   StartTime=2015-11-13T20:19:08 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
JobId=130 JobName=shouldnotrunyet
   JobState=PENDING Reason=Priority Dependency=(null)
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2015-11-13T20:14:17 EligibleTime=2015-11-13T20:14:17
   StartTime=2015-11-13T20:22:00 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0


# goodjob1+2 are both running initially. 
# goodjob1 will complete at t=5min, goodjob2 at t=10min.

# Note that StartTime[problemjob] == (EndTime[goodjob] + 10 seconds)
# 				 ~= t=5min.

sleep 300

### this is now t~=8min

scontrol show jobs|grep 'JobId\|Time\|Reas'
### wait longer, then:
JobId=128 JobName=goodjob2
   JobState=RUNNING Reason=None Dependency=(null)
   RunTime=00:07:40 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2015-11-13T20:14:07 EligibleTime=2015-11-13T20:14:07
   StartTime=2015-11-13T20:14:08 EndTime=2015-11-13T20:24:08
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
JobId=129 JobName=problemjob
   JobState=PENDING Reason=Resources Dependency=(null)
   RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A
   SubmitTime=2015-11-13T20:14:17 EligibleTime=2015-11-13T20:14:17
   StartTime=2015-11-13T20:24:08 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
JobId=130 JobName=shouldnotrunyet
   JobState=RUNNING Reason=None Dependency=(null)
   RunTime=00:02:27 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2015-11-13T20:14:17 EligibleTime=2015-11-13T20:14:17
   StartTime=2015-11-13T20:19:21 EndTime=2015-11-13T20:29:21
   PreemptTime=None SuspendTime=None SecsPreSuspend=0

The QOS prevented problemjob from launching at t=5. Instead shouldnotrunyet has started at t=5min, which will push problemjob's actual launch time back to t=15minutes. problemjob incorrectly recalculates its StartTime[problemjob] ~= t=10minutes ~= EndTime[goodjob2].

Not shown here is that with additional jobs of various sizes and runtimes Slurm will need to hold nodes idle for any large "problemjob". When the expected start time happens and the large job is blocked by the QOS, more of those other smaller lower-priority jobs that were prevented from backfilling will launch, re-filling up the nodes we will eventually need to satisfy the GrpNodes constraint for the large job. So we're stalling the queue for larger jobs, while introducing seemingly random delays for smaller jobs while the scheduler attempts (but fails) to free up additional nodes for that large job, only to them be blocked again by the QOS.

Comment 16 Moe Jette 2015-11-16 02:13:32 MST

(In reply to Doug Jacobsen from comment #1)
> So, I guess bf_busy_nodes really won't work with select/cray -- is that
> right?

Tim is at SC15, so I'll take over this bug. I'm just starting to work down the comments now. Regarding comment 2, I just updated the documentation:

-This option is currently only supported by the select/cons_res plugin.
+This option is currently only supported by the select/cons_res plugin
+(or select/cray with SelectTypeParameters set to "OTHER_CONS_RES",
+which layers the select/cray plugin over the select/cons_res plugin).

Comment 17 Moe Jette 2015-11-16 02:27:14 MST

(In reply to Doug Jacobsen from comment #0)
> and was finding that the system was draining --- a lot --- for high priority
> jobs, but that it was getting overlapping reservations of nodelists for
> large jobs (e.g. nid1-600 and nid400-800), so was effectively draining 800
> nodes instead of 600.  

The backfill scheduler builds a map of resource allocations through time. Nodes nid1-600 could easily be reserved for a job expected to start in one hour and give a time limit on that job of one hour, reserve nodes nid400-800 starting 2 hours in the future. Looking at the expected start times of jobs is critical for seeing what is expected to start both in time and space (nodes). There are a couple of Slurm debug flags that print (very verbose) details about what the backfill scheduling is doing. Let me go through more of this before suggesting turning those on, but for your references, see DebugFlags=backfill and BackfillMap. They can be turned on and off using the scontrol command (e.g. "scontrol setdebugflags +backfill")

Comment 18 Moe Jette 2015-11-16 02:39:28 MST

(In reply to Tim Wickberg from comment #2)
> we have a strict backfill algorithm and
> if jobs that could otherwise be run now on open resources would delay the
> anticipated start time of the highest priority job

Technically, Slurm uses what is called conservative backfill: No job will be started that delays the expected start time of ANY higher priority job (not just the HIGHEST priority job).

Comment 19 Moe Jette 2015-11-16 04:25:05 MST

This commit will confirm the association and QOS node limits prior to reserving resources for a pending job:
https://github.com/SchedMD/slurm/commit/dcc943b7b37fca6b0ddfe67bc393b8547930555a

There are other limits which are not currently tested in the backfill scheduler and I am studying those now. I will be adding more tests in the near future, but this should fix the GrpNodes limit problem reported in this bug.

Comment 20 Moe Jette 2015-11-16 05:20:48 MST

I've added a second commit which complements the previous one, adding more association and QOS limit checks:
https://github.com/SchedMD/slurm/commit/94f0e9485b35af4e5749d2195820bb5805f14922

These two patches will bring the backfill schedulers testing of limits into agreement with Slurm's main scheduling logic.

Comment 21 Doug Jacobsen 2015-11-16 12:15:43 MST

Hi Moe,

This is great, thank you for looking at this.  I'll put these patches on alva tomorrow and will try it out.

One question, Tim mentioned that GrpCPUs might be a good fit for our mix of node-exclusive and shared-node jobs.  Will this patch also work with GrpCPUs?

If not, that's OK, I'll keep the shared jobs in a fixed partition for the time being, but I would eventually like to have all partitions be non-static (all floating) to prevent any particular partition being harmed too much by an outage, also to allow us to adjust limits on the fly using by adjusting GrpCPUs.

Thanks again,
Doug

Comment 22 Moe Jette 2015-11-17 03:59:00 MST

(In reply to Doug Jacobsen from comment #21)
> One question, Tim mentioned that GrpCPUs might be a good fit for our mix of
> node-exclusive and shared-node jobs.  Will this patch also work with GrpCPUs?

Not exactly. Let me explain how the logic works and its limitations.

The limits logic does support dozens of association and QOS limits, but all of the tests are based upon the _current_ configuration. They do not support the concept of something like "job 123 will end in 10 minutes, releasing CPUs/nodes/whatever so that job 125 will be able to begin then". In addition, some information is not available until the allocation takes place. For example, if a job requests a node count (or a node count range) on a heterogeneous system then Slurm will not know the CPU count until after resources are selected. These issues restrict the capabilities of backfill scheduling logic. Right now the backfill scheduling logic matches that of the main scheduling logic to the extent possible. Here's an outline of the logic.

1. Build queue of pending jobs, validate dependencies, start time, and some other basic limits.
2. Sort job queue by priority.
3. For each job in the queue:
 A. Test more limits (done here as this is more heavy-weight), this is newly added. If not runnable NOW, the go to next job
 B. Determine when/where the job can/will start
 C. If job can start now, validate more limits with resources selected in step "B"
   a. If limits all good then start it
   b. Otherwise, skip to next job
 D. If job can start later, reserve those resources at that point in the future

The GrpNodes check happens in step A.
The GrpCPUs check happens in step C.

On a different note, I just fixed a burst buffer issue that Cray considered a high priority. The commit is here:
https://github.com/SchedMD/slurm/commit/20e0636537476395fce50efe140e5a4a55c2099b

Comment 23 Doug Jacobsen 2015-12-02 04:12:53 MST

I no longer have any test environments (because edison and its test system alva are moving).  Thus I will upgrade cori to 15.08.4 following a maintenance today and apply this patch and try to test this out today and tomorrow.

Is the 3rd burst buffer patch to be applied on top of 15.08.4 as well?

Comment 24 Moe Jette 2015-12-02 04:29:51 MST

(In reply to Doug Jacobsen from comment #23)
> I no longer have any test environments (because edison and its test system
> alva are moving).  Thus I will upgrade cori to 15.08.4 following a
> maintenance today and apply this patch and try to test this out today and
> tomorrow.
> 
> Is the 3rd burst buffer patch to be applied on top of 15.08.4 as well?

Yes. All of these changes will be in v15.08.5 when released, likely mid December.

Comment 25 Moe Jette 2015-12-08 09:18:12 MST

Any update on this?

I'm dropping this from severity 2 to severity 3 since it should be fixed and we're just waiting for confirmation.

Comment 26 Doug Jacobsen 2015-12-08 12:44:08 MST

Hi Moe,

I applied the patches and ran a scenario wherein I queued many 200 node
jobs that would end at a variety of times.  The partition had a
partition-qos limiting to GrpNodes=1400 nodes.  I then submitted a high
priority 1400 node job.  The job was assigned a particular start time.  The
system began reserving resources for this job.  Once the system had drained
a total of 1400 nodes (but still had one 200 node job still running), it
recurred with the same issue.  Instead of waiting for that last 200 node
job to finish, it instead started a bunch of 200-node lower priority jobs
and pushed back the start time for the 1400 node job.  As far as I can tell
the patch did not allow the scheduler to accurately drain for the needed
node count when considering partition-qos limits.

Thanks for your continuing help with this,
Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________

On Tue, Dec 8, 2015 at 3:18 PM, <bugs@schedmd.com> wrote:

> Moe Jette <jette@schedmd.com> changed bug 2129
> <http://bugs.schedmd.com/show_bug.cgi?id=2129>
> What Removed Added Severity 2 - High Impact 3 - Medium Impact
>
> *Comment # 25 <http://bugs.schedmd.com/show_bug.cgi?id=2129#c25> on bug
> 2129 <http://bugs.schedmd.com/show_bug.cgi?id=2129> from Moe Jette
> <jette@schedmd.com> *
>
> Any update on this?
>
> I'm dropping this from severity 2 to severity 3 since it should be fixed and
> we're just waiting for confirmation.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 27 Moe Jette 2015-12-09 02:24:59 MST

I was just able to reproduce this and have detailed logs to study now.

Comment 28 Moe Jette 2015-12-09 05:39:54 MST

This time I found a way to reproduce the failure and generate a fix. Similar logic existed in both the backfill and primary scheduling code. This change fixes both bugs. In the case of the backfill logic, the algorithm is still imperfect. Ideally we track all resources through time based upon when pending jobs are expected to start and end, which involves very high overhead. This new code is better than the original and seems to work fine, but is not generating _ideal_ scheduling. This change will be in version 15.08.5, that we should release either this week or next week. The commit with the fix is here:
https://github.com/SchedMD/slurm/commit/fd6a48a494ae0de2b282708c98f06fce3ba56a35

Comment 29 Moe Jette 2015-12-10 10:15:51 MST

FYI: We just released version 15.08.5, which includes this fix (and quite a few other Cray and burst buffer specific fixes).

I'll close this now. Please re-open with details if necessary.