| Summary: | backfill scheduler not considering GrpNodes limit on partition qos [was bf_busy_node issue] | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | slurmctld | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | tim |
| Version: | 15.08.3 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | NERSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 15.08.5 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurm.conf | ||
|
Description
Doug Jacobsen
2015-11-10 06:56:32 MST
gah, I see now, that bf_busy_nodes only works with Select/cons_res and we don't really actually use cons_res, we just have this other_cons_res capability of the select/cray plugin. So, I guess bf_busy_nodes really won't work with select/cray -- is that right? Can you attach the full slurm.conf? cons_res or linear are still used depending on your configuration, the select/cray plugin layers on top of those. select/linear is the default, but bf_busy_nodes should still work if you have other_cons_rest set in SelectTypeParameters. Can you better define "draining" ? At some point the system is always going to hold off backfilling other jobs - we have a strict backfill algorithm and if jobs that could otherwise be run now on open resources would delay the anticipated start time of the highest priority job we won't schedule them to start. I'm not sure if this is the issue you're describing or now, or exactly how you may want to adjust some of the parameters to suit. - Tim The behavior we were seeing without bf_busy_nodes set (again with select/cray using other_cons_res enabled) is that a big job would be in the top priority spot. Nodes would be left idle to allow it to start. The time for it to start would come up and it would get delayed. Sometimes a different job would start instead (lower priority, I think). Looking further down the backfill map I'd also see a non-overlapping set of nodes selected for later jobs (e.g., job1 nid001-600, then job2 nid400-700). One concern I had about the job getting delayed was possibly something not playing well with GrpNodes set on the partition QOS (e.g., starting would exceed that limit, so it wouldn't start), but I couldn't gather enough evidence to determine if this was the case. Turning on "bf_busy_nodes" has the effect, however, of keeping the system extremely busy -- but as far as I can tell no jobs are getting reservations at all, and it's just picking some job that can start given the available resources. I will turn off bf_busy_nodes tomorrow morning to once again try to get backfill scheduling working correctly. -Doug Hi,
I removed bf_busy_nodes and backfill started planning again. It appears that the GrpNodes limit on the regular partition QOS is not being considered. The system has an 864 node job to schedule. There are 832 nodes available (in total -- but really only 632 of these would be accessible owing to the GrpNodes limit). So with an additional 32 node job finishing it will not be able to start:
nid00837:~ # squeue --start --sort=S | grep -v "N/A" | head -n 3
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
19407 regular hmc_72_8 bjoo PD 2015-11-11T11:59:16 864 nid0[0025-0056,0111- (Resources)
20725 regular make_pro smeinel PD 2015-11-11T12:19:44 128 nid0[0059-0063,0080- (Priority)
nid00837:~ # squeue -t R --sort=L
JOBID USER ACCOUNT NAME PARTITION QOS NODES TIME_LIMIT TIME ST
23722 jcorrea ngbi sh regular normal 1 5:00 3:18 R
21061 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 3:26:32 R
23720 baustin mpccc run_ep.sru regular normal 1 10:00 4:50 R
23724 dolmsted m1090 lbe_bi_v1_ debug normal 2 10:00 1:50 R
21065 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 3:19:55 R
23718 tslo m1393 myJob debug normal 1 30:00 12:06 R
21066 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 3:11:39 R
21067 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 3:06:10 R
23719 dks mp27 test.scrip debug normal 36 30:00 6:04 R
23725 qyang m1867 Secwrf debug normal 35 30:00 1:32 R
21076 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 2:37:55 R
21077 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 2:37:24 R
23699 jllyons m1900 vpbs.com regular normal 4 2:30:00 6:21 R
20721 smeinel m789 make_props regular normal 128 6:00:00 1:02:43 R
20722 smeinel m789 make_props regular normal 128 6:00:00 1:02:43 R
20723 smeinel m789 make_props regular normal 128 6:00:00 1:02:43 R
20724 smeinel m789 make_props regular normal 128 6:00:00 1:02:43 R
23588 dolmsted m1090 bi_v2_d0 regular normal 2 12:00:00 3:59:42 R
22674 luisruiz m657 my_job regular normal 2 12:00:00 3:57:08 R
22676 luisruiz m657 my_job regular normal 2 12:00:00 3:57:08 R
23621 dolmsted m1090 lbe_bi_v1_ regular normal 2 12:00:00 3:05:01 R
nid00837:~ # sinfo -p regular
PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST
regular up 1-infini 12:00:00 64 2:16:2 2 down* nid00[663,881]
regular up 1-infini 12:00:00 64 2:16:2 2 draining nid0[1338,1785]
regular up 1-infini 12:00:00 64 2:16:2 792 allocated nid0[0024,0057-0063,0080-0083,0088-0110,0238-0255,0272-0319,0336-0383,0408-0423,0561-0575,0596-0611,0660-0662,0696-0703,0720-0731,1056-1087,1104-1151,1172-1213,1232-1235,1240-1279,1300-1309,1326-1329,1489-1511,1564-1595,1786-1791,1808-1839,1844-1855,1872-1919,1940-1983,2000-2003,2008-2047,2068-2111,2128-2148,2201-2239,2256-2303]
regular up 1-infini 12:00:00 64 2:16:2 832 idle nid0[0025-0056,0111-0127,0148-0191,0208-0211,0216-0237,0424-0447,0464-0467,0472-0511,0532-0560,0612-0639,0656-0659,0664-0695,0732-0767,0788-0831,0848-0851,0856-0880,0882-0895,0916-0959,0980-1023,1040-1055,1214-1215,1310-1325,1330-1337,1339-1343,1364-1407,1424-1471,1488,1512-1535,1556-1563,1596-1599,1616-1619,1624-1663,1684-1727,1748-1784,1840-1843,2149-2175,2192-2200]
nid00837:~ #
It ended up starting jobs from further down the list -- so all the draining time was wasted and it has now scheduled the job for later -- so I assume we will drain more.
nid00837:~ # squeue --start --sort=S | grep -v "N/A" | head -n 3
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
19407 regular hmc_72_8 bjoo PD 2015-11-11T16:53:05 864 nid0[0059-0063,0080- (Resources)
20727 regular make_pro smeinel PD 2015-11-11T16:53:05 128 nid0[0238-0255,0272- (AssocGrpCPURunMinutesLimit)
nid00837:~ #
Created attachment 2406 [details]
slurm.conf
please find attached the slurm.conf none of the jobs in the system right now are using the job-level qos that define QosOverPart nid00837:~ # sacctmgr show qos -p Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MinTRES| normal|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||||||| premium|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||||||| low|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||||||| serialize|5000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||1||||||||||| scavenger|0|00:00:00||cluster|DenyOnLimit||1.000000||||||||||||||| normal_regular_0|5000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||||||||02:00:00||1||| normal_regular_1|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||04:00:00||1||| normal_regular_2|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00||||| normal_regular_3|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00||||| normal_regular_4|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||12:00:00||||| premium_regular_0|10000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||||||||02:00:00||1||| premium_regular_1|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||04:00:00||1||| premium_regular_2|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00||||| premium_regular_3|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00||||| premium_regular_4|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||12:00:00||||| low_regular_0|1000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||||||||02:00:00||1||| low_regular_1|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||04:00:00||1||| low_regular_2|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00||||| low_regular_3|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00||||| low_regular_4|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||12:00:00||||| part_debug|0|00:00:00||cluster|DenyOnLimit||1.000000|node=160||||||node=128|||||1||| part_reg|0|00:00:00||cluster|DenyOnLimit||1.000000|node=1428|||||||||||||| part_immed|0|00:00:00||cluster|DenyOnLimit||1.000000|node=32|||||||||||||| part_shared|0|00:00:00||cluster|DenyOnLimit||1.000000||||||||||||||| killable|0|00:00:00||cluster|DenyOnLimit||1.000000|node=1628|||||||||||||| nid00837:~ # Thanks, Alex and I are looking into this now. Please understand that Veterans' Day is a scheduled holiday for SchedMD, and as such most of our staff are out of the office today. - Tim On 11/11/2015 12:22 PM, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=2129 > > --- Comment #6 from Doug Jacobsen <dmjacobsen@lbl.gov> --- > please find attached the slurm.conf > > none of the jobs in the system right now are using the job-level qos that > define QosOverPart > > nid00837:~ # sacctmgr show qos -p > Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MinTRES| > normal|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||||||| > premium|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||||||| > low|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||||||| > serialize|5000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||1||||||||||| > scavenger|0|00:00:00||cluster|DenyOnLimit||1.000000||||||||||||||| > normal_regular_0|5000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||||||||02:00:00||1||| > normal_regular_1|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||04:00:00||1||| > normal_regular_2|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00||||| > normal_regular_3|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00||||| > normal_regular_4|5000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||12:00:00||||| > premium_regular_0|10000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||||||||02:00:00||1||| > premium_regular_1|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||04:00:00||1||| > premium_regular_2|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00||||| > premium_regular_3|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00||||| > premium_regular_4|10000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||12:00:00||||| > low_regular_0|1000|00:00:00||cluster|DenyOnLimit,OverPartQOS||1.000000|node=1628|||||||||02:00:00||1||| > low_regular_1|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||04:00:00||1||| > low_regular_2|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00||||| > low_regular_3|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||06:00:00||||| > low_regular_4|1000|00:00:00||cluster|DenyOnLimit||1.000000||||||||||12:00:00||||| > part_debug|0|00:00:00||cluster|DenyOnLimit||1.000000|node=160||||||node=128|||||1||| > part_reg|0|00:00:00||cluster|DenyOnLimit||1.000000|node=1428|||||||||||||| > part_immed|0|00:00:00||cluster|DenyOnLimit||1.000000|node=32|||||||||||||| > part_shared|0|00:00:00||cluster|DenyOnLimit||1.000000||||||||||||||| > killable|0|00:00:00||cluster|DenyOnLimit||1.000000|node=1628|||||||||||||| > nid00837:~ # > Just to make sure I'm understanding right: #19407 has the highest priority, so we start reserving and setting aside resources in anticipation of launch. But you don't think that the 864 nodes requested would ever become available to that job due to a lower GrpNodes limit?. I don't see why you're saying "only 632 of these would be accessible owing to the GrpNodes limit)" - where do those numbers (632 and/or -200) come from? I see the part_reg qos is set to node=1628, which means it shouldn't be a factor in what you've described. These jobs that "steal" resources out from #19407 - which partition are they submitted under? In the config at least both your "realtime" and "debug" partitions have nodes=all and a higher priority value, which would mean anything submitted to those would schedule and run ahead of anything under regular? Although the priority for regular in your slurm.conf is "0000" which I'm assuming is a mistake. One thing I've been confused by before is that priority between partitions is considered first, and at a whole different level than the calculated priority values within multifactor. If anything on a higher priority partition would use nodes that overlap with lower priority partitions, that higher priority always wins. The multifactor priorities are only compared between partitions of equal priority levels. That may explain the behavior you're seeing, unless you've reset those partition priorities through scontrol after the last restart. Given that your config doesn't seem to match perfectly to what is currently running, would you be able to attach the output of scontrol show assoc scontrol show part as well? - Tim On 11/11/2015 12:04 PM, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=2129 > > Doug Jacobsen <dmjacobsen@lbl.gov> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Summary|bf_busy_nodes not reserving |backfill scheduler not > |resources? |considering GrpNodes limit > | |on partition qos [was > | |bf_busy_node issue] > Severity|3 - Medium Impact |2 - High Impact > > --- Comment #4 from Doug Jacobsen <dmjacobsen@lbl.gov> --- > Hi, > > I removed bf_busy_nodes and backfill started planning again. It appears that > the GrpNodes limit on the regular partition QOS is not being considered. The > system has an 864 node job to schedule. There are 832 nodes available (in > total -- but really only 632 of these would be accessible owing to the GrpNodes > limit). So with an additional 32 node job finishing it will not be able to > start: > > nid00837:~ # squeue --start --sort=S | grep -v "N/A" | head -n 3 > JOBID PARTITION NAME USER ST START_TIME NODES > SCHEDNODES NODELIST(REASON) > 19407 regular hmc_72_8 bjoo PD 2015-11-11T11:59:16 864 > nid0[0025-0056,0111- (Resources) > 20725 regular make_pro smeinel PD 2015-11-11T12:19:44 128 > nid0[0059-0063,0080- (Priority) > nid00837:~ # squeue -t R --sort=L > JOBID USER ACCOUNT NAME PARTITION QOS NODES TIME_LIMIT > TIME ST > 23722 jcorrea ngbi sh regular normal 1 5:00 > 3:18 R > 21061 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 > 3:26:32 R > 23720 baustin mpccc run_ep.sru regular normal 1 10:00 > 4:50 R > 23724 dolmsted m1090 lbe_bi_v1_ debug normal 2 10:00 > 1:50 R > 21065 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 > 3:19:55 R > 23718 tslo m1393 myJob debug normal 1 30:00 > 12:06 R > 21066 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 > 3:11:39 R > 21067 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 > 3:06:10 R > 23719 dks mp27 test.scrip debug normal 36 30:00 > 6:04 R > 23725 qyang m1867 Secwrf debug normal 35 30:00 > 1:32 R > 21076 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 > 2:37:55 R > 21077 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 > 2:37:24 R > 23699 jllyons m1900 vpbs.com regular normal 4 2:30:00 > 6:21 R > 20721 smeinel m789 make_props regular normal 128 6:00:00 > 1:02:43 R > 20722 smeinel m789 make_props regular normal 128 6:00:00 > 1:02:43 R > 20723 smeinel m789 make_props regular normal 128 6:00:00 > 1:02:43 R > 20724 smeinel m789 make_props regular normal 128 6:00:00 > 1:02:43 R > 23588 dolmsted m1090 bi_v2_d0 regular normal 2 12:00:00 > 3:59:42 R > 22674 luisruiz m657 my_job regular normal 2 12:00:00 > 3:57:08 R > 22676 luisruiz m657 my_job regular normal 2 12:00:00 > 3:57:08 R > 23621 dolmsted m1090 lbe_bi_v1_ regular normal 2 12:00:00 > 3:05:01 R > nid00837:~ # sinfo -p regular > PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST > regular up 1-infini 12:00:00 64 2:16:2 2 down* > nid00[663,881] > regular up 1-infini 12:00:00 64 2:16:2 2 draining > nid0[1338,1785] > regular up 1-infini 12:00:00 64 2:16:2 792 allocated > nid0[0024,0057-0063,0080-0083,0088-0110,0238-0255,0272-0319,0336-0383,0408-0423,0561-0575,0596-0611,0660-0662,0696-0703,0720-0731,1056-1087,1104-1151,1172-1213,1232-1235,1240-1279,1300-1309,1326-1329,1489-1511,1564-1595,1786-1791,1808-1839,1844-1855,1872-1919,1940-1983,2000-2003,2008-2047,2068-2111,2128-2148,2201-2239,2256-2303] > regular up 1-infini 12:00:00 64 2:16:2 832 idle > nid0[0025-0056,0111-0127,0148-0191,0208-0211,0216-0237,0424-0447,0464-0467,0472-0511,0532-0560,0612-0639,0656-0659,0664-0695,0732-0767,0788-0831,0848-0851,0856-0880,0882-0895,0916-0959,0980-1023,1040-1055,1214-1215,1310-1325,1330-1337,1339-1343,1364-1407,1424-1471,1488,1512-1535,1556-1563,1596-1599,1616-1619,1624-1663,1684-1727,1748-1784,1840-1843,2149-2175,2192-2200] > nid00837:~ # > > > It ended up starting jobs from further down the list -- so all the draining > time was wasted and it has now scheduled the job for later -- so I assume we > will drain more. > > nid00837:~ # squeue --start --sort=S | grep -v "N/A" | head -n 3 > JOBID PARTITION NAME USER ST START_TIME NODES > SCHEDNODES NODELIST(REASON) > 19407 regular hmc_72_8 bjoo PD 2015-11-11T16:53:05 864 > nid0[0059-0063,0080- (Resources) > 20727 regular make_pro smeinel PD 2015-11-11T16:53:05 128 > nid0[0238-0255,0272- (AssocGrpCPURunMinutesLimit) > nid00837:~ # > Hi Tim,
From Above:
part_reg|0|00:00:00||cluster|DenyOnLimit||1.000000|node=1428||||||||||||||
We have a qos (job class) which can achieve node=1628 (full system), but none of those are in the system at present.
We have over 4000 users in the system now, so I assume you only want the partitions and qoss from the scontrol cache.
dmj@cori01:~> scontrol show part
PartitionName=debug
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=part_debug
DefaultTime=00:10:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=00:30:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=nid0[0024-0063,0080-0083,0088-0127,0148-0191,0208-0211,0216-0255,0272-0319,0336-0383,0408-0447,0464-0467,0472-0511,0532-0575,0596-0639,0656-0703,0720-0767,0788-0831,0848-0851,0856-0895,0916-0959,0980-1023,1040-1087,1104-1151,1172-1215,1232-1235,1240-1279,1300-1343,1364-1407,1424-1471,1488-1535,1556-1599,1616-1619,1624-1663,1684-1727,1748-1791,1808-1855,1872-1919,1940-1983,2000-2003,2008-2047,2068-2111,2128-2175,2192-2239,2256-2303]
Priority=2000 RootOnly=NO ReqResv=NO Shared=EXCLUSIVE PreemptMode=REQUEUE
State=UP TotalCPUs=104192 TotalNodes=1628 SelectTypeParameters=N/A
DefMemPerNode=UNLIMITED MaxMemPerNode=124928
PartitionName=regular
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=part_reg
DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=12:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=nid0[0024-0063,0080-0083,0088-0127,0148-0191,0208-0211,0216-0255,0272-0319,0336-0383,0408-0447,0464-0467,0472-0511,0532-0575,0596-0639,0656-0703,0720-0767,0788-0831,0848-0851,0856-0895,0916-0959,0980-1023,1040-1087,1104-1151,1172-1215,1232-1235,1240-1279,1300-1343,1364-1407,1424-1471,1488-1535,1556-1599,1616-1619,1624-1663,1684-1727,1748-1791,1808-1855,1872-1919,1940-1983,2000-2003,2008-2047,2068-2111,2128-2175,2192-2239,2256-2303]
Priority=0 RootOnly=NO ReqResv=NO Shared=EXCLUSIVE PreemptMode=REQUEUE
State=UP TotalCPUs=104192 TotalNodes=1628 SelectTypeParameters=N/A
DefMemPerNode=UNLIMITED MaxMemPerNode=124928
PartitionName=realtime
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=part_immed
DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=06:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=nid0[0024-0063,0080-0083,0088-0127,0148-0191,0208-0211,0216-0255,0272-0319,0336-0383,0408-0447,0464-0467,0472-0511,0532-0575,0596-0639,0656-0703,0720-0767,0788-0831,0848-0851,0856-0895,0916-0959,0980-1023,1040-1087,1104-1151,1172-1215,1232-1235,1240-1279,1300-1343,1364-1407,1424-1471,1488-1535,1556-1599,1616-1619,1624-1663,1684-1727,1748-1791,1808-1855,1872-1919,1940-1983,2000-2003,2008-2047,2068-2111,2128-2175,2192-2239,2256-2303]
Priority=3000 RootOnly=NO ReqResv=NO Shared=YES:32 PreemptMode=REQUEUE
State=DOWN TotalCPUs=104192 TotalNodes=1628 SelectTypeParameters=CR_CORE
DefMemPerCPU=3904 MaxMemPerCPU=3094
PartitionName=shared
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=part_shared
DefaultTime=00:05:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=1 MaxTime=12:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=nid001[88-91],nid0038[0-3],nid0057[2-5],nid0076[4-7],nid0095[6-9],nid011[48-51],nid0134[0-3],nid0153[2-5],nid0172[4-7],nid0191[6-9]
Priority=1000 RootOnly=NO ReqResv=NO Shared=FORCE:32 PreemptMode=REQUEUE
State=UP TotalCPUs=2560 TotalNodes=40 SelectTypeParameters=CR_CORE
DefMemPerCPU=1952 MaxMemPerCPU=1952
dmj@cori01:~>
dmj@cori01:~> scontrol show cache
Current Association Manager state
User Records
UserName=a0o(61087) DefAccount=m1820 DefWckey= AdminLevel=None
UserName=a2832ba(49508) DefAccount=m888 DefWckey= AdminLevel=None
UserName=a3uw(69297) DefAccount=m452 DefWckey= AdminLevel=None
UserName=aae109(57847) DefAccount=m1673 DefWckey= AdminLevel=None
UserName=aagaga(31122) DefAccount=mpccc DefWckey= AdminLevel=None
UserName=aakhan(4294967294) DefAccount=(null) DefWckey= AdminLevel=None
UserName=aalbaugh(57551) DefAccount=m1876 DefWckey= AdminLevel=None
UserName=aaronkim(61299) DefAccount=m1869 DefWckey= AdminLevel=None
UserName=aaronm(61043) DefAccount=lux DefWckey= AdminLevel=None
....
QOS Records
QOS=normal(1)
UsageRaw=331170672.457604
GrpJobs=N(1) GrpSubmitJobs=N(1) GrpWall=N(790714.41)
GrpTRES=cpu=N(3840),mem=N(7495680),energy=N(0),node=N(60),bb/cray=N(0)
GrpTRESMins=cpu=N(5519511),mem=N(10773220719),energy=N(0),node=N(843827),bb/cray=N(550317400)
GrpTRESRunMins=cpu=N(106880),mem=N(208629760),energy=N(0),node=N(1670),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=premium(6)
UsageRaw=243564.720693
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(7.85)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(4059),mem=N(7923972),energy=N(0),node=N(63),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=low(7)
UsageRaw=1672867811.190714
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(3564.15)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(27881130),mem=N(54423966124),energy=N(0),node=N(435642),bb/cray=N(153147777)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=serialize(11)
UsageRaw=0.000000
GrpJobs=1(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=1628(0),bb/cray=N(0)
GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=scavenger(12)
UsageRaw=0.000000
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=normal_regular_0(13)
UsageRaw=2451945063.593852
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(404.75)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=1628(0),bb/cray=N(0)
GrpTRESMins=cpu=N(40865751),mem=N(79769946068),energy=N(0),node=N(638527),bb/cray=N(1698879661)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=1(0) MaxSubmitJobs= MaxWallPJ=120
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=normal_regular_1(14)
UsageRaw=5231961807.901316
GrpJobs=N(0) GrpSubmitJobs=N(10) GrpWall=N(1689.23)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(87199363),mem=N(170213157483),energy=N(0),node=N(1362490),bb/cray=N(1448802875)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=1(0) MaxSubmitJobs= MaxWallPJ=240
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=normal_regular_2(15)
UsageRaw=2070800115.052875
GrpJobs=N(1) GrpSubmitJobs=N(1) GrpWall=N(1610.34)
GrpTRES=cpu=N(16512),mem=N(32231424),energy=N(0),node=N(258),bb/cray=N(0)
GrpTRESMins=cpu=N(34513335),mem=N(67370030409),energy=N(0),node=N(539270),bb/cray=N(1977480605)
GrpTRESRunMins=cpu=N(1008332),mem=N(1968265625),energy=N(0),node=N(15755),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=360
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=normal_regular_3(16)
UsageRaw=34471007135.085697
GrpJobs=N(7) GrpSubmitJobs=N(731) GrpWall=N(178855.57)
GrpTRES=cpu=N(29568),mem=N(57716736),energy=N(0),node=N(462),bb/cray=N(0)
GrpTRESMins=cpu=N(574516785),mem=N(1121456765461),energy=N(0),node=N(8976824),bb/cray=N(651658573)
GrpTRESRunMins=cpu=N(7858877),mem=N(15340529595),energy=N(0),node=N(122794),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=360
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=normal_regular_4(17)
UsageRaw=99116791.116665
GrpJobs=N(10) GrpSubmitJobs=N(18) GrpWall=N(14233.57)
GrpTRES=cpu=N(1216),mem=N(2373632),energy=N(0),node=N(19),bb/cray=N(0)
GrpTRESMins=cpu=N(1651946),mem=N(3224599604),energy=N(0),node=N(25811),bb/cray=N(1352098725)
GrpTRESRunMins=cpu=N(495053),mem=N(966345147),energy=N(0),node=N(7735),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=720
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=premium_regular_0(19)
UsageRaw=89485143.139266
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(14.31)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=1628(0),bb/cray=N(0)
GrpTRESMins=cpu=N(1491419),mem=N(2911249990),energy=N(0),node=N(23303),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=1(0) MaxSubmitJobs= MaxWallPJ=120
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=premium_regular_1(20)
UsageRaw=130201380.162831
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(42.87)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(2170023),mem=N(4235884901),energy=N(0),node=N(33906),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=1(0) MaxSubmitJobs= MaxWallPJ=240
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=premium_regular_2(21)
UsageRaw=2772314.787207
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(1.31)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(46205),mem=N(90192641),energy=N(0),node=N(721),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=360
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=premium_regular_3(22)
UsageRaw=165071227.514326
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(190.20)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(2751187),mem=N(5370317268),energy=N(0),node=N(42987),bb/cray=N(59895075)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=360
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=premium_regular_4(23)
UsageRaw=0.000000
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=720
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=low_regular_0(25)
UsageRaw=0.000000
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=1628(0),bb/cray=N(0)
GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=1(0) MaxSubmitJobs= MaxWallPJ=120
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=low_regular_1(26)
UsageRaw=0.000000
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=1(0) MaxSubmitJobs= MaxWallPJ=240
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=low_regular_2(27)
UsageRaw=0.000000
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=360
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=low_regular_3(28)
UsageRaw=0.000000
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=360
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=low_regular_4(29)
UsageRaw=0.000000
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=720
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=part_debug(32)
UsageRaw=1910439247.828047
GrpJobs=N(1) GrpSubmitJobs=N(1) GrpWall=N(12263.23)
GrpTRES=cpu=N(3840),mem=N(7495680),energy=N(0),node=160(60),bb/cray=N(0)
GrpTRESMins=cpu=N(31840654),mem=N(62152091705),energy=N(0),node=N(497510),bb/cray=N(703465178)
GrpTRESRunMins=cpu=N(118843),mem=N(231982967),energy=N(0),node=N(1856),bb/cray=N(0)
MaxJobsPU=1(1) MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=node=128
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=part_reg(33)
UsageRaw=44737923138.446764
GrpJobs=N(18) GrpSubmitJobs=N(760) GrpWall=N(197250.85)
GrpTRES=cpu=N(47296),mem=N(92321792),energy=N(0),node=1428(739),bb/cray=N(0)
GrpTRESMins=cpu=N(745632052),mem=N(1455473766104),energy=N(0),node=N(11650500),bb/cray=N(7188815516)
GrpTRESRunMins=cpu=N(17640787),mem=N(34434816614),energy=N(0),node=N(275637),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=part_immed(35)
UsageRaw=0.000000
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=32(0),bb/cray=N(0)
GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=part_shared(36)
UsageRaw=93875352.155751
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(782294.54)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(1564589),mem=N(3054078123),energy=N(0),node=N(782294),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=killable(37)
UsageRaw=0.000000
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=1628(0),bb/cray=N(0)
GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
dmj@cori01:~>
My recollection is that the jobs that were started were also in the regular partition. That is interesting about the subtle difference in ways that partition priorities differ from other priorities. So, in general, to prevent these kinds of issues should I set everything with equal partition priorities and use perhaps a more extensive mapping of QOSs (the mappings are done in the job submit plugin) to set baseline priorities for the different job classes? In fact it just did it again. That same job was scheduled to start at 17:00, had drained the system to achieve that. When it failed it instead started a bunch of lower priority regular jobs: dmj@cori03:~> squeue -t R JOBID USER ACCOUNT NAME PARTITION QOS NODES TIME_LIMIT TIME ST 23851 inascime mpopn killable_q regular normal 10 2:00:00 1:44:31 R 20768 pankin m1043 run.nimrod regular normal 258 6:00:00 5:12:48 R 23978 masao dessn EXPNUM_401 regular normal 16 2:00:00 11:10 R 21093 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 1:24:51 R 23588 dolmsted m1090 bi_v2_d0 regular normal 2 12:00:00 9:13:35 R 22674 luisruiz m657 my_job regular normal 2 12:00:00 9:11:01 R 22676 luisruiz m657 my_job regular normal 2 12:00:00 9:11:01 R 21094 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 33:26 R 21095 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 11:11 R 21096 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 11:11 R 21097 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 11:11 R 21098 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 11:11 R 21099 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 10:40 R 21100 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 10:40 R 21101 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 10:40 R 21102 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 10:40 R 21103 psteinbr m2078 l328f21b64 regular normal 32 3:30:00 10:09 R 23621 dolmsted m1090 lbe_bi_v1_ regular normal 2 12:00:00 8:18:54 R 23814 jllyons m1900 vpbs.com regular normal 4 6:00:00 1:25:26 R 20727 smeinel m789 make_props regular normal 128 6:00:00 1:25:20 R 20728 smeinel m789 make_props regular normal 128 6:00:00 1:25:20 R 20729 smeinel m789 make_props regular normal 128 6:00:00 1:25:16 R 20730 smeinel m789 make_props regular normal 128 6:00:00 11:11 R 20731 smeinel m789 make_props regular normal 128 6:00:00 11:11 R 20732 smeinel m789 make_props regular normal 128 6:00:00 11:11 R 23741 tslo m1393 myJob regular normal 1 12:00:00 4:19:41 R 22677 luisruiz m657 my_job regular normal 2 12:00:00 3:19:47 R 22679 luisruiz m657 my_job regular normal 2 12:00:00 3:14:43 R 23624 dolmsted m1090 lbe_bi_v2_ regular normal 2 12:00:00 2:54:53 R 23636 luisruiz m657 my_job regular normal 2 12:00:00 2:43:44 R 23638 luisruiz m657 my_job regular normal 2 12:00:00 1:26:39 R dmj@cori03:~> When I was looking at the jobs that would finish prior to starting job 19407, it was relying on the completion of a 128 node debug job. The issue is that there weren't enough regular jobs completing to allow it to start and keep the GrpNodes sum of regular jobs less than 1428. part_reg is limited to 1428 nodes using GrpNodes. The only thing I can see to explain this is that the backfill scheduler is not considering the partition and job QOS Grp limits when planning out the job. I can set the partition priorities equal if you think that will help for debugging purposes. I may also come up with node lists explicitly limiting regular to a 1428 node subset temporarily just so we can see high utilization during this initial acceptance period. I would, however, like these GrpNode limits to work in the long term because I would like to have flexibility to change the limits on the fly without adjusting slurm.conf or restarting slurmctld. One example of why we want this is to allow the GrpNodes of debug and regular to change during the day and night so we can devote more or fewer resources to debug. One thing that was disappointing about GrpNodes is that each job that shares a node contributes to the sum -- that is why shared is on a defined list of nodes. I'd prefer to have all nodes in all partitions and use these QOS GrpNodes barriers to create floating partitions that can adjust with node availability or policy changes. -Doug On 11/11/2015 05:17 PM, bugs@schedmd.com wrote: > *Comment # 10 <http://bugs.schedmd.com/show_bug.cgi?id=2129#c10> on bug > 2129 <http://bugs.schedmd.com/show_bug.cgi?id=2129> from Doug Jacobsen > <mailto:dmjacobsen@lbl.gov> * > > My recollection is that the jobs that were started were also in the regular > partition. > > That is interesting about the subtle difference in ways that partition > priorities differ from other priorities. So, in general, to prevent these > kinds of issues should I set everything with equal partition priorities and use > perhaps a more extensive mapping of QOSs (the mappings are done in the job > submit plugin) to set baseline priorities for the different job classes? > > > In fact it just did it again. That same job was scheduled to start at 17:00, > had drained the system to achieve that. When it failed it instead started a > bunch of lower priority regular jobs: We should definitely improve the documentation around those partition priorities - I'm still suspicious that they may be causing some of these problems. Resetting those to the same level or turning them off completely may help in the short term. E.g., a brief debug job asking for 32-nodes and 2-hours could wreck havoc on the schedule planner, even if it only runs for one minute and then disappears. I think I now get what you're saying around GrpNodes not being factored into the main schedule planner correctly, and will start looking into that tomorrow. > When I was looking at the jobs that would finish prior to starting job 19407, > it was relying on the completion of a 128 node debug job. The issue is that > there weren't enough regular jobs completing to allow it to start and keep the > GrpNodes sum of regular jobs less than 1428. > > part_reg is limited to 1428 nodes using GrpNodes. The only thing I can see to > explain this is that the backfill scheduler is not considering the partition > and job QOS Grp limits when planning out the job. > > I can set the partition priorities equal if you think that will help for > debugging purposes. Please do. It'll at least rule out one source of contention. > I may also come up with node lists explicitly limiting regular to a 1428 node > subset temporarily just so we can see high utilization during this initial > acceptance period. > > I would, however, like these GrpNode limits to work in the long term because I > would like to have flexibility to change the limits on the fly without > adjusting slurm.conf or restarting slurmctld. One example of why we want this > is to allow the GrpNodes of debug and regular to change during the day and > night so we can devote more or fewer resources to debug. > > One thing that was disappointing about GrpNodes is that each job that shares a > node contributes to the sum -- that is why shared is on a defined list of > nodes. I'd prefer to have all nodes in all partitions and use these QOS > GrpNodes barriers to create floating partitions that can adjust with node > availability or policy changes. That's why those partition QOS's got added with 15.08. :) GrpCPUs may better handle what you're trying to do, although I'd caution against mixing it together with GrpNodes. I've set all partitions to priority 1000 and restarted slurmctld. The job is scheduled to start at 21:44, if it fails to start this time, i'll reconfigure the partitions (temporarily!) to be on mutually exclusive sets of node and (temporarily!) not use GrpNodes until we figure this out. GrpCPUs may be a better option for the future. I'll eventually want to figure out how this can be used one a partition spanning two different architectures (Haswell and KNL) -- e.g., GrpCPUS or GrpTRES=haswell=X,knl=y or similar --- but that can wait! Thanks for looking at this, Doug Hello, I'm just wondering if there is an update on this. I've temporarily modified cori to have static partitions for the major production partitions (debug, regular, shared) running on distinct nodes. I hope to be able to move back to GrpNodes (or GrpCpus) once we can get the backfill scheduler to honor the limits. One additional thing I wanted to mention - the design I have in mind for edison (our much larger system moving to Native SLURM next month) will have two major partitions: regular and debug. The "normal" jobs that would be run on these will use a GrpCpus that limit the partitions to mutually exclusive counts (e.g. if the system is 5500 nodes, regular would have GrpNodes=5250 and debug with no limit). Larger jobs would be put into a job QOS allowing larger GrpNodes limit and defining OverPartQos. Thus, it will be important for the backfill scheduler to honor not only the GrpNodes (GrpCpus) limit on the partition QOS, but the effective QOS for a job running in an OverPartQos QOS. Is that feasible? Thanks, Doug > --- Comment #13 from Doug Jacobsen <dmjacobsen@lbl.gov> --- > Hello, > > I'm just wondering if there is an update on this. I've temporarily modified > cori to have static partitions for the major production partitions (debug, > regular, shared) running on distinct nodes. I hope to be able to move back to > GrpNodes (or GrpCpus) once we can get the backfill scheduler to honor the > limits. I'm still researching this, it's not a trivial fix unfortunately. I take it disabling the partition priorities alone didn't solve the problem for you? As you suspected we use GrpNodes/GrpCpus only as a reason to defer job execution; we don't appear to consider it at all during the normal schedule plan. I'm trying to see how difficult it would be to add the partition qos constraints to the main schedule planner, but I'm going to have to loop in the developers on how best to structure this or if thats a change they'd want to take on. > One additional thing I wanted to mention - the design I have in mind for edison > (our much larger system moving to Native SLURM next month) will have two major > partitions: regular and debug. The "normal" jobs that would be run on these > will use a GrpCpus that limit the partitions to mutually exclusive counts (e.g. > if the system is 5500 nodes, regular would have GrpNodes=5250 and debug with no > limit). Larger jobs would be put into a job QOS allowing larger GrpNodes limit > and defining OverPartQos. > > Thus, it will be important for the backfill scheduler to honor not only the > GrpNodes (GrpCpus) limit on the partition QOS, but the effective QOS for a job > running in an OverPartQos QOS. Is that feasible? Possibly, although that certainly adds another complication for the scheduler to consider. I'll take this up internally and have to get back to you on that as well. If you're out at SC15 please stop by our booth - I'd love to talk through some of this in person to better understand how you'd expect to tie all the components together. - Tim Hey Doug - I think I finally understand exactly what you're seeing after putting together a reproducer for us to use internally. Can you take a look through the notes below and verify this is what you were seeing on cori? - Tim #### reproducer: # i have 10 nodes in my test config, which is sufficient for # problemjob + goodjob2 to run alongside each other at t=5min # if the QOS didn't block it. scale these values as needed scontrol create partitionname=confusion nodes=ALL sacctmgr create qos part_confusion maxnodes=6 scontrol update partitionname=confusion qos=part_confusion # start these two immediately sbatch -J goodjob1 --wrap "sleep 600" -t 5 -p confusion -N 3 --exclusive sbatch -J goodjob2 --wrap "sleep 600" -t 10 -p confusion -N 3 --exclusive # sleep to ensure those two start up first sleep 10 # this will be delayed by the two above, but slurm *thinks* it # could start @ t=5min since we'd have 7 nodes available in the partition # however, the qos means we're only allowed access to 4 of those 7 nodes! # won't actually be eligible until t=10min # note that it wants most nodes in partition, # so other jobs would have to drain for it sbatch -J problemjob --wrap "sleep 600" -t 3 -p confusion -N 5 --exclusive # this *should not* run at t=5min, since problem job (with higher priority) # would be delayed out to t=10min # nice used to ensure problemjob has higher priority, so shouldbackfillbutwont # will be attempting to backfill sbatch -J shouldnotrunyet --wrap "sleep 600" -t 10 -p confusion -N 3 --exclusive --nice=200 #### example and further analysis: # let full scheduler loop have a chance to run sleep 100 ### this is now t~= 2 minutes: scontrol show jobs|grep 'JobId\|Time\|Reas' JobId=127 JobName=goodjob1 JobState=RUNNING Reason=None Dependency=(null) RunTime=00:01:36 TimeLimit=00:05:00 TimeMin=N/A SubmitTime=2015-11-13T20:14:07 EligibleTime=2015-11-13T20:14:07 StartTime=2015-11-13T20:14:08 EndTime=2015-11-13T20:19:08 PreemptTime=None SuspendTime=None SecsPreSuspend=0 JobId=128 JobName=goodjob2 JobState=RUNNING Reason=None Dependency=(null) RunTime=00:01:36 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2015-11-13T20:14:07 EligibleTime=2015-11-13T20:14:07 StartTime=2015-11-13T20:14:08 EndTime=2015-11-13T20:24:08 PreemptTime=None SuspendTime=None SecsPreSuspend=0 JobId=129 JobName=problemjob JobState=PENDING Reason=Resources Dependency=(null) RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A SubmitTime=2015-11-13T20:14:17 EligibleTime=2015-11-13T20:14:17 StartTime=2015-11-13T20:19:08 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 JobId=130 JobName=shouldnotrunyet JobState=PENDING Reason=Priority Dependency=(null) RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2015-11-13T20:14:17 EligibleTime=2015-11-13T20:14:17 StartTime=2015-11-13T20:22:00 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 # goodjob1+2 are both running initially. # goodjob1 will complete at t=5min, goodjob2 at t=10min. # Note that StartTime[problemjob] == (EndTime[goodjob] + 10 seconds) # ~= t=5min. sleep 300 ### this is now t~=8min scontrol show jobs|grep 'JobId\|Time\|Reas' ### wait longer, then: JobId=128 JobName=goodjob2 JobState=RUNNING Reason=None Dependency=(null) RunTime=00:07:40 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2015-11-13T20:14:07 EligibleTime=2015-11-13T20:14:07 StartTime=2015-11-13T20:14:08 EndTime=2015-11-13T20:24:08 PreemptTime=None SuspendTime=None SecsPreSuspend=0 JobId=129 JobName=problemjob JobState=PENDING Reason=Resources Dependency=(null) RunTime=00:00:00 TimeLimit=00:03:00 TimeMin=N/A SubmitTime=2015-11-13T20:14:17 EligibleTime=2015-11-13T20:14:17 StartTime=2015-11-13T20:24:08 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 JobId=130 JobName=shouldnotrunyet JobState=RUNNING Reason=None Dependency=(null) RunTime=00:02:27 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2015-11-13T20:14:17 EligibleTime=2015-11-13T20:14:17 StartTime=2015-11-13T20:19:21 EndTime=2015-11-13T20:29:21 PreemptTime=None SuspendTime=None SecsPreSuspend=0 The QOS prevented problemjob from launching at t=5. Instead shouldnotrunyet has started at t=5min, which will push problemjob's actual launch time back to t=15minutes. problemjob incorrectly recalculates its StartTime[problemjob] ~= t=10minutes ~= EndTime[goodjob2]. Not shown here is that with additional jobs of various sizes and runtimes Slurm will need to hold nodes idle for any large "problemjob". When the expected start time happens and the large job is blocked by the QOS, more of those other smaller lower-priority jobs that were prevented from backfilling will launch, re-filling up the nodes we will eventually need to satisfy the GrpNodes constraint for the large job. So we're stalling the queue for larger jobs, while introducing seemingly random delays for smaller jobs while the scheduler attempts (but fails) to free up additional nodes for that large job, only to them be blocked again by the QOS. (In reply to Doug Jacobsen from comment #1) > So, I guess bf_busy_nodes really won't work with select/cray -- is that > right? Tim is at SC15, so I'll take over this bug. I'm just starting to work down the comments now. Regarding comment 2, I just updated the documentation: -This option is currently only supported by the select/cons_res plugin. +This option is currently only supported by the select/cons_res plugin +(or select/cray with SelectTypeParameters set to "OTHER_CONS_RES", +which layers the select/cray plugin over the select/cons_res plugin). (In reply to Doug Jacobsen from comment #0) > and was finding that the system was draining --- a lot --- for high priority > jobs, but that it was getting overlapping reservations of nodelists for > large jobs (e.g. nid1-600 and nid400-800), so was effectively draining 800 > nodes instead of 600. The backfill scheduler builds a map of resource allocations through time. Nodes nid1-600 could easily be reserved for a job expected to start in one hour and give a time limit on that job of one hour, reserve nodes nid400-800 starting 2 hours in the future. Looking at the expected start times of jobs is critical for seeing what is expected to start both in time and space (nodes). There are a couple of Slurm debug flags that print (very verbose) details about what the backfill scheduling is doing. Let me go through more of this before suggesting turning those on, but for your references, see DebugFlags=backfill and BackfillMap. They can be turned on and off using the scontrol command (e.g. "scontrol setdebugflags +backfill") (In reply to Tim Wickberg from comment #2) > we have a strict backfill algorithm and > if jobs that could otherwise be run now on open resources would delay the > anticipated start time of the highest priority job Technically, Slurm uses what is called conservative backfill: No job will be started that delays the expected start time of ANY higher priority job (not just the HIGHEST priority job). This commit will confirm the association and QOS node limits prior to reserving resources for a pending job: https://github.com/SchedMD/slurm/commit/dcc943b7b37fca6b0ddfe67bc393b8547930555a There are other limits which are not currently tested in the backfill scheduler and I am studying those now. I will be adding more tests in the near future, but this should fix the GrpNodes limit problem reported in this bug. I've added a second commit which complements the previous one, adding more association and QOS limit checks: https://github.com/SchedMD/slurm/commit/94f0e9485b35af4e5749d2195820bb5805f14922 These two patches will bring the backfill schedulers testing of limits into agreement with Slurm's main scheduling logic. Hi Moe, This is great, thank you for looking at this. I'll put these patches on alva tomorrow and will try it out. One question, Tim mentioned that GrpCPUs might be a good fit for our mix of node-exclusive and shared-node jobs. Will this patch also work with GrpCPUs? If not, that's OK, I'll keep the shared jobs in a fixed partition for the time being, but I would eventually like to have all partitions be non-static (all floating) to prevent any particular partition being harmed too much by an outage, also to allow us to adjust limits on the fly using by adjusting GrpCPUs. Thanks again, Doug (In reply to Doug Jacobsen from comment #21) > One question, Tim mentioned that GrpCPUs might be a good fit for our mix of > node-exclusive and shared-node jobs. Will this patch also work with GrpCPUs? Not exactly. Let me explain how the logic works and its limitations. The limits logic does support dozens of association and QOS limits, but all of the tests are based upon the _current_ configuration. They do not support the concept of something like "job 123 will end in 10 minutes, releasing CPUs/nodes/whatever so that job 125 will be able to begin then". In addition, some information is not available until the allocation takes place. For example, if a job requests a node count (or a node count range) on a heterogeneous system then Slurm will not know the CPU count until after resources are selected. These issues restrict the capabilities of backfill scheduling logic. Right now the backfill scheduling logic matches that of the main scheduling logic to the extent possible. Here's an outline of the logic. 1. Build queue of pending jobs, validate dependencies, start time, and some other basic limits. 2. Sort job queue by priority. 3. For each job in the queue: A. Test more limits (done here as this is more heavy-weight), this is newly added. If not runnable NOW, the go to next job B. Determine when/where the job can/will start C. If job can start now, validate more limits with resources selected in step "B" a. If limits all good then start it b. Otherwise, skip to next job D. If job can start later, reserve those resources at that point in the future The GrpNodes check happens in step A. The GrpCPUs check happens in step C. On a different note, I just fixed a burst buffer issue that Cray considered a high priority. The commit is here: https://github.com/SchedMD/slurm/commit/20e0636537476395fce50efe140e5a4a55c2099b I no longer have any test environments (because edison and its test system alva are moving). Thus I will upgrade cori to 15.08.4 following a maintenance today and apply this patch and try to test this out today and tomorrow. Is the 3rd burst buffer patch to be applied on top of 15.08.4 as well? (In reply to Doug Jacobsen from comment #23) > I no longer have any test environments (because edison and its test system > alva are moving). Thus I will upgrade cori to 15.08.4 following a > maintenance today and apply this patch and try to test this out today and > tomorrow. > > Is the 3rd burst buffer patch to be applied on top of 15.08.4 as well? Yes. All of these changes will be in v15.08.5 when released, likely mid December. Any update on this? I'm dropping this from severity 2 to severity 3 since it should be fixed and we're just waiting for confirmation. Hi Moe, I applied the patches and ran a scenario wherein I queued many 200 node jobs that would end at a variety of times. The partition had a partition-qos limiting to GrpNodes=1400 nodes. I then submitted a high priority 1400 node job. The job was assigned a particular start time. The system began reserving resources for this job. Once the system had drained a total of 1400 nodes (but still had one 200 node job still running), it recurred with the same issue. Instead of waiting for that last 200 node job to finish, it instead started a bunch of 200-node lower priority jobs and pushed back the start time for the 1400 node job. As far as I can tell the patch did not allow the scheduler to accurately drain for the needed node count when considering partition-qos limits. Thanks for your continuing help with this, Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Tue, Dec 8, 2015 at 3:18 PM, <bugs@schedmd.com> wrote: > Moe Jette <jette@schedmd.com> changed bug 2129 > <http://bugs.schedmd.com/show_bug.cgi?id=2129> > What Removed Added Severity 2 - High Impact 3 - Medium Impact > > *Comment # 25 <http://bugs.schedmd.com/show_bug.cgi?id=2129#c25> on bug > 2129 <http://bugs.schedmd.com/show_bug.cgi?id=2129> from Moe Jette > <jette@schedmd.com> * > > Any update on this? > > I'm dropping this from severity 2 to severity 3 since it should be fixed and > we're just waiting for confirmation. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > I was just able to reproduce this and have detailed logs to study now. This time I found a way to reproduce the failure and generate a fix. Similar logic existed in both the backfill and primary scheduling code. This change fixes both bugs. In the case of the backfill logic, the algorithm is still imperfect. Ideally we track all resources through time based upon when pending jobs are expected to start and end, which involves very high overhead. This new code is better than the original and seems to work fine, but is not generating _ideal_ scheduling. This change will be in version 15.08.5, that we should release either this week or next week. The commit with the fix is here: https://github.com/SchedMD/slurm/commit/fd6a48a494ae0de2b282708c98f06fce3ba56a35 FYI: We just released version 15.08.5, which includes this fix (and quite a few other Cray and burst buffer specific fixes). I'll close this now. Please re-open with details if necessary. |