Ticket 2285

Summary: many jobs have reason "Resources", seems to confuse scheduling
Product: Slurm Reporter: Doug Jacobsen <dmjacobsen>
Component: slurmctldAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 15.08.5   
Hardware: Cray XC   
OS: Linux   
See Also: http://bugs.schedmd.com/show_bug.cgi?id=2300
https://bugs.schedmd.com/show_bug.cgi?id=8347
Site: NERSC Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 15.08.7
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: cori slurm.conf
terminal output showing the issue
slurmctld log since last restart
Fix for v15.08.5

Description Doug Jacobsen 2015-12-28 03:28:03 MST
Created attachment 2546 [details]
cori slurm.conf

Hello,

Since upgrading to 15.08.5 on cori I've been observing occasional instances where many hundreds of jobs are blocked on reason "Resources" instead of the two or three corresponding to the partial segmentation of the cori system (one for regular, debug, shared partitions each).

When so many jobs are apparently getting nodes reserved for resources, the system starts becoming more idle than necessary.

I have not been able to identify conditions that lead to this behavior, but setting the partition to down, allowing a scheduling cycle to complete, then setting a partition back up seems to temporarily correct the issue.

This has occurred four times since the 23rd, twice today.

I'll try to collect more information from the logs that might be of use, but the lab is in shut down right now so I have limited time resources available to me.

The current slurm.conf for cori is attached.

-Doug
Comment 1 Doug Jacobsen 2015-12-28 03:28:44 MST
Created attachment 2547 [details]
terminal output showing the issue
Comment 2 Doug Jacobsen 2015-12-28 03:50:49 MST
This has happened two more times today on cori.  I've reduced the bf_max_job_user from 5 to 1 to see if that prevents some kind of bad interaction between the QOS limits and the new adjustments to the scheduling logic.
Comment 3 Tim Wickberg 2015-12-28 05:03:27 MST
Can you share your Partition QOS's? I'm guessing those are leading to this interaction, although I'm not yet certain how.

The output from `scontrol show assoc` would be plenty.
Comment 4 Doug Jacobsen 2015-12-28 05:33:17 MST
That would be a tremendous amount of output -- we have several thousand
associations.  Assuming you just want the qos's:

QOS Records

QOS=normal(1)
    UsageRaw=29035864812.296301
    GrpJobs=N(757) GrpSubmitJobs=N(10370) GrpWall=N(38569165.57)

GrpTRES=cpu=N(11044),mem=N(19901760),energy=N(0),node=N(868),bb/cray=N(0)

GrpTRESMins=cpu=N(483931080),mem=N(914467304608),energy=N(0),node=N(43846004),bb/cray=N(2466900640)

GrpTRESRunMins=cpu=N(1542092),mem=N(2466990822),energy=N(0),node=N(385444),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium(6)
    UsageRaw=58570546.135811
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(2765.51)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(976175),mem=N(1905495100),energy=N(0),node=N(15349),bb/cray=N(467060952)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low(7)
    UsageRaw=2966771927.483770
    GrpJobs=N(0) GrpSubmitJobs=N(5) GrpWall=N(6969.30)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(49446198),mem=N(96518980040),energy=N(0),node=N(772596),bb/cray=N(54061765)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=serialize(11)
    UsageRaw=131692149.509956
    GrpJobs=1(0) GrpSubmitJobs=N(0) GrpWall=N(67.20)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(2194869),mem=N(4284384597),energy=N(0),node=N(34294),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=scavenger(12)
    UsageRaw=64479816.589748
    GrpJobs=N(0) GrpSubmitJobs=N(6) GrpWall=N(47865.76)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(1074663),mem=N(2084784411),energy=N(0),node=N(47865),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=normal_regular_0(13)
    UsageRaw=1603426268.757679
    GrpJobs=N(0) GrpSubmitJobs=N(3) GrpWall=N(316.58)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(26723771),mem=N(52164801276),energy=N(0),node=N(417558),bb/cray=N(599147136)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=1(0) MaxSubmitJobs=4(3) MaxWallPJ=360
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=node=1628
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=normal_regular_1(14)
    UsageRaw=21783598846.431665
    GrpJobs=N(0) GrpSubmitJobs=N(31) GrpWall=N(6957.24)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(363059980),mem=N(708693082470),energy=N(0),node=N(5672812),bb/cray=N(1505843391)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=1(0) MaxSubmitJobs=4(31) MaxWallPJ=720
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=node=1024
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=normal_regular_2(15)
    UsageRaw=65269007585.668082
    GrpJobs=N(0) GrpSubmitJobs=N(124) GrpWall=N(47085.74)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(1087816793),mem=N(2123418380120),energy=N(0),node=N(16997137),bb/cray=N(1143539304)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=10(0) MaxSubmitJobs=100(124) MaxWallPJ=1440
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=node=1024
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=normal_regular_3(16)
    UsageRaw=167403609831.030172
    GrpJobs=N(67) GrpSubmitJobs=N(2434) GrpWall=N(1470414.32)

GrpTRES=cpu=N(80256),mem=N(156659712),energy=N(0),node=N(1254),bb/cray=N(0)

GrpTRESMins=cpu=N(2790060163),mem=N(5446109051293),energy=N(0),node=N(43594690),bb/cray=N(1580642408)

GrpTRESRunMins=cpu=N(27710726),mem=N(54091337932),energy=N(0),node=N(432980),bb/cray=N(0)
    MaxJobsPU=10(67) MaxSubmitJobs=250(2434) MaxWallPJ=1440
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=node=512
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=normal_regular_4(17)
    UsageRaw=19323189062.714505
    GrpJobs=N(101) GrpSubmitJobs=N(2555) GrpWall=N(3910499.22)
    GrpTRES=cpu=N(9600),mem=N(16959488),energy=N(0),node=N(150),bb/cray=N(0)

GrpTRESMins=cpu=N(322053151),mem=N(613441844570),energy=N(0),node=N(5032080),bb/cray=N(4167754646)

GrpTRESRunMins=cpu=N(6107728),mem=N(11238938999),energy=N(0),node=N(95433),bb/cray=N(0)
    MaxJobsPU=50(101) MaxSubmitJobs=500(2555) MaxWallPJ=1440
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=node=100
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium_regular_0(19)
    UsageRaw=62740426.670975
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(10.63)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(1045673),mem=N(2041155214),energy=N(0),node=N(16338),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=1(0) MaxSubmitJobs=4(0) MaxWallPJ=360
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=node=1628
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium_regular_1(20)
    UsageRaw=1105208059.015536
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(339.99)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(18420134),mem=N(35956102186),energy=N(0),node=N(287814),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=1(0) MaxSubmitJobs=4(0) MaxWallPJ=720
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=node=1024
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium_regular_2(21)
    UsageRaw=394514798.365269
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(200.45)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(6575246),mem=N(12834881440),energy=N(0),node=N(102738),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=10(0) MaxSubmitJobs=100(0) MaxWallPJ=1440
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=node=1024
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium_regular_3(22)
    UsageRaw=139942721.225102
    GrpJobs=N(1) GrpSubmitJobs=N(1) GrpWall=N(1846.17)
    GrpTRES=cpu=N(320),mem=N(624640),energy=N(0),node=N(5),bb/cray=N(0)

GrpTRESMins=cpu=N(2332378),mem=N(4552803197),energy=N(0),node=N(36443),bb/cray=N(5638522108)

GrpTRESRunMins=cpu=N(111642),mem=N(217926485),energy=N(0),node=N(1744),bb/cray=N(0)
    MaxJobsPU=10(1) MaxSubmitJobs=250(1) MaxWallPJ=1440
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=node=512
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium_regular_4(23)
    UsageRaw=1811337.955253
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(355.23)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(30188),mem=N(58928861),energy=N(0),node=N(471),bb/cray=N(35278242)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=50(0) MaxSubmitJobs=500(0) MaxWallPJ=1440
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=node=100
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low_regular_0(25)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=1(0) MaxSubmitJobs=4(0) MaxWallPJ=360
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=node=1628
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low_regular_1(26)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=1(0) MaxSubmitJobs=4(0) MaxWallPJ=720
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=node=1024
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low_regular_2(27)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=10(0) MaxSubmitJobs=100(0) MaxWallPJ=1440
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=node=1024
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low_regular_3(28)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=10(0) MaxSubmitJobs=250(0) MaxWallPJ=1440
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=node=512
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low_regular_4(29)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobsPU=50(0) MaxSubmitJobs=500(0) MaxWallPJ=1440
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=node=100
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=part_debug(32)
    UsageRaw=23592245099.469097
    GrpJobs=N(9) GrpSubmitJobs=N(122) GrpWall=N(389416.66)
    GrpTRES=cpu=N(7680),mem=N(14991360),energy=N(0),node=N(120),bb/cray=N(0)

GrpTRESMins=cpu=N(393204084),mem=N(767360107327),energy=N(0),node=N(6143813),bb/cray=N(2619505647)

GrpTRESRunMins=cpu=N(232929),mem=N(454677538),energy=N(0),node=N(3639),bb/cray=N(0)
    MaxJobsPU=1(9) MaxSubmitJobs=10(122) MaxWallPJ=
    MaxTRESPJ=node=128
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=part_reg(33)
    UsageRaw=277289275824.041923
    GrpJobs=N(169) GrpSubmitJobs=N(5154) GrpWall=N(5454035.99)

GrpTRES=cpu=N(90176),mem=N(174243840),energy=N(0),node=N(1409),bb/cray=N(0)

GrpTRESMins=cpu=N(4621487930),mem=N(9005850145329),energy=N(0),node=N(72210748),bb/cray=N(14670727238)

GrpTRESRunMins=cpu=N(50370413),mem=N(97196104567),energy=N(0),node=N(787037),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=part_shared(36)
    UsageRaw=6167943337.439099
    GrpJobs=N(748) GrpSubmitJobs=N(10253) GrpWall=N(37922174.68)
    GrpTRES=cpu=N(3364),mem=N(4910400),energy=N(0),node=N(748),bb/cray=N(0)

GrpTRESMins=cpu=N(102799055),mem=N(170658900245),energy=N(0),node=N(37922174),bb/cray=N(368517711)

GrpTRESRunMins=cpu=N(4335376),mem=N(6226500412),energy=N(0),node=N(902159),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs=25000(10253) MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=part_preempt(44)
    UsageRaw=1869181.523848
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(239.54)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(31153),mem=N(60810705),energy=N(0),node=N(486),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=part_realtime(45)
    UsageRaw=63961418.292732
    GrpJobs=N(7) GrpSubmitJobs=N(11) GrpWall=N(47279.89)
    GrpTRES=cpu=2048(168),mem=N(327936),energy=N(0),node=N(7),bb/cray=N(0)

GrpTRESMins=cpu=N(1066023),mem=N(2075797392),energy=N(0),node=N(47329),bb/cray=N(10515235)

GrpTRESRunMins=cpu=N(6043),mem=N(11796326),energy=N(0),node=N(251),bb/cray=N(0)
    MaxJobsPU=8(7) MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=cpu=512
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=realtime_generic(46)
    UsageRaw=44140975.222532
    GrpJobs=8(0) GrpSubmitJobs=N(0) GrpWall=N(34077.98)
    GrpTRES=cpu=256(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(735682),mem=N(1436130664),energy=N(0),node=N(34098),bb/cray=N(6511657)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=realtime_lcls(47)
    UsageRaw=231370.796306
    GrpJobs=8(0) GrpSubmitJobs=N(0) GrpWall=N(122.47)
    GrpTRES=cpu=256(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(3856),mem=N(7527263),energy=N(0),node=N(122),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=realtime_openmsi(48)
    UsageRaw=5887143.595559
    GrpJobs=8(0) GrpSubmitJobs=N(0) GrpWall=N(2696.26)
    GrpTRES=cpu=256(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(98119),mem=N(191528404),energy=N(0),node=N(2725),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=realtime_ngbi(49)
    UsageRaw=3278.289890
    GrpJobs=8(0) GrpSubmitJobs=N(0) GrpWall=N(27.31)
    GrpTRES=cpu=256(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    GrpTRESMins=cpu=N(54),mem=N(106653),energy=N(0),node=N(27),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=realtime_als(50)
    UsageRaw=1033500.474004
    GrpJobs=8(0) GrpSubmitJobs=N(0) GrpWall=N(2179.16)
    GrpTRES=cpu=256(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(17225),mem=N(28523604),energy=N(0),node=N(2179),bb/cray=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=realtime_ptf(51)
    UsageRaw=12580964.763747
    GrpJobs=8(7) GrpSubmitJobs=N(7) GrpWall=N(8139.20)
    GrpTRES=cpu=256(168),mem=N(327936),energy=N(0),node=N(7),bb/cray=N(0)

GrpTRESMins=cpu=N(209682),mem=N(409241978),energy=N(0),node=N(8139),bb/cray=N(0)

GrpTRESRunMins=cpu=N(4771),mem=N(9313382),energy=N(0),node=N(198),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=realtime_nstaff(53)
    UsageRaw=66024.381529
    GrpJobs=8(0) GrpSubmitJobs=N(0) GrpWall=N(27.84)
    GrpTRES=cpu=256(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)

GrpTRESMins=cpu=N(1100),mem=N(2147993),energy=N(0),node=N(27),bb/cray=N(4003577)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Mon, Dec 28, 2015 at 11:03 AM, <bugs@schedmd.com> wrote:

> Tim Wickberg <tim@schedmd.com> changed bug 2285
> <http://bugs.schedmd.com/show_bug.cgi?id=2285>
> What Removed Added Assignee support@schedmd.com tim@schedmd.com
>
> *Comment # 3 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c3> on bug 2285
> <http://bugs.schedmd.com/show_bug.cgi?id=2285> from Tim Wickberg
> <tim@schedmd.com> *
>
> Can you share your Partition QOS's? I'm guessing those are leading to this
> interaction, although I'm not yet certain how.
>
> The output from `scontrol show assoc` would be plenty.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 5 Tim Wickberg 2015-12-28 07:27:34 MST
Sorry about that - I forgot "scontrol show assoc" can get rather large when you have a considerable number of accounts defined... my test systems usually only have a handful.

I don't see any obvious issues with your config, and it doesn't look like you're using the partitionQOS to limit access.

Those jobs waiting for resources - can you confirm they're not waiting for BB or memory, or some other limit, but appear to be stuck on available nodes only?

I'd be curious as to what slurmctld.log is indicating is happening. Are you able to grab debug logs before/after you "clear" the problem by draining/resuming the partition?

I'd also be curious as to how the scheduler is performing for cori - sstat's output may be of value, although I can't point to anything in particular there at the moment.
Comment 6 Doug Jacobsen 2015-12-29 05:16:42 MST
Hi Tim,

Yes, as far as I can tell these jobs are only waiting on nodes, but you did
trigger an idea.  Does SLURM have some sort of load sensor like GridEngine
for determining available memory on nodes?  Or does it just dole out the
theoretical max of memory as specified in the slurm.conf for the node, and
just assume the memory is available?  I ask because I can imagine a
situation wherein the node doesn't get fully cleaned and there no longer is
sufficient memory to be running jobs based on our DefMemPerNode settings.
However, I think this is more of an academic point, our "regular" and
"debug" partitions only give out the maximum amount of memory we allow, and
I don't see any particular group of nodes stagnating.


Actually, this reoccured last night -- I'll see if I can dig up the logs in
a few minutes (there are a LOT of logs...)

This time, on a whim, I left the shared partition down, which comprises
over half our job queue in terms of entry count, and has generated
scheduling issues in the past (pathological failures wherein some jerk
asking for all the memory on a node but only 1 core would completely block
all shared-partition jobs from running, even on nodes the system wasn't
planning on running the job on).

Anyway, with the shared partition down this issue has not reoccurred - so
I'm wondering if this is somehow related to that partition.

Regarding your request for performance numbers.  I typically see our
backfill scheduler cycle around 30s when shared is enabled. I did update
the parameters yesterday to start preparing for our production
configuration starting on 1/11.
SchedulerParameters     =
no_backup_scheduling,bf_window=10080,bf_resolution=120,bf_max_job_array_resv=20,default_queue_depth=400,bf_max_job_test=6000,bf_max_job_user=1,bf_continue,nohold_on_prolog_fail,kill_invalid_depend

With the shared partition down, things seem smoother.  Obviously we need to
get that back online, but I want to let a few more big jobs run first
before signing up for more pain.


nid00837:~ # sdiag
*******************************************************
sdiag output at Tue Dec 29 11:07:48 2015
Data since      Mon Dec 28 16:00:00 2015
*******************************************************
Server thread count: 3
Agent queue size:    0

Jobs submitted: 8560
Jobs started:   5564
Jobs completed: 5289
Jobs canceled:  270
Jobs failed:    1

Main schedule statistics (microseconds):
Last cycle:   28172
Max cycle:    531631
Total cycles: 9372
Mean cycle:   26076
Mean depth cycle:  884
Cycles per minute: 8
Last queue length: 4164

Backfilling stats
Total backfilled jobs (since last slurm start): 4674
Total backfilled jobs (since last stats cycle start): 4328
Total cycles: 1451
Last cycle when: Tue Dec 29 11:07:01 2015
Last cycle: 14648298
Max cycle:  210382414
Mean cycle: 15108727
Last depth cycle: 4153
Last depth cycle (try sched): 216
Depth Mean: 4366
Depth Mean (try depth): 199
Last queue length: 4164
Queue length mean: 4538

Remote Procedure Call statistics by message type
REQUEST_JOB_STEP_CREATE                 ( 5001) count:51635
 ave_time:453031 total_time:23392303668
MESSAGE_EPILOG_COMPLETE                 ( 6012) count:45966
 ave_time:1442391 total_time:66300959109
REQUEST_COMPLETE_PROLOG                 ( 6018) count:44917
 ave_time:4917923 total_time:220898380128
REQUEST_PARTITION_INFO                  ( 2009) count:42389  ave_time:2603
  total_time:110346796
REQUEST_JOB_ALLOCATION_INFO_LITE        ( 4016) count:31971
 ave_time:1461398 total_time:46722363264
REQUEST_STEP_COMPLETE                   ( 5016) count:31662
 ave_time:1143450 total_time:36203931498
MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:31012  ave_time:77940
 total_time:2417090957
REQUEST_JOB_INFO                        ( 2003) count:28519
 ave_time:1655330 total_time:47208357850
REQUEST_JOB_USER_INFO                   ( 2039) count:12753
 ave_time:1119198 total_time:14273135712
REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:8311
ave_time:2595395 total_time:21570329270
REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:7527
ave_time:6497207 total_time:48904482394
REQUEST_PING                            ( 1008) count:1906   ave_time:194
 total_time:369931
REQUEST_JOB_INFO_SINGLE                 ( 2021) count:1551
ave_time:8941220 total_time:13867833675
REQUEST_NODE_INFO                       ( 2007) count:1479
ave_time:5262233 total_time:7782843909
REQUEST_CANCEL_JOB_STEP                 ( 5005) count:908
 ave_time:108256 total_time:98297288
REQUEST_KILL_JOB                        ( 5032) count:837
 ave_time:282130 total_time:236143105
REQUEST_BURST_BUFFER_INFO               ( 2037) count:254    ave_time:308
 total_time:78300
REQUEST_JOB_READY                       ( 4019) count:108
 ave_time:736117 total_time:79500701
REQUEST_RESOURCE_ALLOCATION             ( 4001) count:90
ave_time:6618577 total_time:595671996
REQUEST_UPDATE_JOB                      ( 3001) count:67
ave_time:125614 total_time:8416200
REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:62
ave_time:3288170 total_time:203866574
REQUEST_STATS_INFO                      ( 2035) count:17     ave_time:179
 total_time:3048
REQUEST_UPDATE_PARTITION                ( 3005) count:9
 ave_time:133969 total_time:1205728
REQUEST_PRIORITY_FACTORS                ( 2026) count:6
 ave_time:541223 total_time:3247339
REQUEST_NODE_INFO_SINGLE                ( 2040) count:5
 ave_time:3039054 total_time:15195270
REQUEST_UPDATE_NODE                     ( 3002) count:4
 ave_time:14166058 total_time:56664232
REQUEST_RESERVATION_INFO                ( 2024) count:1      ave_time:166
 total_time:166

Remote Procedure Call statistics by user

root            (       0) count:165638 ave_time:2358599
total_time:390673627512
guangsha        (   54808) count:57766  ave_time:383712
total_time:22165519601
aoliu           (   56679) count:48534  ave_time:750603
total_time:36429805003
lsq             (   63156) count:15374  ave_time:181018
total_time:2782984897
tang31          (   62615) count:3845   ave_time:381498
total_time:1466861482
rwelsch         (   69678) count:3417   ave_time:172162 total_time:588278710
malbon          (   58163) count:3270   ave_time:394690
total_time:1290638777
krach           (   58876) count:3120   ave_time:312076 total_time:973677142
berkowit        (   62817) count:3084   ave_time:551964
total_time:1702258279
gpzhang         (   41964) count:2711   ave_time:426136
total_time:1155255579
polkituser      (     108) count:2149   ave_time:1934544
total_time:4157336913
cemitch         (   34773) count:1359   ave_time:5043030
total_time:6853477894
ptfproc         (   62098) count:1074   ave_time:2773597
total_time:2978843540
lsqphot         (   62521) count:886    ave_time:2606250
total_time:2309138029
jiachen         (   52191) count:752    ave_time:6227583
total_time:4683143148
yfeng1          (   62716) count:699    ave_time:7381043
total_time:5159349145
fangyong        (   53958) count:588    ave_time:6475573
total_time:3807637109
jesutton        (   60057) count:579    ave_time:4026770
total_time:2331499933
dmj             (   56094) count:547    ave_time:2571703
total_time:1406721673
operator        (   34510) count:524    ave_time:3889977
total_time:2038348407
usgweb          (   33442) count:502    ave_time:2856711
total_time:1434069071
pdhuvad         (   59561) count:348    ave_time:1982374
total_time:689866240
u15013          (   15013) count:320    ave_time:142528 total_time:45609015
fwang26         (   49811) count:291    ave_time:1853384
total_time:539335032
fpaesani        (   33949) count:258    ave_time:255265 total_time:65858581
glock           (   69615) count:254    ave_time:308    total_time:78300
rch             (   52243) count:250    ave_time:254317 total_time:63579335
friesen         (   52244) count:249    ave_time:5646210
total_time:1405906476
jaehong         (   42915) count:232    ave_time:1525988
total_time:354029299
mycoy           (   62389) count:232    ave_time:549504 total_time:127484967
emiliord        (   56872) count:225    ave_time:3441861
total_time:774418761
jhc585          (   70382) count:213    ave_time:1994431
total_time:424813813
orginos         (   13909) count:210    ave_time:317101 total_time:66591297
yaowang         (   56583) count:206    ave_time:1175761
total_time:242206789
ysuleyma        (   60557) count:204    ave_time:6829919
total_time:1393303668
swu_ncsu        (   69821) count:201    ave_time:1032187
total_time:207469772
byujiang        (   63096) count:198    ave_time:993105 total_time:196634850
archs           (   69687) count:192    ave_time:233125 total_time:44760086
mlubin          (   62217) count:183    ave_time:1589433
total_time:290866336
rsakidja        (   55248) count:183    ave_time:6333642
total_time:1159056492
skcheng         (   68361) count:173    ave_time:264116 total_time:45692212
rcane           (   58910) count:173    ave_time:7606778
total_time:1315972759
szg142          (   62766) count:172    ave_time:4065274
total_time:699227201
chlee10         (   45580) count:170    ave_time:170497 total_time:28984523
knam            (   41118) count:169    ave_time:134244 total_time:22687395
sghosh28        (   68990) count:157    ave_time:294877 total_time:46295836
luzhixin        (   58710) count:152    ave_time:7329518
total_time:1114086778
schenke         (   52653) count:140    ave_time:1447775
total_time:202688593
xgli            (   48514) count:139    ave_time:15994864
total_time:2223286133
startsev        (   16891) count:137    ave_time:5434589
total_time:744538738
qyang           (   56661) count:131    ave_time:2863608
total_time:375132672
wangyu          (   49739) count:129    ave_time:395237 total_time:50985694
yy293           (   61446) count:127    ave_time:585155 total_time:74314778
weichen         (   51381) count:126    ave_time:164564 total_time:20735176
mkunz           (   57597) count:123    ave_time:3103359
total_time:381713272
jchowdhu        (   49411) count:122    ave_time:1496306
total_time:182549365
stpi            (   68888) count:118    ave_time:716532 total_time:84550828
aike            (   34983) count:116    ave_time:3682853
total_time:427210973
xiey            (   57446) count:107    ave_time:23232515
total_time:2485879195
rncahn          (   42003) count:105    ave_time:2969432
total_time:311790379
s7z             (   69292) count:90     ave_time:1223935
total_time:110154216
sokseiha        (   60723) count:87     ave_time:4439473
total_time:386234185
slz839          (   68508) count:87     ave_time:3222288
total_time:280339138
bmarco          (   49744) count:85     ave_time:648784 total_time:55146653
phychem         (   61270) count:83     ave_time:181968 total_time:15103359
akara           (   40227) count:82     ave_time:2399977
total_time:196798117
dingjun         (   57157) count:78     ave_time:5066085
total_time:395154665
ninghai         (   51797) count:76     ave_time:995114 total_time:75628722
zrsun           (   63135) count:75     ave_time:283932 total_time:21294922
psteinbr        (   62610) count:74     ave_time:11636502
total_time:861101165
sselcuk         (   55180) count:70     ave_time:426590 total_time:29861324
sburrows        (   56392) count:68     ave_time:834995 total_time:56779665
saunders        (   56320) count:67     ave_time:6344775
total_time:425099955
sivanr          (   55792) count:66     ave_time:324165 total_time:21394911
dorislee        (   64581) count:64     ave_time:4843782
total_time:310002055
rotureau        (   41524) count:64     ave_time:2234647
total_time:143017452
wangjp          (   59411) count:63     ave_time:31893940
total_time:2009318222
tslo            (   44437) count:63     ave_time:140462 total_time:8849167
bkang           (   70639) count:62     ave_time:722314 total_time:44783479
ckerr           (   13601) count:61     ave_time:437294 total_time:26674983
masha           (   12880) count:58     ave_time:21679175
total_time:1257392185
vancho          (   69590) count:56     ave_time:3097429
total_time:173456062
jihwang         (   69202) count:55     ave_time:6754315
total_time:371487364
ayonge          (   63348) count:54     ave_time:617843 total_time:33363550
dpetesch        (   51668) count:50     ave_time:185625 total_time:9281290
yuan_pin        (   44577) count:49     ave_time:507326 total_time:24859012
brightzh        (   49011) count:49     ave_time:636386 total_time:31182959
dbrout          (   58732) count:48     ave_time:1103036 total_time:52945771
yhzhao          (   70137) count:46     ave_time:366928 total_time:16878702
haobin          (   12588) count:45     ave_time:791601 total_time:35622085
bojana          (   45227) count:45     ave_time:824766 total_time:37114471
binchen         (   69475) count:45     ave_time:869099 total_time:39109461
huang26         (   69507) count:44     ave_time:4798425
total_time:211130737
scoh            (   54290) count:44     ave_time:879761 total_time:38709495
apurkaya        (   50086) count:42     ave_time:296766 total_time:12464207
szuchia         (   70542) count:42     ave_time:730901 total_time:30697882
divalent        (   58435) count:37     ave_time:10076267
total_time:372821907
vorberg         (   68395) count:36     ave_time:155281 total_time:5590117
jptrinas        (   55511) count:35     ave_time:20860364
total_time:730112761
vetinari        (   56108) count:32     ave_time:856092 total_time:27394964
hergert         (   66183) count:32     ave_time:1093159 total_time:34981102
vih173          (   61676) count:31     ave_time:15414454
total_time:477848081
dcantu          (   59635) count:31     ave_time:381786 total_time:11835389
gpau            (   43040) count:29     ave_time:405953 total_time:11772649
vijaysr         (   61136) count:29     ave_time:326426 total_time:9466359
lpyu            (   47763) count:29     ave_time:3359420 total_time:97423180
vfung           (   69515) count:29     ave_time:1542510 total_time:44732790
monoue          (   56205) count:27     ave_time:531624 total_time:14353850
huiyufen        (   70591) count:27     ave_time:428552 total_time:11570915
aryal           (   44957) count:26     ave_time:230997 total_time:6005941
janina          (   63054) count:26     ave_time:256841 total_time:6677870
shpark          (   70066) count:26     ave_time:19839359
total_time:515823337
jddenlin        (   46195) count:25     ave_time:827131 total_time:20678293
ebraun          (   60799) count:24     ave_time:468404 total_time:11241719
loryza          (   68452) count:24     ave_time:41678603
total_time:1000286480
euniv           (   35016) count:23     ave_time:4951549
total_time:113885630
linj7           (   55679) count:21     ave_time:320613 total_time:6732889
staimour        (   61277) count:20     ave_time:80891022
total_time:1617820440
schrier         (   33338) count:19     ave_time:53185  total_time:1010525
mtreagan        (   55441) count:19     ave_time:1032315 total_time:19613989
dkitch          (   60923) count:18     ave_time:3616284 total_time:65093118
songliu         (   62095) count:17     ave_time:612109 total_time:10405861
kenmc           (   59827) count:17     ave_time:16510408
total_time:280676948
ameisner        (   68391) count:17     ave_time:777231 total_time:13212931
tyson           (   41570) count:16     ave_time:28144  total_time:450304
mandrade        (   69505) count:16     ave_time:3977005 total_time:63632090
jj1             (   63053) count:16     ave_time:2956263 total_time:47300212
ravish          (   59487) count:16     ave_time:575726 total_time:9211620
kjwlou          (   68953) count:16     ave_time:2556664 total_time:40906631
bln             (   62356) count:14     ave_time:163746 total_time:2292450
ppetrov         (   54943) count:14     ave_time:7199064
total_time:100786906
sisir           (   64001) count:14     ave_time:271684 total_time:3803587
bsingh          (   51922) count:14     ave_time:295149 total_time:4132096
wangjl          (   56483) count:14     ave_time:465903 total_time:6522655
pyhuang         (   70607) count:14     ave_time:763101 total_time:10683425
xinzhang        (   70290) count:12     ave_time:164348 total_time:1972176
u10198          (   10198) count:12     ave_time:10871316
total_time:130455799
bravenec        (   32825) count:11     ave_time:251121 total_time:2762333
rtsyshev        (   50700) count:11     ave_time:62201  total_time:684211
ajinich         (   70532) count:11     ave_time:472611 total_time:5198731
pjfeibe         (   52605) count:11     ave_time:21141403
total_time:232555443
pankin          (   33880) count:10     ave_time:3019392 total_time:30193921
mniesen         (   58371) count:10     ave_time:855176 total_time:8551764
mgalib          (   64941) count:10     ave_time:806979 total_time:8069797
sfischer        (   65263) count:10     ave_time:1654892 total_time:16548921
mewang          (   69389) count:10     ave_time:1357193 total_time:13571931
smhagos         (   52024) count:10     ave_time:60779  total_time:607797
yaping          (   62471) count:10     ave_time:103049 total_time:1030496
saif            (   70526) count:9      ave_time:34614  total_time:311534
hoa84           (   69035) count:9      ave_time:2090018 total_time:18810169
mtnguyen        (   70592) count:9      ave_time:13011413
total_time:117102717
smirzaei        (   62064) count:9      ave_time:366272 total_time:3296453
geniav          (   55362) count:8      ave_time:229781 total_time:1838251
cashman         (   55766) count:8      ave_time:312038 total_time:2496306
dgold           (   58888) count:8      ave_time:79941  total_time:639533
dks             (   12735) count:8      ave_time:1302715 total_time:10421723
jhyoon          (   51879) count:8      ave_time:454561 total_time:3636493
shizhong        (   61675) count:8      ave_time:495150 total_time:3961202
paganol         (   47088) count:7      ave_time:203840 total_time:1426885
dbowring        (   54298) count:7      ave_time:28776  total_time:201438
ruchen          (   62289) count:6      ave_time:14811  total_time:88869
sabuda          (   69930) count:6      ave_time:6644786 total_time:39868719
cenko           (   49323) count:6      ave_time:829963 total_time:4979780
taibui          (   65462) count:6      ave_time:190463 total_time:1142782
alexand         (   32910) count:6      ave_time:51024  total_time:306146
mwhite          (   31845) count:6      ave_time:553    total_time:3323
jihankim        (   47675) count:6      ave_time:52533  total_time:315198
samolyuk        (   51792) count:6      ave_time:1110883 total_time:6665298
rtumkur         (   63583) count:6      ave_time:656338 total_time:3938030
krad            (   69112) count:6      ave_time:186925 total_time:1121555
lslivins        (   69829) count:5      ave_time:24927  total_time:124636
tutchton        (   59274) count:5      ave_time:24824148
total_time:124120740
nishino         (   46822) count:5      ave_time:741407 total_time:3707036
mastriko        (   45620) count:5      ave_time:2684   total_time:13421
cdiaz           (   58183) count:5      ave_time:468331 total_time:2341659
canon           (   16907) count:5      ave_time:6795   total_time:33977
yunl            (   57742) count:4      ave_time:129959 total_time:519837
mdfowler        (   69296) count:4      ave_time:39404  total_time:157616
vlcek           (   68560) count:4      ave_time:1335161 total_time:5340646
otresca         (   61059) count:4      ave_time:297226 total_time:1188906
holod           (   34809) count:4      ave_time:92674  total_time:370697
yihe            (   55756) count:4      ave_time:575641 total_time:2302565
ngnedin         (   54589) count:4      ave_time:624076 total_time:2496304
syuk            (   70145) count:4      ave_time:384192 total_time:1536769
tianq           (   63881) count:4      ave_time:83826388
total_time:335305554
toussain        (   10173) count:4      ave_time:68295  total_time:273183
karol           (   68859) count:4      ave_time:551    total_time:2204
dajiang         (   41744) count:4      ave_time:15885126
total_time:63540506
u232            (     232) count:2      ave_time:722560 total_time:1445120
u16621          (   16621) count:2      ave_time:4878741 total_time:9757483
szyang          (   70310) count:2      ave_time:19517654
total_time:39035309
samli           (   61845) count:2      ave_time:19341  total_time:38682
hyeongk         (   68594) count:2      ave_time:475    total_time:951
fgygi           (   40699) count:2      ave_time:368595 total_time:737190
morgak          (   57524) count:2      ave_time:366171 total_time:732342
istet           (   44164) count:2      ave_time:19766  total_time:39533
jyoti           (   63106) count:2      ave_time:1214022 total_time:2428045
xueling         (   63083) count:2      ave_time:2988804 total_time:5977609
gandolfi        (   47676) count:1      ave_time:90515  total_time:90515
jschlup         (   70509) count:1      ave_time:44571  total_time:44571
nid00837:~ #
Comment 7 Tim Wickberg 2015-12-29 06:28:13 MST
(In reply to Doug Jacobsen from comment #6)
> Hi Tim,
> 
> Yes, as far as I can tell these jobs are only waiting on nodes, but you did
> trigger an idea.  Does SLURM have some sort of load sensor like GridEngine
> for determining available memory on nodes?  Or does it just dole out the
> theoretical max of memory as specified in the slurm.conf for the node, and
> just assume the memory is available?  I ask because I can imagine a
> situation wherein the node doesn't get fully cleaned and there no longer is
> sufficient memory to be running jobs based on our DefMemPerNode settings.
> However, I think this is more of an academic point, our "regular" and
> "debug" partitions only give out the maximum amount of memory we allow, and
> I don't see any particular group of nodes stagnating.

There's no load-monitoring ala SGE for memory, Slurm schedules based on the memory defined for the node versus the total requested when running with cons_res and CR_socket_memory. (This avoids any potential over-subscription, I've always been suspicious of that behavior on other schedulers. There is a way to forcible over-provision nodes with memory, but we recommend against it for obvious reasons.)

> Actually, this reoccured last night -- I'll see if I can dig up the logs in
> a few minutes (there are a LOT of logs...)
> 
> This time, on a whim, I left the shared partition down, which comprises
> over half our job queue in terms of entry count, and has generated
> scheduling issues in the past (pathological failures wherein some jerk
> asking for all the memory on a node but only 1 core would completely block
> all shared-partition jobs from running, even on nodes the system wasn't
> planning on running the job on).
> 
> Anyway, with the shared partition down this issue has not reoccurred - so
> I'm wondering if this is somehow related to that partition.

What are you trying to do with the shared partition?

I could see Shared=FORCE:32 causing some odd behavior - it does do some load monitoring when deciding which nodes to over-subscribe. (Shared=FORCE oversubscribes, in your case up to 32x. Hopefully that's what you expect, I know the nomenclature behind some of those options isn't obvious - I know I've looked at it expecting it to share sockets but still allocate individual cores properly which is not what that does.)

Also note that any sharing is per-partition, Slurm will not co-mingle jobs from separate partitions within a single node. This may be leading to some of the resource contention you're seeing - it looked like you'd only sent squeue for regular, but I'm guessing there may have been some relatively large jobs pending in shared that could have caused the resources to be reserved awaiting a larger job launching in a separate partition.

Before I joined, I'd submitted a feature request to mark nodes as "earmarked" or something similar - some mechanism of noting that they aren't "idle" but are instead are being kept empty in order to launch some future job. I'll see if I can get that done for 16.05 to at least help indicate the current status.

> Regarding your request for performance numbers.  I typically see our
> backfill scheduler cycle around 30s when shared is enabled. I did update
> the parameters yesterday to start preparing for our production
> configuration starting on 1/11.
> SchedulerParameters     =
> no_backup_scheduling,bf_window=10080,bf_resolution=120,
> bf_max_job_array_resv=20,default_queue_depth=400,bf_max_job_test=6000,
> bf_max_job_user=1,bf_continue,nohold_on_prolog_fail,kill_invalid_depend
> 
> With the shared partition down, things seem smoother.  Obviously we need to
> get that back online, but I want to let a few more big jobs run first
> before signing up for more pain.

That looks fine, I don't see any obvious anomalies. I await further logs when available.
Comment 10 Tim Wickberg 2015-12-29 07:52:48 MST
(In reply to Tim Wickberg from comment #7)
> Also note that any sharing is per-partition, Slurm will not co-mingle jobs
> from separate partitions within a single node. This may be leading to some
> of the resource contention you're seeing - it looked like you'd only sent
> squeue for regular, but I'm guessing there may have been some relatively
> large jobs pending in shared that could have caused the resources to be
> reserved awaiting a larger job launching in a separate partition.

I misspoke on part of this - splitting nodes between partitions does work as expected. Distinction and issues with partitions not splitting nodes would happen only in certain cases when using gang scheduling or preemption, and you have neither of those here. My apologies for any confusion.

- Tim
Comment 11 Doug Jacobsen 2015-12-29 08:05:09 MST
Created attachment 2553 [details]
slurmctld log since last restart

slurmctld log.  look for statements after the partition is marked up, in particular after shared partition marked up.
Comment 12 Doug Jacobsen 2015-12-29 08:20:15 MST
I sent this in email, but it didn't seem to get put in:


I just modified some configs based on our edison experience (set explicit srun ports, KillOnBadExit, and such) -- basically things that don't involve this.  After restarting slurmctld, I up'd all the partitions (including shared) and the same thing happened again.

Really, only job 550452 should be blocked for Resources at this point.

nid00837:~ # squeue --start --sort=Q | grep "Resources"
            730556     debug   my_job    vfung PD                 N/A     40 (null)               (Resources)
            739990     debug     LiTi    weihu PD                 N/A     20 (null)               (Resources)
            743246     debug   my_job  ninghai PD                 N/A     64 nid00[024-051,062-06 (Resources)
            683567     debug I805_315  ppetrov PD                 N/A    128 (null)               (Resources)
            743330     debug   runner gandolfi PD                 N/A     32 nid0[0107-0108,0113- (Resources)
            736356    shared multi_0. jkretchm PD                 N/A      1 (null)               (Resources)
            674260   regular Run_0251   jjunum PD                 N/A      8 (null)               (Resources)
            674261   regular Run_0252   jjunum PD                 N/A      8 (null)               (Resources)
            674262   regular Run_0253   jjunum PD                 N/A      8 (null)               (Resources)
            674263   regular Run_0254   jjunum PD                 N/A      8 (null)               (Resources)
            674264   regular Run_0255   jjunum PD                 N/A      8 (null)               (Resources)
            674265   regular Run_0256   jjunum PD                 N/A      8 (null)               (Resources)
            674266   regular Run_0257   jjunum PD                 N/A      8 (null)               (Resources)
            674267   regular Run_0258   jjunum PD                 N/A      8 (null)               (Resources)
            674268   regular Run_0259   jjunum PD                 N/A      8 (null)               (Resources)
            674269   regular Run_0260   jjunum PD                 N/A      8 (null)               (Resources)
            674270   regular Run_0261   jjunum PD                 N/A      8 (null)               (Resources)
            674271   regular Run_0262   jjunum PD                 N/A      8 (null)               (Resources)
            674272   regular Run_0263   jjunum PD                 N/A      8 (null)               (Resources)
            674273   regular Run_0264   jjunum PD                 N/A      8 (null)               (Resources)
            674274   regular Run_0265   jjunum PD                 N/A      8 (null)               (Resources)
            674275   regular Run_0266   jjunum PD                 N/A      8 (null)               (Resources)
            674276   regular Run_0267   jjunum PD                 N/A      8 (null)               (Resources)
            674277   regular Run_0268   jjunum PD                 N/A      8 (null)               (Resources)
            674278   regular Run_0269   jjunum PD                 N/A      8 (null)               (Resources)
            674279   regular Run_0270   jjunum PD                 N/A      8 (null)               (Resources)
            674280   regular Run_0271   jjunum PD                 N/A      8 (null)               (Resources)
            674281   regular Run_0272   jjunum PD                 N/A      8 (null)               (Resources)
            674282   regular Run_0273   jjunum PD                 N/A      8 (null)               (Resources)
            674283   regular Run_0274   jjunum PD                 N/A      8 (null)               (Resources)
            674284   regular Run_0275   jjunum PD                 N/A      8 (null)               (Resources)
            674285   regular Run_0276   jjunum PD                 N/A      8 (null)               (Resources)
            674286   regular Run_0277   jjunum PD                 N/A      8 (null)               (Resources)
            674287   regular Run_0278   jjunum PD                 N/A      8 (null)               (Resources)
            674288   regular Run_0279   jjunum PD                 N/A      8 (null)               (Resources)
            674289   regular Run_0280   jjunum PD                 N/A      8 (null)               (Resources)
            674290   regular Run_0281   jjunum PD                 N/A      8 (null)               (Resources)
            674291   regular Run_0282   jjunum PD                 N/A      8 (null)               (Resources)
            674292   regular Run_0283   jjunum PD                 N/A      8 (null)               (Resources)
            674293   regular Run_0284   jjunum PD                 N/A      8 (null)               (Resources)
            674294   regular Run_0285   jjunum PD                 N/A      8 (null)               (Resources)
            674295   regular Run_0286   jjunum PD                 N/A      8 (null)               (Resources)
            675149   regular  GDB5L_L     bzhu PD                 N/A     32 (null)               (Resources)
            675150   regular  GDB5L_L     bzhu PD                 N/A     32 (null)               (Resources)
            675151   regular  GDB5L_L     bzhu PD                 N/A     32 (null)               (Resources)
            673738   regular trinity_  jungpyo PD 2015-12-31T02:20:00   1024 nid0[0209-0211,0216- (Resources)
            648729   regular mbd_rela   farren PD 2015-12-30T14:20:00    128 nid0[0209,0218-0222, (Resources)
            673922   regular     STw3  drhatch PD 2015-12-30T11:19:27     16 nid0[0231,0413,0501- (Resources)
            651551   regular Run_0683  gmcfarq PD                 N/A      9 (null)               (Resources)
            651552   regular Run_0684  gmcfarq PD                 N/A      9 (null)               (Resources)
            651553   regular Run_0685  gmcfarq PD                 N/A      9 (null)               (Resources)
            651554   regular Run_0686  gmcfarq PD                 N/A      9 (null)               (Resources)
            651555   regular Run_0687  gmcfarq PD                 N/A      9 (null)               (Resources)
            651556   regular Run_0688  gmcfarq PD                 N/A      9 (null)               (Resources)
            651557   regular Run_0689  gmcfarq PD                 N/A      9 (null)               (Resources)
            651558   regular Run_0690  gmcfarq PD                 N/A      9 (null)               (Resources)
            651559   regular Run_0691  gmcfarq PD                 N/A      9 (null)               (Resources)
            651560   regular Run_0692  gmcfarq PD                 N/A      9 (null)               (Resources)
            651561   regular Run_0693  gmcfarq PD                 N/A      9 (null)               (Resources)
            651562   regular Run_0694  gmcfarq PD                 N/A      9 (null)               (Resources)
            651563   regular Run_0695  gmcfarq PD                 N/A      9 (null)               (Resources)
            651564   regular Run_0696  gmcfarq PD                 N/A      9 (null)               (Resources)
            651565   regular Run_0697  gmcfarq PD                 N/A      9 (null)               (Resources)
            651566   regular Run_0698  gmcfarq PD                 N/A      9 (null)               (Resources)
            651567   regular Run_0699  gmcfarq PD                 N/A      9 (null)               (Resources)
            651568   regular Run_0700  gmcfarq PD                 N/A      9 (null)               (Resources)
            526702   regular      416    zfliu PD 2015-12-30T13:20:00    360 nid0[0208,0281-0287, (Resources)
            672510   regular ITER12MA     izzo PD 2015-12-30T10:54:00     22 nid00[446-447,464-46 (Resources)
            673750   regular HEAT_UCL  chhabra PD                 N/A     31 (null)               (Resources)
            669504   regular     test mastriko PD 2015-12-30T10:54:00     16 nid00[294-309]       (Resources)
            673200   regular    DCLL2  chhabra PD 2015-12-30T10:54:00     28 nid0[1440-1467]      (Resources)
            672765   regular ch3nh3pb  abdalla PD 2015-12-30T10:54:00     12 nid0[0998-1009]      (Resources)
            672766   regular ch3nh3pb  abdalla PD                 N/A     12 (null)               (Resources)
            672767   regular ch3nh3pb  abdalla PD                 N/A     12 (null)               (Resources)
            672768   regular ch3nh3pb  abdalla PD                 N/A     12 (null)               (Resources)
            672769   regular ch3nh3pb  abdalla PD                 N/A     12 (null)               (Resources)
            647441   regular usgsmega chunzhao PD                 N/A     50 (null)               (Resources)
            646874   regular  htmegan chunzhao PD 2015-12-30T10:54:00     50 nid0[0749,1616-1619, (Resources)
            669872   regular   zgoubi vranjbar PD 2015-12-30T10:54:00     32 nid00[701-703,720-74 (Resources)
            651548   regular Run_0682   jjunum PD                 N/A      9 (null)               (Resources)
            651485   regular Run_0619   jjunum PD                 N/A      9 (null)               (Resources)
            651486   regular Run_0620   jjunum PD                 N/A      9 (null)               (Resources)
            651487   regular Run_0621   jjunum PD                 N/A      9 (null)               (Resources)
            651488   regular Run_0622   jjunum PD                 N/A      9 (null)               (Resources)
            651489   regular Run_0623   jjunum PD                 N/A      9 (null)               (Resources)
            651490   regular Run_0624   jjunum PD                 N/A      9 (null)               (Resources)
            651491   regular Run_0625   jjunum PD                 N/A      9 (null)               (Resources)
            651492   regular Run_0626   jjunum PD                 N/A      9 (null)               (Resources)
            651493   regular Run_0627   jjunum PD                 N/A      9 (null)               (Resources)
            651494   regular Run_0628   jjunum PD                 N/A      9 (null)               (Resources)
            651495   regular Run_0629   jjunum PD                 N/A      9 (null)               (Resources)
            651496   regular Run_0630   jjunum PD                 N/A      9 (null)               (Resources)
            651497   regular Run_0631   jjunum PD                 N/A      9 (null)               (Resources)
            651498   regular Run_0632   jjunum PD                 N/A      9 (null)               (Resources)
            651499   regular Run_0633   jjunum PD                 N/A      9 (null)               (Resources)
            651500   regular Run_0634   jjunum PD                 N/A      9 (null)               (Resources)
            651501   regular Run_0635   jjunum PD                 N/A      9 (null)               (Resources)
            651502   regular Run_0636   jjunum PD                 N/A      9 (null)               (Resources)
            651503   regular Run_0637   jjunum PD                 N/A      9 (null)               (Resources)
            651504   regular Run_0638   jjunum PD                 N/A      9 (null)               (Resources)
            651505   regular Run_0639   jjunum PD                 N/A      9 (null)               (Resources)
            651506   regular Run_0640   jjunum PD                 N/A      9 (null)               (Resources)
            651507   regular Run_0641   jjunum PD                 N/A      9 (null)               (Resources)
            651508   regular Run_0642   jjunum PD                 N/A      9 (null)               (Resources)
            651509   regular Run_0643   jjunum PD                 N/A      9 (null)               (Resources)
            651510   regular Run_0644   jjunum PD                 N/A      9 (null)               (Resources)
            651511   regular Run_0645   jjunum PD                 N/A      9 (null)               (Resources)
            651512   regular Run_0646   jjunum PD                 N/A      9 (null)               (Resources)
            651513   regular Run_0647   jjunum PD                 N/A      9 (null)               (Resources)
            651514   regular Run_0648   jjunum PD                 N/A      9 (null)               (Resources)
            651515   regular Run_0649   jjunum PD                 N/A      9 (null)               (Resources)
            651516   regular Run_0650   jjunum PD                 N/A      9 (null)               (Resources)
            651517   regular Run_0651   jjunum PD                 N/A      9 (null)               (Resources)
            651518   regular Run_0652   jjunum PD                 N/A      9 (null)               (Resources)
            651519   regular Run_0653   jjunum PD                 N/A      9 (null)               (Resources)
            651520   regular Run_0654   jjunum PD                 N/A      9 (null)               (Resources)
            651521   regular Run_0655   jjunum PD                 N/A      9 (null)               (Resources)
            651522   regular Run_0656   jjunum PD                 N/A      9 (null)               (Resources)
            651523   regular Run_0657   jjunum PD                 N/A      9 (null)               (Resources)
            651524   regular Run_0658   jjunum PD                 N/A      9 (null)               (Resources)
            651525   regular Run_0659   jjunum PD                 N/A      9 (null)               (Resources)
            651526   regular Run_0660   jjunum PD                 N/A      9 (null)               (Resources)
            651527   regular Run_0661   jjunum PD                 N/A      9 (null)               (Resources)
            651528   regular Run_0662   jjunum PD                 N/A      9 (null)               (Resources)
            651529   regular Run_0663   jjunum PD                 N/A      9 (null)               (Resources)
            651530   regular Run_0664   jjunum PD                 N/A      9 (null)               (Resources)
            651531   regular Run_0665   jjunum PD                 N/A      9 (null)               (Resources)
            651532   regular Run_0666   jjunum PD                 N/A      9 (null)               (Resources)
            651533   regular Run_0667   jjunum PD                 N/A      9 (null)               (Resources)
            651534   regular Run_0668   jjunum PD                 N/A      9 (null)               (Resources)
            651535   regular Run_0669   jjunum PD                 N/A      9 (null)               (Resources)
            651536   regular Run_0670   jjunum PD                 N/A      9 (null)               (Resources)
            651537   regular Run_0671   jjunum PD                 N/A      9 (null)               (Resources)
            651538   regular Run_0672   jjunum PD                 N/A      9 (null)               (Resources)
            651539   regular Run_0673   jjunum PD                 N/A      9 (null)               (Resources)
            651540   regular Run_0674   jjunum PD                 N/A      9 (null)               (Resources)
            651541   regular Run_0675   jjunum PD                 N/A      9 (null)               (Resources)
            651542   regular Run_0676   jjunum PD                 N/A      9 (null)               (Resources)
            651543   regular Run_0677   jjunum PD                 N/A      9 (null)               (Resources)
            651544   regular Run_0678   jjunum PD                 N/A      9 (null)               (Resources)
            651545   regular Run_0679   jjunum PD                 N/A      9 (null)               (Resources)
            651546   regular Run_0680   jjunum PD                 N/A      9 (null)               (Resources)
            651547   regular Run_0681   jjunum PD                 N/A      9 (null)               (Resources)
            328241   regular   D23_re    wangw PD 2015-12-30T10:54:00     64 nid0[2070-2111,2128- (Resources)
            651948   regular     GENE   nbonan PD                 N/A    255 (null)               (Resources)
            651381   regular     NEB4 mgsensoy PD 2015-12-30T10:54:00     80 nid0[0210-0211,0216- (Resources)
            528855   regular run.nimr   pankin PD                 N/A    258 (null)               (Resources)
            644338   regular Run_0579  gmcfarq PD 2015-12-30T10:32:05      9 nid00[598-599,920-92 (Resources)
            644339   regular Run_0580  gmcfarq PD                 N/A      9 (null)               (Resources)
            644340   regular Run_0581  gmcfarq PD                 N/A      9 (null)               (Resources)
            644341   regular Run_0582  gmcfarq PD                 N/A      9 (null)               (Resources)
            644342   regular Run_0583  gmcfarq PD                 N/A      9 (null)               (Resources)
            644343   regular Run_0584  gmcfarq PD                 N/A      9 (null)               (Resources)
            644344   regular Run_0585  gmcfarq PD                 N/A      9 (null)               (Resources)
            644345   regular Run_0586  gmcfarq PD                 N/A      9 (null)               (Resources)
            644346   regular Run_0587  gmcfarq PD                 N/A      9 (null)               (Resources)
            644347   regular Run_0588  gmcfarq PD                 N/A      9 (null)               (Resources)
            644348   regular Run_0589  gmcfarq PD                 N/A      9 (null)               (Resources)
            649026   regular   n_1120   heidih PD 2015-12-30T10:54:00     35 nid0[0571,0596-0597, (Resources)
            649033   regular    n_960   heidih PD                 N/A     30 (null)               (Resources)
            644382   regular Run_0592   jjunum PD 2015-12-30T10:32:05      9 nid00[891-895,916-91 (Resources)
            644384   regular Run_0594   jjunum PD                 N/A      9 (null)               (Resources)
            644385   regular Run_0595   jjunum PD                 N/A      9 (null)               (Resources)
            644386   regular Run_0596   jjunum PD                 N/A      9 (null)               (Resources)
            644387   regular Run_0597   jjunum PD                 N/A      9 (null)               (Resources)
            644388   regular Run_0598   jjunum PD                 N/A      9 (null)               (Resources)
            644389   regular Run_0599   jjunum PD                 N/A      9 (null)               (Resources)
            644390   regular Run_0600   jjunum PD                 N/A      9 (null)               (Resources)
            644391   regular Run_0601   jjunum PD                 N/A      9 (null)               (Resources)
            644392   regular Run_0602   jjunum PD                 N/A      9 (null)               (Resources)
            644393   regular Run_0603   jjunum PD                 N/A      9 (null)               (Resources)
            644394   regular Run_0604   jjunum PD                 N/A      9 (null)               (Resources)
            644395   regular Run_0605   jjunum PD                 N/A      9 (null)               (Resources)
            644396   regular Run_0606   jjunum PD                 N/A      9 (null)               (Resources)
            644397   regular Run_0607   jjunum PD                 N/A      9 (null)               (Resources)
            644398   regular Run_0608   jjunum PD                 N/A      9 (null)               (Resources)
            644399   regular Run_0609   jjunum PD                 N/A      9 (null)               (Resources)
            644400   regular Run_0610   jjunum PD                 N/A      9 (null)               (Resources)
            644401   regular Run_0611   jjunum PD                 N/A      9 (null)               (Resources)
            644402   regular Run_0612   jjunum PD                 N/A      9 (null)               (Resources)
            644403   regular Run_0613   jjunum PD                 N/A      9 (null)               (Resources)
            644404   regular Run_0614   jjunum PD                 N/A      9 (null)               (Resources)
            644405   regular Run_0615   jjunum PD                 N/A      9 (null)               (Resources)
            644406   regular Run_0616   jjunum PD                 N/A      9 (null)               (Resources)
            644407   regular Run_0617   jjunum PD                 N/A      9 (null)               (Resources)
            644408   regular Run_0618   jjunum PD                 N/A      9 (null)               (Resources)
            502782   regular P2228_NT mhumbert PD                 N/A    100 (null)               (Resources)
            502780   regular C4C1PIP_ mhumbert PD                 N/A    100 (null)               (Resources)
            502776   regular C4C1IM_N mhumbert PD                 N/A    100 (null)               (Resources)
            502778   regular C4C1IM_O mhumbert PD                 N/A    100 (null)               (Resources)
            502775   regular C4C1IM_4 mhumbert PD 2015-12-30T10:54:00    100 nid0[0223-0225,0233, (Resources)
            643870   regular  GDB5L_L     bzhu PD                 N/A     32 (null)               (Resources)
            643871   regular  GDB5L_L     bzhu PD                 N/A     32 (null)               (Resources)
            643872   regular  GDB5L_L     bzhu PD                 N/A     32 (null)               (Resources)
            643856   regular  GDB5L_L     bzhu PD                 N/A     32 (null)               (Resources)
            643848   regular  GDB5L_L     bzhu PD 2015-12-30T10:01:15     32 nid0[0358-0359,0494, (Resources)
            535908   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535909   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535910   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535915   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535916   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535918   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535919   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535920   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535921   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535922   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535923   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535924   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535925   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535926   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535927   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535928   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535929   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535930   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535403   regular heavy_me  smeinel PD 2015-12-30T07:02:09     32 nid0[0229-0230,0235- (Resources)
            535404   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535405   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535406   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535407   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535408   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535409   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535410   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535411   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            535412   regular heavy_me  smeinel PD                 N/A     32 (null)               (Resources)
            525578   regular 4kmNCEP_   yuxing PD 2015-12-30T05:20:59    100 nid0[0209,0218,0221- (Resources)
            543194   regular rawMPI84  fnrizzi PD                 N/A    831 (null)               (Resources)
            543196   regular rawMPI98  fnrizzi PD                 N/A   1050 (null)               (Resources)
            551687   regular  Cu_45_4   ayonge PD                 N/A      2 (null)               (Resources)
            551665   regular Cu_45_3_   ayonge PD                 N/A      2 (null)               (Resources)
            551474   regular Cu_45_2_   ayonge PD                 N/A      2 (null)               (Resources)
            551232   regular Ag_45_4_   ayonge PD                 N/A      2 (null)               (Resources)
            551159   regular Ag_45_3_   ayonge PD                 N/A      2 (null)               (Resources)
            550867   regular Ag_45_2_   ayonge PD                 N/A      2 (null)               (Resources)
            549905   regular Cu_44_2_   ayonge PD                 N/A      2 (null)               (Resources)
            549830   regular Cu_O_2_h   ayonge PD                 N/A      2 (null)               (Resources)
            549695   regular Cu_43_on   ayonge PD                 N/A      2 (null)               (Resources)
            549460   regular Ag_43_on   ayonge PD                 N/A      2 (null)               (Resources)
            547747   regular Cu_42_on   ayonge PD                 N/A      2 (null)               (Resources)
            547677   regular Ag_42_on   ayonge PD                 N/A      2 (null)               (Resources)
            547186   regular Cu_41_on   ayonge PD                 N/A      2 (null)               (Resources)
            547112   regular Ag_41_on   ayonge PD                 N/A      2 (null)               (Resources)
            546845   regular  Cu_40_4   ayonge PD                 N/A      2 (null)               (Resources)
            546818   regular  Cu_40_3   ayonge PD                 N/A      2 (null)               (Resources)
            582336   regular eX10B-x1     jmay PD 2015-12-30T10:54:00    400 nid0[0252-0255,0272- (Resources)
            544800   regular  Cu_40_2   ayonge PD                 N/A      2 (null)               (Resources)
            544793   regular  Ag_40_4   ayonge PD                 N/A      2 (null)               (Resources)
            544789   regular   Ag_h-3   ayonge PD                 N/A      2 (null)               (Resources)
            544783   regular   Ag_h_1   ayonge PD                 N/A      2 (null)               (Resources)
            544765   regular  Cu_39_4   ayonge PD                 N/A      2 (null)               (Resources)
            544761   regular  Cu_39_3   ayonge PD                 N/A      2 (null)               (Resources)
            544756   regular  Cu_39_2   ayonge PD                 N/A      2 (null)               (Resources)
            544753   regular  Cu_39_1   ayonge PD                 N/A      2 (null)               (Resources)
            544748   regular  Ag_39_4   ayonge PD                 N/A      2 (null)               (Resources)
            544744   regular Ag_39_3_   ayonge PD                 N/A      2 (null)               (Resources)
            544740   regular  Ag_39_2   ayonge PD                 N/A      2 (null)               (Resources)
            544735   regular Ag_39_1_   ayonge PD                 N/A      2 (null)               (Resources)
            544594   regular  Cu_36_4   ayonge PD                 N/A      2 (null)               (Resources)
            544586   regular  Cu_36_3   ayonge PD                 N/A      2 (null)               (Resources)
            544566   regular Ag_36_4_   ayonge PD                 N/A      2 (null)               (Resources)
            544561   regular Ag_36_3_   ayonge PD                 N/A      2 (null)               (Resources)
            544552   regular  Cu_36_2   ayonge PD                 N/A      2 (null)               (Resources)
            544517   regular Ag_36_2_   ayonge PD                 N/A      2 (null)               (Resources)
            544494   regular  Cu_35_5   ayonge PD                 N/A      2 (null)               (Resources)
            544486   regular  Cu_35_3   ayonge PD                 N/A      2 (null)               (Resources)
            544487   regular  Cu_35_4   ayonge PD                 N/A      2 (null)               (Resources)
            544481   regular  Cu_35_2   ayonge PD                 N/A      2 (null)               (Resources)
            544474   regular  Cu_35_1   ayonge PD                 N/A      2 (null)               (Resources)
            544463   regular  Ag_35_5   ayonge PD                 N/A      2 (null)               (Resources)
            544457   regular  Ag_35_4   ayonge PD                 N/A      2 (null)               (Resources)
            544439   regular  Ag_35_3   ayonge PD                 N/A      2 (null)               (Resources)
            544432   regular  Ag_35_2   ayonge PD                 N/A      2 (null)               (Resources)
            544425   regular  Ag_35_1   ayonge PD                 N/A      2 (null)               (Resources)
            543284   regular Cu_33_1_   ayonge PD                 N/A      2 (null)               (Resources)
            543266   regular Ag_33_1_   ayonge PD                 N/A      2 (null)               (Resources)
            543251   regular  Cu_32_6   ayonge PD                 N/A      2 (null)               (Resources)
            543240   regular  Cu_32_5   ayonge PD                 N/A      2 (null)               (Resources)
            543230   regular  Cu_32_4   ayonge PD                 N/A      2 (null)               (Resources)
            543220   regular Ag_32_6_   ayonge PD                 N/A      2 (null)               (Resources)
            543213   regular Ag_32_5_   ayonge PD                 N/A      2 (null)               (Resources)
            543203   regular Ag_32_4_   ayonge PD                 N/A      2 (null)               (Resources)
            543109   regular Cu_32_1_   ayonge PD                 N/A      2 (null)               (Resources)
            543107   regular Cu_32_2_   ayonge PD                 N/A      2 (null)               (Resources)
            543100   regular Cu_32_1_   ayonge PD                 N/A      2 (null)               (Resources)
            543088   regular Ag_32_2_   ayonge PD                 N/A      2 (null)               (Resources)
            543077   regular Ag_32_1_   ayonge PD                 N/A      2 (null)               (Resources)
            543034   regular Cu_30_2_   ayonge PD                 N/A      2 (null)               (Resources)
            543033   regular Cu_30_1_   ayonge PD                 N/A      2 (null)               (Resources)
            543027   regular Ag_30_2_   ayonge PD                 N/A      2 (null)               (Resources)
            543013   regular Ag_30_1_   ayonge PD                 N/A      2 (null)               (Resources)
            542968   regular Cu_29_2_   ayonge PD                 N/A      2 (null)               (Resources)
            542962   regular Ag_29_2_   ayonge PD                 N/A      2 (null)               (Resources)
            542917   regular Cu_28_3_   ayonge PD                 N/A      2 (null)               (Resources)
            542845   regular Cu_28_2_   ayonge PD                 N/A      2 (null)               (Resources)
            542843   regular Cu_28_1_   ayonge PD                 N/A      2 (null)               (Resources)
            542807   regular Ag_28_3_   ayonge PD                 N/A      2 (null)               (Resources)
            542800   regular Ag_28_2_   ayonge PD                 N/A      2 (null)               (Resources)
            542793   regular Ag_28_1_   ayonge PD 2015-12-29T14:39:43      2 nid0[1840,1960]      (Resources)
            478346   regular run.nimr   pankin PD 2015-12-30T01:20:08    258 nid0[0208,0219-0220, (Resources)
            549228   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549229   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549230   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549231   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549232   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549233   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549234   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549235   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549236   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549237   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549238   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549239   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549240   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549241   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549242   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549243   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549244   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549245   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549246   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549247   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549248   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549249   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549250   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549251   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549252   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549253   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549254   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549255   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549256   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549257   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549258   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549259   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549260   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549261   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549262   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549263   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549264   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549265   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549266   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549267   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549268   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549269   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549270   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549271   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549272   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549273   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549274   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549275   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549276   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549277   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549278   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549279   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549280   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549281   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549282   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549283   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549284   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549285   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549286   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549287   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549288   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549289   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549290   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            549291   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545556   regular  big_job berkowit PD 2015-12-29T22:54:00    512 nid0[0223-0225,0233, (Resources)
            545557   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545558   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545559   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545560   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545561   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545562   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545563   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545564   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545565   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545566   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545567   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545568   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545569   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545570   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545571   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545572   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545573   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545574   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545575   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545576   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545577   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545578   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545579   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545580   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545581   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545582   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545583   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545584   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            545585   regular  big_job berkowit PD                 N/A    512 (null)               (Resources)
            499409   regular     GENE    dtold PD 2015-12-29T22:54:00    320 nid0[0210-0211,0216- (Resources)
            513398   regular       05      ocs PD 2015-12-29T22:54:00    128 nid0[0252-0255,0272- (Resources)
            513399   regular       06      ocs PD                 N/A    128 (null)               (Resources)
            513400   regular       07      ocs PD                 N/A    128 (null)               (Resources)
            513401   regular       08      ocs PD                 N/A    128 (null)               (Resources)
            513402   regular       09      ocs PD                 N/A    128 (null)               (Resources)
            513403   regular       10      ocs PD                 N/A    128 (null)               (Resources)
            513404   regular       11      ocs PD                 N/A    128 (null)               (Resources)
            513405   regular       12      ocs PD                 N/A    128 (null)               (Resources)
            513406   regular       13      ocs PD                 N/A    128 (null)               (Resources)
            513407   regular       14      ocs PD                 N/A    128 (null)               (Resources)
            513408   regular       15      ocs PD                 N/A    128 (null)               (Resources)
            509895   regular rawMPI36  fnrizzi PD                 N/A   1424 (null)               (Resources)
            509887   regular rawMPI30  fnrizzi PD 2015-12-29T20:17:36    989 nid0[0208,0210-0211, (Resources)
            550452   regular    ucan2    u1103 PD 2015-12-29T22:24:17   1024 nid0[0208,0210-0211, (Resources)
nid00837:~ # sprio -j 550452
          JOBID   PRIORITY        AGE  FAIRSHARE  PARTITION        QOS
         550452      33680      17223       7097       2160       7200
nid00837:~ # sprio -j 509887
          JOBID   PRIORITY        AGE  FAIRSHARE  PARTITION        QOS
         509887      33622      20002       4260       2160       7200
nid00837:~ # scontrol show job 550452
JobId=550452 JobName=ucan2
   UserId=u1103(1103) GroupId=u1103(1001103)
   Priority=33685 Nice=0 Account=m616 QOS=normal_regular_1
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2015-12-17T14:32:12 EligibleTime=2015-12-17T14:32:12
   StartTime=2015-12-29T22:24:17 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=regular AllocNode:Sid=cori10:50398
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=nid0[0208,0210-0211,0216-0217,0223-0225,0232-0234,0237-0238,0241-0248,0252-0255,0272-0279,0294-0319,0336-0341,0350-0352,0360-0363,0365-0368,0376-0383,0408-0412,0414-0447,0464-0467,0472-0493,0495-0502,0504-0511,0532-0535,0537-0539,0541-0548,0559-0571,0574-0575,0596-0597,0600-0604,0606-0607,0629-0636,0663-0666,0668-0669,0681-0688,0701-0703,0720-0767,0788-0789,0800-0802,0817-0821,0827-0830,0848-0851,0861-0863,0865-0884,0990-1016,1042-1083,1085-1087,1104-1128,1143-1148,1150-1151,1172-1184,1190-1215,1232-1235,1240-1247,1270-1278,1301-1324,1328-1343,1364-1366,1379-1385,1405-1407,1424-1438,1440-1471,1488-1492,1498-1535,1556-1565,1567-1599,1616-1619,1624-1663,1684-1688,1690-1696,1704-1717,1719-1727,1748-1789,1808-1823,1880-1913,1946-1952,1954-1957,1970-1983,2000-2003,2008-2009,2016-2019,2027-2031,2034-2047,2068-2111,2128-2151,2156-2167,2170-2172,2174-2175,2192-2195,2197-2216,2219-2239]
   NumNodes=1024-1024 NumCPUs=1024 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1024,node=1024
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=craynetwork:1 Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/global/cscratch1/sd/u1103/ducan2/dtest/d32768/run.cori
   WorkDir=/global/cscratch1/sd/u1103/ducan2/dtest/d32768
   StdErr=/global/cscratch1/sd/u1103/ducan2/dtest/d32768/ucan2-32768.err
   StdIn=/dev/null
   StdOut=/global/cscratch1/sd/u1103/ducan2/dtest/d32768/ucan2-32768.out
   Power= SICP=0

nid00837:~ #
nid00837:~ #
nid00837:~ #
nid00837:~ # scontrol show job 509887
JobId=509887 JobName=rawMPI30x30cori
   UserId=fnrizzi(60679) GroupId=fnrizzi(60679)
   Priority=33622 Nice=0 Account=m1882 QOS=normal_regular_1
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:40:00 TimeMin=N/A
   SubmitTime=2015-12-15T16:18:38 EligibleTime=2015-12-15T16:18:38
   StartTime=2015-12-29T20:17:36 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=regular AllocNode:Sid=cori07:43248
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=nid0[0208,0210-0211,0216-0217,0223-0225,0228,0232-0234,0237-0238,0241-0248,0252-0255,0272-0279,0287,0292,0294-0319,0336-0341,0350-0352,0357,0360-0363,0365-0367,0376-0383,0408-0412,0446-0447,0464-0467,0472-0493,0497-0502,0504-0511,0532-0535,0537-0539,0541-0548,0559-0571,0574-0575,0596-0597,0600-0604,0606-0607,0629-0636,0663-0666,0668-0669,0678-0679,0681-0688,0701-0703,0720-0767,0788-0789,0800-0801,0817-0821,0823-0824,0827-0830,0848-0851,0861-0863,0865-0884,0987,0993-1016,1042-1083,1085-1087,1104-1128,1138,1145-1148,1150-1151,1172-1184,1190-1215,1232-1235,1240-1247,1270-1278,1301-1323,1328-1343,1364-1366,1375-1376,1379-1385,1405-1407,1424-1438,1440-1471,1488-1492,1498-1535,1556-1565,1567-1599,1616-1619,1624-1663,1684-1688,1690-1696,1704-1717,1719-1727,1748-1789,1808-1823,1837-1838,1880-1913,1946-1952,1954-1957,1970-1971,1973-1983,2000-2003,2008-2009,2016-2019,2024-2025,2027-2031,2034-2047,2068-2111,2128-2151,2164-2167,2170-2172,2174-2175,2192-2195,2197-2216,2219-2239]
   NumNodes=989-989 NumCPUs=989 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=989,node=989
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=craynetwork:1 Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/global/u1/f/fnrizzi/coriRuns/run30x30.cori
   WorkDir=/global/u1/f/fnrizzi/coriRuns
   StdErr=/global/u1/f/fnrizzi/coriRuns/slurm-509887.out
   StdIn=/dev/null
   StdOut=/global/u1/f/fnrizzi/coriRuns/slurm-509887.out
   Power= SICP=0

nid00837:~ #
Comment 13 Doug Jacobsen 2015-12-29 08:26:15 MST
I just modified some configs based on our edison experience (set explicit
srun ports, KillOnBadExit, and such) -- basically things that don't involve
this.  After restarting slurmctld, I up'd all the partitions (including
shared) and the same thing happened again.

Really, only job 550452 should be blocked for Resources at this point.

nid00837:~ # squeue --start --sort=Q | grep "Resources"
            730556     debug   my_job    vfung PD                 N/A
40 (null)               (Resources)
            739990     debug     LiTi    weihu PD                 N/A
20 (null)               (Resources)
            743246     debug   my_job  ninghai PD                 N/A
64 nid00[024-051,062-06 (Resources)
            683567     debug I805_315  ppetrov PD                 N/A
 128 (null)               (Resources)
            743330     debug   runner gandolfi PD                 N/A
32 nid0[0107-0108,0113- (Resources)
            736356    shared multi_0. jkretchm PD                 N/A
 1 (null)               (Resources)
            674260   regular Run_0251   jjunum PD                 N/A
 8 (null)               (Resources)
            674261   regular Run_0252   jjunum PD                 N/A
 8 (null)               (Resources)
            674262   regular Run_0253   jjunum PD                 N/A
 8 (null)               (Resources)
            674263   regular Run_0254   jjunum PD                 N/A
 8 (null)               (Resources)
            674264   regular Run_0255   jjunum PD                 N/A
 8 (null)               (Resources)
            674265   regular Run_0256   jjunum PD                 N/A
 8 (null)               (Resources)
            674266   regular Run_0257   jjunum PD                 N/A
 8 (null)               (Resources)
            674267   regular Run_0258   jjunum PD                 N/A
 8 (null)               (Resources)
            674268   regular Run_0259   jjunum PD                 N/A
 8 (null)               (Resources)
            674269   regular Run_0260   jjunum PD                 N/A
 8 (null)               (Resources)
            674270   regular Run_0261   jjunum PD                 N/A
 8 (null)               (Resources)
            674271   regular Run_0262   jjunum PD                 N/A
 8 (null)               (Resources)
            674272   regular Run_0263   jjunum PD                 N/A
 8 (null)               (Resources)
            674273   regular Run_0264   jjunum PD                 N/A
 8 (null)               (Resources)
            674274   regular Run_0265   jjunum PD                 N/A
 8 (null)               (Resources)
            674275   regular Run_0266   jjunum PD                 N/A
 8 (null)               (Resources)
            674276   regular Run_0267   jjunum PD                 N/A
 8 (null)               (Resources)
            674277   regular Run_0268   jjunum PD                 N/A
 8 (null)               (Resources)
            674278   regular Run_0269   jjunum PD                 N/A
 8 (null)               (Resources)
            674279   regular Run_0270   jjunum PD                 N/A
 8 (null)               (Resources)
            674280   regular Run_0271   jjunum PD                 N/A
 8 (null)               (Resources)
            674281   regular Run_0272   jjunum PD                 N/A
 8 (null)               (Resources)
            674282   regular Run_0273   jjunum PD                 N/A
 8 (null)               (Resources)
            674283   regular Run_0274   jjunum PD                 N/A
 8 (null)               (Resources)
            674284   regular Run_0275   jjunum PD                 N/A
 8 (null)               (Resources)
            674285   regular Run_0276   jjunum PD                 N/A
 8 (null)               (Resources)
            674286   regular Run_0277   jjunum PD                 N/A
 8 (null)               (Resources)
            674287   regular Run_0278   jjunum PD                 N/A
 8 (null)               (Resources)
            674288   regular Run_0279   jjunum PD                 N/A
 8 (null)               (Resources)
            674289   regular Run_0280   jjunum PD                 N/A
 8 (null)               (Resources)
            674290   regular Run_0281   jjunum PD                 N/A
 8 (null)               (Resources)
            674291   regular Run_0282   jjunum PD                 N/A
 8 (null)               (Resources)
            674292   regular Run_0283   jjunum PD                 N/A
 8 (null)               (Resources)
            674293   regular Run_0284   jjunum PD                 N/A
 8 (null)               (Resources)
            674294   regular Run_0285   jjunum PD                 N/A
 8 (null)               (Resources)
            674295   regular Run_0286   jjunum PD                 N/A
 8 (null)               (Resources)
            675149   regular  GDB5L_L     bzhu PD                 N/A
32 (null)               (Resources)
            675150   regular  GDB5L_L     bzhu PD                 N/A
32 (null)               (Resources)
            675151   regular  GDB5L_L     bzhu PD                 N/A
32 (null)               (Resources)
            673738   regular trinity_  jungpyo PD 2015-12-31T02:20:00
1024 nid0[0209-0211,0216- (Resources)
            648729   regular mbd_rela   farren PD 2015-12-30T14:20:00
 128 nid0[0209,0218-0222, (Resources)
            673922   regular     STw3  drhatch PD 2015-12-30T11:19:27
16 nid0[0231,0413,0501- (Resources)
            651551   regular Run_0683  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651552   regular Run_0684  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651553   regular Run_0685  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651554   regular Run_0686  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651555   regular Run_0687  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651556   regular Run_0688  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651557   regular Run_0689  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651558   regular Run_0690  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651559   regular Run_0691  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651560   regular Run_0692  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651561   regular Run_0693  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651562   regular Run_0694  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651563   regular Run_0695  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651564   regular Run_0696  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651565   regular Run_0697  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651566   regular Run_0698  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651567   regular Run_0699  gmcfarq PD                 N/A
 9 (null)               (Resources)
            651568   regular Run_0700  gmcfarq PD                 N/A
 9 (null)               (Resources)
            526702   regular      416    zfliu PD 2015-12-30T13:20:00
 360 nid0[0208,0281-0287, (Resources)
            672510   regular ITER12MA     izzo PD 2015-12-30T10:54:00
22 nid00[446-447,464-46 (Resources)
            673750   regular HEAT_UCL  chhabra PD                 N/A
31 (null)               (Resources)
            669504   regular     test mastriko PD 2015-12-30T10:54:00
16 nid00[294-309]       (Resources)
            673200   regular    DCLL2  chhabra PD 2015-12-30T10:54:00
28 nid0[1440-1467]      (Resources)
            672765   regular ch3nh3pb  abdalla PD 2015-12-30T10:54:00
12 nid0[0998-1009]      (Resources)
            672766   regular ch3nh3pb  abdalla PD                 N/A
12 (null)               (Resources)
            672767   regular ch3nh3pb  abdalla PD                 N/A
12 (null)               (Resources)
            672768   regular ch3nh3pb  abdalla PD                 N/A
12 (null)               (Resources)
            672769   regular ch3nh3pb  abdalla PD                 N/A
12 (null)               (Resources)
            647441   regular usgsmega chunzhao PD                 N/A
50 (null)               (Resources)
            646874   regular  htmegan chunzhao PD 2015-12-30T10:54:00
50 nid0[0749,1616-1619, (Resources)
            669872   regular   zgoubi vranjbar PD 2015-12-30T10:54:00
32 nid00[701-703,720-74 (Resources)
            651548   regular Run_0682   jjunum PD                 N/A
 9 (null)               (Resources)
            651485   regular Run_0619   jjunum PD                 N/A
 9 (null)               (Resources)
            651486   regular Run_0620   jjunum PD                 N/A
 9 (null)               (Resources)
            651487   regular Run_0621   jjunum PD                 N/A
 9 (null)               (Resources)
            651488   regular Run_0622   jjunum PD                 N/A
 9 (null)               (Resources)
            651489   regular Run_0623   jjunum PD                 N/A
 9 (null)               (Resources)
            651490   regular Run_0624   jjunum PD                 N/A
 9 (null)               (Resources)
            651491   regular Run_0625   jjunum PD                 N/A
 9 (null)               (Resources)
            651492   regular Run_0626   jjunum PD                 N/A
 9 (null)               (Resources)
            651493   regular Run_0627   jjunum PD                 N/A
 9 (null)               (Resources)
            651494   regular Run_0628   jjunum PD                 N/A
 9 (null)               (Resources)
            651495   regular Run_0629   jjunum PD                 N/A
 9 (null)               (Resources)
            651496   regular Run_0630   jjunum PD                 N/A
 9 (null)               (Resources)
            651497   regular Run_0631   jjunum PD                 N/A
 9 (null)               (Resources)
            651498   regular Run_0632   jjunum PD                 N/A
 9 (null)               (Resources)
            651499   regular Run_0633   jjunum PD                 N/A
 9 (null)               (Resources)
            651500   regular Run_0634   jjunum PD                 N/A
 9 (null)               (Resources)
            651501   regular Run_0635   jjunum PD                 N/A
 9 (null)               (Resources)
            651502   regular Run_0636   jjunum PD                 N/A
 9 (null)               (Resources)
            651503   regular Run_0637   jjunum PD                 N/A
 9 (null)               (Resources)
            651504   regular Run_0638   jjunum PD                 N/A
 9 (null)               (Resources)
            651505   regular Run_0639   jjunum PD                 N/A
 9 (null)               (Resources)
            651506   regular Run_0640   jjunum PD                 N/A
 9 (null)               (Resources)
            651507   regular Run_0641   jjunum PD                 N/A
 9 (null)               (Resources)
            651508   regular Run_0642   jjunum PD                 N/A
 9 (null)               (Resources)
            651509   regular Run_0643   jjunum PD                 N/A
 9 (null)               (Resources)
            651510   regular Run_0644   jjunum PD                 N/A
 9 (null)               (Resources)
            651511   regular Run_0645   jjunum PD                 N/A
 9 (null)               (Resources)
            651512   regular Run_0646   jjunum PD                 N/A
 9 (null)               (Resources)
            651513   regular Run_0647   jjunum PD                 N/A
 9 (null)               (Resources)
            651514   regular Run_0648   jjunum PD                 N/A
 9 (null)               (Resources)
            651515   regular Run_0649   jjunum PD                 N/A
 9 (null)               (Resources)
            651516   regular Run_0650   jjunum PD                 N/A
 9 (null)               (Resources)
            651517   regular Run_0651   jjunum PD                 N/A
 9 (null)               (Resources)
            651518   regular Run_0652   jjunum PD                 N/A
 9 (null)               (Resources)
            651519   regular Run_0653   jjunum PD                 N/A
 9 (null)               (Resources)
            651520   regular Run_0654   jjunum PD                 N/A
 9 (null)               (Resources)
            651521   regular Run_0655   jjunum PD                 N/A
 9 (null)               (Resources)
            651522   regular Run_0656   jjunum PD                 N/A
 9 (null)               (Resources)
            651523   regular Run_0657   jjunum PD                 N/A
 9 (null)               (Resources)
            651524   regular Run_0658   jjunum PD                 N/A
 9 (null)               (Resources)
            651525   regular Run_0659   jjunum PD                 N/A
 9 (null)               (Resources)
            651526   regular Run_0660   jjunum PD                 N/A
 9 (null)               (Resources)
            651527   regular Run_0661   jjunum PD                 N/A
 9 (null)               (Resources)
            651528   regular Run_0662   jjunum PD                 N/A
 9 (null)               (Resources)
            651529   regular Run_0663   jjunum PD                 N/A
 9 (null)               (Resources)
            651530   regular Run_0664   jjunum PD                 N/A
 9 (null)               (Resources)
            651531   regular Run_0665   jjunum PD                 N/A
 9 (null)               (Resources)
            651532   regular Run_0666   jjunum PD                 N/A
 9 (null)               (Resources)
            651533   regular Run_0667   jjunum PD                 N/A
 9 (null)               (Resources)
            651534   regular Run_0668   jjunum PD                 N/A
 9 (null)               (Resources)
            651535   regular Run_0669   jjunum PD                 N/A
 9 (null)               (Resources)
            651536   regular Run_0670   jjunum PD                 N/A
 9 (null)               (Resources)
            651537   regular Run_0671   jjunum PD                 N/A
 9 (null)               (Resources)
            651538   regular Run_0672   jjunum PD                 N/A
 9 (null)               (Resources)
            651539   regular Run_0673   jjunum PD                 N/A
 9 (null)               (Resources)
            651540   regular Run_0674   jjunum PD                 N/A
 9 (null)               (Resources)
            651541   regular Run_0675   jjunum PD                 N/A
 9 (null)               (Resources)
            651542   regular Run_0676   jjunum PD                 N/A
 9 (null)               (Resources)
            651543   regular Run_0677   jjunum PD                 N/A
 9 (null)               (Resources)
            651544   regular Run_0678   jjunum PD                 N/A
 9 (null)               (Resources)
            651545   regular Run_0679   jjunum PD                 N/A
 9 (null)               (Resources)
            651546   regular Run_0680   jjunum PD                 N/A
 9 (null)               (Resources)
            651547   regular Run_0681   jjunum PD                 N/A
 9 (null)               (Resources)
            328241   regular   D23_re    wangw PD 2015-12-30T10:54:00
64 nid0[2070-2111,2128- (Resources)
            651948   regular     GENE   nbonan PD                 N/A
 255 (null)               (Resources)
            651381   regular     NEB4 mgsensoy PD 2015-12-30T10:54:00
80 nid0[0210-0211,0216- (Resources)
            528855   regular run.nimr   pankin PD                 N/A
 258 (null)               (Resources)
            644338   regular Run_0579  gmcfarq PD 2015-12-30T10:32:05
 9 nid00[598-599,920-92 (Resources)
            644339   regular Run_0580  gmcfarq PD                 N/A
 9 (null)               (Resources)
            644340   regular Run_0581  gmcfarq PD                 N/A
 9 (null)               (Resources)
            644341   regular Run_0582  gmcfarq PD                 N/A
 9 (null)               (Resources)
            644342   regular Run_0583  gmcfarq PD                 N/A
 9 (null)               (Resources)
            644343   regular Run_0584  gmcfarq PD                 N/A
 9 (null)               (Resources)
            644344   regular Run_0585  gmcfarq PD                 N/A
 9 (null)               (Resources)
            644345   regular Run_0586  gmcfarq PD                 N/A
 9 (null)               (Resources)
            644346   regular Run_0587  gmcfarq PD                 N/A
 9 (null)               (Resources)
            644347   regular Run_0588  gmcfarq PD                 N/A
 9 (null)               (Resources)
            644348   regular Run_0589  gmcfarq PD                 N/A
 9 (null)               (Resources)
            649026   regular   n_1120   heidih PD 2015-12-30T10:54:00
35 nid0[0571,0596-0597, (Resources)
            649033   regular    n_960   heidih PD                 N/A
30 (null)               (Resources)
            644382   regular Run_0592   jjunum PD 2015-12-30T10:32:05
 9 nid00[891-895,916-91 (Resources)
            644384   regular Run_0594   jjunum PD                 N/A
 9 (null)               (Resources)
            644385   regular Run_0595   jjunum PD                 N/A
 9 (null)               (Resources)
            644386   regular Run_0596   jjunum PD                 N/A
 9 (null)               (Resources)
            644387   regular Run_0597   jjunum PD                 N/A
 9 (null)               (Resources)
            644388   regular Run_0598   jjunum PD                 N/A
 9 (null)               (Resources)
            644389   regular Run_0599   jjunum PD                 N/A
 9 (null)               (Resources)
            644390   regular Run_0600   jjunum PD                 N/A
 9 (null)               (Resources)
            644391   regular Run_0601   jjunum PD                 N/A
 9 (null)               (Resources)
            644392   regular Run_0602   jjunum PD                 N/A
 9 (null)               (Resources)
            644393   regular Run_0603   jjunum PD                 N/A
 9 (null)               (Resources)
            644394   regular Run_0604   jjunum PD                 N/A
 9 (null)               (Resources)
            644395   regular Run_0605   jjunum PD                 N/A
 9 (null)               (Resources)
            644396   regular Run_0606   jjunum PD                 N/A
 9 (null)               (Resources)
            644397   regular Run_0607   jjunum PD                 N/A
 9 (null)               (Resources)
            644398   regular Run_0608   jjunum PD                 N/A
 9 (null)               (Resources)
            644399   regular Run_0609   jjunum PD                 N/A
 9 (null)               (Resources)
            644400   regular Run_0610   jjunum PD                 N/A
 9 (null)               (Resources)
            644401   regular Run_0611   jjunum PD                 N/A
 9 (null)               (Resources)
            644402   regular Run_0612   jjunum PD                 N/A
 9 (null)               (Resources)
            644403   regular Run_0613   jjunum PD                 N/A
 9 (null)               (Resources)
            644404   regular Run_0614   jjunum PD                 N/A
 9 (null)               (Resources)
            644405   regular Run_0615   jjunum PD                 N/A
 9 (null)               (Resources)
            644406   regular Run_0616   jjunum PD                 N/A
 9 (null)               (Resources)
            644407   regular Run_0617   jjunum PD                 N/A
 9 (null)               (Resources)
            644408   regular Run_0618   jjunum PD                 N/A
 9 (null)               (Resources)
            502782   regular P2228_NT mhumbert PD                 N/A
 100 (null)               (Resources)
            502780   regular C4C1PIP_ mhumbert PD                 N/A
 100 (null)               (Resources)
            502776   regular C4C1IM_N mhumbert PD                 N/A
 100 (null)               (Resources)
            502778   regular C4C1IM_O mhumbert PD                 N/A
 100 (null)               (Resources)
            502775   regular C4C1IM_4 mhumbert PD 2015-12-30T10:54:00
 100 nid0[0223-0225,0233, (Resources)
            643870   regular  GDB5L_L     bzhu PD                 N/A
32 (null)               (Resources)
            643871   regular  GDB5L_L     bzhu PD                 N/A
32 (null)               (Resources)
            643872   regular  GDB5L_L     bzhu PD                 N/A
32 (null)               (Resources)
            643856   regular  GDB5L_L     bzhu PD                 N/A
32 (null)               (Resources)
            643848   regular  GDB5L_L     bzhu PD 2015-12-30T10:01:15
32 nid0[0358-0359,0494, (Resources)
            535908   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535909   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535910   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535915   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535916   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535918   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535919   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535920   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535921   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535922   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535923   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535924   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535925   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535926   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535927   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535928   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535929   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535930   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535403   regular heavy_me  smeinel PD 2015-12-30T07:02:09
32 nid0[0229-0230,0235- (Resources)
            535404   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535405   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535406   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535407   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535408   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535409   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535410   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535411   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            535412   regular heavy_me  smeinel PD                 N/A
32 (null)               (Resources)
            525578   regular 4kmNCEP_   yuxing PD 2015-12-30T05:20:59
 100 nid0[0209,0218,0221- (Resources)
            543194   regular rawMPI84  fnrizzi PD                 N/A
 831 (null)               (Resources)
            543196   regular rawMPI98  fnrizzi PD                 N/A
1050 (null)               (Resources)
            551687   regular  Cu_45_4   ayonge PD                 N/A
 2 (null)               (Resources)
            551665   regular Cu_45_3_   ayonge PD                 N/A
 2 (null)               (Resources)
            551474   regular Cu_45_2_   ayonge PD                 N/A
 2 (null)               (Resources)
            551232   regular Ag_45_4_   ayonge PD                 N/A
 2 (null)               (Resources)
            551159   regular Ag_45_3_   ayonge PD                 N/A
 2 (null)               (Resources)
            550867   regular Ag_45_2_   ayonge PD                 N/A
 2 (null)               (Resources)
            549905   regular Cu_44_2_   ayonge PD                 N/A
 2 (null)               (Resources)
            549830   regular Cu_O_2_h   ayonge PD                 N/A
 2 (null)               (Resources)
            549695   regular Cu_43_on   ayonge PD                 N/A
 2 (null)               (Resources)
            549460   regular Ag_43_on   ayonge PD                 N/A
 2 (null)               (Resources)
            547747   regular Cu_42_on   ayonge PD                 N/A
 2 (null)               (Resources)
            547677   regular Ag_42_on   ayonge PD                 N/A
 2 (null)               (Resources)
            547186   regular Cu_41_on   ayonge PD                 N/A
 2 (null)               (Resources)
            547112   regular Ag_41_on   ayonge PD                 N/A
 2 (null)               (Resources)
            546845   regular  Cu_40_4   ayonge PD                 N/A
 2 (null)               (Resources)
            546818   regular  Cu_40_3   ayonge PD                 N/A
 2 (null)               (Resources)
            582336   regular eX10B-x1     jmay PD 2015-12-30T10:54:00
 400 nid0[0252-0255,0272- (Resources)
            544800   regular  Cu_40_2   ayonge PD                 N/A
 2 (null)               (Resources)
            544793   regular  Ag_40_4   ayonge PD                 N/A
 2 (null)               (Resources)
            544789   regular   Ag_h-3   ayonge PD                 N/A
 2 (null)               (Resources)
            544783   regular   Ag_h_1   ayonge PD                 N/A
 2 (null)               (Resources)
            544765   regular  Cu_39_4   ayonge PD                 N/A
 2 (null)               (Resources)
            544761   regular  Cu_39_3   ayonge PD                 N/A
 2 (null)               (Resources)
            544756   regular  Cu_39_2   ayonge PD                 N/A
 2 (null)               (Resources)
            544753   regular  Cu_39_1   ayonge PD                 N/A
 2 (null)               (Resources)
            544748   regular  Ag_39_4   ayonge PD                 N/A
 2 (null)               (Resources)
            544744   regular Ag_39_3_   ayonge PD                 N/A
 2 (null)               (Resources)
            544740   regular  Ag_39_2   ayonge PD                 N/A
 2 (null)               (Resources)
            544735   regular Ag_39_1_   ayonge PD                 N/A
 2 (null)               (Resources)
            544594   regular  Cu_36_4   ayonge PD                 N/A
 2 (null)               (Resources)
            544586   regular  Cu_36_3   ayonge PD                 N/A
 2 (null)               (Resources)
            544566   regular Ag_36_4_   ayonge PD                 N/A
 2 (null)               (Resources)
            544561   regular Ag_36_3_   ayonge PD                 N/A
 2 (null)               (Resources)
            544552   regular  Cu_36_2   ayonge PD                 N/A
 2 (null)               (Resources)
            544517   regular Ag_36_2_   ayonge PD                 N/A
 2 (null)               (Resources)
            544494   regular  Cu_35_5   ayonge PD                 N/A
 2 (null)               (Resources)
            544486   regular  Cu_35_3   ayonge PD                 N/A
 2 (null)               (Resources)
            544487   regular  Cu_35_4   ayonge PD                 N/A
 2 (null)               (Resources)
            544481   regular  Cu_35_2   ayonge PD                 N/A
 2 (null)               (Resources)
            544474   regular  Cu_35_1   ayonge PD                 N/A
 2 (null)               (Resources)
            544463   regular  Ag_35_5   ayonge PD                 N/A
 2 (null)               (Resources)
            544457   regular  Ag_35_4   ayonge PD                 N/A
 2 (null)               (Resources)
            544439   regular  Ag_35_3   ayonge PD                 N/A
 2 (null)               (Resources)
            544432   regular  Ag_35_2   ayonge PD                 N/A
 2 (null)               (Resources)
            544425   regular  Ag_35_1   ayonge PD                 N/A
 2 (null)               (Resources)
            543284   regular Cu_33_1_   ayonge PD                 N/A
 2 (null)               (Resources)
            543266   regular Ag_33_1_   ayonge PD                 N/A
 2 (null)               (Resources)
            543251   regular  Cu_32_6   ayonge PD                 N/A
 2 (null)               (Resources)
            543240   regular  Cu_32_5   ayonge PD                 N/A
 2 (null)               (Resources)
            543230   regular  Cu_32_4   ayonge PD                 N/A
 2 (null)               (Resources)
            543220   regular Ag_32_6_   ayonge PD                 N/A
 2 (null)               (Resources)
            543213   regular Ag_32_5_   ayonge PD                 N/A
 2 (null)               (Resources)
            543203   regular Ag_32_4_   ayonge PD                 N/A
 2 (null)               (Resources)
            543109   regular Cu_32_1_   ayonge PD                 N/A
 2 (null)               (Resources)
            543107   regular Cu_32_2_   ayonge PD                 N/A
 2 (null)               (Resources)
            543100   regular Cu_32_1_   ayonge PD                 N/A
 2 (null)               (Resources)
            543088   regular Ag_32_2_   ayonge PD                 N/A
 2 (null)               (Resources)
            543077   regular Ag_32_1_   ayonge PD                 N/A
 2 (null)               (Resources)
            543034   regular Cu_30_2_   ayonge PD                 N/A
 2 (null)               (Resources)
            543033   regular Cu_30_1_   ayonge PD                 N/A
 2 (null)               (Resources)
            543027   regular Ag_30_2_   ayonge PD                 N/A
 2 (null)               (Resources)
            543013   regular Ag_30_1_   ayonge PD                 N/A
 2 (null)               (Resources)
            542968   regular Cu_29_2_   ayonge PD                 N/A
 2 (null)               (Resources)
            542962   regular Ag_29_2_   ayonge PD                 N/A
 2 (null)               (Resources)
            542917   regular Cu_28_3_   ayonge PD                 N/A
 2 (null)               (Resources)
            542845   regular Cu_28_2_   ayonge PD                 N/A
 2 (null)               (Resources)
            542843   regular Cu_28_1_   ayonge PD                 N/A
 2 (null)               (Resources)
            542807   regular Ag_28_3_   ayonge PD                 N/A
 2 (null)               (Resources)
            542800   regular Ag_28_2_   ayonge PD                 N/A
 2 (null)               (Resources)
            542793   regular Ag_28_1_   ayonge PD 2015-12-29T14:39:43
 2 nid0[1840,1960]      (Resources)
            478346   regular run.nimr   pankin PD 2015-12-30T01:20:08
 258 nid0[0208,0219-0220, (Resources)
            549228   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549229   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549230   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549231   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549232   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549233   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549234   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549235   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549236   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549237   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549238   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549239   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549240   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549241   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549242   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549243   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549244   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549245   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549246   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549247   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549248   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549249   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549250   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549251   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549252   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549253   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549254   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549255   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549256   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549257   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549258   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549259   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549260   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549261   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549262   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549263   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549264   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549265   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549266   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549267   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549268   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549269   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549270   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549271   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549272   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549273   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549274   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549275   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549276   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549277   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549278   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549279   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549280   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549281   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549282   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549283   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549284   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549285   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549286   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549287   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549288   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549289   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549290   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            549291   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545556   regular  big_job berkowit PD 2015-12-29T22:54:00
 512 nid0[0223-0225,0233, (Resources)
            545557   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545558   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545559   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545560   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545561   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545562   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545563   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545564   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545565   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545566   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545567   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545568   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545569   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545570   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545571   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545572   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545573   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545574   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545575   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545576   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545577   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545578   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545579   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545580   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545581   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545582   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545583   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545584   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            545585   regular  big_job berkowit PD                 N/A
 512 (null)               (Resources)
            499409   regular     GENE    dtold PD 2015-12-29T22:54:00
 320 nid0[0210-0211,0216- (Resources)
            513398   regular       05      ocs PD 2015-12-29T22:54:00
 128 nid0[0252-0255,0272- (Resources)
            513399   regular       06      ocs PD                 N/A
 128 (null)               (Resources)
            513400   regular       07      ocs PD                 N/A
 128 (null)               (Resources)
            513401   regular       08      ocs PD                 N/A
 128 (null)               (Resources)
            513402   regular       09      ocs PD                 N/A
 128 (null)               (Resources)
            513403   regular       10      ocs PD                 N/A
 128 (null)               (Resources)
            513404   regular       11      ocs PD                 N/A
 128 (null)               (Resources)
            513405   regular       12      ocs PD                 N/A
 128 (null)               (Resources)
            513406   regular       13      ocs PD                 N/A
 128 (null)               (Resources)
            513407   regular       14      ocs PD                 N/A
 128 (null)               (Resources)
            513408   regular       15      ocs PD                 N/A
 128 (null)               (Resources)
            509895   regular rawMPI36  fnrizzi PD                 N/A
1424 (null)               (Resources)
            509887   regular rawMPI30  fnrizzi PD 2015-12-29T20:17:36
 989 nid0[0208,0210-0211, (Resources)
            550452   regular    ucan2    u1103 PD 2015-12-29T22:24:17
1024 nid0[0208,0210-0211, (Resources)
nid00837:~ # sprio -j 550452
          JOBID   PRIORITY        AGE  FAIRSHARE  PARTITION        QOS
         550452      33680      17223       7097       2160       7200
nid00837:~ # sprio -j 509887
          JOBID   PRIORITY        AGE  FAIRSHARE  PARTITION        QOS
         509887      33622      20002       4260       2160       7200
nid00837:~ # scontrol show job 550452
JobId=550452 JobName=ucan2
   UserId=u1103(1103) GroupId=u1103(1001103)
   Priority=33685 Nice=0 Account=m616 QOS=normal_regular_1
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2015-12-17T14:32:12 EligibleTime=2015-12-17T14:32:12
   StartTime=2015-12-29T22:24:17 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=regular AllocNode:Sid=cori10:50398
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
SchedNodeList=nid0[0208,0210-0211,0216-0217,0223-0225,0232-0234,0237-0238,0241-0248,0252-0255,0272-0279,0294-0319,0336-0341,0350-0352,0360-0363,0365-0368,0376-0383,0408-0412,0414-0447,0464-0467,0472-0493,0495-0502,0504-0511,0532-0535,0537-0539,0541-0548,0559-0571,0574-0575,0596-0597,0600-0604,0606-0607,0629-0636,0663-0666,0668-0669,0681-0688,0701-0703,0720-0767,0788-0789,0800-0802,0817-0821,0827-0830,0848-0851,0861-0863,0865-0884,0990-1016,1042-1083,1085-1087,1104-1128,1143-1148,1150-1151,1172-1184,1190-1215,1232-1235,1240-1247,1270-1278,1301-1324,1328-1343,1364-1366,1379-1385,1405-1407,1424-1438,1440-1471,1488-1492,1498-1535,1556-1565,1567-1599,1616-1619,1624-1663,1684-1688,1690-1696,1704-1717,1719-1727,1748-1789,1808-1823,1880-1913,1946-1952,1954-1957,1970-1983,2000-2003,2008-2009,2016-2019,2027-2031,2034-2047,2068-2111,2128-2151,2156-2167,2170-2172,2174-2175,2192-2195,2197-2216,2219-2239]
   NumNodes=1024-1024 NumCPUs=1024 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1024,node=1024
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=craynetwork:1 Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/global/cscratch1/sd/u1103/ducan2/dtest/d32768/run.cori
   WorkDir=/global/cscratch1/sd/u1103/ducan2/dtest/d32768
   StdErr=/global/cscratch1/sd/u1103/ducan2/dtest/d32768/ucan2-32768.err
   StdIn=/dev/null
   StdOut=/global/cscratch1/sd/u1103/ducan2/dtest/d32768/ucan2-32768.out
   Power= SICP=0

nid00837:~ #
nid00837:~ #
nid00837:~ #
nid00837:~ # scontrol show job 509887
JobId=509887 JobName=rawMPI30x30cori
   UserId=fnrizzi(60679) GroupId=fnrizzi(60679)
   Priority=33622 Nice=0 Account=m1882 QOS=normal_regular_1
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:40:00 TimeMin=N/A
   SubmitTime=2015-12-15T16:18:38 EligibleTime=2015-12-15T16:18:38
   StartTime=2015-12-29T20:17:36 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=regular AllocNode:Sid=cori07:43248
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
SchedNodeList=nid0[0208,0210-0211,0216-0217,0223-0225,0228,0232-0234,0237-0238,0241-0248,0252-0255,0272-0279,0287,0292,0294-0319,0336-0341,0350-0352,0357,0360-0363,0365-0367,0376-0383,0408-0412,0446-0447,0464-0467,0472-0493,0497-0502,0504-0511,0532-0535,0537-0539,0541-0548,0559-0571,0574-0575,0596-0597,0600-0604,0606-0607,0629-0636,0663-0666,0668-0669,0678-0679,0681-0688,0701-0703,0720-0767,0788-0789,0800-0801,0817-0821,0823-0824,0827-0830,0848-0851,0861-0863,0865-0884,0987,0993-1016,1042-1083,1085-1087,1104-1128,1138,1145-1148,1150-1151,1172-1184,1190-1215,1232-1235,1240-1247,1270-1278,1301-1323,1328-1343,1364-1366,1375-1376,1379-1385,1405-1407,1424-1438,1440-1471,1488-1492,1498-1535,1556-1565,1567-1599,1616-1619,1624-1663,1684-1688,1690-1696,1704-1717,1719-1727,1748-1789,1808-1823,1837-1838,1880-1913,1946-1952,1954-1957,1970-1971,1973-1983,2000-2003,2008-2009,2016-2019,2024-2025,2027-2031,2034-2047,2068-2111,2128-2151,2164-2167,2170-2172,2174-2175,2192-2195,2197-2216,2219-2239]
   NumNodes=989-989 NumCPUs=989 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=989,node=989
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=craynetwork:1 Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/global/u1/f/fnrizzi/coriRuns/run30x30.cori
   WorkDir=/global/u1/f/fnrizzi/coriRuns
   StdErr=/global/u1/f/fnrizzi/coriRuns/slurm-509887.out
   StdIn=/dev/null
   StdOut=/global/u1/f/fnrizzi/coriRuns/slurm-509887.out
   Power= SICP=0

nid00837:~ #

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Tue, Dec 29, 2015 at 12:28 PM, <bugs@schedmd.com> wrote:

> *Comment # 7 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c7> on bug 2285
> <http://bugs.schedmd.com/show_bug.cgi?id=2285> from Tim Wickberg
> <tim@schedmd.com> *
>
> (In reply to Doug Jacobsen from comment #6 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c6>)> Hi Tim,
> >
> > Yes, as far as I can tell these jobs are only waiting on nodes, but you did
> > trigger an idea.  Does SLURM have some sort of load sensor like GridEngine
> > for determining available memory on nodes?  Or does it just dole out the
> > theoretical max of memory as specified in the slurm.conf for the node, and
> > just assume the memory is available?  I ask because I can imagine a
> > situation wherein the node doesn't get fully cleaned and there no longer is
> > sufficient memory to be running jobs based on our DefMemPerNode settings.
> > However, I think this is more of an academic point, our "regular" and
> > "debug" partitions only give out the maximum amount of memory we allow, and
> > I don't see any particular group of nodes stagnating.
>
> There's no load-monitoring ala SGE for memory, Slurm schedules based on the
> memory defined for the node versus the total requested when running with
> cons_res and CR_socket_memory. (This avoids any potential over-subscription,
> I've always been suspicious of that behavior on other schedulers. There is a
> way to forcible over-provision nodes with memory, but we recommend against it
> for obvious reasons.)
> > Actually, this reoccured last night -- I'll see if I can dig up the logs in
> > a few minutes (there are a LOT of logs...)
> >
> > This time, on a whim, I left the shared partition down, which comprises
> > over half our job queue in terms of entry count, and has generated
> > scheduling issues in the past (pathological failures wherein some jerk
> > asking for all the memory on a node but only 1 core would completely block
> > all shared-partition jobs from running, even on nodes the system wasn't
> > planning on running the job on).
> >
> > Anyway, with the shared partition down this issue has not reoccurred - so
> > I'm wondering if this is somehow related to that partition.
>
> What are you trying to do with the shared partition?
>
> I could see Shared=FORCE:32 causing some odd behavior - it does do some load
> monitoring when deciding which nodes to over-subscribe. (Shared=FORCE
> oversubscribes, in your case up to 32x. Hopefully that's what you expect, I
> know the nomenclature behind some of those options isn't obvious - I know I've
> looked at it expecting it to share sockets but still allocate individual cores
> properly which is not what that does.)
>
> Also note that any sharing is per-partition, Slurm will not co-mingle jobs from
> separate partitions within a single node. This may be leading to some of the
> resource contention you're seeing - it looked like you'd only sent squeue for
> regular, but I'm guessing there may have been some relatively large jobs
> pending in shared that could have caused the resources to be reserved awaiting
> a larger job launching in a separate partition.
>
> Before I joined, I'd submitted a feature request to mark nodes as "earmarked"
> or something similar - some mechanism of noting that they aren't "idle" but are
> instead are being kept empty in order to launch some future job. I'll see if I
> can get that done for 16.05 to at least help indicate the current status.
> > Regarding your request for performance numbers.  I typically see our
> > backfill scheduler cycle around 30s when shared is enabled. I did update
> > the parameters yesterday to start preparing for our production
> > configuration starting on 1/11.
> > SchedulerParameters     =
> > no_backup_scheduling,bf_window=10080,bf_resolution=120,
> > bf_max_job_array_resv=20,default_queue_depth=400,bf_max_job_test=6000,
> > bf_max_job_user=1,bf_continue,nohold_on_prolog_fail,kill_invalid_depend
> >
> > With the shared partition down, things seem smoother.  Obviously we need to
> > get that back online, but I want to let a few more big jobs run first
> > before signing up for more pain.
>
> That looks fine, I don't see any obvious anomalies. I await further logs when
> available.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 14 Doug Jacobsen 2015-12-29 08:35:18 MST
so, here is something interesting, I don't know if it is related to this Resources issue or not, but it seems that a forced, shared, job that requests all the cpus on a node causes a major slowdown in the backfill scheduler.  We've had anecdotal evidence of this before but I've tried to document it below.  Note that in the sdiag | grep "Last cycle" output below, the first is the primary scheduler, and the second instance is the backfill scheduler timing.  You can see that when the shared partition is down it takes about 13s to run the backfill scheduler.  When I up the shared partition it takes nearly 90s.  When I then put a nasty requesting-all-the-cpus-in-the-shared partition on hold, the scheduler cycle time is back down to about 18s.

-Doug


nid00837:~ # sinfo -o "%R %a"
PARTITION AVAIL
system up
debug up
regular up
preempt down
realtime up
shared down
nid00837:~ #
nid00837:~ #
nid00837:~ # sdiag | grep "Last cycle"
	Last cycle:   26938
	Last cycle when: Tue Dec 29 14:25:46 2015
	Last cycle: 12758773
nid00837:~ #
nid00837:~ #
nid00837:~ #
nid00837:~ #
nid00837:~ # scontrol update partition=shared state=up
nid00837:~ # sinfo -o "%R %a"
PARTITION AVAIL
system up
debug up
regular up
preempt down
realtime up
shared up
nid00837:~ #
nid00837:~ #
nid00837:~ #
nid00837:~ #
nid00837:~ #
nid00837:~ # sdiag | grep "Last cycle"
	Last cycle:   25997
	Last cycle when: Tue Dec 29 14:28:40 2015
	Last cycle: 90586400
nid00837:~ #
nid00837:~ #
nid00837:~ #
nid00837:~ # scontrol show job 734373
JobId=734373 JobName=AIMD
   UserId=rsakidja(55248) GroupId=rsakidja(55248)
   Priority=10919 Nice=0 Account=m1491 QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=03:00:00 TimeMin=N/A
   SubmitTime=2015-12-28T19:41:46 EligibleTime=2015-12-28T19:41:46
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=shared AllocNode:Sid=cori08:111015
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=64,mem=124928,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1952M MinTmpDiskNode=0
   Features=(null) Gres=craynetwork:0 Reservation=(null)
   Shared=1 Contiguous=0 Licenses=(null) Network=(null)
   Command=/global/cscratch1/sd/rsakidja/CTE/Tb5SiO5/1400K_111/take2/job_stampede
   WorkDir=/global/cscratch1/sd/rsakidja/CTE/Tb5SiO5/1400K_111/take2
   StdErr=/global/cscratch1/sd/rsakidja/CTE/Tb5SiO5/1400K_111/take2/AIMD.o734373
   StdIn=/dev/null
   StdOut=/global/cscratch1/sd/rsakidja/CTE/Tb5SiO5/1400K_111/take2/AIMD.o734373
   Power= SICP=0

nid00837:~ # scontrol hold 734373
nid00837:~ # sdiag | grep "Last cycle"
	Last cycle:   27634
	Last cycle when: Tue Dec 29 14:31:17 2015
	Last cycle: 18127373
nid00837:~ #
Comment 15 Tim Wickberg 2015-12-29 08:46:03 MST
You did warn me that the log would be big...

There are 1096 error messages in there, and that's excluding the 
warnings about the slurmd's having a different config value. (You may 
want to set NO_CONF_HASH just to keep the logs a bit more manageable 
long-term if the slurmd's aren't seeing the same file.)

Most of them look harmless, but it'd still be nice to clean them up at 
some point - are you expecting those error messages from your 
job_submit.lua plugin? (564 of 'em)

The errors related to gres/craynetwork (518) are possibly more 
interesting, I'm looking into those further now.

Aside from those, there are these 14:

> [2015-12-29T13:25:26.039] error: Can't find parent id 751 for assoc 4958, this should never happen.
> [2015-12-29T13:25:26.039] error: Can't find parent id 751 for assoc 4958, this should never happen.
> [2015-12-29T13:25:26.539] error: Can't find parent id 43 for assoc 4959, this should never happen.
> [2015-12-29T13:25:26.539] error: Can't find parent id 43 for assoc 4959, this should never happen.
> [2015-12-29T13:26:03.201] error: cons_res: node nid01146 memory is under-allocated (0-124928) for job 742992
> [2015-12-29T13:26:06.321] error: cons_res: node nid00666 memory is under-allocated (0-124928) for job 743317
> [2015-12-29T13:26:34.457] error: _start_stage_in: setup for job 743319 status:256 response:dwpost - failed client status code %s 409
> [2015-12-29T13:31:53.760] error: slurm_receive_msg: Zero Bytes were transmitted or received
> [2015-12-29T13:31:53.770] error: slurm_receive_msg: Zero Bytes were transmitted or received
> [2015-12-29T13:41:53.746] error: slurm_receive_msg: Zero Bytes were transmitted or received
> [2015-12-29T13:41:53.756] error: slurm_receive_msg: Zero Bytes were transmitted or received
> [2015-12-29T13:43:49.703] error: cons_res: node nid00163 memory is under-allocated (0-124928) for job 743307
> [2015-12-29T13:51:53.754] error: slurm_receive_msg: Zero Bytes were transmitted or received
> [2015-12-29T13:51:53.764] error: slurm_receive_msg: Zero Bytes were transmitted or received

Any idea why those associations can't match to a parent? You might need 
to dig into MySQL to sort those out, but I doubt those are related to 
your scheduling issues.
Comment 16 Doug Jacobsen 2015-12-29 08:51:53 MST
Yeah, sorry the thing from the job submit lua should have been tagged as a
debug message, not error.  I'll fix those next time I update the job submit
filter (for example to prevent jerks from requesting whole nodes in the
shared partition).



----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Tue, Dec 29, 2015 at 2:46 PM, <bugs@schedmd.com> wrote:

> *Comment # 15 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c15> on bug
> 2285 <http://bugs.schedmd.com/show_bug.cgi?id=2285> from Tim Wickberg
> <tim@schedmd.com> *
>
> You did warn me that the log would be big...
>
> There are 1096 error messages in there, and that's excluding the
> warnings about the slurmd's having a different config value. (You may
> want to set NO_CONF_HASH just to keep the logs a bit more manageable
> long-term if the slurmd's aren't seeing the same file.)
>
> Most of them look harmless, but it'd still be nice to clean them up at
> some point - are you expecting those error messages from your
> job_submit.lua plugin? (564 of 'em)
>
> The errors related to gres/craynetwork (518) are possibly more
> interesting, I'm looking into those further now.
>
> Aside from those, there are these 14:
> > [2015-12-29T13:25:26.039] error: Can't find parent id 751 for assoc 4958, this should never happen.
> > [2015-12-29T13:25:26.039] error: Can't find parent id 751 for assoc 4958, this should never happen.
> > [2015-12-29T13:25:26.539] error: Can't find parent id 43 for assoc 4959, this should never happen.
> > [2015-12-29T13:25:26.539] error: Can't find parent id 43 for assoc 4959, this should never happen.
> > [2015-12-29T13:26:03.201] error: cons_res: node nid01146 memory is under-allocated (0-124928) for job 742992
> > [2015-12-29T13:26:06.321] error: cons_res: node nid00666 memory is under-allocated (0-124928) for job 743317
> > [2015-12-29T13:26:34.457] error: _start_stage_in: setup for job 743319 status:256 response:dwpost - failed client status code %s 409
> > [2015-12-29T13:31:53.760] error: slurm_receive_msg: Zero Bytes were transmitted or received
> > [2015-12-29T13:31:53.770] error: slurm_receive_msg: Zero Bytes were transmitted or received
> > [2015-12-29T13:41:53.746] error: slurm_receive_msg: Zero Bytes were transmitted or received
> > [2015-12-29T13:41:53.756] error: slurm_receive_msg: Zero Bytes were transmitted or received
> > [2015-12-29T13:43:49.703] error: cons_res: node nid00163 memory is under-allocated (0-124928) for job 743307
> > [2015-12-29T13:51:53.754] error: slurm_receive_msg: Zero Bytes were transmitted or received
> > [2015-12-29T13:51:53.764] error: slurm_receive_msg: Zero Bytes were transmitted or received
>
> Any idea why those associations can't match to a parent? You might need
> to dig into MySQL to sort those out, but I doubt those are related to
> your scheduling issues.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 17 Tim Wickberg 2015-12-29 09:10:42 MST
Not a problem, just was curious. I start skimming the error messages 
first and those popped out.

IIRC, the job_submit plugin may be doing something with the craynetwork 
GRES? Is every job in shared still asking for one unit of that GRES, 
even though there are only four available per node?

The constant underflow warnings are curious, I'm still digging into why 
those may be getting generated that frequently.

On 12/29/2015 05:51 PM, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=2285
>
> --- Comment #16 from Doug Jacobsen <dmjacobsen@lbl.gov> ---
> Yeah, sorry the thing from the job submit lua should have been tagged as a
> debug message, not error.  I'll fix those next time I update the job submit
> filter (for example to prevent jerks from requesting whole nodes in the
> shared partition).
Comment 18 Doug Jacobsen 2015-12-29 09:35:06 MST
The job_submit/cray plugin is adding craynetwork:1 to all jobs if
craynetwork is not specified.  Our job_submit.lua script is setting
craynetwork:0 for all jobs in the shared partition.

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Tue, Dec 29, 2015 at 3:10 PM, <bugs@schedmd.com> wrote:

> *Comment # 17 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c17> on bug
> 2285 <http://bugs.schedmd.com/show_bug.cgi?id=2285> from Tim Wickberg
> <tim@schedmd.com> *
>
> Not a problem, just was curious. I start skimming the error messages
> first and those popped out.
>
> IIRC, the job_submit plugin may be doing something with the craynetwork
> GRES? Is every job in shared still asking for one unit of that GRES,
> even though there are only four available per node?
>
> The constant underflow warnings are curious, I'm still digging into why
> those may be getting generated that frequently.
>
> On 12/29/2015 05:51 PM, bugs@schedmd.com wrote:> http://bugs.schedmd.com/show_bug.cgi?id=2285
> >> --- Comment #16 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c16> from Doug Jacobsen <dmjacobsen@lbl.gov> ---
> > Yeah, sorry the thing from the job submit lua should have been tagged as a
> > debug message, not error.  I'll fix those next time I update the job submit
> > filter (for example to prevent jerks from requesting whole nodes in the
> > shared partition).
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 19 Doug Jacobsen 2015-12-29 10:06:52 MST
Regarding the gres, we don't use the weird bind-mounted gres.conf that the
cray wlm_switch init script tries to setup.  I disabled all that and simply
have this as my gres.conf:

NodeName=nid0[0024-0063,0080-0083,0088-0127,0148-0191,0208-0211,0216-0255,0272-0319,0336-0383,0408-0447,0464-0467,0472-0511,0532-0575,0596-0639,0656-0703,0720-0767,0788-0831,0848-0851,0856-0895,0916-0959,0980-1023,1040-1087,1104-1151,1172-1215,1232-1235,1240-1279,1300-1343,1364-1407,1424-1471,1488-1535,1556-1599,1616-1619,1624-1663,1684-1727,1748-1791,1808-1855,1872-1919,1940-1983,2000-2003,2008-2047,2068-2111,2128-2175,2192-2239,2256-2303]
Name=craynetwork Count=4



----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Tue, Dec 29, 2015 at 3:12 PM, Douglas Jacobsen <dmjacobsen@lbl.gov>
wrote:

> The job_submit/cray plugin is adding craynetwork:1 to all jobs if
> craynetwork is not specified.  Our job_submit.lua script is setting
> craynetwork:0 for all jobs in the shared partition.
>
> ----
> Doug Jacobsen, Ph.D.
> NERSC Computer Systems Engineer
> National Energy Research Scientific Computing Center
> <http://www.nersc.gov>
> dmjacobsen@lbl.gov
>
> ------------- __o
> ---------- _ '\<,_
> ----------(_)/  (_)__________________________
>
>
> On Tue, Dec 29, 2015 at 3:10 PM, <bugs@schedmd.com> wrote:
>
>> *Comment # 17 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c17> on bug
>> 2285 <http://bugs.schedmd.com/show_bug.cgi?id=2285> from Tim Wickberg
>> <tim@schedmd.com> *
>>
>> Not a problem, just was curious. I start skimming the error messages
>> first and those popped out.
>>
>> IIRC, the job_submit plugin may be doing something with the craynetwork
>> GRES? Is every job in shared still asking for one unit of that GRES,
>> even though there are only four available per node?
>>
>> The constant underflow warnings are curious, I'm still digging into why
>> those may be getting generated that frequently.
>>
>> On 12/29/2015 05:51 PM, bugs@schedmd.com wrote:> http://bugs.schedmd.com/show_bug.cgi?id=2285
>> >> --- Comment #16 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c16> from Doug Jacobsen <dmjacobsen@lbl.gov> ---
>> > Yeah, sorry the thing from the job submit lua should have been tagged as a
>> > debug message, not error.  I'll fix those next time I update the job submit
>> > filter (for example to prevent jerks from requesting whole nodes in the
>> > shared partition).
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>
>
Comment 20 Doug Jacobsen 2015-12-30 23:03:18 MST
Hello,

This happened again yesterday morning and again this morning.  The full-node shared partition jobs have been blocked since Tuesday, so that is not involved.

I checked and a lot of realtime jobs had come in.  The realtime partition has higher partition priority than the rest.  Also it has access to all the nodes in the system.  There is a group of 8 nodes (that debug, shared, regular do not have access to), nid[02256-02263] which have lower weight than all the other nodes.

So, the idea is that if a realtime job comes in, it will get high priority in the scheduling algorithm owing to its high partition priority score, but will most likely be placed on the low-weight nodes that other partitions are not using.  Only if those nodes are fully occupied will it then move on to consuming idle (possibly resource-reserved) nodes that other partitions can access.

Yesterday we got 133 realtime job requests, and all scheduled on the realtime-only nodes (nid[02256-02263]), but I'm wondering if the disruption in ordering of the jobs is causing this behavior.

As an experiment, I've lowered the partition priority of realtime to match the other partitions (and have given it slightly more resources, just in case).

-Doug
Comment 21 Doug Jacobsen 2016-01-01 04:50:46 MST
Hello,

Just a status update.  Since making the realtime partition have the same partition priority as the rest, this issue has not re-occurred.  So it that type of configuration may help track down the issue.

I can get away with realtime not having a higher partition priority for now, but at some point I would like to get back to that, if possible.

I'll continue to monitor this and let you know if I see any changes.

-Doug
Comment 22 Tim Wickberg 2016-01-01 05:57:53 MST
My apologies for some unplanned delay on our end, the holidays and some 
other matters have complicated things this week.

I should have asked about the mixed partition priority levels before - I 
have seen other cases leading to this type of fragmentation.

The partition priority is evaluated at an entirely different level than 
fairshare. It predates the fairshare work in Slurm, and has some 
implicit assumptions about system management that are the likely source 
of the issue you're seeing: if any jobs are in a higher priority 
partition then we will schedule those ASAP, regardless of the impact on 
lower priority partitions. As you've seen, this can cause some 
significant and unexpected impacts on throughput and utilization.

I will make sure this is made more explicit in the documentation, and 
that - unless you're looking at a preemption model or similar - setting 
partition priorities will likely cause problems when the partitions 
overlap. At the very least we should warn about this with the 
PriorityWeightPartition setting.

Using the new PartitionQOS settings + TRES should allow you to reweight 
different aspects of the various partitions, while still subjecting 
everything to the normal fairshare operation.
Comment 23 Doug Jacobsen 2016-01-01 07:05:33 MST
Hi Tim,

I'll go ahead and re-work the priorities so that the realtime jobs get
highest possible priority via the QOS rather than via the partition method.

I do think, however, it is odd that this should result in jobs in the lower
priority partitions incorrectly getting assigned a "Resources" reason, and
probably still represents somewhat of a bug.

For the meantime, however, structuring our priorities using just the QOS's
will probably be easier.  The main use of fairshare will go away for NERSC
on January 11, when we enter full production (we're in "free" time mode
right now, and so are using fairshare as a way to weight priorities towards
users that didn't make full use of their allocation last year).

Thank you for looking at this and getting back to me,
Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Fri, Jan 1, 2016 at 11:57 AM, <bugs@schedmd.com> wrote:

> *Comment # 22 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c22> on bug
> 2285 <http://bugs.schedmd.com/show_bug.cgi?id=2285> from Tim Wickberg
> <tim@schedmd.com> *
>
> My apologies for some unplanned delay on our end, the holidays and some
> other matters have complicated things this week.
>
> I should have asked about the mixed partition priority levels before - I
> have seen other cases leading to this type of fragmentation.
>
> The partition priority is evaluated at an entirely different level than
> fairshare. It predates the fairshare work in Slurm, and has some
> implicit assumptions about system management that are the likely source
> of the issue you're seeing: if any jobs are in a higher priority
> partition then we will schedule those ASAP, regardless of the impact on
> lower priority partitions. As you've seen, this can cause some
> significant and unexpected impacts on throughput and utilization.
>
> I will make sure this is made more explicit in the documentation, and
> that - unless you're looking at a preemption model or similar - setting
> partition priorities will likely cause problems when the partitions
> overlap. At the very least we should warn about this with the
> PriorityWeightPartition setting.
>
> Using the new PartitionQOS settings + TRES should allow you to reweight
> different aspects of the various partitions, while still subjecting
> everything to the normal fairshare operation.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 25 Moe Jette 2016-01-04 07:46:44 MST
This problem does _not_ appear related to your Slurm upgrade. What I'm seeing is Slurm logic is failing to reset the job's reason to "Priority" if the initial state is "Resources". I see similar flawed logic in a couple of places. Here is a portion of flawed logic in src/slurmctld/job_scheduler.c:

		} else if (_failed_partition(job_ptr->part_ptr, failed_parts,
					     failed_part_cnt)) {
			if ((job_ptr->state_reason == WAIT_NODE_NOT_AVAIL) ||
			    (job_ptr->state_reason == WAIT_NO_REASON)) {
				job_ptr->state_reason = WAIT_PRIORITY;
				xfree(job_ptr->state_desc);
				last_job_update = now;
			}

I believe the job's "reason" should be getting reset from most, if not all, initial "reason" states. I don't believe this is causing any jobs to not be scheduled when they should be, but it is clearly a confusing situation.
Comment 26 Moe Jette 2016-01-04 08:32:43 MST
Created attachment 2569 [details]
Fix for v15.08.5
Comment 27 Moe Jette 2016-01-04 08:35:16 MST
Here is the commit for version 15.08.7, likely to be released min-January:

https://github.com/SchedMD/slurm/commit/65bb07dc13065c245e2aa02f9efc6eedda7d236b
Comment 29 Doug Jacobsen 2016-01-04 09:49:41 MST
I can't see bug 2300 (access denied).

Thanks for getting to the bottom of this Tim and Moe!

-Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Mon, Jan 4, 2016 at 3:25 PM, <bugs@schedmd.com> wrote:

> Tim Wickberg <tim@schedmd.com> changed bug 2285
> <http://bugs.schedmd.com/show_bug.cgi?id=2285>
> What Removed Added See Also   http://bugs.schedmd.com/show_bug.cgi?id=2300
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 30 Doug Jacobsen 2016-01-19 04:23:14 MST
Hello,

This issue just occurred on edison when we started the "regularx" partition
-- a partition which has access to ALL resources to allow full scale jobs
to run.  There is a top-priority job in regularx and so far it seems to be
scheduling well.

Unfortunately almost all jobs in the system now have a reason of
(Resources).  Scheduling appears to be proceeding correctly, but it is hard
from a user perspective to understand what is going on.

I applied the patch from this bug and restarted slurmctld.  All jobs still
have reason (Resources).

What information can I provide that might help?

-Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Mon, Jan 4, 2016 at 3:49 PM, Douglas Jacobsen <dmjacobsen@lbl.gov> wrote:

> I can't see bug 2300 (access denied).
>
> Thanks for getting to the bottom of this Tim and Moe!
>
> -Doug
>
> ----
> Doug Jacobsen, Ph.D.
> NERSC Computer Systems Engineer
> National Energy Research Scientific Computing Center
> <http://www.nersc.gov>
> dmjacobsen@lbl.gov
>
> ------------- __o
> ---------- _ '\<,_
> ----------(_)/  (_)__________________________
>
>
> On Mon, Jan 4, 2016 at 3:25 PM, <bugs@schedmd.com> wrote:
>
>> Tim Wickberg <tim@schedmd.com> changed bug 2285
>> <http://bugs.schedmd.com/show_bug.cgi?id=2285>
>> What Removed Added
>> See Also   http://bugs.schedmd.com/show_bug.cgi?id=2300
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>
>