| Summary: | many jobs have reason "Resources", seems to confuse scheduling | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | slurmctld | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 15.08.5 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| See Also: |
http://bugs.schedmd.com/show_bug.cgi?id=2300 https://bugs.schedmd.com/show_bug.cgi?id=8347 |
||
| Site: | NERSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 15.08.7 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
cori slurm.conf
terminal output showing the issue slurmctld log since last restart Fix for v15.08.5 |
||
Created attachment 2547 [details]
terminal output showing the issue
This has happened two more times today on cori. I've reduced the bf_max_job_user from 5 to 1 to see if that prevents some kind of bad interaction between the QOS limits and the new adjustments to the scheduling logic. Can you share your Partition QOS's? I'm guessing those are leading to this interaction, although I'm not yet certain how. The output from `scontrol show assoc` would be plenty. That would be a tremendous amount of output -- we have several thousand
associations. Assuming you just want the qos's:
QOS Records
QOS=normal(1)
UsageRaw=29035864812.296301
GrpJobs=N(757) GrpSubmitJobs=N(10370) GrpWall=N(38569165.57)
GrpTRES=cpu=N(11044),mem=N(19901760),energy=N(0),node=N(868),bb/cray=N(0)
GrpTRESMins=cpu=N(483931080),mem=N(914467304608),energy=N(0),node=N(43846004),bb/cray=N(2466900640)
GrpTRESRunMins=cpu=N(1542092),mem=N(2466990822),energy=N(0),node=N(385444),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=premium(6)
UsageRaw=58570546.135811
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(2765.51)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(976175),mem=N(1905495100),energy=N(0),node=N(15349),bb/cray=N(467060952)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=low(7)
UsageRaw=2966771927.483770
GrpJobs=N(0) GrpSubmitJobs=N(5) GrpWall=N(6969.30)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(49446198),mem=N(96518980040),energy=N(0),node=N(772596),bb/cray=N(54061765)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=serialize(11)
UsageRaw=131692149.509956
GrpJobs=1(0) GrpSubmitJobs=N(0) GrpWall=N(67.20)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(2194869),mem=N(4284384597),energy=N(0),node=N(34294),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=scavenger(12)
UsageRaw=64479816.589748
GrpJobs=N(0) GrpSubmitJobs=N(6) GrpWall=N(47865.76)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(1074663),mem=N(2084784411),energy=N(0),node=N(47865),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=normal_regular_0(13)
UsageRaw=1603426268.757679
GrpJobs=N(0) GrpSubmitJobs=N(3) GrpWall=N(316.58)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(26723771),mem=N(52164801276),energy=N(0),node=N(417558),bb/cray=N(599147136)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=1(0) MaxSubmitJobs=4(3) MaxWallPJ=360
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=node=1628
MaxTRESMinsPJ=
MinTRESPJ=
QOS=normal_regular_1(14)
UsageRaw=21783598846.431665
GrpJobs=N(0) GrpSubmitJobs=N(31) GrpWall=N(6957.24)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(363059980),mem=N(708693082470),energy=N(0),node=N(5672812),bb/cray=N(1505843391)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=1(0) MaxSubmitJobs=4(31) MaxWallPJ=720
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=node=1024
MaxTRESMinsPJ=
MinTRESPJ=
QOS=normal_regular_2(15)
UsageRaw=65269007585.668082
GrpJobs=N(0) GrpSubmitJobs=N(124) GrpWall=N(47085.74)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(1087816793),mem=N(2123418380120),energy=N(0),node=N(16997137),bb/cray=N(1143539304)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=10(0) MaxSubmitJobs=100(124) MaxWallPJ=1440
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=node=1024
MaxTRESMinsPJ=
MinTRESPJ=
QOS=normal_regular_3(16)
UsageRaw=167403609831.030172
GrpJobs=N(67) GrpSubmitJobs=N(2434) GrpWall=N(1470414.32)
GrpTRES=cpu=N(80256),mem=N(156659712),energy=N(0),node=N(1254),bb/cray=N(0)
GrpTRESMins=cpu=N(2790060163),mem=N(5446109051293),energy=N(0),node=N(43594690),bb/cray=N(1580642408)
GrpTRESRunMins=cpu=N(27710726),mem=N(54091337932),energy=N(0),node=N(432980),bb/cray=N(0)
MaxJobsPU=10(67) MaxSubmitJobs=250(2434) MaxWallPJ=1440
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=node=512
MaxTRESMinsPJ=
MinTRESPJ=
QOS=normal_regular_4(17)
UsageRaw=19323189062.714505
GrpJobs=N(101) GrpSubmitJobs=N(2555) GrpWall=N(3910499.22)
GrpTRES=cpu=N(9600),mem=N(16959488),energy=N(0),node=N(150),bb/cray=N(0)
GrpTRESMins=cpu=N(322053151),mem=N(613441844570),energy=N(0),node=N(5032080),bb/cray=N(4167754646)
GrpTRESRunMins=cpu=N(6107728),mem=N(11238938999),energy=N(0),node=N(95433),bb/cray=N(0)
MaxJobsPU=50(101) MaxSubmitJobs=500(2555) MaxWallPJ=1440
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=node=100
MaxTRESMinsPJ=
MinTRESPJ=
QOS=premium_regular_0(19)
UsageRaw=62740426.670975
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(10.63)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(1045673),mem=N(2041155214),energy=N(0),node=N(16338),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=1(0) MaxSubmitJobs=4(0) MaxWallPJ=360
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=node=1628
MaxTRESMinsPJ=
MinTRESPJ=
QOS=premium_regular_1(20)
UsageRaw=1105208059.015536
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(339.99)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(18420134),mem=N(35956102186),energy=N(0),node=N(287814),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=1(0) MaxSubmitJobs=4(0) MaxWallPJ=720
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=node=1024
MaxTRESMinsPJ=
MinTRESPJ=
QOS=premium_regular_2(21)
UsageRaw=394514798.365269
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(200.45)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(6575246),mem=N(12834881440),energy=N(0),node=N(102738),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=10(0) MaxSubmitJobs=100(0) MaxWallPJ=1440
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=node=1024
MaxTRESMinsPJ=
MinTRESPJ=
QOS=premium_regular_3(22)
UsageRaw=139942721.225102
GrpJobs=N(1) GrpSubmitJobs=N(1) GrpWall=N(1846.17)
GrpTRES=cpu=N(320),mem=N(624640),energy=N(0),node=N(5),bb/cray=N(0)
GrpTRESMins=cpu=N(2332378),mem=N(4552803197),energy=N(0),node=N(36443),bb/cray=N(5638522108)
GrpTRESRunMins=cpu=N(111642),mem=N(217926485),energy=N(0),node=N(1744),bb/cray=N(0)
MaxJobsPU=10(1) MaxSubmitJobs=250(1) MaxWallPJ=1440
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=node=512
MaxTRESMinsPJ=
MinTRESPJ=
QOS=premium_regular_4(23)
UsageRaw=1811337.955253
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(355.23)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(30188),mem=N(58928861),energy=N(0),node=N(471),bb/cray=N(35278242)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=50(0) MaxSubmitJobs=500(0) MaxWallPJ=1440
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=node=100
MaxTRESMinsPJ=
MinTRESPJ=
QOS=low_regular_0(25)
UsageRaw=0.000000
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=1(0) MaxSubmitJobs=4(0) MaxWallPJ=360
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=node=1628
MaxTRESMinsPJ=
MinTRESPJ=
QOS=low_regular_1(26)
UsageRaw=0.000000
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=1(0) MaxSubmitJobs=4(0) MaxWallPJ=720
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=node=1024
MaxTRESMinsPJ=
MinTRESPJ=
QOS=low_regular_2(27)
UsageRaw=0.000000
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=10(0) MaxSubmitJobs=100(0) MaxWallPJ=1440
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=node=1024
MaxTRESMinsPJ=
MinTRESPJ=
QOS=low_regular_3(28)
UsageRaw=0.000000
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=10(0) MaxSubmitJobs=250(0) MaxWallPJ=1440
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=node=512
MaxTRESMinsPJ=
MinTRESPJ=
QOS=low_regular_4(29)
UsageRaw=0.000000
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobsPU=50(0) MaxSubmitJobs=500(0) MaxWallPJ=1440
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=node=100
MaxTRESMinsPJ=
MinTRESPJ=
QOS=part_debug(32)
UsageRaw=23592245099.469097
GrpJobs=N(9) GrpSubmitJobs=N(122) GrpWall=N(389416.66)
GrpTRES=cpu=N(7680),mem=N(14991360),energy=N(0),node=N(120),bb/cray=N(0)
GrpTRESMins=cpu=N(393204084),mem=N(767360107327),energy=N(0),node=N(6143813),bb/cray=N(2619505647)
GrpTRESRunMins=cpu=N(232929),mem=N(454677538),energy=N(0),node=N(3639),bb/cray=N(0)
MaxJobsPU=1(9) MaxSubmitJobs=10(122) MaxWallPJ=
MaxTRESPJ=node=128
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=part_reg(33)
UsageRaw=277289275824.041923
GrpJobs=N(169) GrpSubmitJobs=N(5154) GrpWall=N(5454035.99)
GrpTRES=cpu=N(90176),mem=N(174243840),energy=N(0),node=N(1409),bb/cray=N(0)
GrpTRESMins=cpu=N(4621487930),mem=N(9005850145329),energy=N(0),node=N(72210748),bb/cray=N(14670727238)
GrpTRESRunMins=cpu=N(50370413),mem=N(97196104567),energy=N(0),node=N(787037),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=part_shared(36)
UsageRaw=6167943337.439099
GrpJobs=N(748) GrpSubmitJobs=N(10253) GrpWall=N(37922174.68)
GrpTRES=cpu=N(3364),mem=N(4910400),energy=N(0),node=N(748),bb/cray=N(0)
GrpTRESMins=cpu=N(102799055),mem=N(170658900245),energy=N(0),node=N(37922174),bb/cray=N(368517711)
GrpTRESRunMins=cpu=N(4335376),mem=N(6226500412),energy=N(0),node=N(902159),bb/cray=N(0)
MaxJobs= MaxSubmitJobs=25000(10253) MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=part_preempt(44)
UsageRaw=1869181.523848
GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(239.54)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(31153),mem=N(60810705),energy=N(0),node=N(486),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=part_realtime(45)
UsageRaw=63961418.292732
GrpJobs=N(7) GrpSubmitJobs=N(11) GrpWall=N(47279.89)
GrpTRES=cpu=2048(168),mem=N(327936),energy=N(0),node=N(7),bb/cray=N(0)
GrpTRESMins=cpu=N(1066023),mem=N(2075797392),energy=N(0),node=N(47329),bb/cray=N(10515235)
GrpTRESRunMins=cpu=N(6043),mem=N(11796326),energy=N(0),node=N(251),bb/cray=N(0)
MaxJobsPU=8(7) MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=cpu=512
MaxTRESMinsPJ=
MinTRESPJ=
QOS=realtime_generic(46)
UsageRaw=44140975.222532
GrpJobs=8(0) GrpSubmitJobs=N(0) GrpWall=N(34077.98)
GrpTRES=cpu=256(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(735682),mem=N(1436130664),energy=N(0),node=N(34098),bb/cray=N(6511657)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=realtime_lcls(47)
UsageRaw=231370.796306
GrpJobs=8(0) GrpSubmitJobs=N(0) GrpWall=N(122.47)
GrpTRES=cpu=256(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(3856),mem=N(7527263),energy=N(0),node=N(122),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=realtime_openmsi(48)
UsageRaw=5887143.595559
GrpJobs=8(0) GrpSubmitJobs=N(0) GrpWall=N(2696.26)
GrpTRES=cpu=256(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(98119),mem=N(191528404),energy=N(0),node=N(2725),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=realtime_ngbi(49)
UsageRaw=3278.289890
GrpJobs=8(0) GrpSubmitJobs=N(0) GrpWall=N(27.31)
GrpTRES=cpu=256(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(54),mem=N(106653),energy=N(0),node=N(27),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=realtime_als(50)
UsageRaw=1033500.474004
GrpJobs=8(0) GrpSubmitJobs=N(0) GrpWall=N(2179.16)
GrpTRES=cpu=256(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(17225),mem=N(28523604),energy=N(0),node=N(2179),bb/cray=N(0)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=realtime_ptf(51)
UsageRaw=12580964.763747
GrpJobs=8(7) GrpSubmitJobs=N(7) GrpWall=N(8139.20)
GrpTRES=cpu=256(168),mem=N(327936),energy=N(0),node=N(7),bb/cray=N(0)
GrpTRESMins=cpu=N(209682),mem=N(409241978),energy=N(0),node=N(8139),bb/cray=N(0)
GrpTRESRunMins=cpu=N(4771),mem=N(9313382),energy=N(0),node=N(198),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
QOS=realtime_nstaff(53)
UsageRaw=66024.381529
GrpJobs=8(0) GrpSubmitJobs=N(0) GrpWall=N(27.84)
GrpTRES=cpu=256(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
GrpTRESMins=cpu=N(1100),mem=N(2147993),energy=N(0),node=N(27),bb/cray=N(4003577)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),bb/cray=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov
------------- __o
---------- _ '\<,_
----------(_)/ (_)__________________________
On Mon, Dec 28, 2015 at 11:03 AM, <bugs@schedmd.com> wrote:
> Tim Wickberg <tim@schedmd.com> changed bug 2285
> <http://bugs.schedmd.com/show_bug.cgi?id=2285>
> What Removed Added Assignee support@schedmd.com tim@schedmd.com
>
> *Comment # 3 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c3> on bug 2285
> <http://bugs.schedmd.com/show_bug.cgi?id=2285> from Tim Wickberg
> <tim@schedmd.com> *
>
> Can you share your Partition QOS's? I'm guessing those are leading to this
> interaction, although I'm not yet certain how.
>
> The output from `scontrol show assoc` would be plenty.
>
> ------------------------------
> You are receiving this mail because:
>
> - You reported the bug.
>
>
Sorry about that - I forgot "scontrol show assoc" can get rather large when you have a considerable number of accounts defined... my test systems usually only have a handful. I don't see any obvious issues with your config, and it doesn't look like you're using the partitionQOS to limit access. Those jobs waiting for resources - can you confirm they're not waiting for BB or memory, or some other limit, but appear to be stuck on available nodes only? I'd be curious as to what slurmctld.log is indicating is happening. Are you able to grab debug logs before/after you "clear" the problem by draining/resuming the partition? I'd also be curious as to how the scheduler is performing for cori - sstat's output may be of value, although I can't point to anything in particular there at the moment. Hi Tim, Yes, as far as I can tell these jobs are only waiting on nodes, but you did trigger an idea. Does SLURM have some sort of load sensor like GridEngine for determining available memory on nodes? Or does it just dole out the theoretical max of memory as specified in the slurm.conf for the node, and just assume the memory is available? I ask because I can imagine a situation wherein the node doesn't get fully cleaned and there no longer is sufficient memory to be running jobs based on our DefMemPerNode settings. However, I think this is more of an academic point, our "regular" and "debug" partitions only give out the maximum amount of memory we allow, and I don't see any particular group of nodes stagnating. Actually, this reoccured last night -- I'll see if I can dig up the logs in a few minutes (there are a LOT of logs...) This time, on a whim, I left the shared partition down, which comprises over half our job queue in terms of entry count, and has generated scheduling issues in the past (pathological failures wherein some jerk asking for all the memory on a node but only 1 core would completely block all shared-partition jobs from running, even on nodes the system wasn't planning on running the job on). Anyway, with the shared partition down this issue has not reoccurred - so I'm wondering if this is somehow related to that partition. Regarding your request for performance numbers. I typically see our backfill scheduler cycle around 30s when shared is enabled. I did update the parameters yesterday to start preparing for our production configuration starting on 1/11. SchedulerParameters = no_backup_scheduling,bf_window=10080,bf_resolution=120,bf_max_job_array_resv=20,default_queue_depth=400,bf_max_job_test=6000,bf_max_job_user=1,bf_continue,nohold_on_prolog_fail,kill_invalid_depend With the shared partition down, things seem smoother. Obviously we need to get that back online, but I want to let a few more big jobs run first before signing up for more pain. nid00837:~ # sdiag ******************************************************* sdiag output at Tue Dec 29 11:07:48 2015 Data since Mon Dec 28 16:00:00 2015 ******************************************************* Server thread count: 3 Agent queue size: 0 Jobs submitted: 8560 Jobs started: 5564 Jobs completed: 5289 Jobs canceled: 270 Jobs failed: 1 Main schedule statistics (microseconds): Last cycle: 28172 Max cycle: 531631 Total cycles: 9372 Mean cycle: 26076 Mean depth cycle: 884 Cycles per minute: 8 Last queue length: 4164 Backfilling stats Total backfilled jobs (since last slurm start): 4674 Total backfilled jobs (since last stats cycle start): 4328 Total cycles: 1451 Last cycle when: Tue Dec 29 11:07:01 2015 Last cycle: 14648298 Max cycle: 210382414 Mean cycle: 15108727 Last depth cycle: 4153 Last depth cycle (try sched): 216 Depth Mean: 4366 Depth Mean (try depth): 199 Last queue length: 4164 Queue length mean: 4538 Remote Procedure Call statistics by message type REQUEST_JOB_STEP_CREATE ( 5001) count:51635 ave_time:453031 total_time:23392303668 MESSAGE_EPILOG_COMPLETE ( 6012) count:45966 ave_time:1442391 total_time:66300959109 REQUEST_COMPLETE_PROLOG ( 6018) count:44917 ave_time:4917923 total_time:220898380128 REQUEST_PARTITION_INFO ( 2009) count:42389 ave_time:2603 total_time:110346796 REQUEST_JOB_ALLOCATION_INFO_LITE ( 4016) count:31971 ave_time:1461398 total_time:46722363264 REQUEST_STEP_COMPLETE ( 5016) count:31662 ave_time:1143450 total_time:36203931498 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:31012 ave_time:77940 total_time:2417090957 REQUEST_JOB_INFO ( 2003) count:28519 ave_time:1655330 total_time:47208357850 REQUEST_JOB_USER_INFO ( 2039) count:12753 ave_time:1119198 total_time:14273135712 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:8311 ave_time:2595395 total_time:21570329270 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:7527 ave_time:6497207 total_time:48904482394 REQUEST_PING ( 1008) count:1906 ave_time:194 total_time:369931 REQUEST_JOB_INFO_SINGLE ( 2021) count:1551 ave_time:8941220 total_time:13867833675 REQUEST_NODE_INFO ( 2007) count:1479 ave_time:5262233 total_time:7782843909 REQUEST_CANCEL_JOB_STEP ( 5005) count:908 ave_time:108256 total_time:98297288 REQUEST_KILL_JOB ( 5032) count:837 ave_time:282130 total_time:236143105 REQUEST_BURST_BUFFER_INFO ( 2037) count:254 ave_time:308 total_time:78300 REQUEST_JOB_READY ( 4019) count:108 ave_time:736117 total_time:79500701 REQUEST_RESOURCE_ALLOCATION ( 4001) count:90 ave_time:6618577 total_time:595671996 REQUEST_UPDATE_JOB ( 3001) count:67 ave_time:125614 total_time:8416200 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:62 ave_time:3288170 total_time:203866574 REQUEST_STATS_INFO ( 2035) count:17 ave_time:179 total_time:3048 REQUEST_UPDATE_PARTITION ( 3005) count:9 ave_time:133969 total_time:1205728 REQUEST_PRIORITY_FACTORS ( 2026) count:6 ave_time:541223 total_time:3247339 REQUEST_NODE_INFO_SINGLE ( 2040) count:5 ave_time:3039054 total_time:15195270 REQUEST_UPDATE_NODE ( 3002) count:4 ave_time:14166058 total_time:56664232 REQUEST_RESERVATION_INFO ( 2024) count:1 ave_time:166 total_time:166 Remote Procedure Call statistics by user root ( 0) count:165638 ave_time:2358599 total_time:390673627512 guangsha ( 54808) count:57766 ave_time:383712 total_time:22165519601 aoliu ( 56679) count:48534 ave_time:750603 total_time:36429805003 lsq ( 63156) count:15374 ave_time:181018 total_time:2782984897 tang31 ( 62615) count:3845 ave_time:381498 total_time:1466861482 rwelsch ( 69678) count:3417 ave_time:172162 total_time:588278710 malbon ( 58163) count:3270 ave_time:394690 total_time:1290638777 krach ( 58876) count:3120 ave_time:312076 total_time:973677142 berkowit ( 62817) count:3084 ave_time:551964 total_time:1702258279 gpzhang ( 41964) count:2711 ave_time:426136 total_time:1155255579 polkituser ( 108) count:2149 ave_time:1934544 total_time:4157336913 cemitch ( 34773) count:1359 ave_time:5043030 total_time:6853477894 ptfproc ( 62098) count:1074 ave_time:2773597 total_time:2978843540 lsqphot ( 62521) count:886 ave_time:2606250 total_time:2309138029 jiachen ( 52191) count:752 ave_time:6227583 total_time:4683143148 yfeng1 ( 62716) count:699 ave_time:7381043 total_time:5159349145 fangyong ( 53958) count:588 ave_time:6475573 total_time:3807637109 jesutton ( 60057) count:579 ave_time:4026770 total_time:2331499933 dmj ( 56094) count:547 ave_time:2571703 total_time:1406721673 operator ( 34510) count:524 ave_time:3889977 total_time:2038348407 usgweb ( 33442) count:502 ave_time:2856711 total_time:1434069071 pdhuvad ( 59561) count:348 ave_time:1982374 total_time:689866240 u15013 ( 15013) count:320 ave_time:142528 total_time:45609015 fwang26 ( 49811) count:291 ave_time:1853384 total_time:539335032 fpaesani ( 33949) count:258 ave_time:255265 total_time:65858581 glock ( 69615) count:254 ave_time:308 total_time:78300 rch ( 52243) count:250 ave_time:254317 total_time:63579335 friesen ( 52244) count:249 ave_time:5646210 total_time:1405906476 jaehong ( 42915) count:232 ave_time:1525988 total_time:354029299 mycoy ( 62389) count:232 ave_time:549504 total_time:127484967 emiliord ( 56872) count:225 ave_time:3441861 total_time:774418761 jhc585 ( 70382) count:213 ave_time:1994431 total_time:424813813 orginos ( 13909) count:210 ave_time:317101 total_time:66591297 yaowang ( 56583) count:206 ave_time:1175761 total_time:242206789 ysuleyma ( 60557) count:204 ave_time:6829919 total_time:1393303668 swu_ncsu ( 69821) count:201 ave_time:1032187 total_time:207469772 byujiang ( 63096) count:198 ave_time:993105 total_time:196634850 archs ( 69687) count:192 ave_time:233125 total_time:44760086 mlubin ( 62217) count:183 ave_time:1589433 total_time:290866336 rsakidja ( 55248) count:183 ave_time:6333642 total_time:1159056492 skcheng ( 68361) count:173 ave_time:264116 total_time:45692212 rcane ( 58910) count:173 ave_time:7606778 total_time:1315972759 szg142 ( 62766) count:172 ave_time:4065274 total_time:699227201 chlee10 ( 45580) count:170 ave_time:170497 total_time:28984523 knam ( 41118) count:169 ave_time:134244 total_time:22687395 sghosh28 ( 68990) count:157 ave_time:294877 total_time:46295836 luzhixin ( 58710) count:152 ave_time:7329518 total_time:1114086778 schenke ( 52653) count:140 ave_time:1447775 total_time:202688593 xgli ( 48514) count:139 ave_time:15994864 total_time:2223286133 startsev ( 16891) count:137 ave_time:5434589 total_time:744538738 qyang ( 56661) count:131 ave_time:2863608 total_time:375132672 wangyu ( 49739) count:129 ave_time:395237 total_time:50985694 yy293 ( 61446) count:127 ave_time:585155 total_time:74314778 weichen ( 51381) count:126 ave_time:164564 total_time:20735176 mkunz ( 57597) count:123 ave_time:3103359 total_time:381713272 jchowdhu ( 49411) count:122 ave_time:1496306 total_time:182549365 stpi ( 68888) count:118 ave_time:716532 total_time:84550828 aike ( 34983) count:116 ave_time:3682853 total_time:427210973 xiey ( 57446) count:107 ave_time:23232515 total_time:2485879195 rncahn ( 42003) count:105 ave_time:2969432 total_time:311790379 s7z ( 69292) count:90 ave_time:1223935 total_time:110154216 sokseiha ( 60723) count:87 ave_time:4439473 total_time:386234185 slz839 ( 68508) count:87 ave_time:3222288 total_time:280339138 bmarco ( 49744) count:85 ave_time:648784 total_time:55146653 phychem ( 61270) count:83 ave_time:181968 total_time:15103359 akara ( 40227) count:82 ave_time:2399977 total_time:196798117 dingjun ( 57157) count:78 ave_time:5066085 total_time:395154665 ninghai ( 51797) count:76 ave_time:995114 total_time:75628722 zrsun ( 63135) count:75 ave_time:283932 total_time:21294922 psteinbr ( 62610) count:74 ave_time:11636502 total_time:861101165 sselcuk ( 55180) count:70 ave_time:426590 total_time:29861324 sburrows ( 56392) count:68 ave_time:834995 total_time:56779665 saunders ( 56320) count:67 ave_time:6344775 total_time:425099955 sivanr ( 55792) count:66 ave_time:324165 total_time:21394911 dorislee ( 64581) count:64 ave_time:4843782 total_time:310002055 rotureau ( 41524) count:64 ave_time:2234647 total_time:143017452 wangjp ( 59411) count:63 ave_time:31893940 total_time:2009318222 tslo ( 44437) count:63 ave_time:140462 total_time:8849167 bkang ( 70639) count:62 ave_time:722314 total_time:44783479 ckerr ( 13601) count:61 ave_time:437294 total_time:26674983 masha ( 12880) count:58 ave_time:21679175 total_time:1257392185 vancho ( 69590) count:56 ave_time:3097429 total_time:173456062 jihwang ( 69202) count:55 ave_time:6754315 total_time:371487364 ayonge ( 63348) count:54 ave_time:617843 total_time:33363550 dpetesch ( 51668) count:50 ave_time:185625 total_time:9281290 yuan_pin ( 44577) count:49 ave_time:507326 total_time:24859012 brightzh ( 49011) count:49 ave_time:636386 total_time:31182959 dbrout ( 58732) count:48 ave_time:1103036 total_time:52945771 yhzhao ( 70137) count:46 ave_time:366928 total_time:16878702 haobin ( 12588) count:45 ave_time:791601 total_time:35622085 bojana ( 45227) count:45 ave_time:824766 total_time:37114471 binchen ( 69475) count:45 ave_time:869099 total_time:39109461 huang26 ( 69507) count:44 ave_time:4798425 total_time:211130737 scoh ( 54290) count:44 ave_time:879761 total_time:38709495 apurkaya ( 50086) count:42 ave_time:296766 total_time:12464207 szuchia ( 70542) count:42 ave_time:730901 total_time:30697882 divalent ( 58435) count:37 ave_time:10076267 total_time:372821907 vorberg ( 68395) count:36 ave_time:155281 total_time:5590117 jptrinas ( 55511) count:35 ave_time:20860364 total_time:730112761 vetinari ( 56108) count:32 ave_time:856092 total_time:27394964 hergert ( 66183) count:32 ave_time:1093159 total_time:34981102 vih173 ( 61676) count:31 ave_time:15414454 total_time:477848081 dcantu ( 59635) count:31 ave_time:381786 total_time:11835389 gpau ( 43040) count:29 ave_time:405953 total_time:11772649 vijaysr ( 61136) count:29 ave_time:326426 total_time:9466359 lpyu ( 47763) count:29 ave_time:3359420 total_time:97423180 vfung ( 69515) count:29 ave_time:1542510 total_time:44732790 monoue ( 56205) count:27 ave_time:531624 total_time:14353850 huiyufen ( 70591) count:27 ave_time:428552 total_time:11570915 aryal ( 44957) count:26 ave_time:230997 total_time:6005941 janina ( 63054) count:26 ave_time:256841 total_time:6677870 shpark ( 70066) count:26 ave_time:19839359 total_time:515823337 jddenlin ( 46195) count:25 ave_time:827131 total_time:20678293 ebraun ( 60799) count:24 ave_time:468404 total_time:11241719 loryza ( 68452) count:24 ave_time:41678603 total_time:1000286480 euniv ( 35016) count:23 ave_time:4951549 total_time:113885630 linj7 ( 55679) count:21 ave_time:320613 total_time:6732889 staimour ( 61277) count:20 ave_time:80891022 total_time:1617820440 schrier ( 33338) count:19 ave_time:53185 total_time:1010525 mtreagan ( 55441) count:19 ave_time:1032315 total_time:19613989 dkitch ( 60923) count:18 ave_time:3616284 total_time:65093118 songliu ( 62095) count:17 ave_time:612109 total_time:10405861 kenmc ( 59827) count:17 ave_time:16510408 total_time:280676948 ameisner ( 68391) count:17 ave_time:777231 total_time:13212931 tyson ( 41570) count:16 ave_time:28144 total_time:450304 mandrade ( 69505) count:16 ave_time:3977005 total_time:63632090 jj1 ( 63053) count:16 ave_time:2956263 total_time:47300212 ravish ( 59487) count:16 ave_time:575726 total_time:9211620 kjwlou ( 68953) count:16 ave_time:2556664 total_time:40906631 bln ( 62356) count:14 ave_time:163746 total_time:2292450 ppetrov ( 54943) count:14 ave_time:7199064 total_time:100786906 sisir ( 64001) count:14 ave_time:271684 total_time:3803587 bsingh ( 51922) count:14 ave_time:295149 total_time:4132096 wangjl ( 56483) count:14 ave_time:465903 total_time:6522655 pyhuang ( 70607) count:14 ave_time:763101 total_time:10683425 xinzhang ( 70290) count:12 ave_time:164348 total_time:1972176 u10198 ( 10198) count:12 ave_time:10871316 total_time:130455799 bravenec ( 32825) count:11 ave_time:251121 total_time:2762333 rtsyshev ( 50700) count:11 ave_time:62201 total_time:684211 ajinich ( 70532) count:11 ave_time:472611 total_time:5198731 pjfeibe ( 52605) count:11 ave_time:21141403 total_time:232555443 pankin ( 33880) count:10 ave_time:3019392 total_time:30193921 mniesen ( 58371) count:10 ave_time:855176 total_time:8551764 mgalib ( 64941) count:10 ave_time:806979 total_time:8069797 sfischer ( 65263) count:10 ave_time:1654892 total_time:16548921 mewang ( 69389) count:10 ave_time:1357193 total_time:13571931 smhagos ( 52024) count:10 ave_time:60779 total_time:607797 yaping ( 62471) count:10 ave_time:103049 total_time:1030496 saif ( 70526) count:9 ave_time:34614 total_time:311534 hoa84 ( 69035) count:9 ave_time:2090018 total_time:18810169 mtnguyen ( 70592) count:9 ave_time:13011413 total_time:117102717 smirzaei ( 62064) count:9 ave_time:366272 total_time:3296453 geniav ( 55362) count:8 ave_time:229781 total_time:1838251 cashman ( 55766) count:8 ave_time:312038 total_time:2496306 dgold ( 58888) count:8 ave_time:79941 total_time:639533 dks ( 12735) count:8 ave_time:1302715 total_time:10421723 jhyoon ( 51879) count:8 ave_time:454561 total_time:3636493 shizhong ( 61675) count:8 ave_time:495150 total_time:3961202 paganol ( 47088) count:7 ave_time:203840 total_time:1426885 dbowring ( 54298) count:7 ave_time:28776 total_time:201438 ruchen ( 62289) count:6 ave_time:14811 total_time:88869 sabuda ( 69930) count:6 ave_time:6644786 total_time:39868719 cenko ( 49323) count:6 ave_time:829963 total_time:4979780 taibui ( 65462) count:6 ave_time:190463 total_time:1142782 alexand ( 32910) count:6 ave_time:51024 total_time:306146 mwhite ( 31845) count:6 ave_time:553 total_time:3323 jihankim ( 47675) count:6 ave_time:52533 total_time:315198 samolyuk ( 51792) count:6 ave_time:1110883 total_time:6665298 rtumkur ( 63583) count:6 ave_time:656338 total_time:3938030 krad ( 69112) count:6 ave_time:186925 total_time:1121555 lslivins ( 69829) count:5 ave_time:24927 total_time:124636 tutchton ( 59274) count:5 ave_time:24824148 total_time:124120740 nishino ( 46822) count:5 ave_time:741407 total_time:3707036 mastriko ( 45620) count:5 ave_time:2684 total_time:13421 cdiaz ( 58183) count:5 ave_time:468331 total_time:2341659 canon ( 16907) count:5 ave_time:6795 total_time:33977 yunl ( 57742) count:4 ave_time:129959 total_time:519837 mdfowler ( 69296) count:4 ave_time:39404 total_time:157616 vlcek ( 68560) count:4 ave_time:1335161 total_time:5340646 otresca ( 61059) count:4 ave_time:297226 total_time:1188906 holod ( 34809) count:4 ave_time:92674 total_time:370697 yihe ( 55756) count:4 ave_time:575641 total_time:2302565 ngnedin ( 54589) count:4 ave_time:624076 total_time:2496304 syuk ( 70145) count:4 ave_time:384192 total_time:1536769 tianq ( 63881) count:4 ave_time:83826388 total_time:335305554 toussain ( 10173) count:4 ave_time:68295 total_time:273183 karol ( 68859) count:4 ave_time:551 total_time:2204 dajiang ( 41744) count:4 ave_time:15885126 total_time:63540506 u232 ( 232) count:2 ave_time:722560 total_time:1445120 u16621 ( 16621) count:2 ave_time:4878741 total_time:9757483 szyang ( 70310) count:2 ave_time:19517654 total_time:39035309 samli ( 61845) count:2 ave_time:19341 total_time:38682 hyeongk ( 68594) count:2 ave_time:475 total_time:951 fgygi ( 40699) count:2 ave_time:368595 total_time:737190 morgak ( 57524) count:2 ave_time:366171 total_time:732342 istet ( 44164) count:2 ave_time:19766 total_time:39533 jyoti ( 63106) count:2 ave_time:1214022 total_time:2428045 xueling ( 63083) count:2 ave_time:2988804 total_time:5977609 gandolfi ( 47676) count:1 ave_time:90515 total_time:90515 jschlup ( 70509) count:1 ave_time:44571 total_time:44571 nid00837:~ # (In reply to Doug Jacobsen from comment #6) > Hi Tim, > > Yes, as far as I can tell these jobs are only waiting on nodes, but you did > trigger an idea. Does SLURM have some sort of load sensor like GridEngine > for determining available memory on nodes? Or does it just dole out the > theoretical max of memory as specified in the slurm.conf for the node, and > just assume the memory is available? I ask because I can imagine a > situation wherein the node doesn't get fully cleaned and there no longer is > sufficient memory to be running jobs based on our DefMemPerNode settings. > However, I think this is more of an academic point, our "regular" and > "debug" partitions only give out the maximum amount of memory we allow, and > I don't see any particular group of nodes stagnating. There's no load-monitoring ala SGE for memory, Slurm schedules based on the memory defined for the node versus the total requested when running with cons_res and CR_socket_memory. (This avoids any potential over-subscription, I've always been suspicious of that behavior on other schedulers. There is a way to forcible over-provision nodes with memory, but we recommend against it for obvious reasons.) > Actually, this reoccured last night -- I'll see if I can dig up the logs in > a few minutes (there are a LOT of logs...) > > This time, on a whim, I left the shared partition down, which comprises > over half our job queue in terms of entry count, and has generated > scheduling issues in the past (pathological failures wherein some jerk > asking for all the memory on a node but only 1 core would completely block > all shared-partition jobs from running, even on nodes the system wasn't > planning on running the job on). > > Anyway, with the shared partition down this issue has not reoccurred - so > I'm wondering if this is somehow related to that partition. What are you trying to do with the shared partition? I could see Shared=FORCE:32 causing some odd behavior - it does do some load monitoring when deciding which nodes to over-subscribe. (Shared=FORCE oversubscribes, in your case up to 32x. Hopefully that's what you expect, I know the nomenclature behind some of those options isn't obvious - I know I've looked at it expecting it to share sockets but still allocate individual cores properly which is not what that does.) Also note that any sharing is per-partition, Slurm will not co-mingle jobs from separate partitions within a single node. This may be leading to some of the resource contention you're seeing - it looked like you'd only sent squeue for regular, but I'm guessing there may have been some relatively large jobs pending in shared that could have caused the resources to be reserved awaiting a larger job launching in a separate partition. Before I joined, I'd submitted a feature request to mark nodes as "earmarked" or something similar - some mechanism of noting that they aren't "idle" but are instead are being kept empty in order to launch some future job. I'll see if I can get that done for 16.05 to at least help indicate the current status. > Regarding your request for performance numbers. I typically see our > backfill scheduler cycle around 30s when shared is enabled. I did update > the parameters yesterday to start preparing for our production > configuration starting on 1/11. > SchedulerParameters = > no_backup_scheduling,bf_window=10080,bf_resolution=120, > bf_max_job_array_resv=20,default_queue_depth=400,bf_max_job_test=6000, > bf_max_job_user=1,bf_continue,nohold_on_prolog_fail,kill_invalid_depend > > With the shared partition down, things seem smoother. Obviously we need to > get that back online, but I want to let a few more big jobs run first > before signing up for more pain. That looks fine, I don't see any obvious anomalies. I await further logs when available. (In reply to Tim Wickberg from comment #7) > Also note that any sharing is per-partition, Slurm will not co-mingle jobs > from separate partitions within a single node. This may be leading to some > of the resource contention you're seeing - it looked like you'd only sent > squeue for regular, but I'm guessing there may have been some relatively > large jobs pending in shared that could have caused the resources to be > reserved awaiting a larger job launching in a separate partition. I misspoke on part of this - splitting nodes between partitions does work as expected. Distinction and issues with partitions not splitting nodes would happen only in certain cases when using gang scheduling or preemption, and you have neither of those here. My apologies for any confusion. - Tim Created attachment 2553 [details]
slurmctld log since last restart
slurmctld log. look for statements after the partition is marked up, in particular after shared partition marked up.
I sent this in email, but it didn't seem to get put in:
I just modified some configs based on our edison experience (set explicit srun ports, KillOnBadExit, and such) -- basically things that don't involve this. After restarting slurmctld, I up'd all the partitions (including shared) and the same thing happened again.
Really, only job 550452 should be blocked for Resources at this point.
nid00837:~ # squeue --start --sort=Q | grep "Resources"
730556 debug my_job vfung PD N/A 40 (null) (Resources)
739990 debug LiTi weihu PD N/A 20 (null) (Resources)
743246 debug my_job ninghai PD N/A 64 nid00[024-051,062-06 (Resources)
683567 debug I805_315 ppetrov PD N/A 128 (null) (Resources)
743330 debug runner gandolfi PD N/A 32 nid0[0107-0108,0113- (Resources)
736356 shared multi_0. jkretchm PD N/A 1 (null) (Resources)
674260 regular Run_0251 jjunum PD N/A 8 (null) (Resources)
674261 regular Run_0252 jjunum PD N/A 8 (null) (Resources)
674262 regular Run_0253 jjunum PD N/A 8 (null) (Resources)
674263 regular Run_0254 jjunum PD N/A 8 (null) (Resources)
674264 regular Run_0255 jjunum PD N/A 8 (null) (Resources)
674265 regular Run_0256 jjunum PD N/A 8 (null) (Resources)
674266 regular Run_0257 jjunum PD N/A 8 (null) (Resources)
674267 regular Run_0258 jjunum PD N/A 8 (null) (Resources)
674268 regular Run_0259 jjunum PD N/A 8 (null) (Resources)
674269 regular Run_0260 jjunum PD N/A 8 (null) (Resources)
674270 regular Run_0261 jjunum PD N/A 8 (null) (Resources)
674271 regular Run_0262 jjunum PD N/A 8 (null) (Resources)
674272 regular Run_0263 jjunum PD N/A 8 (null) (Resources)
674273 regular Run_0264 jjunum PD N/A 8 (null) (Resources)
674274 regular Run_0265 jjunum PD N/A 8 (null) (Resources)
674275 regular Run_0266 jjunum PD N/A 8 (null) (Resources)
674276 regular Run_0267 jjunum PD N/A 8 (null) (Resources)
674277 regular Run_0268 jjunum PD N/A 8 (null) (Resources)
674278 regular Run_0269 jjunum PD N/A 8 (null) (Resources)
674279 regular Run_0270 jjunum PD N/A 8 (null) (Resources)
674280 regular Run_0271 jjunum PD N/A 8 (null) (Resources)
674281 regular Run_0272 jjunum PD N/A 8 (null) (Resources)
674282 regular Run_0273 jjunum PD N/A 8 (null) (Resources)
674283 regular Run_0274 jjunum PD N/A 8 (null) (Resources)
674284 regular Run_0275 jjunum PD N/A 8 (null) (Resources)
674285 regular Run_0276 jjunum PD N/A 8 (null) (Resources)
674286 regular Run_0277 jjunum PD N/A 8 (null) (Resources)
674287 regular Run_0278 jjunum PD N/A 8 (null) (Resources)
674288 regular Run_0279 jjunum PD N/A 8 (null) (Resources)
674289 regular Run_0280 jjunum PD N/A 8 (null) (Resources)
674290 regular Run_0281 jjunum PD N/A 8 (null) (Resources)
674291 regular Run_0282 jjunum PD N/A 8 (null) (Resources)
674292 regular Run_0283 jjunum PD N/A 8 (null) (Resources)
674293 regular Run_0284 jjunum PD N/A 8 (null) (Resources)
674294 regular Run_0285 jjunum PD N/A 8 (null) (Resources)
674295 regular Run_0286 jjunum PD N/A 8 (null) (Resources)
675149 regular GDB5L_L bzhu PD N/A 32 (null) (Resources)
675150 regular GDB5L_L bzhu PD N/A 32 (null) (Resources)
675151 regular GDB5L_L bzhu PD N/A 32 (null) (Resources)
673738 regular trinity_ jungpyo PD 2015-12-31T02:20:00 1024 nid0[0209-0211,0216- (Resources)
648729 regular mbd_rela farren PD 2015-12-30T14:20:00 128 nid0[0209,0218-0222, (Resources)
673922 regular STw3 drhatch PD 2015-12-30T11:19:27 16 nid0[0231,0413,0501- (Resources)
651551 regular Run_0683 gmcfarq PD N/A 9 (null) (Resources)
651552 regular Run_0684 gmcfarq PD N/A 9 (null) (Resources)
651553 regular Run_0685 gmcfarq PD N/A 9 (null) (Resources)
651554 regular Run_0686 gmcfarq PD N/A 9 (null) (Resources)
651555 regular Run_0687 gmcfarq PD N/A 9 (null) (Resources)
651556 regular Run_0688 gmcfarq PD N/A 9 (null) (Resources)
651557 regular Run_0689 gmcfarq PD N/A 9 (null) (Resources)
651558 regular Run_0690 gmcfarq PD N/A 9 (null) (Resources)
651559 regular Run_0691 gmcfarq PD N/A 9 (null) (Resources)
651560 regular Run_0692 gmcfarq PD N/A 9 (null) (Resources)
651561 regular Run_0693 gmcfarq PD N/A 9 (null) (Resources)
651562 regular Run_0694 gmcfarq PD N/A 9 (null) (Resources)
651563 regular Run_0695 gmcfarq PD N/A 9 (null) (Resources)
651564 regular Run_0696 gmcfarq PD N/A 9 (null) (Resources)
651565 regular Run_0697 gmcfarq PD N/A 9 (null) (Resources)
651566 regular Run_0698 gmcfarq PD N/A 9 (null) (Resources)
651567 regular Run_0699 gmcfarq PD N/A 9 (null) (Resources)
651568 regular Run_0700 gmcfarq PD N/A 9 (null) (Resources)
526702 regular 416 zfliu PD 2015-12-30T13:20:00 360 nid0[0208,0281-0287, (Resources)
672510 regular ITER12MA izzo PD 2015-12-30T10:54:00 22 nid00[446-447,464-46 (Resources)
673750 regular HEAT_UCL chhabra PD N/A 31 (null) (Resources)
669504 regular test mastriko PD 2015-12-30T10:54:00 16 nid00[294-309] (Resources)
673200 regular DCLL2 chhabra PD 2015-12-30T10:54:00 28 nid0[1440-1467] (Resources)
672765 regular ch3nh3pb abdalla PD 2015-12-30T10:54:00 12 nid0[0998-1009] (Resources)
672766 regular ch3nh3pb abdalla PD N/A 12 (null) (Resources)
672767 regular ch3nh3pb abdalla PD N/A 12 (null) (Resources)
672768 regular ch3nh3pb abdalla PD N/A 12 (null) (Resources)
672769 regular ch3nh3pb abdalla PD N/A 12 (null) (Resources)
647441 regular usgsmega chunzhao PD N/A 50 (null) (Resources)
646874 regular htmegan chunzhao PD 2015-12-30T10:54:00 50 nid0[0749,1616-1619, (Resources)
669872 regular zgoubi vranjbar PD 2015-12-30T10:54:00 32 nid00[701-703,720-74 (Resources)
651548 regular Run_0682 jjunum PD N/A 9 (null) (Resources)
651485 regular Run_0619 jjunum PD N/A 9 (null) (Resources)
651486 regular Run_0620 jjunum PD N/A 9 (null) (Resources)
651487 regular Run_0621 jjunum PD N/A 9 (null) (Resources)
651488 regular Run_0622 jjunum PD N/A 9 (null) (Resources)
651489 regular Run_0623 jjunum PD N/A 9 (null) (Resources)
651490 regular Run_0624 jjunum PD N/A 9 (null) (Resources)
651491 regular Run_0625 jjunum PD N/A 9 (null) (Resources)
651492 regular Run_0626 jjunum PD N/A 9 (null) (Resources)
651493 regular Run_0627 jjunum PD N/A 9 (null) (Resources)
651494 regular Run_0628 jjunum PD N/A 9 (null) (Resources)
651495 regular Run_0629 jjunum PD N/A 9 (null) (Resources)
651496 regular Run_0630 jjunum PD N/A 9 (null) (Resources)
651497 regular Run_0631 jjunum PD N/A 9 (null) (Resources)
651498 regular Run_0632 jjunum PD N/A 9 (null) (Resources)
651499 regular Run_0633 jjunum PD N/A 9 (null) (Resources)
651500 regular Run_0634 jjunum PD N/A 9 (null) (Resources)
651501 regular Run_0635 jjunum PD N/A 9 (null) (Resources)
651502 regular Run_0636 jjunum PD N/A 9 (null) (Resources)
651503 regular Run_0637 jjunum PD N/A 9 (null) (Resources)
651504 regular Run_0638 jjunum PD N/A 9 (null) (Resources)
651505 regular Run_0639 jjunum PD N/A 9 (null) (Resources)
651506 regular Run_0640 jjunum PD N/A 9 (null) (Resources)
651507 regular Run_0641 jjunum PD N/A 9 (null) (Resources)
651508 regular Run_0642 jjunum PD N/A 9 (null) (Resources)
651509 regular Run_0643 jjunum PD N/A 9 (null) (Resources)
651510 regular Run_0644 jjunum PD N/A 9 (null) (Resources)
651511 regular Run_0645 jjunum PD N/A 9 (null) (Resources)
651512 regular Run_0646 jjunum PD N/A 9 (null) (Resources)
651513 regular Run_0647 jjunum PD N/A 9 (null) (Resources)
651514 regular Run_0648 jjunum PD N/A 9 (null) (Resources)
651515 regular Run_0649 jjunum PD N/A 9 (null) (Resources)
651516 regular Run_0650 jjunum PD N/A 9 (null) (Resources)
651517 regular Run_0651 jjunum PD N/A 9 (null) (Resources)
651518 regular Run_0652 jjunum PD N/A 9 (null) (Resources)
651519 regular Run_0653 jjunum PD N/A 9 (null) (Resources)
651520 regular Run_0654 jjunum PD N/A 9 (null) (Resources)
651521 regular Run_0655 jjunum PD N/A 9 (null) (Resources)
651522 regular Run_0656 jjunum PD N/A 9 (null) (Resources)
651523 regular Run_0657 jjunum PD N/A 9 (null) (Resources)
651524 regular Run_0658 jjunum PD N/A 9 (null) (Resources)
651525 regular Run_0659 jjunum PD N/A 9 (null) (Resources)
651526 regular Run_0660 jjunum PD N/A 9 (null) (Resources)
651527 regular Run_0661 jjunum PD N/A 9 (null) (Resources)
651528 regular Run_0662 jjunum PD N/A 9 (null) (Resources)
651529 regular Run_0663 jjunum PD N/A 9 (null) (Resources)
651530 regular Run_0664 jjunum PD N/A 9 (null) (Resources)
651531 regular Run_0665 jjunum PD N/A 9 (null) (Resources)
651532 regular Run_0666 jjunum PD N/A 9 (null) (Resources)
651533 regular Run_0667 jjunum PD N/A 9 (null) (Resources)
651534 regular Run_0668 jjunum PD N/A 9 (null) (Resources)
651535 regular Run_0669 jjunum PD N/A 9 (null) (Resources)
651536 regular Run_0670 jjunum PD N/A 9 (null) (Resources)
651537 regular Run_0671 jjunum PD N/A 9 (null) (Resources)
651538 regular Run_0672 jjunum PD N/A 9 (null) (Resources)
651539 regular Run_0673 jjunum PD N/A 9 (null) (Resources)
651540 regular Run_0674 jjunum PD N/A 9 (null) (Resources)
651541 regular Run_0675 jjunum PD N/A 9 (null) (Resources)
651542 regular Run_0676 jjunum PD N/A 9 (null) (Resources)
651543 regular Run_0677 jjunum PD N/A 9 (null) (Resources)
651544 regular Run_0678 jjunum PD N/A 9 (null) (Resources)
651545 regular Run_0679 jjunum PD N/A 9 (null) (Resources)
651546 regular Run_0680 jjunum PD N/A 9 (null) (Resources)
651547 regular Run_0681 jjunum PD N/A 9 (null) (Resources)
328241 regular D23_re wangw PD 2015-12-30T10:54:00 64 nid0[2070-2111,2128- (Resources)
651948 regular GENE nbonan PD N/A 255 (null) (Resources)
651381 regular NEB4 mgsensoy PD 2015-12-30T10:54:00 80 nid0[0210-0211,0216- (Resources)
528855 regular run.nimr pankin PD N/A 258 (null) (Resources)
644338 regular Run_0579 gmcfarq PD 2015-12-30T10:32:05 9 nid00[598-599,920-92 (Resources)
644339 regular Run_0580 gmcfarq PD N/A 9 (null) (Resources)
644340 regular Run_0581 gmcfarq PD N/A 9 (null) (Resources)
644341 regular Run_0582 gmcfarq PD N/A 9 (null) (Resources)
644342 regular Run_0583 gmcfarq PD N/A 9 (null) (Resources)
644343 regular Run_0584 gmcfarq PD N/A 9 (null) (Resources)
644344 regular Run_0585 gmcfarq PD N/A 9 (null) (Resources)
644345 regular Run_0586 gmcfarq PD N/A 9 (null) (Resources)
644346 regular Run_0587 gmcfarq PD N/A 9 (null) (Resources)
644347 regular Run_0588 gmcfarq PD N/A 9 (null) (Resources)
644348 regular Run_0589 gmcfarq PD N/A 9 (null) (Resources)
649026 regular n_1120 heidih PD 2015-12-30T10:54:00 35 nid0[0571,0596-0597, (Resources)
649033 regular n_960 heidih PD N/A 30 (null) (Resources)
644382 regular Run_0592 jjunum PD 2015-12-30T10:32:05 9 nid00[891-895,916-91 (Resources)
644384 regular Run_0594 jjunum PD N/A 9 (null) (Resources)
644385 regular Run_0595 jjunum PD N/A 9 (null) (Resources)
644386 regular Run_0596 jjunum PD N/A 9 (null) (Resources)
644387 regular Run_0597 jjunum PD N/A 9 (null) (Resources)
644388 regular Run_0598 jjunum PD N/A 9 (null) (Resources)
644389 regular Run_0599 jjunum PD N/A 9 (null) (Resources)
644390 regular Run_0600 jjunum PD N/A 9 (null) (Resources)
644391 regular Run_0601 jjunum PD N/A 9 (null) (Resources)
644392 regular Run_0602 jjunum PD N/A 9 (null) (Resources)
644393 regular Run_0603 jjunum PD N/A 9 (null) (Resources)
644394 regular Run_0604 jjunum PD N/A 9 (null) (Resources)
644395 regular Run_0605 jjunum PD N/A 9 (null) (Resources)
644396 regular Run_0606 jjunum PD N/A 9 (null) (Resources)
644397 regular Run_0607 jjunum PD N/A 9 (null) (Resources)
644398 regular Run_0608 jjunum PD N/A 9 (null) (Resources)
644399 regular Run_0609 jjunum PD N/A 9 (null) (Resources)
644400 regular Run_0610 jjunum PD N/A 9 (null) (Resources)
644401 regular Run_0611 jjunum PD N/A 9 (null) (Resources)
644402 regular Run_0612 jjunum PD N/A 9 (null) (Resources)
644403 regular Run_0613 jjunum PD N/A 9 (null) (Resources)
644404 regular Run_0614 jjunum PD N/A 9 (null) (Resources)
644405 regular Run_0615 jjunum PD N/A 9 (null) (Resources)
644406 regular Run_0616 jjunum PD N/A 9 (null) (Resources)
644407 regular Run_0617 jjunum PD N/A 9 (null) (Resources)
644408 regular Run_0618 jjunum PD N/A 9 (null) (Resources)
502782 regular P2228_NT mhumbert PD N/A 100 (null) (Resources)
502780 regular C4C1PIP_ mhumbert PD N/A 100 (null) (Resources)
502776 regular C4C1IM_N mhumbert PD N/A 100 (null) (Resources)
502778 regular C4C1IM_O mhumbert PD N/A 100 (null) (Resources)
502775 regular C4C1IM_4 mhumbert PD 2015-12-30T10:54:00 100 nid0[0223-0225,0233, (Resources)
643870 regular GDB5L_L bzhu PD N/A 32 (null) (Resources)
643871 regular GDB5L_L bzhu PD N/A 32 (null) (Resources)
643872 regular GDB5L_L bzhu PD N/A 32 (null) (Resources)
643856 regular GDB5L_L bzhu PD N/A 32 (null) (Resources)
643848 regular GDB5L_L bzhu PD 2015-12-30T10:01:15 32 nid0[0358-0359,0494, (Resources)
535908 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535909 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535910 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535915 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535916 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535918 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535919 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535920 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535921 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535922 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535923 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535924 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535925 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535926 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535927 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535928 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535929 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535930 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535403 regular heavy_me smeinel PD 2015-12-30T07:02:09 32 nid0[0229-0230,0235- (Resources)
535404 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535405 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535406 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535407 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535408 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535409 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535410 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535411 regular heavy_me smeinel PD N/A 32 (null) (Resources)
535412 regular heavy_me smeinel PD N/A 32 (null) (Resources)
525578 regular 4kmNCEP_ yuxing PD 2015-12-30T05:20:59 100 nid0[0209,0218,0221- (Resources)
543194 regular rawMPI84 fnrizzi PD N/A 831 (null) (Resources)
543196 regular rawMPI98 fnrizzi PD N/A 1050 (null) (Resources)
551687 regular Cu_45_4 ayonge PD N/A 2 (null) (Resources)
551665 regular Cu_45_3_ ayonge PD N/A 2 (null) (Resources)
551474 regular Cu_45_2_ ayonge PD N/A 2 (null) (Resources)
551232 regular Ag_45_4_ ayonge PD N/A 2 (null) (Resources)
551159 regular Ag_45_3_ ayonge PD N/A 2 (null) (Resources)
550867 regular Ag_45_2_ ayonge PD N/A 2 (null) (Resources)
549905 regular Cu_44_2_ ayonge PD N/A 2 (null) (Resources)
549830 regular Cu_O_2_h ayonge PD N/A 2 (null) (Resources)
549695 regular Cu_43_on ayonge PD N/A 2 (null) (Resources)
549460 regular Ag_43_on ayonge PD N/A 2 (null) (Resources)
547747 regular Cu_42_on ayonge PD N/A 2 (null) (Resources)
547677 regular Ag_42_on ayonge PD N/A 2 (null) (Resources)
547186 regular Cu_41_on ayonge PD N/A 2 (null) (Resources)
547112 regular Ag_41_on ayonge PD N/A 2 (null) (Resources)
546845 regular Cu_40_4 ayonge PD N/A 2 (null) (Resources)
546818 regular Cu_40_3 ayonge PD N/A 2 (null) (Resources)
582336 regular eX10B-x1 jmay PD 2015-12-30T10:54:00 400 nid0[0252-0255,0272- (Resources)
544800 regular Cu_40_2 ayonge PD N/A 2 (null) (Resources)
544793 regular Ag_40_4 ayonge PD N/A 2 (null) (Resources)
544789 regular Ag_h-3 ayonge PD N/A 2 (null) (Resources)
544783 regular Ag_h_1 ayonge PD N/A 2 (null) (Resources)
544765 regular Cu_39_4 ayonge PD N/A 2 (null) (Resources)
544761 regular Cu_39_3 ayonge PD N/A 2 (null) (Resources)
544756 regular Cu_39_2 ayonge PD N/A 2 (null) (Resources)
544753 regular Cu_39_1 ayonge PD N/A 2 (null) (Resources)
544748 regular Ag_39_4 ayonge PD N/A 2 (null) (Resources)
544744 regular Ag_39_3_ ayonge PD N/A 2 (null) (Resources)
544740 regular Ag_39_2 ayonge PD N/A 2 (null) (Resources)
544735 regular Ag_39_1_ ayonge PD N/A 2 (null) (Resources)
544594 regular Cu_36_4 ayonge PD N/A 2 (null) (Resources)
544586 regular Cu_36_3 ayonge PD N/A 2 (null) (Resources)
544566 regular Ag_36_4_ ayonge PD N/A 2 (null) (Resources)
544561 regular Ag_36_3_ ayonge PD N/A 2 (null) (Resources)
544552 regular Cu_36_2 ayonge PD N/A 2 (null) (Resources)
544517 regular Ag_36_2_ ayonge PD N/A 2 (null) (Resources)
544494 regular Cu_35_5 ayonge PD N/A 2 (null) (Resources)
544486 regular Cu_35_3 ayonge PD N/A 2 (null) (Resources)
544487 regular Cu_35_4 ayonge PD N/A 2 (null) (Resources)
544481 regular Cu_35_2 ayonge PD N/A 2 (null) (Resources)
544474 regular Cu_35_1 ayonge PD N/A 2 (null) (Resources)
544463 regular Ag_35_5 ayonge PD N/A 2 (null) (Resources)
544457 regular Ag_35_4 ayonge PD N/A 2 (null) (Resources)
544439 regular Ag_35_3 ayonge PD N/A 2 (null) (Resources)
544432 regular Ag_35_2 ayonge PD N/A 2 (null) (Resources)
544425 regular Ag_35_1 ayonge PD N/A 2 (null) (Resources)
543284 regular Cu_33_1_ ayonge PD N/A 2 (null) (Resources)
543266 regular Ag_33_1_ ayonge PD N/A 2 (null) (Resources)
543251 regular Cu_32_6 ayonge PD N/A 2 (null) (Resources)
543240 regular Cu_32_5 ayonge PD N/A 2 (null) (Resources)
543230 regular Cu_32_4 ayonge PD N/A 2 (null) (Resources)
543220 regular Ag_32_6_ ayonge PD N/A 2 (null) (Resources)
543213 regular Ag_32_5_ ayonge PD N/A 2 (null) (Resources)
543203 regular Ag_32_4_ ayonge PD N/A 2 (null) (Resources)
543109 regular Cu_32_1_ ayonge PD N/A 2 (null) (Resources)
543107 regular Cu_32_2_ ayonge PD N/A 2 (null) (Resources)
543100 regular Cu_32_1_ ayonge PD N/A 2 (null) (Resources)
543088 regular Ag_32_2_ ayonge PD N/A 2 (null) (Resources)
543077 regular Ag_32_1_ ayonge PD N/A 2 (null) (Resources)
543034 regular Cu_30_2_ ayonge PD N/A 2 (null) (Resources)
543033 regular Cu_30_1_ ayonge PD N/A 2 (null) (Resources)
543027 regular Ag_30_2_ ayonge PD N/A 2 (null) (Resources)
543013 regular Ag_30_1_ ayonge PD N/A 2 (null) (Resources)
542968 regular Cu_29_2_ ayonge PD N/A 2 (null) (Resources)
542962 regular Ag_29_2_ ayonge PD N/A 2 (null) (Resources)
542917 regular Cu_28_3_ ayonge PD N/A 2 (null) (Resources)
542845 regular Cu_28_2_ ayonge PD N/A 2 (null) (Resources)
542843 regular Cu_28_1_ ayonge PD N/A 2 (null) (Resources)
542807 regular Ag_28_3_ ayonge PD N/A 2 (null) (Resources)
542800 regular Ag_28_2_ ayonge PD N/A 2 (null) (Resources)
542793 regular Ag_28_1_ ayonge PD 2015-12-29T14:39:43 2 nid0[1840,1960] (Resources)
478346 regular run.nimr pankin PD 2015-12-30T01:20:08 258 nid0[0208,0219-0220, (Resources)
549228 regular big_job berkowit PD N/A 512 (null) (Resources)
549229 regular big_job berkowit PD N/A 512 (null) (Resources)
549230 regular big_job berkowit PD N/A 512 (null) (Resources)
549231 regular big_job berkowit PD N/A 512 (null) (Resources)
549232 regular big_job berkowit PD N/A 512 (null) (Resources)
549233 regular big_job berkowit PD N/A 512 (null) (Resources)
549234 regular big_job berkowit PD N/A 512 (null) (Resources)
549235 regular big_job berkowit PD N/A 512 (null) (Resources)
549236 regular big_job berkowit PD N/A 512 (null) (Resources)
549237 regular big_job berkowit PD N/A 512 (null) (Resources)
549238 regular big_job berkowit PD N/A 512 (null) (Resources)
549239 regular big_job berkowit PD N/A 512 (null) (Resources)
549240 regular big_job berkowit PD N/A 512 (null) (Resources)
549241 regular big_job berkowit PD N/A 512 (null) (Resources)
549242 regular big_job berkowit PD N/A 512 (null) (Resources)
549243 regular big_job berkowit PD N/A 512 (null) (Resources)
549244 regular big_job berkowit PD N/A 512 (null) (Resources)
549245 regular big_job berkowit PD N/A 512 (null) (Resources)
549246 regular big_job berkowit PD N/A 512 (null) (Resources)
549247 regular big_job berkowit PD N/A 512 (null) (Resources)
549248 regular big_job berkowit PD N/A 512 (null) (Resources)
549249 regular big_job berkowit PD N/A 512 (null) (Resources)
549250 regular big_job berkowit PD N/A 512 (null) (Resources)
549251 regular big_job berkowit PD N/A 512 (null) (Resources)
549252 regular big_job berkowit PD N/A 512 (null) (Resources)
549253 regular big_job berkowit PD N/A 512 (null) (Resources)
549254 regular big_job berkowit PD N/A 512 (null) (Resources)
549255 regular big_job berkowit PD N/A 512 (null) (Resources)
549256 regular big_job berkowit PD N/A 512 (null) (Resources)
549257 regular big_job berkowit PD N/A 512 (null) (Resources)
549258 regular big_job berkowit PD N/A 512 (null) (Resources)
549259 regular big_job berkowit PD N/A 512 (null) (Resources)
549260 regular big_job berkowit PD N/A 512 (null) (Resources)
549261 regular big_job berkowit PD N/A 512 (null) (Resources)
549262 regular big_job berkowit PD N/A 512 (null) (Resources)
549263 regular big_job berkowit PD N/A 512 (null) (Resources)
549264 regular big_job berkowit PD N/A 512 (null) (Resources)
549265 regular big_job berkowit PD N/A 512 (null) (Resources)
549266 regular big_job berkowit PD N/A 512 (null) (Resources)
549267 regular big_job berkowit PD N/A 512 (null) (Resources)
549268 regular big_job berkowit PD N/A 512 (null) (Resources)
549269 regular big_job berkowit PD N/A 512 (null) (Resources)
549270 regular big_job berkowit PD N/A 512 (null) (Resources)
549271 regular big_job berkowit PD N/A 512 (null) (Resources)
549272 regular big_job berkowit PD N/A 512 (null) (Resources)
549273 regular big_job berkowit PD N/A 512 (null) (Resources)
549274 regular big_job berkowit PD N/A 512 (null) (Resources)
549275 regular big_job berkowit PD N/A 512 (null) (Resources)
549276 regular big_job berkowit PD N/A 512 (null) (Resources)
549277 regular big_job berkowit PD N/A 512 (null) (Resources)
549278 regular big_job berkowit PD N/A 512 (null) (Resources)
549279 regular big_job berkowit PD N/A 512 (null) (Resources)
549280 regular big_job berkowit PD N/A 512 (null) (Resources)
549281 regular big_job berkowit PD N/A 512 (null) (Resources)
549282 regular big_job berkowit PD N/A 512 (null) (Resources)
549283 regular big_job berkowit PD N/A 512 (null) (Resources)
549284 regular big_job berkowit PD N/A 512 (null) (Resources)
549285 regular big_job berkowit PD N/A 512 (null) (Resources)
549286 regular big_job berkowit PD N/A 512 (null) (Resources)
549287 regular big_job berkowit PD N/A 512 (null) (Resources)
549288 regular big_job berkowit PD N/A 512 (null) (Resources)
549289 regular big_job berkowit PD N/A 512 (null) (Resources)
549290 regular big_job berkowit PD N/A 512 (null) (Resources)
549291 regular big_job berkowit PD N/A 512 (null) (Resources)
545556 regular big_job berkowit PD 2015-12-29T22:54:00 512 nid0[0223-0225,0233, (Resources)
545557 regular big_job berkowit PD N/A 512 (null) (Resources)
545558 regular big_job berkowit PD N/A 512 (null) (Resources)
545559 regular big_job berkowit PD N/A 512 (null) (Resources)
545560 regular big_job berkowit PD N/A 512 (null) (Resources)
545561 regular big_job berkowit PD N/A 512 (null) (Resources)
545562 regular big_job berkowit PD N/A 512 (null) (Resources)
545563 regular big_job berkowit PD N/A 512 (null) (Resources)
545564 regular big_job berkowit PD N/A 512 (null) (Resources)
545565 regular big_job berkowit PD N/A 512 (null) (Resources)
545566 regular big_job berkowit PD N/A 512 (null) (Resources)
545567 regular big_job berkowit PD N/A 512 (null) (Resources)
545568 regular big_job berkowit PD N/A 512 (null) (Resources)
545569 regular big_job berkowit PD N/A 512 (null) (Resources)
545570 regular big_job berkowit PD N/A 512 (null) (Resources)
545571 regular big_job berkowit PD N/A 512 (null) (Resources)
545572 regular big_job berkowit PD N/A 512 (null) (Resources)
545573 regular big_job berkowit PD N/A 512 (null) (Resources)
545574 regular big_job berkowit PD N/A 512 (null) (Resources)
545575 regular big_job berkowit PD N/A 512 (null) (Resources)
545576 regular big_job berkowit PD N/A 512 (null) (Resources)
545577 regular big_job berkowit PD N/A 512 (null) (Resources)
545578 regular big_job berkowit PD N/A 512 (null) (Resources)
545579 regular big_job berkowit PD N/A 512 (null) (Resources)
545580 regular big_job berkowit PD N/A 512 (null) (Resources)
545581 regular big_job berkowit PD N/A 512 (null) (Resources)
545582 regular big_job berkowit PD N/A 512 (null) (Resources)
545583 regular big_job berkowit PD N/A 512 (null) (Resources)
545584 regular big_job berkowit PD N/A 512 (null) (Resources)
545585 regular big_job berkowit PD N/A 512 (null) (Resources)
499409 regular GENE dtold PD 2015-12-29T22:54:00 320 nid0[0210-0211,0216- (Resources)
513398 regular 05 ocs PD 2015-12-29T22:54:00 128 nid0[0252-0255,0272- (Resources)
513399 regular 06 ocs PD N/A 128 (null) (Resources)
513400 regular 07 ocs PD N/A 128 (null) (Resources)
513401 regular 08 ocs PD N/A 128 (null) (Resources)
513402 regular 09 ocs PD N/A 128 (null) (Resources)
513403 regular 10 ocs PD N/A 128 (null) (Resources)
513404 regular 11 ocs PD N/A 128 (null) (Resources)
513405 regular 12 ocs PD N/A 128 (null) (Resources)
513406 regular 13 ocs PD N/A 128 (null) (Resources)
513407 regular 14 ocs PD N/A 128 (null) (Resources)
513408 regular 15 ocs PD N/A 128 (null) (Resources)
509895 regular rawMPI36 fnrizzi PD N/A 1424 (null) (Resources)
509887 regular rawMPI30 fnrizzi PD 2015-12-29T20:17:36 989 nid0[0208,0210-0211, (Resources)
550452 regular ucan2 u1103 PD 2015-12-29T22:24:17 1024 nid0[0208,0210-0211, (Resources)
nid00837:~ # sprio -j 550452
JOBID PRIORITY AGE FAIRSHARE PARTITION QOS
550452 33680 17223 7097 2160 7200
nid00837:~ # sprio -j 509887
JOBID PRIORITY AGE FAIRSHARE PARTITION QOS
509887 33622 20002 4260 2160 7200
nid00837:~ # scontrol show job 550452
JobId=550452 JobName=ucan2
UserId=u1103(1103) GroupId=u1103(1001103)
Priority=33685 Nice=0 Account=m616 QOS=normal_regular_1
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=00:30:00 TimeMin=N/A
SubmitTime=2015-12-17T14:32:12 EligibleTime=2015-12-17T14:32:12
StartTime=2015-12-29T22:24:17 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=regular AllocNode:Sid=cori10:50398
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=nid0[0208,0210-0211,0216-0217,0223-0225,0232-0234,0237-0238,0241-0248,0252-0255,0272-0279,0294-0319,0336-0341,0350-0352,0360-0363,0365-0368,0376-0383,0408-0412,0414-0447,0464-0467,0472-0493,0495-0502,0504-0511,0532-0535,0537-0539,0541-0548,0559-0571,0574-0575,0596-0597,0600-0604,0606-0607,0629-0636,0663-0666,0668-0669,0681-0688,0701-0703,0720-0767,0788-0789,0800-0802,0817-0821,0827-0830,0848-0851,0861-0863,0865-0884,0990-1016,1042-1083,1085-1087,1104-1128,1143-1148,1150-1151,1172-1184,1190-1215,1232-1235,1240-1247,1270-1278,1301-1324,1328-1343,1364-1366,1379-1385,1405-1407,1424-1438,1440-1471,1488-1492,1498-1535,1556-1565,1567-1599,1616-1619,1624-1663,1684-1688,1690-1696,1704-1717,1719-1727,1748-1789,1808-1823,1880-1913,1946-1952,1954-1957,1970-1983,2000-2003,2008-2009,2016-2019,2027-2031,2034-2047,2068-2111,2128-2151,2156-2167,2170-2172,2174-2175,2192-2195,2197-2216,2219-2239]
NumNodes=1024-1024 NumCPUs=1024 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1024,node=1024
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=craynetwork:1 Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/global/cscratch1/sd/u1103/ducan2/dtest/d32768/run.cori
WorkDir=/global/cscratch1/sd/u1103/ducan2/dtest/d32768
StdErr=/global/cscratch1/sd/u1103/ducan2/dtest/d32768/ucan2-32768.err
StdIn=/dev/null
StdOut=/global/cscratch1/sd/u1103/ducan2/dtest/d32768/ucan2-32768.out
Power= SICP=0
nid00837:~ #
nid00837:~ #
nid00837:~ #
nid00837:~ # scontrol show job 509887
JobId=509887 JobName=rawMPI30x30cori
UserId=fnrizzi(60679) GroupId=fnrizzi(60679)
Priority=33622 Nice=0 Account=m1882 QOS=normal_regular_1
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=00:40:00 TimeMin=N/A
SubmitTime=2015-12-15T16:18:38 EligibleTime=2015-12-15T16:18:38
StartTime=2015-12-29T20:17:36 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=regular AllocNode:Sid=cori07:43248
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=nid0[0208,0210-0211,0216-0217,0223-0225,0228,0232-0234,0237-0238,0241-0248,0252-0255,0272-0279,0287,0292,0294-0319,0336-0341,0350-0352,0357,0360-0363,0365-0367,0376-0383,0408-0412,0446-0447,0464-0467,0472-0493,0497-0502,0504-0511,0532-0535,0537-0539,0541-0548,0559-0571,0574-0575,0596-0597,0600-0604,0606-0607,0629-0636,0663-0666,0668-0669,0678-0679,0681-0688,0701-0703,0720-0767,0788-0789,0800-0801,0817-0821,0823-0824,0827-0830,0848-0851,0861-0863,0865-0884,0987,0993-1016,1042-1083,1085-1087,1104-1128,1138,1145-1148,1150-1151,1172-1184,1190-1215,1232-1235,1240-1247,1270-1278,1301-1323,1328-1343,1364-1366,1375-1376,1379-1385,1405-1407,1424-1438,1440-1471,1488-1492,1498-1535,1556-1565,1567-1599,1616-1619,1624-1663,1684-1688,1690-1696,1704-1717,1719-1727,1748-1789,1808-1823,1837-1838,1880-1913,1946-1952,1954-1957,1970-1971,1973-1983,2000-2003,2008-2009,2016-2019,2024-2025,2027-2031,2034-2047,2068-2111,2128-2151,2164-2167,2170-2172,2174-2175,2192-2195,2197-2216,2219-2239]
NumNodes=989-989 NumCPUs=989 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=989,node=989
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=craynetwork:1 Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/global/u1/f/fnrizzi/coriRuns/run30x30.cori
WorkDir=/global/u1/f/fnrizzi/coriRuns
StdErr=/global/u1/f/fnrizzi/coriRuns/slurm-509887.out
StdIn=/dev/null
StdOut=/global/u1/f/fnrizzi/coriRuns/slurm-509887.out
Power= SICP=0
nid00837:~ #
I just modified some configs based on our edison experience (set explicit
srun ports, KillOnBadExit, and such) -- basically things that don't involve
this. After restarting slurmctld, I up'd all the partitions (including
shared) and the same thing happened again.
Really, only job 550452 should be blocked for Resources at this point.
nid00837:~ # squeue --start --sort=Q | grep "Resources"
730556 debug my_job vfung PD N/A
40 (null) (Resources)
739990 debug LiTi weihu PD N/A
20 (null) (Resources)
743246 debug my_job ninghai PD N/A
64 nid00[024-051,062-06 (Resources)
683567 debug I805_315 ppetrov PD N/A
128 (null) (Resources)
743330 debug runner gandolfi PD N/A
32 nid0[0107-0108,0113- (Resources)
736356 shared multi_0. jkretchm PD N/A
1 (null) (Resources)
674260 regular Run_0251 jjunum PD N/A
8 (null) (Resources)
674261 regular Run_0252 jjunum PD N/A
8 (null) (Resources)
674262 regular Run_0253 jjunum PD N/A
8 (null) (Resources)
674263 regular Run_0254 jjunum PD N/A
8 (null) (Resources)
674264 regular Run_0255 jjunum PD N/A
8 (null) (Resources)
674265 regular Run_0256 jjunum PD N/A
8 (null) (Resources)
674266 regular Run_0257 jjunum PD N/A
8 (null) (Resources)
674267 regular Run_0258 jjunum PD N/A
8 (null) (Resources)
674268 regular Run_0259 jjunum PD N/A
8 (null) (Resources)
674269 regular Run_0260 jjunum PD N/A
8 (null) (Resources)
674270 regular Run_0261 jjunum PD N/A
8 (null) (Resources)
674271 regular Run_0262 jjunum PD N/A
8 (null) (Resources)
674272 regular Run_0263 jjunum PD N/A
8 (null) (Resources)
674273 regular Run_0264 jjunum PD N/A
8 (null) (Resources)
674274 regular Run_0265 jjunum PD N/A
8 (null) (Resources)
674275 regular Run_0266 jjunum PD N/A
8 (null) (Resources)
674276 regular Run_0267 jjunum PD N/A
8 (null) (Resources)
674277 regular Run_0268 jjunum PD N/A
8 (null) (Resources)
674278 regular Run_0269 jjunum PD N/A
8 (null) (Resources)
674279 regular Run_0270 jjunum PD N/A
8 (null) (Resources)
674280 regular Run_0271 jjunum PD N/A
8 (null) (Resources)
674281 regular Run_0272 jjunum PD N/A
8 (null) (Resources)
674282 regular Run_0273 jjunum PD N/A
8 (null) (Resources)
674283 regular Run_0274 jjunum PD N/A
8 (null) (Resources)
674284 regular Run_0275 jjunum PD N/A
8 (null) (Resources)
674285 regular Run_0276 jjunum PD N/A
8 (null) (Resources)
674286 regular Run_0277 jjunum PD N/A
8 (null) (Resources)
674287 regular Run_0278 jjunum PD N/A
8 (null) (Resources)
674288 regular Run_0279 jjunum PD N/A
8 (null) (Resources)
674289 regular Run_0280 jjunum PD N/A
8 (null) (Resources)
674290 regular Run_0281 jjunum PD N/A
8 (null) (Resources)
674291 regular Run_0282 jjunum PD N/A
8 (null) (Resources)
674292 regular Run_0283 jjunum PD N/A
8 (null) (Resources)
674293 regular Run_0284 jjunum PD N/A
8 (null) (Resources)
674294 regular Run_0285 jjunum PD N/A
8 (null) (Resources)
674295 regular Run_0286 jjunum PD N/A
8 (null) (Resources)
675149 regular GDB5L_L bzhu PD N/A
32 (null) (Resources)
675150 regular GDB5L_L bzhu PD N/A
32 (null) (Resources)
675151 regular GDB5L_L bzhu PD N/A
32 (null) (Resources)
673738 regular trinity_ jungpyo PD 2015-12-31T02:20:00
1024 nid0[0209-0211,0216- (Resources)
648729 regular mbd_rela farren PD 2015-12-30T14:20:00
128 nid0[0209,0218-0222, (Resources)
673922 regular STw3 drhatch PD 2015-12-30T11:19:27
16 nid0[0231,0413,0501- (Resources)
651551 regular Run_0683 gmcfarq PD N/A
9 (null) (Resources)
651552 regular Run_0684 gmcfarq PD N/A
9 (null) (Resources)
651553 regular Run_0685 gmcfarq PD N/A
9 (null) (Resources)
651554 regular Run_0686 gmcfarq PD N/A
9 (null) (Resources)
651555 regular Run_0687 gmcfarq PD N/A
9 (null) (Resources)
651556 regular Run_0688 gmcfarq PD N/A
9 (null) (Resources)
651557 regular Run_0689 gmcfarq PD N/A
9 (null) (Resources)
651558 regular Run_0690 gmcfarq PD N/A
9 (null) (Resources)
651559 regular Run_0691 gmcfarq PD N/A
9 (null) (Resources)
651560 regular Run_0692 gmcfarq PD N/A
9 (null) (Resources)
651561 regular Run_0693 gmcfarq PD N/A
9 (null) (Resources)
651562 regular Run_0694 gmcfarq PD N/A
9 (null) (Resources)
651563 regular Run_0695 gmcfarq PD N/A
9 (null) (Resources)
651564 regular Run_0696 gmcfarq PD N/A
9 (null) (Resources)
651565 regular Run_0697 gmcfarq PD N/A
9 (null) (Resources)
651566 regular Run_0698 gmcfarq PD N/A
9 (null) (Resources)
651567 regular Run_0699 gmcfarq PD N/A
9 (null) (Resources)
651568 regular Run_0700 gmcfarq PD N/A
9 (null) (Resources)
526702 regular 416 zfliu PD 2015-12-30T13:20:00
360 nid0[0208,0281-0287, (Resources)
672510 regular ITER12MA izzo PD 2015-12-30T10:54:00
22 nid00[446-447,464-46 (Resources)
673750 regular HEAT_UCL chhabra PD N/A
31 (null) (Resources)
669504 regular test mastriko PD 2015-12-30T10:54:00
16 nid00[294-309] (Resources)
673200 regular DCLL2 chhabra PD 2015-12-30T10:54:00
28 nid0[1440-1467] (Resources)
672765 regular ch3nh3pb abdalla PD 2015-12-30T10:54:00
12 nid0[0998-1009] (Resources)
672766 regular ch3nh3pb abdalla PD N/A
12 (null) (Resources)
672767 regular ch3nh3pb abdalla PD N/A
12 (null) (Resources)
672768 regular ch3nh3pb abdalla PD N/A
12 (null) (Resources)
672769 regular ch3nh3pb abdalla PD N/A
12 (null) (Resources)
647441 regular usgsmega chunzhao PD N/A
50 (null) (Resources)
646874 regular htmegan chunzhao PD 2015-12-30T10:54:00
50 nid0[0749,1616-1619, (Resources)
669872 regular zgoubi vranjbar PD 2015-12-30T10:54:00
32 nid00[701-703,720-74 (Resources)
651548 regular Run_0682 jjunum PD N/A
9 (null) (Resources)
651485 regular Run_0619 jjunum PD N/A
9 (null) (Resources)
651486 regular Run_0620 jjunum PD N/A
9 (null) (Resources)
651487 regular Run_0621 jjunum PD N/A
9 (null) (Resources)
651488 regular Run_0622 jjunum PD N/A
9 (null) (Resources)
651489 regular Run_0623 jjunum PD N/A
9 (null) (Resources)
651490 regular Run_0624 jjunum PD N/A
9 (null) (Resources)
651491 regular Run_0625 jjunum PD N/A
9 (null) (Resources)
651492 regular Run_0626 jjunum PD N/A
9 (null) (Resources)
651493 regular Run_0627 jjunum PD N/A
9 (null) (Resources)
651494 regular Run_0628 jjunum PD N/A
9 (null) (Resources)
651495 regular Run_0629 jjunum PD N/A
9 (null) (Resources)
651496 regular Run_0630 jjunum PD N/A
9 (null) (Resources)
651497 regular Run_0631 jjunum PD N/A
9 (null) (Resources)
651498 regular Run_0632 jjunum PD N/A
9 (null) (Resources)
651499 regular Run_0633 jjunum PD N/A
9 (null) (Resources)
651500 regular Run_0634 jjunum PD N/A
9 (null) (Resources)
651501 regular Run_0635 jjunum PD N/A
9 (null) (Resources)
651502 regular Run_0636 jjunum PD N/A
9 (null) (Resources)
651503 regular Run_0637 jjunum PD N/A
9 (null) (Resources)
651504 regular Run_0638 jjunum PD N/A
9 (null) (Resources)
651505 regular Run_0639 jjunum PD N/A
9 (null) (Resources)
651506 regular Run_0640 jjunum PD N/A
9 (null) (Resources)
651507 regular Run_0641 jjunum PD N/A
9 (null) (Resources)
651508 regular Run_0642 jjunum PD N/A
9 (null) (Resources)
651509 regular Run_0643 jjunum PD N/A
9 (null) (Resources)
651510 regular Run_0644 jjunum PD N/A
9 (null) (Resources)
651511 regular Run_0645 jjunum PD N/A
9 (null) (Resources)
651512 regular Run_0646 jjunum PD N/A
9 (null) (Resources)
651513 regular Run_0647 jjunum PD N/A
9 (null) (Resources)
651514 regular Run_0648 jjunum PD N/A
9 (null) (Resources)
651515 regular Run_0649 jjunum PD N/A
9 (null) (Resources)
651516 regular Run_0650 jjunum PD N/A
9 (null) (Resources)
651517 regular Run_0651 jjunum PD N/A
9 (null) (Resources)
651518 regular Run_0652 jjunum PD N/A
9 (null) (Resources)
651519 regular Run_0653 jjunum PD N/A
9 (null) (Resources)
651520 regular Run_0654 jjunum PD N/A
9 (null) (Resources)
651521 regular Run_0655 jjunum PD N/A
9 (null) (Resources)
651522 regular Run_0656 jjunum PD N/A
9 (null) (Resources)
651523 regular Run_0657 jjunum PD N/A
9 (null) (Resources)
651524 regular Run_0658 jjunum PD N/A
9 (null) (Resources)
651525 regular Run_0659 jjunum PD N/A
9 (null) (Resources)
651526 regular Run_0660 jjunum PD N/A
9 (null) (Resources)
651527 regular Run_0661 jjunum PD N/A
9 (null) (Resources)
651528 regular Run_0662 jjunum PD N/A
9 (null) (Resources)
651529 regular Run_0663 jjunum PD N/A
9 (null) (Resources)
651530 regular Run_0664 jjunum PD N/A
9 (null) (Resources)
651531 regular Run_0665 jjunum PD N/A
9 (null) (Resources)
651532 regular Run_0666 jjunum PD N/A
9 (null) (Resources)
651533 regular Run_0667 jjunum PD N/A
9 (null) (Resources)
651534 regular Run_0668 jjunum PD N/A
9 (null) (Resources)
651535 regular Run_0669 jjunum PD N/A
9 (null) (Resources)
651536 regular Run_0670 jjunum PD N/A
9 (null) (Resources)
651537 regular Run_0671 jjunum PD N/A
9 (null) (Resources)
651538 regular Run_0672 jjunum PD N/A
9 (null) (Resources)
651539 regular Run_0673 jjunum PD N/A
9 (null) (Resources)
651540 regular Run_0674 jjunum PD N/A
9 (null) (Resources)
651541 regular Run_0675 jjunum PD N/A
9 (null) (Resources)
651542 regular Run_0676 jjunum PD N/A
9 (null) (Resources)
651543 regular Run_0677 jjunum PD N/A
9 (null) (Resources)
651544 regular Run_0678 jjunum PD N/A
9 (null) (Resources)
651545 regular Run_0679 jjunum PD N/A
9 (null) (Resources)
651546 regular Run_0680 jjunum PD N/A
9 (null) (Resources)
651547 regular Run_0681 jjunum PD N/A
9 (null) (Resources)
328241 regular D23_re wangw PD 2015-12-30T10:54:00
64 nid0[2070-2111,2128- (Resources)
651948 regular GENE nbonan PD N/A
255 (null) (Resources)
651381 regular NEB4 mgsensoy PD 2015-12-30T10:54:00
80 nid0[0210-0211,0216- (Resources)
528855 regular run.nimr pankin PD N/A
258 (null) (Resources)
644338 regular Run_0579 gmcfarq PD 2015-12-30T10:32:05
9 nid00[598-599,920-92 (Resources)
644339 regular Run_0580 gmcfarq PD N/A
9 (null) (Resources)
644340 regular Run_0581 gmcfarq PD N/A
9 (null) (Resources)
644341 regular Run_0582 gmcfarq PD N/A
9 (null) (Resources)
644342 regular Run_0583 gmcfarq PD N/A
9 (null) (Resources)
644343 regular Run_0584 gmcfarq PD N/A
9 (null) (Resources)
644344 regular Run_0585 gmcfarq PD N/A
9 (null) (Resources)
644345 regular Run_0586 gmcfarq PD N/A
9 (null) (Resources)
644346 regular Run_0587 gmcfarq PD N/A
9 (null) (Resources)
644347 regular Run_0588 gmcfarq PD N/A
9 (null) (Resources)
644348 regular Run_0589 gmcfarq PD N/A
9 (null) (Resources)
649026 regular n_1120 heidih PD 2015-12-30T10:54:00
35 nid0[0571,0596-0597, (Resources)
649033 regular n_960 heidih PD N/A
30 (null) (Resources)
644382 regular Run_0592 jjunum PD 2015-12-30T10:32:05
9 nid00[891-895,916-91 (Resources)
644384 regular Run_0594 jjunum PD N/A
9 (null) (Resources)
644385 regular Run_0595 jjunum PD N/A
9 (null) (Resources)
644386 regular Run_0596 jjunum PD N/A
9 (null) (Resources)
644387 regular Run_0597 jjunum PD N/A
9 (null) (Resources)
644388 regular Run_0598 jjunum PD N/A
9 (null) (Resources)
644389 regular Run_0599 jjunum PD N/A
9 (null) (Resources)
644390 regular Run_0600 jjunum PD N/A
9 (null) (Resources)
644391 regular Run_0601 jjunum PD N/A
9 (null) (Resources)
644392 regular Run_0602 jjunum PD N/A
9 (null) (Resources)
644393 regular Run_0603 jjunum PD N/A
9 (null) (Resources)
644394 regular Run_0604 jjunum PD N/A
9 (null) (Resources)
644395 regular Run_0605 jjunum PD N/A
9 (null) (Resources)
644396 regular Run_0606 jjunum PD N/A
9 (null) (Resources)
644397 regular Run_0607 jjunum PD N/A
9 (null) (Resources)
644398 regular Run_0608 jjunum PD N/A
9 (null) (Resources)
644399 regular Run_0609 jjunum PD N/A
9 (null) (Resources)
644400 regular Run_0610 jjunum PD N/A
9 (null) (Resources)
644401 regular Run_0611 jjunum PD N/A
9 (null) (Resources)
644402 regular Run_0612 jjunum PD N/A
9 (null) (Resources)
644403 regular Run_0613 jjunum PD N/A
9 (null) (Resources)
644404 regular Run_0614 jjunum PD N/A
9 (null) (Resources)
644405 regular Run_0615 jjunum PD N/A
9 (null) (Resources)
644406 regular Run_0616 jjunum PD N/A
9 (null) (Resources)
644407 regular Run_0617 jjunum PD N/A
9 (null) (Resources)
644408 regular Run_0618 jjunum PD N/A
9 (null) (Resources)
502782 regular P2228_NT mhumbert PD N/A
100 (null) (Resources)
502780 regular C4C1PIP_ mhumbert PD N/A
100 (null) (Resources)
502776 regular C4C1IM_N mhumbert PD N/A
100 (null) (Resources)
502778 regular C4C1IM_O mhumbert PD N/A
100 (null) (Resources)
502775 regular C4C1IM_4 mhumbert PD 2015-12-30T10:54:00
100 nid0[0223-0225,0233, (Resources)
643870 regular GDB5L_L bzhu PD N/A
32 (null) (Resources)
643871 regular GDB5L_L bzhu PD N/A
32 (null) (Resources)
643872 regular GDB5L_L bzhu PD N/A
32 (null) (Resources)
643856 regular GDB5L_L bzhu PD N/A
32 (null) (Resources)
643848 regular GDB5L_L bzhu PD 2015-12-30T10:01:15
32 nid0[0358-0359,0494, (Resources)
535908 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535909 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535910 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535915 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535916 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535918 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535919 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535920 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535921 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535922 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535923 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535924 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535925 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535926 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535927 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535928 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535929 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535930 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535403 regular heavy_me smeinel PD 2015-12-30T07:02:09
32 nid0[0229-0230,0235- (Resources)
535404 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535405 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535406 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535407 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535408 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535409 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535410 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535411 regular heavy_me smeinel PD N/A
32 (null) (Resources)
535412 regular heavy_me smeinel PD N/A
32 (null) (Resources)
525578 regular 4kmNCEP_ yuxing PD 2015-12-30T05:20:59
100 nid0[0209,0218,0221- (Resources)
543194 regular rawMPI84 fnrizzi PD N/A
831 (null) (Resources)
543196 regular rawMPI98 fnrizzi PD N/A
1050 (null) (Resources)
551687 regular Cu_45_4 ayonge PD N/A
2 (null) (Resources)
551665 regular Cu_45_3_ ayonge PD N/A
2 (null) (Resources)
551474 regular Cu_45_2_ ayonge PD N/A
2 (null) (Resources)
551232 regular Ag_45_4_ ayonge PD N/A
2 (null) (Resources)
551159 regular Ag_45_3_ ayonge PD N/A
2 (null) (Resources)
550867 regular Ag_45_2_ ayonge PD N/A
2 (null) (Resources)
549905 regular Cu_44_2_ ayonge PD N/A
2 (null) (Resources)
549830 regular Cu_O_2_h ayonge PD N/A
2 (null) (Resources)
549695 regular Cu_43_on ayonge PD N/A
2 (null) (Resources)
549460 regular Ag_43_on ayonge PD N/A
2 (null) (Resources)
547747 regular Cu_42_on ayonge PD N/A
2 (null) (Resources)
547677 regular Ag_42_on ayonge PD N/A
2 (null) (Resources)
547186 regular Cu_41_on ayonge PD N/A
2 (null) (Resources)
547112 regular Ag_41_on ayonge PD N/A
2 (null) (Resources)
546845 regular Cu_40_4 ayonge PD N/A
2 (null) (Resources)
546818 regular Cu_40_3 ayonge PD N/A
2 (null) (Resources)
582336 regular eX10B-x1 jmay PD 2015-12-30T10:54:00
400 nid0[0252-0255,0272- (Resources)
544800 regular Cu_40_2 ayonge PD N/A
2 (null) (Resources)
544793 regular Ag_40_4 ayonge PD N/A
2 (null) (Resources)
544789 regular Ag_h-3 ayonge PD N/A
2 (null) (Resources)
544783 regular Ag_h_1 ayonge PD N/A
2 (null) (Resources)
544765 regular Cu_39_4 ayonge PD N/A
2 (null) (Resources)
544761 regular Cu_39_3 ayonge PD N/A
2 (null) (Resources)
544756 regular Cu_39_2 ayonge PD N/A
2 (null) (Resources)
544753 regular Cu_39_1 ayonge PD N/A
2 (null) (Resources)
544748 regular Ag_39_4 ayonge PD N/A
2 (null) (Resources)
544744 regular Ag_39_3_ ayonge PD N/A
2 (null) (Resources)
544740 regular Ag_39_2 ayonge PD N/A
2 (null) (Resources)
544735 regular Ag_39_1_ ayonge PD N/A
2 (null) (Resources)
544594 regular Cu_36_4 ayonge PD N/A
2 (null) (Resources)
544586 regular Cu_36_3 ayonge PD N/A
2 (null) (Resources)
544566 regular Ag_36_4_ ayonge PD N/A
2 (null) (Resources)
544561 regular Ag_36_3_ ayonge PD N/A
2 (null) (Resources)
544552 regular Cu_36_2 ayonge PD N/A
2 (null) (Resources)
544517 regular Ag_36_2_ ayonge PD N/A
2 (null) (Resources)
544494 regular Cu_35_5 ayonge PD N/A
2 (null) (Resources)
544486 regular Cu_35_3 ayonge PD N/A
2 (null) (Resources)
544487 regular Cu_35_4 ayonge PD N/A
2 (null) (Resources)
544481 regular Cu_35_2 ayonge PD N/A
2 (null) (Resources)
544474 regular Cu_35_1 ayonge PD N/A
2 (null) (Resources)
544463 regular Ag_35_5 ayonge PD N/A
2 (null) (Resources)
544457 regular Ag_35_4 ayonge PD N/A
2 (null) (Resources)
544439 regular Ag_35_3 ayonge PD N/A
2 (null) (Resources)
544432 regular Ag_35_2 ayonge PD N/A
2 (null) (Resources)
544425 regular Ag_35_1 ayonge PD N/A
2 (null) (Resources)
543284 regular Cu_33_1_ ayonge PD N/A
2 (null) (Resources)
543266 regular Ag_33_1_ ayonge PD N/A
2 (null) (Resources)
543251 regular Cu_32_6 ayonge PD N/A
2 (null) (Resources)
543240 regular Cu_32_5 ayonge PD N/A
2 (null) (Resources)
543230 regular Cu_32_4 ayonge PD N/A
2 (null) (Resources)
543220 regular Ag_32_6_ ayonge PD N/A
2 (null) (Resources)
543213 regular Ag_32_5_ ayonge PD N/A
2 (null) (Resources)
543203 regular Ag_32_4_ ayonge PD N/A
2 (null) (Resources)
543109 regular Cu_32_1_ ayonge PD N/A
2 (null) (Resources)
543107 regular Cu_32_2_ ayonge PD N/A
2 (null) (Resources)
543100 regular Cu_32_1_ ayonge PD N/A
2 (null) (Resources)
543088 regular Ag_32_2_ ayonge PD N/A
2 (null) (Resources)
543077 regular Ag_32_1_ ayonge PD N/A
2 (null) (Resources)
543034 regular Cu_30_2_ ayonge PD N/A
2 (null) (Resources)
543033 regular Cu_30_1_ ayonge PD N/A
2 (null) (Resources)
543027 regular Ag_30_2_ ayonge PD N/A
2 (null) (Resources)
543013 regular Ag_30_1_ ayonge PD N/A
2 (null) (Resources)
542968 regular Cu_29_2_ ayonge PD N/A
2 (null) (Resources)
542962 regular Ag_29_2_ ayonge PD N/A
2 (null) (Resources)
542917 regular Cu_28_3_ ayonge PD N/A
2 (null) (Resources)
542845 regular Cu_28_2_ ayonge PD N/A
2 (null) (Resources)
542843 regular Cu_28_1_ ayonge PD N/A
2 (null) (Resources)
542807 regular Ag_28_3_ ayonge PD N/A
2 (null) (Resources)
542800 regular Ag_28_2_ ayonge PD N/A
2 (null) (Resources)
542793 regular Ag_28_1_ ayonge PD 2015-12-29T14:39:43
2 nid0[1840,1960] (Resources)
478346 regular run.nimr pankin PD 2015-12-30T01:20:08
258 nid0[0208,0219-0220, (Resources)
549228 regular big_job berkowit PD N/A
512 (null) (Resources)
549229 regular big_job berkowit PD N/A
512 (null) (Resources)
549230 regular big_job berkowit PD N/A
512 (null) (Resources)
549231 regular big_job berkowit PD N/A
512 (null) (Resources)
549232 regular big_job berkowit PD N/A
512 (null) (Resources)
549233 regular big_job berkowit PD N/A
512 (null) (Resources)
549234 regular big_job berkowit PD N/A
512 (null) (Resources)
549235 regular big_job berkowit PD N/A
512 (null) (Resources)
549236 regular big_job berkowit PD N/A
512 (null) (Resources)
549237 regular big_job berkowit PD N/A
512 (null) (Resources)
549238 regular big_job berkowit PD N/A
512 (null) (Resources)
549239 regular big_job berkowit PD N/A
512 (null) (Resources)
549240 regular big_job berkowit PD N/A
512 (null) (Resources)
549241 regular big_job berkowit PD N/A
512 (null) (Resources)
549242 regular big_job berkowit PD N/A
512 (null) (Resources)
549243 regular big_job berkowit PD N/A
512 (null) (Resources)
549244 regular big_job berkowit PD N/A
512 (null) (Resources)
549245 regular big_job berkowit PD N/A
512 (null) (Resources)
549246 regular big_job berkowit PD N/A
512 (null) (Resources)
549247 regular big_job berkowit PD N/A
512 (null) (Resources)
549248 regular big_job berkowit PD N/A
512 (null) (Resources)
549249 regular big_job berkowit PD N/A
512 (null) (Resources)
549250 regular big_job berkowit PD N/A
512 (null) (Resources)
549251 regular big_job berkowit PD N/A
512 (null) (Resources)
549252 regular big_job berkowit PD N/A
512 (null) (Resources)
549253 regular big_job berkowit PD N/A
512 (null) (Resources)
549254 regular big_job berkowit PD N/A
512 (null) (Resources)
549255 regular big_job berkowit PD N/A
512 (null) (Resources)
549256 regular big_job berkowit PD N/A
512 (null) (Resources)
549257 regular big_job berkowit PD N/A
512 (null) (Resources)
549258 regular big_job berkowit PD N/A
512 (null) (Resources)
549259 regular big_job berkowit PD N/A
512 (null) (Resources)
549260 regular big_job berkowit PD N/A
512 (null) (Resources)
549261 regular big_job berkowit PD N/A
512 (null) (Resources)
549262 regular big_job berkowit PD N/A
512 (null) (Resources)
549263 regular big_job berkowit PD N/A
512 (null) (Resources)
549264 regular big_job berkowit PD N/A
512 (null) (Resources)
549265 regular big_job berkowit PD N/A
512 (null) (Resources)
549266 regular big_job berkowit PD N/A
512 (null) (Resources)
549267 regular big_job berkowit PD N/A
512 (null) (Resources)
549268 regular big_job berkowit PD N/A
512 (null) (Resources)
549269 regular big_job berkowit PD N/A
512 (null) (Resources)
549270 regular big_job berkowit PD N/A
512 (null) (Resources)
549271 regular big_job berkowit PD N/A
512 (null) (Resources)
549272 regular big_job berkowit PD N/A
512 (null) (Resources)
549273 regular big_job berkowit PD N/A
512 (null) (Resources)
549274 regular big_job berkowit PD N/A
512 (null) (Resources)
549275 regular big_job berkowit PD N/A
512 (null) (Resources)
549276 regular big_job berkowit PD N/A
512 (null) (Resources)
549277 regular big_job berkowit PD N/A
512 (null) (Resources)
549278 regular big_job berkowit PD N/A
512 (null) (Resources)
549279 regular big_job berkowit PD N/A
512 (null) (Resources)
549280 regular big_job berkowit PD N/A
512 (null) (Resources)
549281 regular big_job berkowit PD N/A
512 (null) (Resources)
549282 regular big_job berkowit PD N/A
512 (null) (Resources)
549283 regular big_job berkowit PD N/A
512 (null) (Resources)
549284 regular big_job berkowit PD N/A
512 (null) (Resources)
549285 regular big_job berkowit PD N/A
512 (null) (Resources)
549286 regular big_job berkowit PD N/A
512 (null) (Resources)
549287 regular big_job berkowit PD N/A
512 (null) (Resources)
549288 regular big_job berkowit PD N/A
512 (null) (Resources)
549289 regular big_job berkowit PD N/A
512 (null) (Resources)
549290 regular big_job berkowit PD N/A
512 (null) (Resources)
549291 regular big_job berkowit PD N/A
512 (null) (Resources)
545556 regular big_job berkowit PD 2015-12-29T22:54:00
512 nid0[0223-0225,0233, (Resources)
545557 regular big_job berkowit PD N/A
512 (null) (Resources)
545558 regular big_job berkowit PD N/A
512 (null) (Resources)
545559 regular big_job berkowit PD N/A
512 (null) (Resources)
545560 regular big_job berkowit PD N/A
512 (null) (Resources)
545561 regular big_job berkowit PD N/A
512 (null) (Resources)
545562 regular big_job berkowit PD N/A
512 (null) (Resources)
545563 regular big_job berkowit PD N/A
512 (null) (Resources)
545564 regular big_job berkowit PD N/A
512 (null) (Resources)
545565 regular big_job berkowit PD N/A
512 (null) (Resources)
545566 regular big_job berkowit PD N/A
512 (null) (Resources)
545567 regular big_job berkowit PD N/A
512 (null) (Resources)
545568 regular big_job berkowit PD N/A
512 (null) (Resources)
545569 regular big_job berkowit PD N/A
512 (null) (Resources)
545570 regular big_job berkowit PD N/A
512 (null) (Resources)
545571 regular big_job berkowit PD N/A
512 (null) (Resources)
545572 regular big_job berkowit PD N/A
512 (null) (Resources)
545573 regular big_job berkowit PD N/A
512 (null) (Resources)
545574 regular big_job berkowit PD N/A
512 (null) (Resources)
545575 regular big_job berkowit PD N/A
512 (null) (Resources)
545576 regular big_job berkowit PD N/A
512 (null) (Resources)
545577 regular big_job berkowit PD N/A
512 (null) (Resources)
545578 regular big_job berkowit PD N/A
512 (null) (Resources)
545579 regular big_job berkowit PD N/A
512 (null) (Resources)
545580 regular big_job berkowit PD N/A
512 (null) (Resources)
545581 regular big_job berkowit PD N/A
512 (null) (Resources)
545582 regular big_job berkowit PD N/A
512 (null) (Resources)
545583 regular big_job berkowit PD N/A
512 (null) (Resources)
545584 regular big_job berkowit PD N/A
512 (null) (Resources)
545585 regular big_job berkowit PD N/A
512 (null) (Resources)
499409 regular GENE dtold PD 2015-12-29T22:54:00
320 nid0[0210-0211,0216- (Resources)
513398 regular 05 ocs PD 2015-12-29T22:54:00
128 nid0[0252-0255,0272- (Resources)
513399 regular 06 ocs PD N/A
128 (null) (Resources)
513400 regular 07 ocs PD N/A
128 (null) (Resources)
513401 regular 08 ocs PD N/A
128 (null) (Resources)
513402 regular 09 ocs PD N/A
128 (null) (Resources)
513403 regular 10 ocs PD N/A
128 (null) (Resources)
513404 regular 11 ocs PD N/A
128 (null) (Resources)
513405 regular 12 ocs PD N/A
128 (null) (Resources)
513406 regular 13 ocs PD N/A
128 (null) (Resources)
513407 regular 14 ocs PD N/A
128 (null) (Resources)
513408 regular 15 ocs PD N/A
128 (null) (Resources)
509895 regular rawMPI36 fnrizzi PD N/A
1424 (null) (Resources)
509887 regular rawMPI30 fnrizzi PD 2015-12-29T20:17:36
989 nid0[0208,0210-0211, (Resources)
550452 regular ucan2 u1103 PD 2015-12-29T22:24:17
1024 nid0[0208,0210-0211, (Resources)
nid00837:~ # sprio -j 550452
JOBID PRIORITY AGE FAIRSHARE PARTITION QOS
550452 33680 17223 7097 2160 7200
nid00837:~ # sprio -j 509887
JOBID PRIORITY AGE FAIRSHARE PARTITION QOS
509887 33622 20002 4260 2160 7200
nid00837:~ # scontrol show job 550452
JobId=550452 JobName=ucan2
UserId=u1103(1103) GroupId=u1103(1001103)
Priority=33685 Nice=0 Account=m616 QOS=normal_regular_1
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=00:30:00 TimeMin=N/A
SubmitTime=2015-12-17T14:32:12 EligibleTime=2015-12-17T14:32:12
StartTime=2015-12-29T22:24:17 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=regular AllocNode:Sid=cori10:50398
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
SchedNodeList=nid0[0208,0210-0211,0216-0217,0223-0225,0232-0234,0237-0238,0241-0248,0252-0255,0272-0279,0294-0319,0336-0341,0350-0352,0360-0363,0365-0368,0376-0383,0408-0412,0414-0447,0464-0467,0472-0493,0495-0502,0504-0511,0532-0535,0537-0539,0541-0548,0559-0571,0574-0575,0596-0597,0600-0604,0606-0607,0629-0636,0663-0666,0668-0669,0681-0688,0701-0703,0720-0767,0788-0789,0800-0802,0817-0821,0827-0830,0848-0851,0861-0863,0865-0884,0990-1016,1042-1083,1085-1087,1104-1128,1143-1148,1150-1151,1172-1184,1190-1215,1232-1235,1240-1247,1270-1278,1301-1324,1328-1343,1364-1366,1379-1385,1405-1407,1424-1438,1440-1471,1488-1492,1498-1535,1556-1565,1567-1599,1616-1619,1624-1663,1684-1688,1690-1696,1704-1717,1719-1727,1748-1789,1808-1823,1880-1913,1946-1952,1954-1957,1970-1983,2000-2003,2008-2009,2016-2019,2027-2031,2034-2047,2068-2111,2128-2151,2156-2167,2170-2172,2174-2175,2192-2195,2197-2216,2219-2239]
NumNodes=1024-1024 NumCPUs=1024 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1024,node=1024
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=craynetwork:1 Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/global/cscratch1/sd/u1103/ducan2/dtest/d32768/run.cori
WorkDir=/global/cscratch1/sd/u1103/ducan2/dtest/d32768
StdErr=/global/cscratch1/sd/u1103/ducan2/dtest/d32768/ucan2-32768.err
StdIn=/dev/null
StdOut=/global/cscratch1/sd/u1103/ducan2/dtest/d32768/ucan2-32768.out
Power= SICP=0
nid00837:~ #
nid00837:~ #
nid00837:~ #
nid00837:~ # scontrol show job 509887
JobId=509887 JobName=rawMPI30x30cori
UserId=fnrizzi(60679) GroupId=fnrizzi(60679)
Priority=33622 Nice=0 Account=m1882 QOS=normal_regular_1
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=00:40:00 TimeMin=N/A
SubmitTime=2015-12-15T16:18:38 EligibleTime=2015-12-15T16:18:38
StartTime=2015-12-29T20:17:36 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=regular AllocNode:Sid=cori07:43248
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
SchedNodeList=nid0[0208,0210-0211,0216-0217,0223-0225,0228,0232-0234,0237-0238,0241-0248,0252-0255,0272-0279,0287,0292,0294-0319,0336-0341,0350-0352,0357,0360-0363,0365-0367,0376-0383,0408-0412,0446-0447,0464-0467,0472-0493,0497-0502,0504-0511,0532-0535,0537-0539,0541-0548,0559-0571,0574-0575,0596-0597,0600-0604,0606-0607,0629-0636,0663-0666,0668-0669,0678-0679,0681-0688,0701-0703,0720-0767,0788-0789,0800-0801,0817-0821,0823-0824,0827-0830,0848-0851,0861-0863,0865-0884,0987,0993-1016,1042-1083,1085-1087,1104-1128,1138,1145-1148,1150-1151,1172-1184,1190-1215,1232-1235,1240-1247,1270-1278,1301-1323,1328-1343,1364-1366,1375-1376,1379-1385,1405-1407,1424-1438,1440-1471,1488-1492,1498-1535,1556-1565,1567-1599,1616-1619,1624-1663,1684-1688,1690-1696,1704-1717,1719-1727,1748-1789,1808-1823,1837-1838,1880-1913,1946-1952,1954-1957,1970-1971,1973-1983,2000-2003,2008-2009,2016-2019,2024-2025,2027-2031,2034-2047,2068-2111,2128-2151,2164-2167,2170-2172,2174-2175,2192-2195,2197-2216,2219-2239]
NumNodes=989-989 NumCPUs=989 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=989,node=989
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=craynetwork:1 Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/global/u1/f/fnrizzi/coriRuns/run30x30.cori
WorkDir=/global/u1/f/fnrizzi/coriRuns
StdErr=/global/u1/f/fnrizzi/coriRuns/slurm-509887.out
StdIn=/dev/null
StdOut=/global/u1/f/fnrizzi/coriRuns/slurm-509887.out
Power= SICP=0
nid00837:~ #
----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov
------------- __o
---------- _ '\<,_
----------(_)/ (_)__________________________
On Tue, Dec 29, 2015 at 12:28 PM, <bugs@schedmd.com> wrote:
> *Comment # 7 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c7> on bug 2285
> <http://bugs.schedmd.com/show_bug.cgi?id=2285> from Tim Wickberg
> <tim@schedmd.com> *
>
> (In reply to Doug Jacobsen from comment #6 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c6>)> Hi Tim,
> >
> > Yes, as far as I can tell these jobs are only waiting on nodes, but you did
> > trigger an idea. Does SLURM have some sort of load sensor like GridEngine
> > for determining available memory on nodes? Or does it just dole out the
> > theoretical max of memory as specified in the slurm.conf for the node, and
> > just assume the memory is available? I ask because I can imagine a
> > situation wherein the node doesn't get fully cleaned and there no longer is
> > sufficient memory to be running jobs based on our DefMemPerNode settings.
> > However, I think this is more of an academic point, our "regular" and
> > "debug" partitions only give out the maximum amount of memory we allow, and
> > I don't see any particular group of nodes stagnating.
>
> There's no load-monitoring ala SGE for memory, Slurm schedules based on the
> memory defined for the node versus the total requested when running with
> cons_res and CR_socket_memory. (This avoids any potential over-subscription,
> I've always been suspicious of that behavior on other schedulers. There is a
> way to forcible over-provision nodes with memory, but we recommend against it
> for obvious reasons.)
> > Actually, this reoccured last night -- I'll see if I can dig up the logs in
> > a few minutes (there are a LOT of logs...)
> >
> > This time, on a whim, I left the shared partition down, which comprises
> > over half our job queue in terms of entry count, and has generated
> > scheduling issues in the past (pathological failures wherein some jerk
> > asking for all the memory on a node but only 1 core would completely block
> > all shared-partition jobs from running, even on nodes the system wasn't
> > planning on running the job on).
> >
> > Anyway, with the shared partition down this issue has not reoccurred - so
> > I'm wondering if this is somehow related to that partition.
>
> What are you trying to do with the shared partition?
>
> I could see Shared=FORCE:32 causing some odd behavior - it does do some load
> monitoring when deciding which nodes to over-subscribe. (Shared=FORCE
> oversubscribes, in your case up to 32x. Hopefully that's what you expect, I
> know the nomenclature behind some of those options isn't obvious - I know I've
> looked at it expecting it to share sockets but still allocate individual cores
> properly which is not what that does.)
>
> Also note that any sharing is per-partition, Slurm will not co-mingle jobs from
> separate partitions within a single node. This may be leading to some of the
> resource contention you're seeing - it looked like you'd only sent squeue for
> regular, but I'm guessing there may have been some relatively large jobs
> pending in shared that could have caused the resources to be reserved awaiting
> a larger job launching in a separate partition.
>
> Before I joined, I'd submitted a feature request to mark nodes as "earmarked"
> or something similar - some mechanism of noting that they aren't "idle" but are
> instead are being kept empty in order to launch some future job. I'll see if I
> can get that done for 16.05 to at least help indicate the current status.
> > Regarding your request for performance numbers. I typically see our
> > backfill scheduler cycle around 30s when shared is enabled. I did update
> > the parameters yesterday to start preparing for our production
> > configuration starting on 1/11.
> > SchedulerParameters =
> > no_backup_scheduling,bf_window=10080,bf_resolution=120,
> > bf_max_job_array_resv=20,default_queue_depth=400,bf_max_job_test=6000,
> > bf_max_job_user=1,bf_continue,nohold_on_prolog_fail,kill_invalid_depend
> >
> > With the shared partition down, things seem smoother. Obviously we need to
> > get that back online, but I want to let a few more big jobs run first
> > before signing up for more pain.
>
> That looks fine, I don't see any obvious anomalies. I await further logs when
> available.
>
> ------------------------------
> You are receiving this mail because:
>
> - You reported the bug.
>
>
so, here is something interesting, I don't know if it is related to this Resources issue or not, but it seems that a forced, shared, job that requests all the cpus on a node causes a major slowdown in the backfill scheduler. We've had anecdotal evidence of this before but I've tried to document it below. Note that in the sdiag | grep "Last cycle" output below, the first is the primary scheduler, and the second instance is the backfill scheduler timing. You can see that when the shared partition is down it takes about 13s to run the backfill scheduler. When I up the shared partition it takes nearly 90s. When I then put a nasty requesting-all-the-cpus-in-the-shared partition on hold, the scheduler cycle time is back down to about 18s. -Doug nid00837:~ # sinfo -o "%R %a" PARTITION AVAIL system up debug up regular up preempt down realtime up shared down nid00837:~ # nid00837:~ # nid00837:~ # sdiag | grep "Last cycle" Last cycle: 26938 Last cycle when: Tue Dec 29 14:25:46 2015 Last cycle: 12758773 nid00837:~ # nid00837:~ # nid00837:~ # nid00837:~ # nid00837:~ # scontrol update partition=shared state=up nid00837:~ # sinfo -o "%R %a" PARTITION AVAIL system up debug up regular up preempt down realtime up shared up nid00837:~ # nid00837:~ # nid00837:~ # nid00837:~ # nid00837:~ # nid00837:~ # sdiag | grep "Last cycle" Last cycle: 25997 Last cycle when: Tue Dec 29 14:28:40 2015 Last cycle: 90586400 nid00837:~ # nid00837:~ # nid00837:~ # nid00837:~ # scontrol show job 734373 JobId=734373 JobName=AIMD UserId=rsakidja(55248) GroupId=rsakidja(55248) Priority=10919 Nice=0 Account=m1491 QOS=normal JobState=PENDING Reason=Priority Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=03:00:00 TimeMin=N/A SubmitTime=2015-12-28T19:41:46 EligibleTime=2015-12-28T19:41:46 StartTime=Unknown EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=shared AllocNode:Sid=cori08:111015 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=64,mem=124928,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=1952M MinTmpDiskNode=0 Features=(null) Gres=craynetwork:0 Reservation=(null) Shared=1 Contiguous=0 Licenses=(null) Network=(null) Command=/global/cscratch1/sd/rsakidja/CTE/Tb5SiO5/1400K_111/take2/job_stampede WorkDir=/global/cscratch1/sd/rsakidja/CTE/Tb5SiO5/1400K_111/take2 StdErr=/global/cscratch1/sd/rsakidja/CTE/Tb5SiO5/1400K_111/take2/AIMD.o734373 StdIn=/dev/null StdOut=/global/cscratch1/sd/rsakidja/CTE/Tb5SiO5/1400K_111/take2/AIMD.o734373 Power= SICP=0 nid00837:~ # scontrol hold 734373 nid00837:~ # sdiag | grep "Last cycle" Last cycle: 27634 Last cycle when: Tue Dec 29 14:31:17 2015 Last cycle: 18127373 nid00837:~ # You did warn me that the log would be big...
There are 1096 error messages in there, and that's excluding the
warnings about the slurmd's having a different config value. (You may
want to set NO_CONF_HASH just to keep the logs a bit more manageable
long-term if the slurmd's aren't seeing the same file.)
Most of them look harmless, but it'd still be nice to clean them up at
some point - are you expecting those error messages from your
job_submit.lua plugin? (564 of 'em)
The errors related to gres/craynetwork (518) are possibly more
interesting, I'm looking into those further now.
Aside from those, there are these 14:
> [2015-12-29T13:25:26.039] error: Can't find parent id 751 for assoc 4958, this should never happen.
> [2015-12-29T13:25:26.039] error: Can't find parent id 751 for assoc 4958, this should never happen.
> [2015-12-29T13:25:26.539] error: Can't find parent id 43 for assoc 4959, this should never happen.
> [2015-12-29T13:25:26.539] error: Can't find parent id 43 for assoc 4959, this should never happen.
> [2015-12-29T13:26:03.201] error: cons_res: node nid01146 memory is under-allocated (0-124928) for job 742992
> [2015-12-29T13:26:06.321] error: cons_res: node nid00666 memory is under-allocated (0-124928) for job 743317
> [2015-12-29T13:26:34.457] error: _start_stage_in: setup for job 743319 status:256 response:dwpost - failed client status code %s 409
> [2015-12-29T13:31:53.760] error: slurm_receive_msg: Zero Bytes were transmitted or received
> [2015-12-29T13:31:53.770] error: slurm_receive_msg: Zero Bytes were transmitted or received
> [2015-12-29T13:41:53.746] error: slurm_receive_msg: Zero Bytes were transmitted or received
> [2015-12-29T13:41:53.756] error: slurm_receive_msg: Zero Bytes were transmitted or received
> [2015-12-29T13:43:49.703] error: cons_res: node nid00163 memory is under-allocated (0-124928) for job 743307
> [2015-12-29T13:51:53.754] error: slurm_receive_msg: Zero Bytes were transmitted or received
> [2015-12-29T13:51:53.764] error: slurm_receive_msg: Zero Bytes were transmitted or received
Any idea why those associations can't match to a parent? You might need
to dig into MySQL to sort those out, but I doubt those are related to
your scheduling issues.
Yeah, sorry the thing from the job submit lua should have been tagged as a debug message, not error. I'll fix those next time I update the job submit filter (for example to prevent jerks from requesting whole nodes in the shared partition). ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Tue, Dec 29, 2015 at 2:46 PM, <bugs@schedmd.com> wrote: > *Comment # 15 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c15> on bug > 2285 <http://bugs.schedmd.com/show_bug.cgi?id=2285> from Tim Wickberg > <tim@schedmd.com> * > > You did warn me that the log would be big... > > There are 1096 error messages in there, and that's excluding the > warnings about the slurmd's having a different config value. (You may > want to set NO_CONF_HASH just to keep the logs a bit more manageable > long-term if the slurmd's aren't seeing the same file.) > > Most of them look harmless, but it'd still be nice to clean them up at > some point - are you expecting those error messages from your > job_submit.lua plugin? (564 of 'em) > > The errors related to gres/craynetwork (518) are possibly more > interesting, I'm looking into those further now. > > Aside from those, there are these 14: > > [2015-12-29T13:25:26.039] error: Can't find parent id 751 for assoc 4958, this should never happen. > > [2015-12-29T13:25:26.039] error: Can't find parent id 751 for assoc 4958, this should never happen. > > [2015-12-29T13:25:26.539] error: Can't find parent id 43 for assoc 4959, this should never happen. > > [2015-12-29T13:25:26.539] error: Can't find parent id 43 for assoc 4959, this should never happen. > > [2015-12-29T13:26:03.201] error: cons_res: node nid01146 memory is under-allocated (0-124928) for job 742992 > > [2015-12-29T13:26:06.321] error: cons_res: node nid00666 memory is under-allocated (0-124928) for job 743317 > > [2015-12-29T13:26:34.457] error: _start_stage_in: setup for job 743319 status:256 response:dwpost - failed client status code %s 409 > > [2015-12-29T13:31:53.760] error: slurm_receive_msg: Zero Bytes were transmitted or received > > [2015-12-29T13:31:53.770] error: slurm_receive_msg: Zero Bytes were transmitted or received > > [2015-12-29T13:41:53.746] error: slurm_receive_msg: Zero Bytes were transmitted or received > > [2015-12-29T13:41:53.756] error: slurm_receive_msg: Zero Bytes were transmitted or received > > [2015-12-29T13:43:49.703] error: cons_res: node nid00163 memory is under-allocated (0-124928) for job 743307 > > [2015-12-29T13:51:53.754] error: slurm_receive_msg: Zero Bytes were transmitted or received > > [2015-12-29T13:51:53.764] error: slurm_receive_msg: Zero Bytes were transmitted or received > > Any idea why those associations can't match to a parent? You might need > to dig into MySQL to sort those out, but I doubt those are related to > your scheduling issues. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Not a problem, just was curious. I start skimming the error messages first and those popped out. IIRC, the job_submit plugin may be doing something with the craynetwork GRES? Is every job in shared still asking for one unit of that GRES, even though there are only four available per node? The constant underflow warnings are curious, I'm still digging into why those may be getting generated that frequently. On 12/29/2015 05:51 PM, bugs@schedmd.com wrote: > http://bugs.schedmd.com/show_bug.cgi?id=2285 > > --- Comment #16 from Doug Jacobsen <dmjacobsen@lbl.gov> --- > Yeah, sorry the thing from the job submit lua should have been tagged as a > debug message, not error. I'll fix those next time I update the job submit > filter (for example to prevent jerks from requesting whole nodes in the > shared partition). The job_submit/cray plugin is adding craynetwork:1 to all jobs if craynetwork is not specified. Our job_submit.lua script is setting craynetwork:0 for all jobs in the shared partition. ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Tue, Dec 29, 2015 at 3:10 PM, <bugs@schedmd.com> wrote: > *Comment # 17 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c17> on bug > 2285 <http://bugs.schedmd.com/show_bug.cgi?id=2285> from Tim Wickberg > <tim@schedmd.com> * > > Not a problem, just was curious. I start skimming the error messages > first and those popped out. > > IIRC, the job_submit plugin may be doing something with the craynetwork > GRES? Is every job in shared still asking for one unit of that GRES, > even though there are only four available per node? > > The constant underflow warnings are curious, I'm still digging into why > those may be getting generated that frequently. > > On 12/29/2015 05:51 PM, bugs@schedmd.com wrote:> http://bugs.schedmd.com/show_bug.cgi?id=2285 > >> --- Comment #16 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c16> from Doug Jacobsen <dmjacobsen@lbl.gov> --- > > Yeah, sorry the thing from the job submit lua should have been tagged as a > > debug message, not error. I'll fix those next time I update the job submit > > filter (for example to prevent jerks from requesting whole nodes in the > > shared partition). > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Regarding the gres, we don't use the weird bind-mounted gres.conf that the cray wlm_switch init script tries to setup. I disabled all that and simply have this as my gres.conf: NodeName=nid0[0024-0063,0080-0083,0088-0127,0148-0191,0208-0211,0216-0255,0272-0319,0336-0383,0408-0447,0464-0467,0472-0511,0532-0575,0596-0639,0656-0703,0720-0767,0788-0831,0848-0851,0856-0895,0916-0959,0980-1023,1040-1087,1104-1151,1172-1215,1232-1235,1240-1279,1300-1343,1364-1407,1424-1471,1488-1535,1556-1599,1616-1619,1624-1663,1684-1727,1748-1791,1808-1855,1872-1919,1940-1983,2000-2003,2008-2047,2068-2111,2128-2175,2192-2239,2256-2303] Name=craynetwork Count=4 ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Tue, Dec 29, 2015 at 3:12 PM, Douglas Jacobsen <dmjacobsen@lbl.gov> wrote: > The job_submit/cray plugin is adding craynetwork:1 to all jobs if > craynetwork is not specified. Our job_submit.lua script is setting > craynetwork:0 for all jobs in the shared partition. > > ---- > Doug Jacobsen, Ph.D. > NERSC Computer Systems Engineer > National Energy Research Scientific Computing Center > <http://www.nersc.gov> > dmjacobsen@lbl.gov > > ------------- __o > ---------- _ '\<,_ > ----------(_)/ (_)__________________________ > > > On Tue, Dec 29, 2015 at 3:10 PM, <bugs@schedmd.com> wrote: > >> *Comment # 17 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c17> on bug >> 2285 <http://bugs.schedmd.com/show_bug.cgi?id=2285> from Tim Wickberg >> <tim@schedmd.com> * >> >> Not a problem, just was curious. I start skimming the error messages >> first and those popped out. >> >> IIRC, the job_submit plugin may be doing something with the craynetwork >> GRES? Is every job in shared still asking for one unit of that GRES, >> even though there are only four available per node? >> >> The constant underflow warnings are curious, I'm still digging into why >> those may be getting generated that frequently. >> >> On 12/29/2015 05:51 PM, bugs@schedmd.com wrote:> http://bugs.schedmd.com/show_bug.cgi?id=2285 >> >> --- Comment #16 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c16> from Doug Jacobsen <dmjacobsen@lbl.gov> --- >> > Yeah, sorry the thing from the job submit lua should have been tagged as a >> > debug message, not error. I'll fix those next time I update the job submit >> > filter (for example to prevent jerks from requesting whole nodes in the >> > shared partition). >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >> > Hello, This happened again yesterday morning and again this morning. The full-node shared partition jobs have been blocked since Tuesday, so that is not involved. I checked and a lot of realtime jobs had come in. The realtime partition has higher partition priority than the rest. Also it has access to all the nodes in the system. There is a group of 8 nodes (that debug, shared, regular do not have access to), nid[02256-02263] which have lower weight than all the other nodes. So, the idea is that if a realtime job comes in, it will get high priority in the scheduling algorithm owing to its high partition priority score, but will most likely be placed on the low-weight nodes that other partitions are not using. Only if those nodes are fully occupied will it then move on to consuming idle (possibly resource-reserved) nodes that other partitions can access. Yesterday we got 133 realtime job requests, and all scheduled on the realtime-only nodes (nid[02256-02263]), but I'm wondering if the disruption in ordering of the jobs is causing this behavior. As an experiment, I've lowered the partition priority of realtime to match the other partitions (and have given it slightly more resources, just in case). -Doug Hello, Just a status update. Since making the realtime partition have the same partition priority as the rest, this issue has not re-occurred. So it that type of configuration may help track down the issue. I can get away with realtime not having a higher partition priority for now, but at some point I would like to get back to that, if possible. I'll continue to monitor this and let you know if I see any changes. -Doug My apologies for some unplanned delay on our end, the holidays and some other matters have complicated things this week. I should have asked about the mixed partition priority levels before - I have seen other cases leading to this type of fragmentation. The partition priority is evaluated at an entirely different level than fairshare. It predates the fairshare work in Slurm, and has some implicit assumptions about system management that are the likely source of the issue you're seeing: if any jobs are in a higher priority partition then we will schedule those ASAP, regardless of the impact on lower priority partitions. As you've seen, this can cause some significant and unexpected impacts on throughput and utilization. I will make sure this is made more explicit in the documentation, and that - unless you're looking at a preemption model or similar - setting partition priorities will likely cause problems when the partitions overlap. At the very least we should warn about this with the PriorityWeightPartition setting. Using the new PartitionQOS settings + TRES should allow you to reweight different aspects of the various partitions, while still subjecting everything to the normal fairshare operation. Hi Tim, I'll go ahead and re-work the priorities so that the realtime jobs get highest possible priority via the QOS rather than via the partition method. I do think, however, it is odd that this should result in jobs in the lower priority partitions incorrectly getting assigned a "Resources" reason, and probably still represents somewhat of a bug. For the meantime, however, structuring our priorities using just the QOS's will probably be easier. The main use of fairshare will go away for NERSC on January 11, when we enter full production (we're in "free" time mode right now, and so are using fairshare as a way to weight priorities towards users that didn't make full use of their allocation last year). Thank you for looking at this and getting back to me, Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Fri, Jan 1, 2016 at 11:57 AM, <bugs@schedmd.com> wrote: > *Comment # 22 <http://bugs.schedmd.com/show_bug.cgi?id=2285#c22> on bug > 2285 <http://bugs.schedmd.com/show_bug.cgi?id=2285> from Tim Wickberg > <tim@schedmd.com> * > > My apologies for some unplanned delay on our end, the holidays and some > other matters have complicated things this week. > > I should have asked about the mixed partition priority levels before - I > have seen other cases leading to this type of fragmentation. > > The partition priority is evaluated at an entirely different level than > fairshare. It predates the fairshare work in Slurm, and has some > implicit assumptions about system management that are the likely source > of the issue you're seeing: if any jobs are in a higher priority > partition then we will schedule those ASAP, regardless of the impact on > lower priority partitions. As you've seen, this can cause some > significant and unexpected impacts on throughput and utilization. > > I will make sure this is made more explicit in the documentation, and > that - unless you're looking at a preemption model or similar - setting > partition priorities will likely cause problems when the partitions > overlap. At the very least we should warn about this with the > PriorityWeightPartition setting. > > Using the new PartitionQOS settings + TRES should allow you to reweight > different aspects of the various partitions, while still subjecting > everything to the normal fairshare operation. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > This problem does _not_ appear related to your Slurm upgrade. What I'm seeing is Slurm logic is failing to reset the job's reason to "Priority" if the initial state is "Resources". I see similar flawed logic in a couple of places. Here is a portion of flawed logic in src/slurmctld/job_scheduler.c:
} else if (_failed_partition(job_ptr->part_ptr, failed_parts,
failed_part_cnt)) {
if ((job_ptr->state_reason == WAIT_NODE_NOT_AVAIL) ||
(job_ptr->state_reason == WAIT_NO_REASON)) {
job_ptr->state_reason = WAIT_PRIORITY;
xfree(job_ptr->state_desc);
last_job_update = now;
}
I believe the job's "reason" should be getting reset from most, if not all, initial "reason" states. I don't believe this is causing any jobs to not be scheduled when they should be, but it is clearly a confusing situation.
Created attachment 2569 [details]
Fix for v15.08.5
Here is the commit for version 15.08.7, likely to be released min-January: https://github.com/SchedMD/slurm/commit/65bb07dc13065c245e2aa02f9efc6eedda7d236b I can't see bug 2300 (access denied). Thanks for getting to the bottom of this Tim and Moe! -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Mon, Jan 4, 2016 at 3:25 PM, <bugs@schedmd.com> wrote: > Tim Wickberg <tim@schedmd.com> changed bug 2285 > <http://bugs.schedmd.com/show_bug.cgi?id=2285> > What Removed Added See Also http://bugs.schedmd.com/show_bug.cgi?id=2300 > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Hello, This issue just occurred on edison when we started the "regularx" partition -- a partition which has access to ALL resources to allow full scale jobs to run. There is a top-priority job in regularx and so far it seems to be scheduling well. Unfortunately almost all jobs in the system now have a reason of (Resources). Scheduling appears to be proceeding correctly, but it is hard from a user perspective to understand what is going on. I applied the patch from this bug and restarted slurmctld. All jobs still have reason (Resources). What information can I provide that might help? -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Mon, Jan 4, 2016 at 3:49 PM, Douglas Jacobsen <dmjacobsen@lbl.gov> wrote: > I can't see bug 2300 (access denied). > > Thanks for getting to the bottom of this Tim and Moe! > > -Doug > > ---- > Doug Jacobsen, Ph.D. > NERSC Computer Systems Engineer > National Energy Research Scientific Computing Center > <http://www.nersc.gov> > dmjacobsen@lbl.gov > > ------------- __o > ---------- _ '\<,_ > ----------(_)/ (_)__________________________ > > > On Mon, Jan 4, 2016 at 3:25 PM, <bugs@schedmd.com> wrote: > >> Tim Wickberg <tim@schedmd.com> changed bug 2285 >> <http://bugs.schedmd.com/show_bug.cgi?id=2285> >> What Removed Added >> See Also http://bugs.schedmd.com/show_bug.cgi?id=2300 >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >> > |
Created attachment 2546 [details] cori slurm.conf Hello, Since upgrading to 15.08.5 on cori I've been observing occasional instances where many hundreds of jobs are blocked on reason "Resources" instead of the two or three corresponding to the partial segmentation of the cori system (one for regular, debug, shared partitions each). When so many jobs are apparently getting nodes reserved for resources, the system starts becoming more idle than necessary. I have not been able to identify conditions that lead to this behavior, but setting the partition to down, allowing a scheduling cycle to complete, then setting a partition back up seems to temporarily correct the issue. This has occurred four times since the 23rd, twice today. I'll try to collect more information from the logs that might be of use, but the lab is in shut down right now so I have limited time resources available to me. The current slurm.conf for cori is attached. -Doug