Created attachment 2619 [details] slurmctld.log for this morning Hello, I've seen this issue on cori as well, but edison has a much simpler queue, so I'm hoping it will be easier to understand what is going on right now. Anyway, yesterday someone submitted a large job under our "premium" qos, this became the top priority job and was scheduled to run at 6:15PM last night, then 8:30, then 11:30PM. At this point, I manually increased the priority to 2000000, a value much higher than anything our multipriority plugin configuration can achieve without manual intervention. It is definitely the highest priority job (by job priority) in the system, and no newly submitted job can beat it. We do not use partition priorities. Only QOS. The job submit filter does, however often change a user requested QOS to a "real QOS", this is done to provide priority based on a couple of different factors (node count, customer priority request, customer standing, etc), that our job_submit.lua computes. Next the job was delayed to 3:30AM, then 7:30AM, and 8:30AM. At this point I rotated the slurmctld.log (attached), and enabled the backfill debugflags. Now it is scheduled to start at 11:05AM. The problem with this, other than not starting the job is that many nodes (>1000) are being kept idle, presumably to start this job. The job requires 4096 nodes which is slightly less than 75% of the system and 77% of the partition it is submitted to (our "regular" partition). Only 1 node is down and I haven't seen issues with nodes spending a long time in the completing state. Job 27432 is the target job here, has priority 2000000 Looking at the logs it appears that until about 5:05 there are some minor adjustments forward or backward in the schedule for this job, but mostly it is to start at 8:30:12. At 5:05AM, job 32122, a newly submitted job with priority 17368 is started by the backfill scheduler. After that the schedule gets delayed more. I'm attaching the slurm.conf and the slurmctld.log from this morning up to this point. I really appreciate any help or advice you can give here. Thanks, Doug nid01605:/var/tmp/slurm # cat slurmctld.log | egrep 'Started|Allocate|Job 27432 to start' ... ... [2016-01-15T05:03:41.001] Job 27432 to start at 2016-01-15T08:30:12, end at 2016-01-15T09:00:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0542-0554,0579,0584-0605,0610-0625,0627-0648,0651-0654,0667-0703,0746-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0947-0956,1001-1034,1093-1133,1213-1219,1224-1226,1239-1279,1284,1287-1535,1540-1603,1608-1663,1668-1804,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2153,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2465,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4247,4305-4538,4541-4607,4992-5100,5103-5191,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5531-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6141] [2016-01-15T05:04:14.884] sched: Allocate JobID=32121 NodeList=nid00[102-127,136-173] #CPUs=3072 [2016-01-15T05:04:21.710] Job 27432 to start at 2016-01-15T08:30:12, end at 2016-01-15T09:00:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0542-0554,0579,0584-0605,0610-0625,0627-0648,0651-0654,0667-0703,0746-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0947-0956,1001-1034,1093-1133,1213-1219,1224-1226,1239-1279,1284,1287-1535,1540-1603,1608-1663,1668-1804,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2153,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2465,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4247,4305-4538,4541-4607,4992-5100,5103-5191,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5531-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6141] [2016-01-15T05:04:22.258] backfill: Started JobId=27724 in regular on nid0[0655,2593] [2016-01-15T05:05:03.761] Job 27432 to start at 2016-01-15T08:30:12, end at 2016-01-15T09:00:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0542-0554,0579,0584-0605,0610-0625,0627-0648,0651-0654,0667-0703,0746-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0947-0956,1001-1034,1093-1133,1213-1219,1224-1226,1239-1279,1284,1287-1535,1540-1603,1608-1663,1668-1804,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2153,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2465,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4247,4305-4538,4541-4607,4992-5100,5103-5191,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5531-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6141] [2016-01-15T05:05:10.687] backfill: Started JobId=32122 in regular on nid00[927-928] [2016-01-15T05:05:46.890] Job 27432 to start at 2016-01-15T11:35:19, end at 2016-01-15T12:04:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0575-0576,0579,0584-0605,0610-0649,0651-0654,0667-0703,0732-0735,0790-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1193-1201,1213-1219,1224-1226,1285-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4229-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:05:47.537] backfill: Started JobId=27727 in regular on nid0[1239-1240] [2016-01-15T05:05:54.315] backfill: Started JobId=32118 in regular on nid0[1241-1272] [2016-01-15T05:06:27.303] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:07:07.891] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:07:48.391] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:08:28.802] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:09:09.217] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:09:50.188] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:10:31.168] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:11:12.142] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:11:53.159] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:12:34.044] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:13:14.310] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:13:24.790] sched: Allocate JobID=32124 NodeList=nid000[14-63,72-85] #CPUs=3072 [2016-01-15T05:13:53.769] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:13:54.273] backfill: Started JobId=27728 in regular on nid00[362-363] [2016-01-15T05:13:54.316] backfill: Started JobId=27730 in regular on nid0[0368,4167] [2016-01-15T05:14:33.963] Job 27432 to start at 2016-01-15T11:35:19, end at 2016-01-15T12:04:00 on nid0[0311-0323,0328-0339,0342-0357,0366-0367,0426-0451,0456-0461,0480-0491,0540-0554,0575-0576,0579,0584-0605,0610-0649,0651-0654,0664,0667-0703,0732-0735,0744-0767,0772-0790,0797-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,0993,1001-1034,1093-1133,1193-1201,1213-1219,1224-1226,1235-1236,1241-1279,1284-1535,1540-1603,1608-1663,1668-1756,1766-1772,1777-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2047,2052-2155,2185-2197,2243-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2474,2525-2572,2594-2595,2603-2687,2692-2734,2766-2815,2820-2894,2909,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3622,3632-3645,3654-3670,3673-3817,3825-3839,3844-3856,3861-3899,3920-3927,3954-4139,4186-4223,4229-4249,4303-4310,4316-4538,4541-4607,4992-5100,5103-5112,5124-5193,5199-5218,5229-5230,5239-5244,5326-5443,5448-5478,5489-5498,5529-5539,5556-5557,5567-5590,5598-5599,5602-5609,5619-5661,5680-5693,5701-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:14:37.287] backfill: Started JobId=31395 in regular on nid0[1001-1032] [2016-01-15T05:14:37.352] backfill: Started JobId=31455 in regular on nid0[2243-2253] [2016-01-15T05:14:37.416] backfill: Started JobId=31616 in regular on nid0[0886-0895,0900-0919,2525-2572] [2016-01-15T05:14:37.475] backfill: Started JobId=31692 in regular on nid0[0667-0703,0746-0767,0772-0789,1093-1116,1777-1804,2440-2465,3825-3839,3844-3856,5531-5539] [2016-01-15T05:14:37.535] backfill: Started JobId=31695 in regular on nid0[0342-0353,0542-0554,0579,0584-0605,0627-0648,2185-2197,2314-2330,3632-3645,3806-3817,3954-3964,4186-4202,5567-5590,5680-5693] [2016-01-15T05:14:37.592] backfill: Started JobId=31696 in regular on nid0[1127-1133,4213-4223,5428-5439] [2016-01-15T05:14:38.073] backfill: Started JobId=31918 in regular on nid00[610-619] [2016-01-15T05:14:38.130] backfill: Started JobId=32039 in regular on nid0[0947-0956,5103-5112] [2016-01-15T05:14:38.948] backfill: Started JobId=32101 in regular on nid0[0651,5635-5644,5701-5709] [2016-01-15T05:14:39.046] backfill: Started JobId=32103 in regular on nid0[1849-1898] [2016-01-15T05:14:39.103] backfill: Started JobId=32104 in regular on nid0[0823-0830,0931-0934,1273-1279,1284] [2016-01-15T05:14:39.199] backfill: Started JobId=32106 in regular on nid0[3195-3210] [2016-01-15T05:14:39.297] backfill: Started JobId=32107 in regular on nid0[1734-1749] [2016-01-15T05:14:39.392] backfill: Started JobId=32108 in regular on nid00[311-323,328-330] [2016-01-15T05:14:39.489] backfill: Started JobId=32109 in regular on nid0[1899-1914] [2016-01-15T05:14:39.548] backfill: Started JobId=32110 in regular on nid0[5199-5202] [2016-01-15T05:14:39.646] backfill: Started JobId=32111 in regular on nid0[1947-1962] [2016-01-15T05:14:39.745] backfill: Started JobId=32112 in regular on nid0[2103-2118] [2016-01-15T05:14:39.846] backfill: Started JobId=32114 in regular on nid0[2119-2134] [2016-01-15T05:14:39.904] backfill: Started JobId=32116 in regular on nid0[0927-0928,1766-1772,3657-3663,3920-3927,5602-5609] [2016-01-15T05:15:12.064] Job 27432 to start at 2016-01-15T11:14:37, end at 2016-01-15T11:44:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0620-0649,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0926,0929-0934,0945-0958,1001-1034,1093-1133,1193-1201,1213-1219,1224-1226,1241-1279,1284-1535,1540-1603,1608-1663,1668-1765,1773-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3656,3664-3670,3673-3765,3796-3817,3825-3839,3844-3899,3954-4139,4186-4223,4229-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5610-5661,5680-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:15:12.705] backfill: Started JobId=27731 in regular on nid00[528,654] [2016-01-15T05:15:20.230] backfill: Started JobId=32125 in debug on nid00[086-127,136-157] [2016-01-15T05:15:50.955] Job 27432 to start at 2016-01-15T11:14:37, end at 2016-01-15T11:44:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0620-0649,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0926,0929-0934,0945-0958,1001-1034,1093-1133,1193-1201,1213-1219,1224-1226,1241-1279,1284-1535,1540-1603,1608-1663,1668-1765,1773-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3656,3664-3670,3673-3765,3796-3817,3825-3839,3844-3899,3954-4139,4186-4223,4229-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5610-5661,5680-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:16:29.995] Job 27432 to start at 2016-01-15T11:14:37, end at 2016-01-15T11:44:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0620-0649,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0926,0929-0934,0945-0958,1001-1034,1093-1133,1193-1201,1213-1219,1224-1226,1241-1279,1284-1535,1540-1603,1608-1663,1668-1765,1773-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3656,3664-3670,3673-3765,3796-3817,3825-3839,3844-3899,3954-4139,4186-4223,4229-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5610-5661,5680-5759,5764-5827,5832-5951,5956-6143] [2016-01-15T05:17:09.017] Job 27432 to start at 2016-01-15T11:14:37, end at 2016-01-15T11:44:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0620-0649,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0926,0929-0934,0945-0958,1001-1034,1093-1133,1193-1201,1213-1219,1224-1226,1241-1279,1284-1535,1540-1603,1608-1663,1668-1765,1773-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3656,3664-3670,3673-3765,3796-3817,3825-3839,3844-3899,3954-4139,4186-4223,4229-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5610-5661,5680-5759,5764-5827,5832-5951,5956-6143] ... slurm.conf:
Created attachment 2620 [details] slurm.conf for edison
The QOSs, this job was submitted to premium, but job_submit.lua moved it to premium_regular_0 -- the highest priority bin in the system. scontrol show assoc | tail -n 500 ... QOS Records QOS=normal(1) UsageRaw=3641173138.117583 GrpJobs=N(5) GrpSubmitJobs=N(9) GrpWall=N(62601.78) GrpTRES=cpu=N(6288),mem=N(8452513),energy=N(0),node=N(131) GrpTRESMins=cpu=N(60686218),mem=N(81574679163),energy=N(0),node=N(1264296) GrpTRESRunMins=cpu=N(98419),mem=N(132297959),energy=N(0),node=N(2050) MaxJobs= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=part_regx(5) UsageRaw=0.000000 GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00) GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0) GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0) GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0) MaxJobs= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=part_reg(6) UsageRaw=106407596338.459007 GrpJobs=N(195) GrpSubmitJobs=N(308) GrpWall=N(1240015.39) GrpTRES=cpu=N(132240),mem=N(177760865),energy=N(0),node=N(2755) GrpTRESMins=cpu=N(1773459938),mem=N(2381492269692),energy=N(0),node=N(36947082) GrpTRESRunMins=cpu=N(136017413),mem=N(182797082839),energy=N(0),node=N(2833696) MaxJobs= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=part_debug(7) UsageRaw=4470722333.674494 GrpJobs=N(5) GrpSubmitJobs=N(9) GrpWall=N(65017.08) GrpTRES=cpu=N(6288),mem=N(8452513),energy=N(0),node=N(131) GrpTRESMins=cpu=N(74512038),mem=N(100159749560),energy=N(0),node=N(1552334) GrpTRESRunMins=cpu=N(99151),mem=N(133281934),energy=N(0),node=N(2065) MaxJobsPU=1(5) MaxSubmitJobs=10(9) MaxWallPJ=30 MaxTRESPJ=node=512 MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=scavenger(9) UsageRaw=1373370756.785909 GrpJobs=N(1) GrpSubmitJobs=N(4) GrpWall=N(8213.85) GrpTRES=cpu=N(4800),mem=N(6452300),energy=N(0),node=N(100) GrpTRESMins=cpu=N(22889512),mem=N(30768750465),energy=N(0),node=N(476864) GrpTRESRunMins=cpu=N(700480),mem=N(941605646),energy=N(0),node=N(14593) MaxJobsPU=8(1) MaxSubmitJobs=100(4) MaxWallPJ=2160 MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=normal_regular_0(10) UsageRaw=22577055475.016092 GrpJobs=N(0) GrpSubmitJobs=N(5) GrpWall=N(5448.02) GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0) GrpTRESMins=cpu=N(376284257),mem=N(505812274449),energy=N(0),node=N(7839255) GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0) MaxJobsPU=8(0) MaxSubmitJobs=100(5) MaxWallPJ=2160 MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=normal_regular_1(11) UsageRaw=64476715273.811566 GrpJobs=N(162) GrpSubmitJobs=N(260) GrpWall=N(876737.62) GrpTRES=cpu=N(118464),mem=N(159242764),energy=N(0),node=N(2468) GrpTRESMins=cpu=N(1074611921),mem=N(1442827814349),energy=N(0),node=N(22387748) GrpTRESRunMins=cpu=N(79996553),mem=N(107533700581),energy=N(0),node=N(1666594) MaxJobsPU=24(162) MaxSubmitJobs=100(260) MaxWallPJ=2160 MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=normal_regular_2(12) UsageRaw=16211605528.842383 GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(279783.40) GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0) GrpTRESMins=cpu=N(270193425),mem=N(362454449997),energy=N(0),node=N(5629029) GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0) MaxJobsPU=6(0) MaxSubmitJobs=20(0) MaxWallPJ=720 MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=premium_regular_0(15) UsageRaw=53891622.414630 GrpJobs=N(0) GrpSubmitJobs=N(2) GrpWall=N(10.62) GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0) GrpTRESMins=cpu=N(898193),mem=N(1207378178),energy=N(0),node=N(18712) GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0) MaxJobsPU=2(0) MaxSubmitJobs=20(2) MaxWallPJ=2160 MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=premium_regular_1(16) UsageRaw=217220562.337340 GrpJobs=N(9) GrpSubmitJobs=N(9) GrpWall=N(13203.96) GrpTRES=cpu=N(2592),mem=N(3484242),energy=N(0),node=N(54) GrpTRESMins=cpu=N(3620342),mem=N(4866570258),energy=N(0),node=N(75423) GrpTRESRunMins=cpu=N(1435225),mem=N(1929272112),energy=N(0),node=N(29900) MaxJobsPU=8(9) MaxSubmitJobs=20(9) MaxWallPJ=2160 MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=premium_regular_2(17) UsageRaw=1827367.205807 GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(79.30) GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0) GrpTRESMins=cpu=N(30456),mem=N(40940004),energy=N(0),node=N(634) GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0) MaxJobsPU=6(0) MaxSubmitJobs=20(0) MaxWallPJ=720 MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=low_regular_0(18) UsageRaw=8140472.684091 GrpJobs=N(0) GrpSubmitJobs=N(2) GrpWall=N(2.76) GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0) GrpTRESMins=cpu=N(135674),mem=N(182377680),energy=N(0),node=N(2826) GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0) MaxJobsPU=8(0) MaxSubmitJobs=100(2) MaxWallPJ=2160 MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=low_regular_1(19) UsageRaw=1537022988.639393 GrpJobs=N(23) GrpSubmitJobs=N(26) GrpWall=N(50916.60) GrpTRES=cpu=N(6384),mem=N(8581559),energy=N(0),node=N(133) GrpTRESMins=cpu=N(25617049),mem=N(34435185519),energy=N(0),node=N(533688) GrpTRESRunMins=cpu=N(4704555),mem=N(6324000316),energy=N(0),node=N(98011) MaxJobsPU=24(23) MaxSubmitJobs=100(26) MaxWallPJ=2160 MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=low_regular_2(20) UsageRaw=18240353.343954 GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(6317.90) GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0) GrpTRESMins=cpu=N(304005),mem=N(408653582),energy=N(0),node=N(6333) GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0) MaxJobsPU=6(0) MaxSubmitJobs=20(0) MaxWallPJ=720 MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=low(21) UsageRaw=759773698.618572 GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(1207.77) GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0) GrpTRESMins=cpu=N(12662894),mem=N(17021832762),energy=N(0),node=N(263810) GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0) MaxJobs= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=premium(22) UsageRaw=1472412.503318 GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(273.83) GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0) GrpTRESMins=cpu=N(24540),mem=N(32987663),energy=N(0),node=N(511) GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0) MaxJobs= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ= QOS=serialize(23) UsageRaw=0.000000 GrpJobs=1(0) GrpSubmitJobs=N(0) GrpWall=N(0.00) GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0) GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0) GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0) MaxJobsPU=1(0) MaxSubmitJobs=10(0) MaxWallPJ=720 MaxTRESPJ= MaxTRESPN= MaxTRESPU= MaxTRESMinsPJ= MinTRESPJ=
the job that won't start: nid01605:/var/tmp/slurm # scontrol show job 27432 JobId=27432 JobName=fmppic3.slurm8192 UserId=decyk(13195) GroupId=decyk(1013195) Priority=2000000 Nice=0 Account=mp113 QOS=premium_regular_0 JobState=PENDING Reason=Resources Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2016-01-14T08:03:57 EligibleTime=2016-01-14T08:03:57 StartTime=2016-01-15T11:05:54 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=regular AllocNode:Sid=edison05:19579 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=nid0[0311-0323,0328-0339,0342-0357,0366-0367,0426-0451,0456-0461,0486-0491,0540-0554,0575-0576,0579,0584-0605,0610-0619,0626-0649,0651-0653,0667-0703,0732-0735,0744-0767,0772-0820,0823-0835,0840-0844,0853-0880,0886-0895,0900-0920,0929-0934,0940-0958,1033-1034,1093-1133,1193-1201,1213-1219,1224-1226,1235-1236,1241-1279,1284-1535,1540-1603,1608-1663,1668-1765,1773-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2594-2595,2603-2687,2692-2734,2766-2815,2820-2894,2909-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3656,3664-3670,3673-3765,3796-3817,3821-3822,3825-3839,3844-3899,3954-4139,4173-4175,4186-4223,4229-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5229-5230,5239-5271,5326-5443,5448-5478,5489-5498,5523-5527,5529-5539,5567-5590,5594-5599,5610-5661,5680-5759,5764-5827,5832-5951,5956-6143] NumNodes=4096-4096 NumCPUs=4096 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=4096,node=4096 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=craynetwork:1 Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/global/u1/d/decyk/PICodes/mppic3/fmppic3.slurm8192 WorkDir=/global/u1/d/decyk/PICodes/mppic3 StdErr=/global/u1/d/decyk/PICodes/mppic3/slurm-27432.out StdIn=/dev/null StdOut=/global/u1/d/decyk/PICodes/mppic3/slurm-27432.out Power= SICP=0 nid01605:/var/tmp/slurm #
Does the user with this stuck job have other jobs in the queue? Is there a chance that Slurm is back-filling some other jobs of theirs while waiting for enough nodes to launch this? My loose theory is if that's the case, there could be an issue where we backfill sufficient jobs to exceed one of the QOS limits, then are forced to back off the large job until they drop below the limit. Which potentially could allow even more of their smaller jobs to sneak in. Just a guess though, I haven't had a chance to look through the logs in further detail yet, but you might be able to get a better picture of that by checking out what they've run recently through sacct.
at this point the user has this job and its twin (another 4096 node job) in the queue. The other job is the 2nd highest priority job. We are not making use of any association limits on edison (on cori just MaxTRES for cray/bb) The QOS limits on edison do not define and Grp limits nid01605:/var/tmp/slurm # sacct --start=2016-01-01 -u decyk --format=job,user,start,end,partition,qos%20,nnodes -X JobID User Start End Partition QOS NNodes ------------ --------- ------------------- ------------------- ---------- -------------------- -------- 8620 decyk 2016-01-08T14:52:00 2016-01-08T14:54:24 debug low 512 9135 decyk 2016-01-08T19:58:35 2016-01-08T20:03:08 debug low 256 9818 decyk 2016-01-08T20:23:47 2016-01-08T20:32:19 debug low 128 10250 decyk 2016-01-08T23:44:01 2016-01-09T00:01:13 debug normal 64 10317 decyk 2016-01-09T00:44:05 2016-01-09T01:14:26 debug normal 32 10551 decyk 2016-01-09T05:43:18 2016-01-09T05:46:16 debug low 512 10575 decyk 2016-01-09T06:05:46 2016-01-09T06:08:08 regular normal_regular_1 1024 10585 decyk 2016-01-09T06:02:43 2016-01-09T06:35:19 regular normal_regular_2 32 10598 decyk 2016-01-09T06:40:49 2016-01-09T06:42:20 regular normal_regular_1 1024 10601 decyk 2016-01-09T07:07:08 2016-01-09T07:08:21 regular normal_regular_1 2K 10609 decyk 2016-01-14T05:26:35 2016-01-14T05:28:07 regular normal_regular_0 4K 10690 decyk 2016-01-14T05:28:25 2016-01-14T05:30:46 regular normal_regular_0 4K 10721 decyk 2016-01-09T09:50:43 2016-01-09T09:53:07 regular normal_regular_2 16 10903 decyk 2016-01-09T19:26:42 2016-01-09T19:27:18 debug low 512 11787 decyk 2016-01-09T19:57:35 2016-01-09T19:58:29 debug low 256 11868 decyk 2016-01-09T20:16:35 2016-01-09T20:18:07 debug low 128 11889 decyk 2016-01-09T20:28:37 2016-01-09T20:31:32 debug normal 64 11905 decyk 2016-01-09T20:34:41 2016-01-09T20:40:10 debug normal 32 11920 decyk 2016-01-09T20:45:44 2016-01-09T20:48:07 debug normal 16 11929 decyk 2016-01-09T21:01:51 2016-01-09T21:02:25 regular normal_regular_1 1024 11930 decyk 2016-01-12T21:57:59 2016-01-12T21:58:43 regular normal_regular_0 2K 12221 decyk 2016-01-10T01:19:24 2016-01-10T01:20:02 debug low 512 12598 decyk 2016-01-10T06:42:36 2016-01-10T06:45:20 debug low 512 12618 decyk 2016-01-10T07:09:02 2016-01-10T07:14:02 debug low 256 12640 decyk 2016-01-10T07:17:11 2016-01-10T07:26:54 debug low 128 12654 decyk 2016-01-10T07:32:10 2016-01-10T07:41:57 debug low 128 12675 decyk 2016-01-10T07:48:00 2016-01-10T08:07:07 debug normal 64 12702 decyk 2016-01-10T08:12:47 2016-01-10T08:49:40 regular normal_regular_2 32 12797 decyk 2016-01-10T21:56:06 2016-01-10T21:58:51 debug low 512 15138 decyk 2016-01-11T10:05:58 2016-01-11T10:06:40 debug low 512 15524 decyk 2016-01-11T11:23:17 2016-01-11T11:25:40 debug low 512 15908 decyk 2016-01-11T21:37:15 2016-01-11T21:42:05 debug low 512 20798 decyk 2016-01-13T06:40:26 2016-01-13T06:42:15 regular normal_regular_0 1024 21504 decyk 2016-01-13T07:52:29 2016-01-13T07:53:09 regular normal_regular_0 1024 21964 decyk 2016-01-13T11:06:00 2016-01-13T11:07:54 regular normal_regular_0 2K 21967 decyk 2016-01-13T08:56:13 2016-01-13T08:57:46 regular normal_regular_0 1024 22214 decyk 2016-01-13T09:52:18 2016-01-13T09:54:04 regular normal_regular_0 1024 24036 decyk 2016-01-13T15:02:14 2016-01-13T15:03:36 regular premium_regular_0 2K 24043 decyk 2016-01-13T15:03:49 2016-01-13T15:04:30 regular premium_regular_0 2K 24055 decyk 2016-01-13T15:04:47 2016-01-13T15:06:14 regular premium_regular_0 2K 24538 decyk 2016-01-13T19:52:41 2016-01-13T19:54:36 regular premium_regular_0 2K 27432 decyk Unknown Unknown regular premium_regular_0 4K 27436 decyk Unknown Unknown regular premium_regular_0 4K 28532 decyk 2016-01-14T12:24:02 2016-01-14T12:24:02 debug normal 1 28646 decyk 2016-01-14T19:26:46 2016-01-14T19:30:21 debug normal 342 30579 decyk 2016-01-14T20:24:36 2016-01-14T20:26:49 regular premium_regular_0 683 30597 decyk 2016-01-14T20:55:30 2016-01-14T20:57:31 regular premium_regular_0 1366 31349 decyk 2016-01-14T23:33:16 2016-01-14T23:35:06 regular premium_regular_0 2731 nid01605:/var/tmp/slurm #
just to verify, the user's associations from sacctmgr: nid01605:/var/tmp/slurm # sacctmgr show assoc where user=decyk cluster=edison -p Cluster|Account|User|Partition|Share|GrpJobs|GrpTRES|GrpSubmit|GrpWall|GrpTRESMins|MaxJobs|MaxTRES|MaxTRESPerNode|MaxSubmit|MaxWall|MaxTRESMins|QOS|Def QOS|GrpTRESRunMins| edison|m1157|decyk||1||||||||||||low,low_regular_0,low_regular_1,low_regular_2,normal,normal_regular_0,normal_regular_1,normal_regular_2,premium,premium_regular_0,premium_regular_1,premium_regular_2,scavenger,serialize||| edison|mp113|decyk||1||||||||||||low,low_regular_0,low_regular_1,low_regular_2,normal,normal_regular_0,normal_regular_1,normal_regular_2,premium,premium_regular_0,premium_regular_1,premium_regular_2,scavenger,serialize||| nid01605:/var/tmp/slurm #
On 01/15/2016 09:24 AM, bugs@schedmd.com wrote: > at this point the user has this job and its twin (another 4096 node job) in the > queue. The other job is the 2nd highest priority job. > > We are not making use of any association limits on edison (on cori just MaxTRES > for cray/bb) > > The QOS limits on edison do not define and Grp limits You do have MaxJobsPU set to 2, which I think could have caused what I describe: > 30579 decyk 2016-01-14T20:24:36 2016-01-14T20:26:49 regular > premium_regular_0 683 > 30597 decyk 2016-01-14T20:55:30 2016-01-14T20:57:31 regular > premium_regular_0 1366 > 31349 decyk 2016-01-14T23:33:16 2016-01-14T23:35:06 regular > premium_regular_0 2731 > nid01605:/var/tmp/slurm # It looks like he'd have two running up until 21:00, which may have punted back the start time for 27432. Although that wouldn't explain further delays at "3:30AM, then 7:30AM, and 8:30AM." When did those delays kick in?
Hi Tim, I can't say exactly because I saw them using squeue --start output, didn't have backfill debugflags on then, though I think i'll leave them on into the future --- too useful. The user doesn't have any jobs now, nor since midnight, so maybe the case where the start time changed from 8:30 to 11:05 at 5:05 this morning would be informative (or the smaller movements around 4:04ish) -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Fri, Jan 15, 2016 at 6:32 AM, <bugs@schedmd.com> wrote: > *Comment # 7 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c7> on bug 2350 > <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Tim Wickberg > <tim@schedmd.com> * > > On 01/15/2016 09:24 AM, bugs@schedmd.com wrote:> at this point the user has this job and its twin (another 4096 node job) in the > > queue. The other job is the 2nd highest priority job. > >> We are not making use of any association limits on edison (on cori just MaxTRES > > for cray/bb) > >> The QOS limits on edison do not define and Grp limits > > You do have MaxJobsPU set to 2, which I think could have caused what I > describe: > > 30579 decyk 2016-01-14T20:24:36 2016-01-14T20:26:49 regular > > premium_regular_0 683 > > 30597 decyk 2016-01-14T20:55:30 2016-01-14T20:57:31 regular > > premium_regular_0 1366 > > 31349 decyk 2016-01-14T23:33:16 2016-01-14T23:35:06 regular > > premium_regular_0 2731 > > nid01605:/var/tmp/slurm # > > It looks like he'd have two running up until 21:00, which may have > punted back the start time for 27432. > > Although that wouldn't explain further delays at "3:30AM, then 7:30AM, > and 8:30AM." When did those delays kick in? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
I checked with Jay (my supervisor) and he asked me to increase the priority of this issue. Owing to this problem our machine utilization was 75% yesterday. Unfortunately a long, large job was allowed to start and it seems to have pushed this highest priority job to the end of the schedule. I'll send updated logs.
It will take me a bit of time to study the logs and code. In the meanwhile, what you could do is create an advanced reservation for this particular job with sufficient resources for it to start, then modify the job to associate it with that reservation (scontrol update jobid=27432 reservation=WHATEVER). That should keep other jobs from starting. Be sure to create the reservation with "flag=ignore_jobs" Advanced reservation documentation here: http://slurm.schedmd.com/reservations.html
Here is an update. I don't have a solution yet, but found several interesting things in the log. ================================================================================================ All of the jobs that I see started in this log (I didn't check them all), started almost immediately after job submission (usually within a few seconds). I'm not sure if that is relevant, but it is odd. ================================================================================================ Here's an excerpt of the slurmctld log file: [2016-01-15T05:13:24.790] sched: Allocate JobID=32124 NodeList=nid000[14-63,72-85] #CPUs=3072 ... [2016-01-15T05:13:53.649] backfill: beginning [2016-01-15T05:13:53.649] debug: backfill: 107 jobs to backfill [2016-01-15T05:13:53.649] backfill test for JobID=27432 Prio=2000000 Partition=regular [2016-01-15T05:13:53.769] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323...] [2016-01-15T05:13:53.771] backfill test for JobID=27436 Prio=25743 Partition=regular [2016-01-15T05:13:53.771] debug: backfill: user 13195: #jobs 2 ... [2016-01-15T05:14:33.817] debug: backfill: 105 jobs to backfill [2016-01-15T05:14:33.817] backfill test for JobID=27432 Prio=2000000 Partition=regular [2016-01-15T05:14:33.963] Job 27432 to start at 2016-01-15T11:35:19, end at 2016-01-15T12:04:00 on nid0[0311-0323,..] [2016-01-15T05:14:33.965] backfill test for JobID=27436 Prio=25748 Partition=regular [2016-01-15T05:14:33.965] debug: backfill: user 13195: #jobs 2 ... [2016-01-15T05:19:43.565] sched: Allocate JobID=32128 NodeList=nid000[08-63,72-79] #CPUs=3072 What this shows is that job 27432 has its expected start time pushed back from 08:54:33 to 11:35:19 and there weren't even any jobs started in that interval. ================================================================================================ Something else that I do see in the logs is this: [2016-01-15T04:05:42.016] Time limit exhausted for JobId=30731 [2016-01-15T04:05:42.021] debug: backup controller responding [2016-01-15T04:05:42.258] job_complete: JobID=30731 State=0x8006 NodeCnt=20 WTERMSIG 15 [2016-01-15T04:05:42.914] debug: freed ports 63023 for step 31976.89 [2016-01-15T04:05:43.010] error: gres/craynetwork: job 28127 node nid00636 gres count underflow [2016-01-15T04:05:43.010] error: cons_res: node nid00636 memory is under-allocated (0-64523) for job 28127 [2016-01-15T04:05:43.010] error: gres/craynetwork: job 28127 node nid00645 gres count underflow [2016-01-15T04:05:43.010] error: cons_res: node nid00645 memory is under-allocated (0-64523) for job 28127 [2016-01-15T04:05:43.010] error: gres/craynetwork: job 28127 node nid00646 gres count underflow [2016-01-15T04:05:43.010] error: cons_res: node nid00646 memory is under-allocated (0-64523) for job 28127 [2016-01-15T04:05:43.010] error: gres/craynetwork: job 28127 node nid00647 gres count underflow For some unknown reason, when job 30731 ended at 04:05:42, a bunch of counters (memory and gres:craynetwork on many nodes) reported underflows. Slurm will set those counters to zero rather than go negative, but this seems to indicate that resources allocated to one of more jobs got decremented multiple times. This could be the root problem, but I'm not sure yet. If this issue is not resolved as part of this ticket, I'll open a new trouble ticket for this issue.
Doug, When you get a chance, could check the accounting records for some of the started jobs and tell me if these ran in the "regular" partition or some other partition? Also could you check if they happen to be associated with some advanced reservation (I expect that's a long-shot): [2016-01-15T05:01:12.566] sched: Allocate JobID=32117 NodeList=nid00[102-127,136-173] #CPUs=3072 [2016-01-15T05:04:14.884] sched: Allocate JobID=32121 NodeList=nid00[102-127,136-173] #CPUs=3072 [2016-01-15T05:13:24.790] sched: Allocate JobID=32124 NodeList=nid000[14-63,72-85] #CPUs=3072 [2016-01-15T05:19:43.565] sched: Allocate JobID=32128 NodeList=nid000[08-63,72-79] #CPUs=3072 [2016-01-15T05:19:53.023] sched: Allocate JobID=32129 NodeList=nid000[80-90] #CPUs=528
nid01605:~ # sacct -j 32117,32121,32128,32129 --format=job,user,account,partition,qos,alloccpus,nnodes,start,end,elapsed,state,exitcode JobID User Account Partition QOS AllocCPUS NNodes Start End Elapsed State ExitCode ------------ --------- ---------- ---------- ---------- ---------- -------- ------------------- ------------------- ---------- ---------- -------- 32117 liaoxx m808 debug normal 3072 64 2016-01-15T05:01:12 2016-01-15T05:03:10 00:01:58 COMPLETED 0:0 32117.batch m808 48 1 2016-01-15T05:01:12 2016-01-15T05:03:10 00:01:58 COMPLETED 0:0 32117.0 m808 1536 64 2016-01-15T05:01:15 2016-01-15T05:03:07 00:01:52 CANCELLED+ 0:9 32121 liaoxx m808 debug normal 3072 64 2016-01-15T05:04:14 2016-01-15T05:06:15 00:02:01 TIMEOUT 1:0 32121.batch m808 48 1 2016-01-15T05:04:14 2016-01-15T05:06:15 00:02:01 COMPLETED 0:0 32121.0 m808 1536 64 2016-01-15T05:04:17 2016-01-15T05:06:12 00:01:55 CANCELLED+ 0:9 32128 liaoxx m808 debug normal 3072 64 2016-01-15T05:19:43 2016-01-15T05:21:43 00:02:00 COMPLETED 0:0 32128.batch m808 48 1 2016-01-15T05:19:43 2016-01-15T05:21:43 00:02:00 COMPLETED 0:0 32128.0 m808 1536 64 2016-01-15T05:19:46 2016-01-15T05:21:41 00:01:55 CANCELLED+ 0:9 32129 jbao m808 debug normal 528 11 2016-01-15T05:19:52 2016-01-15T05:50:03 00:30:11 TIMEOUT 1:0 32129.batch m808 48 1 2016-01-15T05:19:52 2016-01-15T05:50:04 00:30:12 CANCELLED 0:15 32129.0 m808 256 11 2016-01-15T05:19:55 2016-01-15T05:50:06 00:30:11 CANCELLED 0:15 nid01605:~ # nid01605:~ # sacctmgr show assoc where user=liaoxx user=jbao -p Cluster|Account|User|Partition|Share|GrpJobs|GrpTRES|GrpSubmit|GrpWall|GrpTRESMins|MaxJobs|MaxTRES|MaxTRESPerNode|MaxSubmit|MaxWall|MaxTRESMins|QOS|Def QOS|GrpTRESRunMins| edison|m808|jbao||1||||||||||||low,low_regular_0,low_regular_1,low_regular_2,normal,normal_regular_0,normal_regular_1,normal_regular_2,premium,premium_regular_0,premium_regular_1,premium_regular_2,scavenger,serialize||| esedison|m808|jbao||1||||||||||||low,low_regular_0,low_regular_1,low_regular_2,normal,normal_regular_0,normal_regular_1,normal_regular_2,premium,premium_regular_0,premium_regular_1,premium_regular_2,scavenger,serialize||| edison|m808|liaoxx||1||||||||||||low,low_regular_0,low_regular_1,low_regular_2,normal,normal_regular_0,normal_regular_1,normal_regular_2,premium,premium_regular_0,premium_regular_1,premium_regular_2,scavenger,serialize||| esedison|m808|liaoxx||1||||||||||||low,low_regular_0,low_regular_1,low_regular_2,normal,normal_regular_0,normal_regular_1,normal_regular_2,premium,premium_regular_0,premium_regular_1,premium_regular_2,scavenger,serialize||| nid01605:~ # ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Fri, Jan 15, 2016 at 5:11 PM, Douglas Jacobsen <dmjacobsen@lbl.gov> wrote: > Hi Moe, > > Just from the node list I can tell those were in the debug partition -- > which regular does not have access to. > > I'll send the detailed sacct records shortly. > > -Doug > > ---- > Doug Jacobsen, Ph.D. > NERSC Computer Systems Engineer > National Energy Research Scientific Computing Center > <http://www.nersc.gov> > dmjacobsen@lbl.gov > > ------------- __o > ---------- _ '\<,_ > ----------(_)/ (_)__________________________ > > > On Fri, Jan 15, 2016 at 3:54 PM, <bugs@schedmd.com> wrote: > >> *Comment # 12 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c12> on bug >> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette >> <jette@schedmd.com> * >> >> Doug, When you get a chance, could check the accounting records for some of the >> started jobs and tell me if these ran in the "regular" partition or some other >> partition? Also could you check if they happen to be associated with some >> advanced reservation (I expect that's a long-shot): >> >> [2016-01-15T05:01:12.566] sched: Allocate JobID=32117 >> NodeList=nid00[102-127,136-173] #CPUs=3072 >> [2016-01-15T05:04:14.884] sched: Allocate JobID=32121 >> NodeList=nid00[102-127,136-173] #CPUs=3072 >> [2016-01-15T05:13:24.790] sched: Allocate JobID=32124 >> NodeList=nid000[14-63,72-85] #CPUs=3072 >> [2016-01-15T05:19:43.565] sched: Allocate JobID=32128 >> NodeList=nid000[08-63,72-79] #CPUs=3072 >> [2016-01-15T05:19:53.023] sched: Allocate JobID=32129 NodeList=nid000[80-90] >> #CPUs=528 >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >> >
Hi Moe, Just from the node list I can tell those were in the debug partition -- which regular does not have access to. I'll send the detailed sacct records shortly. -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Fri, Jan 15, 2016 at 3:54 PM, <bugs@schedmd.com> wrote: > *Comment # 12 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c12> on bug > 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette > <jette@schedmd.com> * > > Doug, When you get a chance, could check the accounting records for some of the > started jobs and tell me if these ran in the "regular" partition or some other > partition? Also could you check if they happen to be associated with some > advanced reservation (I expect that's a long-shot): > > [2016-01-15T05:01:12.566] sched: Allocate JobID=32117 > NodeList=nid00[102-127,136-173] #CPUs=3072 > [2016-01-15T05:04:14.884] sched: Allocate JobID=32121 > NodeList=nid00[102-127,136-173] #CPUs=3072 > [2016-01-15T05:13:24.790] sched: Allocate JobID=32124 > NodeList=nid000[14-63,72-85] #CPUs=3072 > [2016-01-15T05:19:43.565] sched: Allocate JobID=32128 > NodeList=nid000[08-63,72-79] #CPUs=3072 > [2016-01-15T05:19:53.023] sched: Allocate JobID=32129 NodeList=nid000[80-90] > #CPUs=528 > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
It looks like every job started by the normal scheduler logic was in the debug partition. All of the jobs started in the "normal" partition were started by backfill scheduling. I believe the underlying problem is the backfill scheduler avoids using nodes which are in a "completing" state. That make sense if we are trying to start a job immediately, but if we are trying to pick resources for a job to use an hour in the future, those are exactly the type of nodes that we want to be able to make use of. I would recommend that make use of an advanced reservation for now. I'll need to study the logic some more and do some testing, but I'm guessing that the patch below will fix the problem. "non_cg_bitmap" is a bitmap of the nodes NOT in a completing state. diff --git a/src/plugins/sched/backfill/backfill.c b/src/plugins/sched/backfill/ index cde86e7..c52bd71 100644 --- a/src/plugins/sched/backfill/backfill.c +++ b/src/plugins/sched/backfill/backfill.c @@ -1292,7 +1292,7 @@ next_task: /* Identify usable nodes for this job */ bit_and(avail_bitmap, part_ptr->node_bitmap); bit_and(avail_bitmap, up_node_bitmap); - bit_and(avail_bitmap, non_cg_bitmap); +// bit_and(avail_bitmap, non_cg_bitmap); for (j=0; ; ) { if ((node_space[j].end_time > start_res) && node_space[j].next && (later_start == 0))
I created a separate trouble ticket for the underflow problem described in comment 11. See: http://bugs.schedmd.com/show_bug.cgi?id=2353
Thank you for looking into this Moe and Tim. Regarding the completing state of the nodes, I had wondered about this in the past, and had manipulated CompleteWait to see if it would help (thinking that delaying the time where the scheduler considered the completing nodes as unavailable would be helpful in preventing this). On cori, we've seen similar effects but the queue was so complicated I wasn't able to find a clean enough example, and have used the reservation trick in the past. I'm glad you were able to get to the bottom of this! Thank you. I have put these two large jobs into a reservation and I think they should run this morning. This may just encourage the user to submit more of them... =) -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Fri, Jan 15, 2016 at 7:16 PM, <bugs@schedmd.com> wrote: > Moe Jette <jette@schedmd.com> changed bug 2350 > <http://bugs.schedmd.com/show_bug.cgi?id=2350> > What Removed Added Status UNCONFIRMED CONFIRMED Ever confirmed 1 > > *Comment # 16 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c16> on bug > 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette > <jette@schedmd.com> * > > I created a separate trouble ticket for the underflow problem described incomment 11 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c11>. See:http://bugs.schedmd.com/show_bug.cgi?id=2353 > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Moe, Encouraged by the success of the user's jobs running in a reservation, the user submitted more. Now job id 37358 and 37386. In response to this I decided to try the patch you sent (I know you said you were still thinking about it). Anyway I thought I'd let you know that these jobs have still been getting delayed with the patch applied. I'm sending the updated logs from the patched version of slurmctld. Thank you for your help with this, Doug
Created attachment 2621 [details] patched slurmctld logs with more top-job delays
(btw I've put these newest jobs in a reservation to get them to go). Turns out the user runs very short jobs. It would be useful to have a reservation flag that would cancel an active reservation once no jobs were queued/running or otherwise attached to it.
I was able to replicate the backfill scheduling bug based upon your configuration files and logs. Here's a detailed description with an example. The fundamental issue is that nodes which are in COMPLETING state (in the process of terminating a job) are avoided when it comes to scheduling. For example, lets consider a simple system with four nodes, each with one job running with the expected end time listed below: Node name End time nid00000 16:00 nid00001 17:00 nid00002 18:00 nid00003 19:00 Now lets consider the highest priority pending job needs 3 full nodes. It's expected start time will be 18:00 and make use of nid0000[0-2]. Now lets say that the job on nid00000 reaches its time limit and is in the process of getting killed. That node will be placed in a COMPLETING state and if the backfill scheduler runs while it is completing, the node nid00000 will be removed from consideration for use and the pending job's expected start time will be 19:00 and make use of nid0000[1-3]. We just slipped the job's expected start time by one hour. My patch described in comment 15 was a partial, but incomplete solution. There also needed to be changes in the resource selection plugin (select/cons_res). I've committed a complete solution to github here (append ".patch" to the patch name to generate a patch file): https://github.com/SchedMD/slurm/commit/1a4b5983b13900302a114eb4a61d7b908c0fa2cf I anticipate tagging a new release of Slurm (version 15.08.7) with this and a number of other bug fixes around January 20. Thank you for your patience. I'm closing the ticket based upon this patch. Please re-open if necessary.
(In reply to Doug Jacobsen from comment #20) > (btw I've put these newest jobs in a reservation to get them to go). Turns > out the user runs very short jobs. It would be useful to have a reservation > flag that would cancel an active reservation once no jobs were > queued/running or otherwise attached to it. That should be pretty simple. I've created a new trouble ticket for that: http://bugs.schedmd.com/show_bug.cgi?id=2355
Hi Moe, Thank you so much for continuing to look at this and sending me this patch. I did apply it and rebuilt SLURM with it in place, and then restarted. While the partitions were down I moved the big jobs out the reservation and then deleted the reservation. Unfortunately it seems to have happened again. At the termination of a large job (1152 nodes) the 4096 node job was delayed in the next backfill scheduler run (within a few seconds). The primary event for this was at [2016-01-17T05:01:05.284]. I'll send the logs in a few minutes. -Doug
Created attachment 2622 [details] 2nd patch slurmctld.log
Almost certainly a distinct problem/bug. I will not be able to work on this until tonight at the earliest. On January 17, 2016 7:40:14 AM PST, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=2350 > >--- Comment #24 from Doug Jacobsen <dmjacobsen@lbl.gov> --- >Created attachment 2622 [details] > --> http://bugs.schedmd.com/attachment.cgi?id=2622&action=edit >2nd patch slurmctld.log > >-- >You are receiving this mail because: >You are watching all bug changes.
Of course that's understandable :) I just wanted to get the info to you as early as possible On Jan 17, 2016 8:23 AM, <bugs@schedmd.com> wrote: > *Comment # 25 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c25> on bug > 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette > <jette@schedmd.com> * > > Almost certainly a distinct problem/bug. I will not be able to work on this > until tonight at the earliest. > > On January 17, 2016 7:40:14 AM PST, bugs@schedmd.com wrote:>http://bugs.schedmd.com/show_bug.cgi?id=2350 > >>--- Comment #24 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c24> from Doug Jacobsen <dmjacobsen@lbl.gov> --- > >Created attachment 2622 [details] <http://bugs.schedmd.com/attachment.cgi?id=2622> [details] <http://bugs.schedmd.com/attachment.cgi?id=2622&action=edit> > > --> http://bugs.schedmd.com/attachment.cgi?id=2622&action=edit > >2nd patch slurmctld.log > >>-- > >You are receiving this mail because: > >You are watching all bug changes. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Now that you mention it, I believe that I often see edison nodes transit through a "mixed" state upon job completion (some still say completing). I wonder if that is related to the gres decrement issue you noticed... I'll see if I can trap any sinfo output that demonstrates this. ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Sun, Jan 17, 2016 at 8:29 AM, Douglas Jacobsen <dmjacobsen@lbl.gov> wrote: > Of course that's understandable :) > > I just wanted to get the info to you as early as possible > On Jan 17, 2016 8:23 AM, <bugs@schedmd.com> wrote: > >> *Comment # 25 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c25> on bug >> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette >> <jette@schedmd.com> * >> >> Almost certainly a distinct problem/bug. I will not be able to work on this >> until tonight at the earliest. >> >> On January 17, 2016 7:40:14 AM PST, bugs@schedmd.com wrote:>http://bugs.schedmd.com/show_bug.cgi?id=2350 >> >>--- Comment #24 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c24> from Doug Jacobsen <dmjacobsen@lbl.gov> --- >> >Created attachment 2622 [details] <http://bugs.schedmd.com/attachment.cgi?id=2622> [details] <http://bugs.schedmd.com/attachment.cgi?id=2622&action=edit> >> > --> http://bugs.schedmd.com/attachment.cgi?id=2622&action=edit >> >2nd patch slurmctld.log >> >>-- >> >You are receiving this mail because: >> >You are watching all bug changes. >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >>
(In reply to Doug Jacobsen from comment #27) > Now that you mention it, I believe that I often see edison nodes transit > through a "mixed" state upon job completion (some still say completing). "mixed" in indicative of allocated and idle CPUs on a node. If there are any jobs in "completing" state on the node, no matter how many CPUs are allocated or idle, it's state will be reported as "completing". The good news is that the "underflow" message from the first log are gone. They got cleaned up on restart, as expected. I did open a separate trouble ticket on that issue. The log here looks very similar to the previous one in that the top priority job's expected start time gets pushed back when another job is in "completing" state. Forgive me, but I have to ask, are you sure this is running with the new sched/backfill and select/cons_res plugins rather than possibly old/cached versions?
Hi Moe, Yeah I was worried about that too, but everything checks out in terms of the installation. The new BACKFILL flags are present in its slurm.h: nid01605:/var/tmp/slurm # ls -l /opt/slurm/default lrwxrwxrwx 1 root root 56 Jan 17 01:16 /opt/slurm/default -> 15.08.6_fixsched2350_15.08.6_fixsched2350_20160117010314 nid01605:/var/tmp/slurm # grep BACKFILL /opt/slurm/default/include/slurm/slurm.h #define BACKFILL_TEST 0x00000008 /* Backfill test in progress */ #define DEBUG_FLAG_BACKFILL 0x0000000000001000 /* debug for #define DEBUG_FLAG_BACKFILL_MAP 0x0000000008000000 /* Backfill scheduler node nid01605:/var/tmp/slurm # nid01605:/var/tmp/slurm # nid01605:/var/tmp/slurm # nid01605:/var/tmp/slurm # ls -l /opt/slurm/default/lib/slurm/*cons* -rw-r--r-- 1 root root 410748 Jan 17 01:06 /opt/slurm/default/lib/slurm/select_cons_res.a -rwxr-xr-x 1 root root 1034 Jan 17 01:06 /opt/slurm/default/lib/slurm/ select_cons_res.la -rwxr-xr-x 1 root root 251569 Jan 17 01:06 /opt/slurm/default/lib/slurm/select_cons_res.so nid01605:/var/tmp/slurm # ls -l /opt/slurm/default/lib/slurm/*backfill* -rw-r--r-- 1 root root 168800 Jan 17 01:06 /opt/slurm/default/lib/slurm/sched_backfill.a -rwxr-xr-x 1 root root 1027 Jan 17 01:06 /opt/slurm/default/lib/slurm/ sched_backfill.la -rwxr-xr-x 1 root root 109748 Jan 17 01:06 /opt/slurm/default/lib/slurm/sched_backfill.so nid01605:/var/tmp/slurm # Regarding the "mixed" state, we do not have any shared nodes or allow over subscription on edison, so the transit through "mixed" state has been confusing to me. I captured this output earlier, I'll try to get a relevant section of the logs in a bit: running "sinfo -p system" every 10s, I caught this: Sun Jan 17 13:08:29 PST 2016 PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST system up 1-infini 30:00 48 2:12:2 2 down* nid[00666,04250] system up 1-infini 30:00 48 2:12:2 5113 allocated nid[00008-00023,00233-00255,00264-00287,00297-00313,00334-00337,00350-00383,00388-00451,00456-00511,00516-00533,00537-00579,00584-00665,00667-00767,00772-00834,00849-00895,00900-00963,00968-01132,01143-01151,01156-01219,01224-01279,01284-01483,01502-01535,01540-01603,01608-01663,01668-01804,01809-01905,01916-01919,01924-01987,01992-02000,02007-02047,02052-02070,02077-02295,02301-02303,02308-02371,02376-02431,02436-02446,02456-02472,02490-02687,02692-02755,02760-02815,02820-02887,02893-03071,03076-03139,03144-03330,03363-03455,03460-03523,03528-03578,03589-03656,03667-03839,03844-03872,03883-03907,03912-03952,03964-04223,04228-04249,04251-04291,04296-04572,04587-04607,04992-05179,05185-05218,05226-05277,05280-05443,05448-05616,05633-05677,05697-05759,05764-05827,05832-05951,05956-06143] system up 1-infini 30:00 48 2:12:2 461 idle nid[00024-00063,00072-00127,00136-00191,00200-00232,00288-00296,00314-00323,00328-00333,00338-00349,00534-00536,00835,00840-00848,01133-01142,01484-01501,01805-01808,01906-01915,02001-02006,02071-02076,02296-02300,02447-02455,02473-02489,02888-02892,03331-03362,03579-03588,03657-03666,03873-03882,03953-03963,04573-04586,05180-05184,05219-05225,05278-05279,05617-05632,05678-05696] Sun Jan 17 13:08:39 PST 2016 PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST system up 1-infini 30:00 48 2:12:2 2 down* nid[00666,04250] system up 1-infini 30:00 48 2:12:2 5026 allocated nid[00008-00023,00233-00255,00264-00287,00297-00313,00334-00337,00350-00383,00388-00451,00456-00511,00516-00533,00537-00579,00584-00665,00667-00767,00772-00806,00821-00834,00849-00895,00900-00963,00968-00992,00994-01132,01143-01151,01156-01219,01224-01279,01284-01483,01502-01535,01540-01603,01608-01663,01668-01804,01809-01905,01916-01919,01924-01962,02007-02047,02052-02070,02077-02295,02301-02303,02308-02371,02376-02431,02436-02446,02456-02472,02490-02687,02692-02755,02760-02815,02820-02887,02893-02894,02933-03071,03076-03139,03144-03330,03363-03455,03460-03523,03528-03578,03589-03656,03667-03839,03844-03872,03883-03907,03912-03952,03964-04223,04228-04249,04251-04291,04296-04572,04587-04607,04992-05179,05185-05218,05226-05277,05280-05443,05448-05616,05633-05677,05697-05759,05764-05827,05832-05951,05956-06143] system up 1-infini 30:00 48 2:12:2 548 mixed nid[00024-00063,00072-00127,00136-00191,00200-00232,00288-00296,00314-00323,00328-00333,00338-00349,00534-00536,00807-00820,00835,00840-00848,00993,01133-01142,01484-01501,01805-01808,01906-01915,01963-01987,01992-02006,02071-02076,02296-02300,02447-02455,02473-02489,02888-02892,02895-02932,03331-03362,03579-03588,03657-03666,03873-03882,03953-03963,04573-04586,05180-05184,05219-05225,05278-05279,05617-05632,05678-05696] Sun Jan 17 13:08:49 PST 2016 PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST system up 1-infini 30:00 48 2:12:2 2 down* nid[00666,04250] system up 1-infini 30:00 48 2:12:2 5025 allocated nid[00008-00023,00233-00255,00264-00287,00297-00313,00334-00337,00350-00383,00388-00451,00456-00511,00516-00533,00537-00579,00584-00665,00667-00767,00772-00806,00821-00834,00849-00895,00900-00963,00968-00992,00994-01132,01143-01151,01156-01219,01224-01279,01284-01483,01502-01535,01540-01603,01608-01663,01668-01804,01809-01905,01916-01919,01924-01962,02007-02047,02052-02070,02077-02295,02301-02303,02308-02371,02376-02431,02436-02446,02456-02472,02490-02687,02692-02755,02760-02815,02820-02887,02893-02894,02933-03071,03076-03139,03144-03330,03363-03455,03460-03523,03528-03578,03589-03656,03667-03839,03844-03872,03883-03907,03912-03952,03964-04223,04228-04249,04251-04291,04296-04572,04587-04607,04992-05179,05185-05218,05226-05277,05280-05443,05448-05597,05599-05616,05633-05677,05697-05759,05764-05827,05832-05951,05956-06143] system up 1-infini 30:00 48 2:12:2 549 idle nid[00024-00063,00072-00127,00136-00191,00200-00232,00288-00296,00314-00323,00328-00333,00338-00349,00534-00536,00807-00820,00835,00840-00848,00993,01133-01142,01484-01501,01805-01808,01906-01915,01963-01987,01992-02006,02071-02076,02296-02300,02447-02455,02473-02489,02888-02892,02895-02932,03331-03362,03579-03588,03657-03666,03873-03882,03953-03963,04573-04586,05180-05184,05219-05225,05278-05279,05598,05617-05632,05678-05696] ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Sun, Jan 17, 2016 at 7:11 PM, <bugs@schedmd.com> wrote: > *Comment # 28 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c28> on bug > 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette > <jette@schedmd.com> * > > (In reply to Doug Jacobsen from comment #27 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c27>)> Now that you mention it, I believe that I often see edison nodes transit > > through a "mixed" state upon job completion (some still say completing). > > "mixed" in indicative of allocated and idle CPUs on a node. If there are any > jobs in "completing" state on the node, no matter how many CPUs are allocated > or idle, it's state will be reported as "completing". > > The good news is that the "underflow" message from the first log are gone. They > got cleaned up on restart, as expected. I did open a separate trouble ticket on > that issue. > > The log here looks very similar to the previous one in that the top priority > job's expected start time gets pushed back when another job is in "completing" > state. Forgive me, but I have to ask, are you sure this is running with the new > sched/backfill and select/cons_res plugins rather than possibly old/cached > versions? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
(In reply to Doug Jacobsen from comment #30) > Regarding the "mixed" state, we do not have any shared nodes or allow over > subscription on edison, so the transit through "mixed" state has been > confusing to me. > > I captured this output earlier, I'll try to get a relevant section of the > logs in a bit: The mixed state would be indicative of a node being in an ALLOCATED state, but with some of the CPUs flagged as IDLE in a data structure. Clearly not a healthy situation. I have been studying the code and have not been able to reproduce or identify the source of this problem. The patch that you have already applied definitely fixes a bug with respect to backfill scheduling while a job is in COMPLETING state, but this is a different problems. Please attach those logs when you have a chance.
Hi Moe, I'm offsite at an all day thing today so will get back to you with logs tonight. I can definitely say that the patch is helping significantly and I'm not seeing the level of issues I was before the fix -- thank you! edison achieved >95% utilization yesterday which is a vast improvement. This patch will go onto cori tomorrow (along with a rapid upgrade to 15.08.7 once it comes out). I have a top-priority 98% scale job in the queue for edison right now, so that should be a fairly good test to see how this works. It is scheduled to run tomorrow at 19:22. Thanks again, Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Tue, Jan 19, 2016 at 9:16 AM, <bugs@schedmd.com> wrote: > *Comment # 31 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c31> on bug > 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette > <jette@schedmd.com> * > > (In reply to Doug Jacobsen from comment #30 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c30>)> Regarding the "mixed" state, we do not have any shared nodes or allow over > > subscription on edison, so the transit through "mixed" state has been > > confusing to me. > > > > I captured this output earlier, I'll try to get a relevant section of the > > logs in a bit: > > The mixed state would be indicative of a node being in an ALLOCATED state, but > with some of the CPUs flagged as IDLE in a data structure. Clearly not a > healthy situation. I have been studying the code and have not been able to > reproduce or identify the source of this problem. > > The patch that you have already applied definitely fixes a bug with respect to > backfill scheduling while a job is in COMPLETING state, but this is a different > problems. Please attach those logs when you have a chance. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
(In reply to Doug Jacobsen from comment #32) > I'm not seeing the level of issues I was before the fix -- thank you! > edison achieved >95% utilization yesterday which is a vast improvement. Excellent! While I have not been able to reproduce the "mixed" node state, I suspect that it is related to core specialization. If I'm correct, the "mixed" state should be changed to "allocated" in the sinfo output, but I do not believe that is adversely impacting scheduling of pending jobs (which was my original concern). In any case, I'm continuing to work on this...
Update: I don't have a fix for you yet, but understand what is happening and will not require additional logs. This is a distinct bug than that previously diagnosed and that you have a patch for. There is a race condition in the job termination logic with respect to how Slurm runs the Cray Node Health Check (NHC). Not that it helps you, but this backfill bug will effect only Cray systems with NHC enabled. It is likely to be most significant when a job ends abnormally (non-zero exit code, reaches time limit, etc.).
We do want to tag Slurm version 15.08.7 today with a multitude of bug fixes, but his is going to require a fairly complex change that will likely not be complete today. We will provide you with a patch when available. You might consider disabling NHC, which would eliminate this race condition. At Cray's request, that will be the default behaviour in the next major release of Slurm (version 16.05). NHC can be disabled by adding "NHC_NO_STEPS,NHC_NO" to your SelectTypeParameters value in slurm.conf.
Ah! That makes sense! I greatly appreciate it. -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Jan 20, 2016 at 1:19 PM, <bugs@schedmd.com> wrote: > *Comment # 34 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c34> on bug > 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette > <jette@schedmd.com> * > > Update: I don't have a fix for you yet, but understand what is happening and > will not require additional logs. This is a distinct bug than that previously > diagnosed and that you have a patch for. > > There is a race condition in the job termination logic with respect to how > Slurm runs the Cray Node Health Check (NHC). Not that it helps you, but this > backfill bug will effect only Cray systems with NHC enabled. It is likely to be > most significant when a job ends abnormally (non-zero exit code, reaches time > limit, etc.). > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Sounds good -- I'm looking forward to a number of the items in 15.08.7 and am awaiting its release anxiously. ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Jan 20, 2016 at 1:47 PM, <bugs@schedmd.com> wrote: > *Comment # 35 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c35> on bug > 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette > <jette@schedmd.com> * > > We do want to tag Slurm version 15.08.7 today with a multitude of bug fixes, > but his is going to require a fairly complex change that will likely not be > complete today. We will provide you with a patch when available. > > You might consider disabling NHC, which would eliminate this race condition. At > Cray's request, that will be the default behaviour in the next major release of > Slurm (version 16.05). NHC can be disabled by adding "NHC_NO_STEPS,NHC_NO" to > your SelectTypeParameters value in slurm.conf. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
What if we called the NHC from a job epilog instead? And prevent that epilog from completing until either the node is deemed clean or the suspect period expires and the node is marked admindown? Would that allow us to continue using the NHC without generating this race condition? ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Jan 20, 2016 at 1:50 PM, Douglas Jacobsen <dmjacobsen@lbl.gov> wrote: > Sounds good -- I'm looking forward to a number of the items in 15.08.7 and > am awaiting its release anxiously. > > ---- > Doug Jacobsen, Ph.D. > NERSC Computer Systems Engineer > National Energy Research Scientific Computing Center > <http://www.nersc.gov> > dmjacobsen@lbl.gov > > ------------- __o > ---------- _ '\<,_ > ----------(_)/ (_)__________________________ > > > On Wed, Jan 20, 2016 at 1:47 PM, <bugs@schedmd.com> wrote: > >> *Comment # 35 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c35> on bug >> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette >> <jette@schedmd.com> * >> >> We do want to tag Slurm version 15.08.7 today with a multitude of bug fixes, >> but his is going to require a fairly complex change that will likely not be >> complete today. We will provide you with a patch when available. >> >> You might consider disabling NHC, which would eliminate this race condition. At >> Cray's request, that will be the default behaviour in the next major release of >> Slurm (version 16.05). NHC can be disabled by adding "NHC_NO_STEPS,NHC_NO" to >> your SelectTypeParameters value in slurm.conf. >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >> >
I have almost no interest in running the post-step NHC and will recommend that we disable that, at least. ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Jan 20, 2016 at 2:05 PM, Douglas Jacobsen <dmjacobsen@lbl.gov> wrote: > What if we called the NHC from a job epilog instead? And prevent that > epilog from completing until either the node is deemed clean or the suspect > period expires and the node is marked admindown? Would that allow us to > continue using the NHC without generating this race condition? > > ---- > Doug Jacobsen, Ph.D. > NERSC Computer Systems Engineer > National Energy Research Scientific Computing Center > <http://www.nersc.gov> > dmjacobsen@lbl.gov > > ------------- __o > ---------- _ '\<,_ > ----------(_)/ (_)__________________________ > > > On Wed, Jan 20, 2016 at 1:50 PM, Douglas Jacobsen <dmjacobsen@lbl.gov> > wrote: > >> Sounds good -- I'm looking forward to a number of the items in 15.08.7 >> and am awaiting its release anxiously. >> >> ---- >> Doug Jacobsen, Ph.D. >> NERSC Computer Systems Engineer >> National Energy Research Scientific Computing Center >> <http://www.nersc.gov> >> dmjacobsen@lbl.gov >> >> ------------- __o >> ---------- _ '\<,_ >> ----------(_)/ (_)__________________________ >> >> >> On Wed, Jan 20, 2016 at 1:47 PM, <bugs@schedmd.com> wrote: >> >>> *Comment # 35 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c35> on bug >>> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette >>> <jette@schedmd.com> * >>> >>> We do want to tag Slurm version 15.08.7 today with a multitude of bug fixes, >>> but his is going to require a fairly complex change that will likely not be >>> complete today. We will provide you with a patch when available. >>> >>> You might consider disabling NHC, which would eliminate this race condition. At >>> Cray's request, that will be the default behaviour in the next major release of >>> Slurm (version 16.05). NHC can be disabled by adding "NHC_NO_STEPS,NHC_NO" to >>> your SelectTypeParameters value in slurm.conf. >>> >>> ------------------------------ >>> You are receiving this mail because: >>> >>> - You reported the bug. >>> >>> >> >
Doing what happens in the NHC in the job epilog would be the dream of us all, but it won't work as you would expect. NHC has to be ran from the node where the slurmctld runs. It wouldn't work from the EpilogSlurmctld either as the nodes have already been given back to the system at that point which would cause a different race condition. I would also suggest just disabling NHC, NHC_NO implies NHC_NO_STEPS so you would only need to add NHC_NO to SelectTypeParameters to completely disable it. If you start seeing issues it is rather easy to turn back on, with hopefully minimal issue if any. As Moe said, Cray has asked it be disabled by default so it apparently isn't that needed to begin with.
Does that mean that Cray will also stop using the NHC for ALPS? Or do they expect native WLM sites to come up with some other mechanism for monitoring node health? ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Jan 20, 2016 at 2:39 PM, <bugs@schedmd.com> wrote: > *Comment # 40 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c40> on bug > 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Danny Auble > <da@schedmd.com> * > > Doing what happens in the NHC in the job epilog would be the dream of us all, > but it won't work as you would expect. NHC has to be ran from the node where > the slurmctld runs. It wouldn't work from the EpilogSlurmctld either as the > nodes have already been given back to the system at that point which would > cause a different race condition. > > I would also suggest just disabling NHC, NHC_NO implies NHC_NO_STEPS so you > would only need to add NHC_NO to SelectTypeParameters to completely disable it. > If you start seeing issues it is rather easy to turn back on, with hopefully > minimal issue if any. As Moe said, Cray has asked it be disabled by default so > it apparently isn't that needed to begin with. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
I don't know. With Slurm there is definitely ways to monitor things outside of NHC. But I don't know what the end game is for this.
I'm not 100% sure but I think much of the end-of-job node cleanup and memory compaction is initiated by the the NHC. I'm going to have to take a pretty serious look at the impact before disabling it. Speaking with a few of the others in my group, there does seem to be a generally negative feeling towards disabling the end-of-job NHC check. I'll look forward to the race condition fix, and will use the new purge_comp capability for reservations in the meantime. Though purge_comp didn't work for me just now (invalid flag), will have to check to see if the patch made it all the way in. Thanks so much! -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Jan 20, 2016 at 2:58 PM, <bugs@schedmd.com> wrote: > *Comment # 42 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c42> on bug > 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Danny Auble > <da@schedmd.com> * > > I don't know. With Slurm there is definitely ways to monitor things outside of > NHC. But I don't know what the end game is for this. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
I am just the messenger from what Cray asked us to do for 16.05 :). I would be very interested if there does appear to be a negative impact from not running NHC.
A fix for Cray-specific backfill bug (NHC race condition) is now available. This fix will be in version 15.08.8 when released, so you'll need to manage this as a patch for now. This is a second, Cray-specific patch in addition to the patch described in comment 21, which was in the version 15.08.7 release. The commit of the new patch is here: https://github.com/SchedMD/slurm/commit/79a21bd697cf2fd365e497872387628a1c670b39 Please re-open the ticket if you encounter any more problems.