2350 – highest priority (large) job keeps getting delayed

Ticket 2350 - highest priority (large) job keeps getting delayed

Summary: highest priority (large) job keeps getting delayed

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	15.08.6
Hardware:	Cray XC Linux

Severity:	2 - High Impact
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2016-01-15 00:07 MST by Doug Jacobsen
Modified:	2024-07-31 09:48 MDT (History)
CC List:	1 user (show)

See Also:	2353 20564
Site:	NERSC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	15.08.8
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmctld.log for this morning (3.71 MB, application/x-gzip) 2016-01-15 00:07 MST, Doug Jacobsen	Details
slurm.conf for edison (287.93 KB, text/plain) 2016-01-15 00:08 MST, Doug Jacobsen	Details
patched slurmctld logs with more top-job delays (3.22 MB, application/x-gzip) 2016-01-16 12:41 MST, Doug Jacobsen	Details
2nd patch slurmctld.log (6.53 MB, application/x-gzip) 2016-01-17 01:40 MST, Doug Jacobsen	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Doug Jacobsen 2016-01-15 00:07:15 MST

Created attachment 2619 [details]
slurmctld.log for this morning

Hello,

I've seen this issue on cori as well, but edison has a much simpler queue, so I'm hoping it will be easier to understand what is going on right now.

Anyway, yesterday someone submitted a large job under our "premium" qos, this became the top priority job and was scheduled to run at 6:15PM last night, then  8:30, then 11:30PM.

At this point, I manually increased the priority to 2000000, a value much higher than anything our multipriority plugin configuration can achieve without manual intervention.  It is definitely the highest priority job (by job priority) in the system, and no newly submitted job can beat it.

We do not use partition priorities.  Only QOS.  The job submit filter does, however often change a user requested QOS to a "real QOS", this is done to provide priority based on a couple of different factors (node count, customer priority request, customer standing, etc), that our job_submit.lua computes.

Next the job was delayed to 3:30AM, then 7:30AM, and 8:30AM.

At this point I rotated the slurmctld.log (attached), and enabled the backfill debugflags.

Now it is scheduled to start at 11:05AM.  The problem with this, other than not starting the job is that many nodes (>1000) are being kept idle, presumably to start this job.  The job requires 4096 nodes which is slightly less than 75% of the system and 77% of the partition it is submitted to (our "regular" partition).  Only 1 node is down and I haven't seen issues with nodes spending a long time in the completing state.

Job 27432 is the target job here, has priority 2000000

Looking at the logs it appears that until about 5:05 there are some minor adjustments forward or backward in the schedule for this job, but mostly it is to start at 8:30:12.  At 5:05AM, job 32122, a newly submitted job with priority 17368 is started by the backfill scheduler.  After that the schedule gets delayed more.

I'm attaching the slurm.conf and the slurmctld.log from this morning up to this point.


I really appreciate any help or advice you can give here.

Thanks,
Doug

nid01605:/var/tmp/slurm # cat slurmctld.log | egrep  'Started|Allocate|Job 27432 to start'
...
...
[2016-01-15T05:03:41.001] Job 27432 to start at 2016-01-15T08:30:12, end at 2016-01-15T09:00:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0542-0554,0579,0584-0605,0610-0625,0627-0648,0651-0654,0667-0703,0746-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0947-0956,1001-1034,1093-1133,1213-1219,1224-1226,1239-1279,1284,1287-1535,1540-1603,1608-1663,1668-1804,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2153,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2465,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4247,4305-4538,4541-4607,4992-5100,5103-5191,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5531-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6141]
[2016-01-15T05:04:14.884] sched: Allocate JobID=32121 NodeList=nid00[102-127,136-173] #CPUs=3072
[2016-01-15T05:04:21.710] Job 27432 to start at 2016-01-15T08:30:12, end at 2016-01-15T09:00:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0542-0554,0579,0584-0605,0610-0625,0627-0648,0651-0654,0667-0703,0746-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0947-0956,1001-1034,1093-1133,1213-1219,1224-1226,1239-1279,1284,1287-1535,1540-1603,1608-1663,1668-1804,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2153,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2465,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4247,4305-4538,4541-4607,4992-5100,5103-5191,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5531-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6141]
[2016-01-15T05:04:22.258] backfill: Started JobId=27724 in regular on nid0[0655,2593]
[2016-01-15T05:05:03.761] Job 27432 to start at 2016-01-15T08:30:12, end at 2016-01-15T09:00:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0542-0554,0579,0584-0605,0610-0625,0627-0648,0651-0654,0667-0703,0746-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0947-0956,1001-1034,1093-1133,1213-1219,1224-1226,1239-1279,1284,1287-1535,1540-1603,1608-1663,1668-1804,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2153,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2465,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4247,4305-4538,4541-4607,4992-5100,5103-5191,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5531-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6141]
[2016-01-15T05:05:10.687] backfill: Started JobId=32122 in regular on nid00[927-928]
[2016-01-15T05:05:46.890] Job 27432 to start at 2016-01-15T11:35:19, end at 2016-01-15T12:04:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0575-0576,0579,0584-0605,0610-0649,0651-0654,0667-0703,0732-0735,0790-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1193-1201,1213-1219,1224-1226,1285-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4229-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:05:47.537] backfill: Started JobId=27727 in regular on nid0[1239-1240]
[2016-01-15T05:05:54.315] backfill: Started JobId=32118 in regular on nid0[1241-1272]
[2016-01-15T05:06:27.303] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:07:07.891] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:07:48.391] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:08:28.802] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:09:09.217] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:09:50.188] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:10:31.168] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:11:12.142] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:11:53.159] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:12:34.044] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:13:14.310] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:13:24.790] sched: Allocate JobID=32124 NodeList=nid000[14-63,72-85] #CPUs=3072
[2016-01-15T05:13:53.769] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0610-0649,0651-0654,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,1001-1034,1093-1133,1213-1219,1224-1226,1273-1279,1284-1535,1540-1603,1608-1663,1668-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2003,2006-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3670,3673-3765,3796-3817,3825-3839,3844-3899,3920-3927,3954-4139,4186-4223,4231-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5602-5661,5680-5713,5715-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:13:54.273] backfill: Started JobId=27728 in regular on nid00[362-363]
[2016-01-15T05:13:54.316] backfill: Started JobId=27730 in regular on nid0[0368,4167]
[2016-01-15T05:14:33.963] Job 27432 to start at 2016-01-15T11:35:19, end at 2016-01-15T12:04:00 on nid0[0311-0323,0328-0339,0342-0357,0366-0367,0426-0451,0456-0461,0480-0491,0540-0554,0575-0576,0579,0584-0605,0610-0649,0651-0654,0664,0667-0703,0732-0735,0744-0767,0772-0790,0797-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0934,0945-0958,0993,1001-1034,1093-1133,1193-1201,1213-1219,1224-1226,1235-1236,1241-1279,1284-1535,1540-1603,1608-1663,1668-1756,1766-1772,1777-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2047,2052-2155,2185-2197,2243-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2474,2525-2572,2594-2595,2603-2687,2692-2734,2766-2815,2820-2894,2909,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3622,3632-3645,3654-3670,3673-3817,3825-3839,3844-3856,3861-3899,3920-3927,3954-4139,4186-4223,4229-4249,4303-4310,4316-4538,4541-4607,4992-5100,5103-5112,5124-5193,5199-5218,5229-5230,5239-5244,5326-5443,5448-5478,5489-5498,5529-5539,5556-5557,5567-5590,5598-5599,5602-5609,5619-5661,5680-5693,5701-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:14:37.287] backfill: Started JobId=31395 in regular on nid0[1001-1032]
[2016-01-15T05:14:37.352] backfill: Started JobId=31455 in regular on nid0[2243-2253]
[2016-01-15T05:14:37.416] backfill: Started JobId=31616 in regular on nid0[0886-0895,0900-0919,2525-2572]
[2016-01-15T05:14:37.475] backfill: Started JobId=31692 in regular on nid0[0667-0703,0746-0767,0772-0789,1093-1116,1777-1804,2440-2465,3825-3839,3844-3856,5531-5539]
[2016-01-15T05:14:37.535] backfill: Started JobId=31695 in regular on nid0[0342-0353,0542-0554,0579,0584-0605,0627-0648,2185-2197,2314-2330,3632-3645,3806-3817,3954-3964,4186-4202,5567-5590,5680-5693]
[2016-01-15T05:14:37.592] backfill: Started JobId=31696 in regular on nid0[1127-1133,4213-4223,5428-5439]
[2016-01-15T05:14:38.073] backfill: Started JobId=31918 in regular on nid00[610-619]
[2016-01-15T05:14:38.130] backfill: Started JobId=32039 in regular on nid0[0947-0956,5103-5112]
[2016-01-15T05:14:38.948] backfill: Started JobId=32101 in regular on nid0[0651,5635-5644,5701-5709]
[2016-01-15T05:14:39.046] backfill: Started JobId=32103 in regular on nid0[1849-1898]
[2016-01-15T05:14:39.103] backfill: Started JobId=32104 in regular on nid0[0823-0830,0931-0934,1273-1279,1284]
[2016-01-15T05:14:39.199] backfill: Started JobId=32106 in regular on nid0[3195-3210]
[2016-01-15T05:14:39.297] backfill: Started JobId=32107 in regular on nid0[1734-1749]
[2016-01-15T05:14:39.392] backfill: Started JobId=32108 in regular on nid00[311-323,328-330]
[2016-01-15T05:14:39.489] backfill: Started JobId=32109 in regular on nid0[1899-1914]
[2016-01-15T05:14:39.548] backfill: Started JobId=32110 in regular on nid0[5199-5202]
[2016-01-15T05:14:39.646] backfill: Started JobId=32111 in regular on nid0[1947-1962]
[2016-01-15T05:14:39.745] backfill: Started JobId=32112 in regular on nid0[2103-2118]
[2016-01-15T05:14:39.846] backfill: Started JobId=32114 in regular on nid0[2119-2134]
[2016-01-15T05:14:39.904] backfill: Started JobId=32116 in regular on nid0[0927-0928,1766-1772,3657-3663,3920-3927,5602-5609]
[2016-01-15T05:15:12.064] Job 27432 to start at 2016-01-15T11:14:37, end at 2016-01-15T11:44:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0620-0649,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0926,0929-0934,0945-0958,1001-1034,1093-1133,1193-1201,1213-1219,1224-1226,1241-1279,1284-1535,1540-1603,1608-1663,1668-1765,1773-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3656,3664-3670,3673-3765,3796-3817,3825-3839,3844-3899,3954-4139,4186-4223,4229-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5610-5661,5680-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:15:12.705] backfill: Started JobId=27731 in regular on nid00[528,654]
[2016-01-15T05:15:20.230] backfill: Started JobId=32125 in debug on nid00[086-127,136-157]
[2016-01-15T05:15:50.955] Job 27432 to start at 2016-01-15T11:14:37, end at 2016-01-15T11:44:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0620-0649,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0926,0929-0934,0945-0958,1001-1034,1093-1133,1193-1201,1213-1219,1224-1226,1241-1279,1284-1535,1540-1603,1608-1663,1668-1765,1773-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3656,3664-3670,3673-3765,3796-3817,3825-3839,3844-3899,3954-4139,4186-4223,4229-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5610-5661,5680-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:16:29.995] Job 27432 to start at 2016-01-15T11:14:37, end at 2016-01-15T11:44:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0620-0649,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0926,0929-0934,0945-0958,1001-1034,1093-1133,1193-1201,1213-1219,1224-1226,1241-1279,1284-1535,1540-1603,1608-1663,1668-1765,1773-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3656,3664-3670,3673-3765,3796-3817,3825-3839,3844-3899,3954-4139,4186-4223,4229-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5610-5661,5680-5759,5764-5827,5832-5951,5956-6143]
[2016-01-15T05:17:09.017] Job 27432 to start at 2016-01-15T11:14:37, end at 2016-01-15T11:44:00 on nid0[0311-0323,0328-0339,0342-0357,0426-0451,0456-0461,0480-0491,0540-0554,0579,0584-0605,0620-0649,0667-0703,0744-0767,0772-0820,0823-0835,0840-0844,0849-0880,0886-0895,0900-0926,0929-0934,0945-0958,1001-1034,1093-1133,1193-1201,1213-1219,1224-1226,1241-1279,1284-1535,1540-1603,1608-1663,1668-1765,1773-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2603-2687,2692-2734,2766-2815,2820-2894,2915-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3656,3664-3670,3673-3765,3796-3817,3825-3839,3844-3899,3954-4139,4186-4223,4229-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5239-5271,5326-5443,5448-5478,5489-5498,5529-5539,5567-5590,5610-5661,5680-5759,5764-5827,5832-5951,5956-6143]
...



slurm.conf:

Comment 1 Doug Jacobsen 2016-01-15 00:08:18 MST

Created attachment 2620 [details]
slurm.conf for edison

Comment 2 Doug Jacobsen 2016-01-15 00:11:02 MST

The QOSs, this job was submitted to premium, but job_submit.lua moved it to premium_regular_0 -- the highest priority bin in the system.


scontrol show assoc | tail -n 500
...
QOS Records

QOS=normal(1)
    UsageRaw=3641173138.117583
    GrpJobs=N(5) GrpSubmitJobs=N(9) GrpWall=N(62601.78)
    GrpTRES=cpu=N(6288),mem=N(8452513),energy=N(0),node=N(131)
    GrpTRESMins=cpu=N(60686218),mem=N(81574679163),energy=N(0),node=N(1264296)
    GrpTRESRunMins=cpu=N(98419),mem=N(132297959),energy=N(0),node=N(2050)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=part_regx(5)
    UsageRaw=0.000000
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=part_reg(6)
    UsageRaw=106407596338.459007
    GrpJobs=N(195) GrpSubmitJobs=N(308) GrpWall=N(1240015.39)
    GrpTRES=cpu=N(132240),mem=N(177760865),energy=N(0),node=N(2755)
    GrpTRESMins=cpu=N(1773459938),mem=N(2381492269692),energy=N(0),node=N(36947082)
    GrpTRESRunMins=cpu=N(136017413),mem=N(182797082839),energy=N(0),node=N(2833696)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=part_debug(7)
    UsageRaw=4470722333.674494
    GrpJobs=N(5) GrpSubmitJobs=N(9) GrpWall=N(65017.08)
    GrpTRES=cpu=N(6288),mem=N(8452513),energy=N(0),node=N(131)
    GrpTRESMins=cpu=N(74512038),mem=N(100159749560),energy=N(0),node=N(1552334)
    GrpTRESRunMins=cpu=N(99151),mem=N(133281934),energy=N(0),node=N(2065)
    MaxJobsPU=1(5) MaxSubmitJobs=10(9) MaxWallPJ=30
    MaxTRESPJ=node=512
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=scavenger(9)
    UsageRaw=1373370756.785909
    GrpJobs=N(1) GrpSubmitJobs=N(4) GrpWall=N(8213.85)
    GrpTRES=cpu=N(4800),mem=N(6452300),energy=N(0),node=N(100)
    GrpTRESMins=cpu=N(22889512),mem=N(30768750465),energy=N(0),node=N(476864)
    GrpTRESRunMins=cpu=N(700480),mem=N(941605646),energy=N(0),node=N(14593)
    MaxJobsPU=8(1) MaxSubmitJobs=100(4) MaxWallPJ=2160
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=normal_regular_0(10)
    UsageRaw=22577055475.016092
    GrpJobs=N(0) GrpSubmitJobs=N(5) GrpWall=N(5448.02)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(376284257),mem=N(505812274449),energy=N(0),node=N(7839255)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobsPU=8(0) MaxSubmitJobs=100(5) MaxWallPJ=2160
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=normal_regular_1(11)
    UsageRaw=64476715273.811566
    GrpJobs=N(162) GrpSubmitJobs=N(260) GrpWall=N(876737.62)
    GrpTRES=cpu=N(118464),mem=N(159242764),energy=N(0),node=N(2468)
    GrpTRESMins=cpu=N(1074611921),mem=N(1442827814349),energy=N(0),node=N(22387748)
    GrpTRESRunMins=cpu=N(79996553),mem=N(107533700581),energy=N(0),node=N(1666594)
    MaxJobsPU=24(162) MaxSubmitJobs=100(260) MaxWallPJ=2160
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=normal_regular_2(12)
    UsageRaw=16211605528.842383
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(279783.40)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(270193425),mem=N(362454449997),energy=N(0),node=N(5629029)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobsPU=6(0) MaxSubmitJobs=20(0) MaxWallPJ=720
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium_regular_0(15)
    UsageRaw=53891622.414630
    GrpJobs=N(0) GrpSubmitJobs=N(2) GrpWall=N(10.62)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(898193),mem=N(1207378178),energy=N(0),node=N(18712)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobsPU=2(0) MaxSubmitJobs=20(2) MaxWallPJ=2160
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium_regular_1(16)
    UsageRaw=217220562.337340
    GrpJobs=N(9) GrpSubmitJobs=N(9) GrpWall=N(13203.96)
    GrpTRES=cpu=N(2592),mem=N(3484242),energy=N(0),node=N(54)
    GrpTRESMins=cpu=N(3620342),mem=N(4866570258),energy=N(0),node=N(75423)
    GrpTRESRunMins=cpu=N(1435225),mem=N(1929272112),energy=N(0),node=N(29900)
    MaxJobsPU=8(9) MaxSubmitJobs=20(9) MaxWallPJ=2160
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium_regular_2(17)
    UsageRaw=1827367.205807
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(79.30)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(30456),mem=N(40940004),energy=N(0),node=N(634)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobsPU=6(0) MaxSubmitJobs=20(0) MaxWallPJ=720
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low_regular_0(18)
    UsageRaw=8140472.684091
    GrpJobs=N(0) GrpSubmitJobs=N(2) GrpWall=N(2.76)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(135674),mem=N(182377680),energy=N(0),node=N(2826)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobsPU=8(0) MaxSubmitJobs=100(2) MaxWallPJ=2160
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low_regular_1(19)
    UsageRaw=1537022988.639393
    GrpJobs=N(23) GrpSubmitJobs=N(26) GrpWall=N(50916.60)
    GrpTRES=cpu=N(6384),mem=N(8581559),energy=N(0),node=N(133)
    GrpTRESMins=cpu=N(25617049),mem=N(34435185519),energy=N(0),node=N(533688)
    GrpTRESRunMins=cpu=N(4704555),mem=N(6324000316),energy=N(0),node=N(98011)
    MaxJobsPU=24(23) MaxSubmitJobs=100(26) MaxWallPJ=2160
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low_regular_2(20)
    UsageRaw=18240353.343954
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(6317.90)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(304005),mem=N(408653582),energy=N(0),node=N(6333)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobsPU=6(0) MaxSubmitJobs=20(0) MaxWallPJ=720
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=low(21)
    UsageRaw=759773698.618572
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(1207.77)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(12662894),mem=N(17021832762),energy=N(0),node=N(263810)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=premium(22)
    UsageRaw=1472412.503318
    GrpJobs=N(0) GrpSubmitJobs=N(0) GrpWall=N(273.83)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(24540),mem=N(32987663),energy=N(0),node=N(511)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=
QOS=serialize(23)
    UsageRaw=0.000000
    GrpJobs=1(0) GrpSubmitJobs=N(0) GrpWall=N(0.00)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobsPU=1(0) MaxSubmitJobs=10(0) MaxWallPJ=720
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=

Comment 3 Doug Jacobsen 2016-01-15 00:11:59 MST

the job that won't start:

nid01605:/var/tmp/slurm # scontrol show job 27432
JobId=27432 JobName=fmppic3.slurm8192
   UserId=decyk(13195) GroupId=decyk(1013195)
   Priority=2000000 Nice=0 Account=mp113 QOS=premium_regular_0
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2016-01-14T08:03:57 EligibleTime=2016-01-14T08:03:57
   StartTime=2016-01-15T11:05:54 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=regular AllocNode:Sid=edison05:19579
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=nid0[0311-0323,0328-0339,0342-0357,0366-0367,0426-0451,0456-0461,0486-0491,0540-0554,0575-0576,0579,0584-0605,0610-0619,0626-0649,0651-0653,0667-0703,0732-0735,0744-0767,0772-0820,0823-0835,0840-0844,0853-0880,0886-0895,0900-0920,0929-0934,0940-0958,1033-1034,1093-1133,1193-1201,1213-1219,1224-1226,1235-1236,1241-1279,1284-1535,1540-1603,1608-1663,1668-1765,1773-1804,1822-1826,1849-1919,1924-1929,1947-1987,1992-2047,2052-2072,2103-2155,2185-2197,2220-2253,2314-2330,2364-2371,2376-2431,2436-2437,2440-2470,2525-2572,2594-2595,2603-2687,2692-2734,2766-2815,2820-2894,2909-3071,3076-3139,3144-3172,3195-3455,3460-3523,3528-3584,3611-3656,3664-3670,3673-3765,3796-3817,3821-3822,3825-3839,3844-3899,3954-4139,4173-4175,4186-4223,4229-4249,4303-4538,4541-4607,4992-5100,5103-5193,5199-5224,5229-5230,5239-5271,5326-5443,5448-5478,5489-5498,5523-5527,5529-5539,5567-5590,5594-5599,5610-5661,5680-5759,5764-5827,5832-5951,5956-6143]
   NumNodes=4096-4096 NumCPUs=4096 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4096,node=4096
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=craynetwork:1 Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/global/u1/d/decyk/PICodes/mppic3/fmppic3.slurm8192
   WorkDir=/global/u1/d/decyk/PICodes/mppic3
   StdErr=/global/u1/d/decyk/PICodes/mppic3/slurm-27432.out
   StdIn=/dev/null
   StdOut=/global/u1/d/decyk/PICodes/mppic3/slurm-27432.out
   Power= SICP=0

nid01605:/var/tmp/slurm #

Comment 4 Tim Wickberg 2016-01-15 00:16:54 MST

Does the user with this stuck job have other jobs in the queue?

Is there a chance that Slurm is back-filling some other jobs of theirs 
while waiting for enough nodes to launch this?

My loose theory is if that's the case, there could be an issue where we 
backfill sufficient jobs to exceed one of the QOS limits, then are 
forced to back off the large job until they drop below the limit. Which 
potentially could allow even more of their smaller jobs to sneak in.

Just a guess though, I haven't had a chance to look through the logs in 
further detail yet, but you might be able to get a better picture of 
that by checking out what they've run recently through sacct.

Comment 5 Doug Jacobsen 2016-01-15 00:24:00 MST

at this point the user has this job and its twin (another 4096 node job) in the queue.  The other job is the 2nd highest priority job.

We are not making use of any association limits on edison (on cori just MaxTRES for cray/bb)

The QOS limits on edison do not define and Grp limits

nid01605:/var/tmp/slurm # sacct --start=2016-01-01 -u decyk --format=job,user,start,end,partition,qos%20,nnodes -X
       JobID      User               Start                 End  Partition                  QOS   NNodes
------------ --------- ------------------- ------------------- ---------- -------------------- --------
8620             decyk 2016-01-08T14:52:00 2016-01-08T14:54:24      debug                  low      512
9135             decyk 2016-01-08T19:58:35 2016-01-08T20:03:08      debug                  low      256
9818             decyk 2016-01-08T20:23:47 2016-01-08T20:32:19      debug                  low      128
10250            decyk 2016-01-08T23:44:01 2016-01-09T00:01:13      debug               normal       64
10317            decyk 2016-01-09T00:44:05 2016-01-09T01:14:26      debug               normal       32
10551            decyk 2016-01-09T05:43:18 2016-01-09T05:46:16      debug                  low      512
10575            decyk 2016-01-09T06:05:46 2016-01-09T06:08:08    regular     normal_regular_1     1024
10585            decyk 2016-01-09T06:02:43 2016-01-09T06:35:19    regular     normal_regular_2       32
10598            decyk 2016-01-09T06:40:49 2016-01-09T06:42:20    regular     normal_regular_1     1024
10601            decyk 2016-01-09T07:07:08 2016-01-09T07:08:21    regular     normal_regular_1       2K
10609            decyk 2016-01-14T05:26:35 2016-01-14T05:28:07    regular     normal_regular_0       4K
10690            decyk 2016-01-14T05:28:25 2016-01-14T05:30:46    regular     normal_regular_0       4K
10721            decyk 2016-01-09T09:50:43 2016-01-09T09:53:07    regular     normal_regular_2       16
10903            decyk 2016-01-09T19:26:42 2016-01-09T19:27:18      debug                  low      512
11787            decyk 2016-01-09T19:57:35 2016-01-09T19:58:29      debug                  low      256
11868            decyk 2016-01-09T20:16:35 2016-01-09T20:18:07      debug                  low      128
11889            decyk 2016-01-09T20:28:37 2016-01-09T20:31:32      debug               normal       64
11905            decyk 2016-01-09T20:34:41 2016-01-09T20:40:10      debug               normal       32
11920            decyk 2016-01-09T20:45:44 2016-01-09T20:48:07      debug               normal       16
11929            decyk 2016-01-09T21:01:51 2016-01-09T21:02:25    regular     normal_regular_1     1024
11930            decyk 2016-01-12T21:57:59 2016-01-12T21:58:43    regular     normal_regular_0       2K
12221            decyk 2016-01-10T01:19:24 2016-01-10T01:20:02      debug                  low      512
12598            decyk 2016-01-10T06:42:36 2016-01-10T06:45:20      debug                  low      512
12618            decyk 2016-01-10T07:09:02 2016-01-10T07:14:02      debug                  low      256
12640            decyk 2016-01-10T07:17:11 2016-01-10T07:26:54      debug                  low      128
12654            decyk 2016-01-10T07:32:10 2016-01-10T07:41:57      debug                  low      128
12675            decyk 2016-01-10T07:48:00 2016-01-10T08:07:07      debug               normal       64
12702            decyk 2016-01-10T08:12:47 2016-01-10T08:49:40    regular     normal_regular_2       32
12797            decyk 2016-01-10T21:56:06 2016-01-10T21:58:51      debug                  low      512
15138            decyk 2016-01-11T10:05:58 2016-01-11T10:06:40      debug                  low      512
15524            decyk 2016-01-11T11:23:17 2016-01-11T11:25:40      debug                  low      512
15908            decyk 2016-01-11T21:37:15 2016-01-11T21:42:05      debug                  low      512
20798            decyk 2016-01-13T06:40:26 2016-01-13T06:42:15    regular     normal_regular_0     1024
21504            decyk 2016-01-13T07:52:29 2016-01-13T07:53:09    regular     normal_regular_0     1024
21964            decyk 2016-01-13T11:06:00 2016-01-13T11:07:54    regular     normal_regular_0       2K
21967            decyk 2016-01-13T08:56:13 2016-01-13T08:57:46    regular     normal_regular_0     1024
22214            decyk 2016-01-13T09:52:18 2016-01-13T09:54:04    regular     normal_regular_0     1024
24036            decyk 2016-01-13T15:02:14 2016-01-13T15:03:36    regular    premium_regular_0       2K
24043            decyk 2016-01-13T15:03:49 2016-01-13T15:04:30    regular    premium_regular_0       2K
24055            decyk 2016-01-13T15:04:47 2016-01-13T15:06:14    regular    premium_regular_0       2K
24538            decyk 2016-01-13T19:52:41 2016-01-13T19:54:36    regular    premium_regular_0       2K
27432            decyk             Unknown             Unknown    regular    premium_regular_0       4K
27436            decyk             Unknown             Unknown    regular    premium_regular_0       4K
28532            decyk 2016-01-14T12:24:02 2016-01-14T12:24:02      debug               normal        1
28646            decyk 2016-01-14T19:26:46 2016-01-14T19:30:21      debug               normal      342
30579            decyk 2016-01-14T20:24:36 2016-01-14T20:26:49    regular    premium_regular_0      683
30597            decyk 2016-01-14T20:55:30 2016-01-14T20:57:31    regular    premium_regular_0     1366
31349            decyk 2016-01-14T23:33:16 2016-01-14T23:35:06    regular    premium_regular_0     2731
nid01605:/var/tmp/slurm #

Comment 6 Doug Jacobsen 2016-01-15 00:27:07 MST

just to verify, the user's associations from sacctmgr:

nid01605:/var/tmp/slurm # sacctmgr show assoc where user=decyk cluster=edison -p
Cluster|Account|User|Partition|Share|GrpJobs|GrpTRES|GrpSubmit|GrpWall|GrpTRESMins|MaxJobs|MaxTRES|MaxTRESPerNode|MaxSubmit|MaxWall|MaxTRESMins|QOS|Def QOS|GrpTRESRunMins|
edison|m1157|decyk||1||||||||||||low,low_regular_0,low_regular_1,low_regular_2,normal,normal_regular_0,normal_regular_1,normal_regular_2,premium,premium_regular_0,premium_regular_1,premium_regular_2,scavenger,serialize|||
edison|mp113|decyk||1||||||||||||low,low_regular_0,low_regular_1,low_regular_2,normal,normal_regular_0,normal_regular_1,normal_regular_2,premium,premium_regular_0,premium_regular_1,premium_regular_2,scavenger,serialize|||
nid01605:/var/tmp/slurm #

Comment 7 Tim Wickberg 2016-01-15 00:32:06 MST

On 01/15/2016 09:24 AM, bugs@schedmd.com wrote:
> at this point the user has this job and its twin (another 4096 node job) in the
> queue.  The other job is the 2nd highest priority job.
>
> We are not making use of any association limits on edison (on cori just MaxTRES
> for cray/bb)
>
> The QOS limits on edison do not define and Grp limits

You do have MaxJobsPU set to 2, which I think could have caused what I 
describe:

> 30579            decyk 2016-01-14T20:24:36 2016-01-14T20:26:49    regular
> premium_regular_0      683
> 30597            decyk 2016-01-14T20:55:30 2016-01-14T20:57:31    regular
> premium_regular_0     1366
> 31349            decyk 2016-01-14T23:33:16 2016-01-14T23:35:06    regular
> premium_regular_0     2731
> nid01605:/var/tmp/slurm #

It looks like he'd have two running up until 21:00, which may have 
punted back the start time for 27432.

Although that wouldn't explain further delays at "3:30AM, then 7:30AM, 
and 8:30AM." When did those delays kick in?

Comment 8 Doug Jacobsen 2016-01-15 01:04:47 MST

Hi Tim,

I can't say exactly because I saw them using squeue --start output, didn't
have backfill debugflags on then, though I think i'll leave them on into
the future --- too useful.

The user doesn't have any jobs now, nor since midnight, so maybe the case
where the start time changed from 8:30 to 11:05 at 5:05 this morning would
be informative (or the smaller movements around 4:04ish)

-Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Fri, Jan 15, 2016 at 6:32 AM, <bugs@schedmd.com> wrote:

> *Comment # 7 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c7> on bug 2350
> <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Tim Wickberg
> <tim@schedmd.com> *
>
> On 01/15/2016 09:24 AM, bugs@schedmd.com wrote:> at this point the user has this job and its twin (another 4096 node job) in the
> > queue.  The other job is the 2nd highest priority job.
> >> We are not making use of any association limits on edison (on cori just MaxTRES
> > for cray/bb)
> >> The QOS limits on edison do not define and Grp limits
>
> You do have MaxJobsPU set to 2, which I think could have caused what I
> describe:
> > 30579            decyk 2016-01-14T20:24:36 2016-01-14T20:26:49    regular
> > premium_regular_0      683
> > 30597            decyk 2016-01-14T20:55:30 2016-01-14T20:57:31    regular
> > premium_regular_0     1366
> > 31349            decyk 2016-01-14T23:33:16 2016-01-14T23:35:06    regular
> > premium_regular_0     2731
> > nid01605:/var/tmp/slurm #
>
> It looks like he'd have two running up until 21:00, which may have
> punted back the start time for 27432.
>
> Although that wouldn't explain further delays at "3:30AM, then 7:30AM,
> and 8:30AM." When did those delays kick in?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 9 Doug Jacobsen 2016-01-15 07:33:50 MST

I checked with Jay (my supervisor) and he asked me to increase the priority of this issue.

Owing to this problem our machine utilization was 75% yesterday.

Unfortunately a long, large job was allowed to start and it seems to have pushed this highest priority job to the end of the schedule.  I'll send updated logs.

Comment 10 Moe Jette 2016-01-15 08:16:01 MST

It will take me a bit of time to study the logs and code.

In the meanwhile, what you could do is create an advanced reservation for this particular job with sufficient resources for it to start, then modify the job to associate it with that reservation (scontrol update jobid=27432 reservation=WHATEVER). That should keep other jobs from starting. Be sure to create the reservation with "flag=ignore_jobs"

Advanced reservation documentation here:
http://slurm.schedmd.com/reservations.html

Comment 11 Moe Jette 2016-01-15 09:33:19 MST

Here is an update. I don't have a solution yet, but found several interesting things in the log.

================================================================================================

All of the jobs that I see started in this log (I didn't check them all), started almost immediately after job submission (usually within a few seconds). I'm not sure if that is relevant, but it is odd.

================================================================================================

Here's an excerpt of the slurmctld log file:

[2016-01-15T05:13:24.790] sched: Allocate JobID=32124 NodeList=nid000[14-63,72-85] #CPUs=3072
...
[2016-01-15T05:13:53.649] backfill: beginning
[2016-01-15T05:13:53.649] debug:  backfill: 107 jobs to backfill
[2016-01-15T05:13:53.649] backfill test for JobID=27432 Prio=2000000 Partition=regular
[2016-01-15T05:13:53.769] Job 27432 to start at 2016-01-15T08:54:33, end at 2016-01-15T09:24:00 on nid0[0311-0323...]
[2016-01-15T05:13:53.771] backfill test for JobID=27436 Prio=25743 Partition=regular
[2016-01-15T05:13:53.771] debug:  backfill: user 13195: #jobs 2
...
[2016-01-15T05:14:33.817] debug:  backfill: 105 jobs to backfill
[2016-01-15T05:14:33.817] backfill test for JobID=27432 Prio=2000000 Partition=regular
[2016-01-15T05:14:33.963] Job 27432 to start at 2016-01-15T11:35:19, end at 2016-01-15T12:04:00 on nid0[0311-0323,..]
[2016-01-15T05:14:33.965] backfill test for JobID=27436 Prio=25748 Partition=regular
[2016-01-15T05:14:33.965] debug:  backfill: user 13195: #jobs 2
...
[2016-01-15T05:19:43.565] sched: Allocate JobID=32128 NodeList=nid000[08-63,72-79] #CPUs=3072

What this shows is that job 27432 has its expected start time pushed back from 08:54:33 to 11:35:19 and there weren't even any jobs started in that interval.

================================================================================================

Something else that I do see in the logs is this:

[2016-01-15T04:05:42.016] Time limit exhausted for JobId=30731
[2016-01-15T04:05:42.021] debug:  backup controller responding
[2016-01-15T04:05:42.258] job_complete: JobID=30731 State=0x8006 NodeCnt=20 WTERMSIG 15
[2016-01-15T04:05:42.914] debug:  freed ports 63023 for step 31976.89
[2016-01-15T04:05:43.010] error: gres/craynetwork: job 28127 node nid00636 gres count underflow
[2016-01-15T04:05:43.010] error: cons_res: node nid00636 memory is under-allocated (0-64523) for job 28127
[2016-01-15T04:05:43.010] error: gres/craynetwork: job 28127 node nid00645 gres count underflow
[2016-01-15T04:05:43.010] error: cons_res: node nid00645 memory is under-allocated (0-64523) for job 28127
[2016-01-15T04:05:43.010] error: gres/craynetwork: job 28127 node nid00646 gres count underflow
[2016-01-15T04:05:43.010] error: cons_res: node nid00646 memory is under-allocated (0-64523) for job 28127
[2016-01-15T04:05:43.010] error: gres/craynetwork: job 28127 node nid00647 gres count underflow

For some unknown reason, when job 30731 ended at 04:05:42, a bunch of counters (memory and gres:craynetwork on many nodes) reported underflows. Slurm will set those counters to zero rather than go negative, but this seems to indicate that resources allocated to one of more jobs got decremented multiple times. This could be the root problem, but I'm not sure yet. If this issue is not resolved as part of this ticket, I'll open a new trouble ticket for this issue.

Comment 12 Moe Jette 2016-01-15 09:54:39 MST

Doug, When you get a chance, could check the accounting records for some of the started jobs and tell me if these ran in the "regular" partition or some other partition? Also could you check if they happen to be associated with some advanced reservation (I expect that's a long-shot):

[2016-01-15T05:01:12.566] sched: Allocate JobID=32117 NodeList=nid00[102-127,136-173] #CPUs=3072
[2016-01-15T05:04:14.884] sched: Allocate JobID=32121 NodeList=nid00[102-127,136-173] #CPUs=3072
[2016-01-15T05:13:24.790] sched: Allocate JobID=32124 NodeList=nid000[14-63,72-85] #CPUs=3072
[2016-01-15T05:19:43.565] sched: Allocate JobID=32128 NodeList=nid000[08-63,72-79] #CPUs=3072
[2016-01-15T05:19:53.023] sched: Allocate JobID=32129 NodeList=nid000[80-90] #CPUs=528

Comment 13 Doug Jacobsen 2016-01-15 11:13:44 MST

nid01605:~ # sacct -j 32117,32121,32128,32129
--format=job,user,account,partition,qos,alloccpus,nnodes,start,end,elapsed,state,exitcode
       JobID      User    Account  Partition        QOS  AllocCPUS   NNodes
              Start                 End    Elapsed      State ExitCode
------------ --------- ---------- ---------- ---------- ---------- --------
------------------- ------------------- ---------- ---------- --------
32117           liaoxx       m808      debug     normal       3072       64
2016-01-15T05:01:12 2016-01-15T05:03:10   00:01:58  COMPLETED      0:0
32117.batch                  m808                               48        1
2016-01-15T05:01:12 2016-01-15T05:03:10   00:01:58  COMPLETED      0:0
32117.0                      m808                             1536       64
2016-01-15T05:01:15 2016-01-15T05:03:07   00:01:52 CANCELLED+      0:9
32121           liaoxx       m808      debug     normal       3072       64
2016-01-15T05:04:14 2016-01-15T05:06:15   00:02:01    TIMEOUT      1:0
32121.batch                  m808                               48        1
2016-01-15T05:04:14 2016-01-15T05:06:15   00:02:01  COMPLETED      0:0
32121.0                      m808                             1536       64
2016-01-15T05:04:17 2016-01-15T05:06:12   00:01:55 CANCELLED+      0:9
32128           liaoxx       m808      debug     normal       3072       64
2016-01-15T05:19:43 2016-01-15T05:21:43   00:02:00  COMPLETED      0:0
32128.batch                  m808                               48        1
2016-01-15T05:19:43 2016-01-15T05:21:43   00:02:00  COMPLETED      0:0
32128.0                      m808                             1536       64
2016-01-15T05:19:46 2016-01-15T05:21:41   00:01:55 CANCELLED+      0:9
32129             jbao       m808      debug     normal        528       11
2016-01-15T05:19:52 2016-01-15T05:50:03   00:30:11    TIMEOUT      1:0
32129.batch                  m808                               48        1
2016-01-15T05:19:52 2016-01-15T05:50:04   00:30:12  CANCELLED     0:15
32129.0                      m808                              256       11
2016-01-15T05:19:55 2016-01-15T05:50:06   00:30:11  CANCELLED     0:15
nid01605:~ #


nid01605:~ # sacctmgr show assoc where user=liaoxx user=jbao -p
Cluster|Account|User|Partition|Share|GrpJobs|GrpTRES|GrpSubmit|GrpWall|GrpTRESMins|MaxJobs|MaxTRES|MaxTRESPerNode|MaxSubmit|MaxWall|MaxTRESMins|QOS|Def
QOS|GrpTRESRunMins|
edison|m808|jbao||1||||||||||||low,low_regular_0,low_regular_1,low_regular_2,normal,normal_regular_0,normal_regular_1,normal_regular_2,premium,premium_regular_0,premium_regular_1,premium_regular_2,scavenger,serialize|||
esedison|m808|jbao||1||||||||||||low,low_regular_0,low_regular_1,low_regular_2,normal,normal_regular_0,normal_regular_1,normal_regular_2,premium,premium_regular_0,premium_regular_1,premium_regular_2,scavenger,serialize|||
edison|m808|liaoxx||1||||||||||||low,low_regular_0,low_regular_1,low_regular_2,normal,normal_regular_0,normal_regular_1,normal_regular_2,premium,premium_regular_0,premium_regular_1,premium_regular_2,scavenger,serialize|||
esedison|m808|liaoxx||1||||||||||||low,low_regular_0,low_regular_1,low_regular_2,normal,normal_regular_0,normal_regular_1,normal_regular_2,premium,premium_regular_0,premium_regular_1,premium_regular_2,scavenger,serialize|||
nid01605:~ #



----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Fri, Jan 15, 2016 at 5:11 PM, Douglas Jacobsen <dmjacobsen@lbl.gov>
wrote:

> Hi Moe,
>
> Just from the node list I can tell those were in the debug partition --
> which regular does not have access to.
>
> I'll send the detailed sacct records shortly.
>
> -Doug
>
> ----
> Doug Jacobsen, Ph.D.
> NERSC Computer Systems Engineer
> National Energy Research Scientific Computing Center
> <http://www.nersc.gov>
> dmjacobsen@lbl.gov
>
> ------------- __o
> ---------- _ '\<,_
> ----------(_)/  (_)__________________________
>
>
> On Fri, Jan 15, 2016 at 3:54 PM, <bugs@schedmd.com> wrote:
>
>> *Comment # 12 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c12> on bug
>> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette
>> <jette@schedmd.com> *
>>
>> Doug, When you get a chance, could check the accounting records for some of the
>> started jobs and tell me if these ran in the "regular" partition or some other
>> partition? Also could you check if they happen to be associated with some
>> advanced reservation (I expect that's a long-shot):
>>
>> [2016-01-15T05:01:12.566] sched: Allocate JobID=32117
>> NodeList=nid00[102-127,136-173] #CPUs=3072
>> [2016-01-15T05:04:14.884] sched: Allocate JobID=32121
>> NodeList=nid00[102-127,136-173] #CPUs=3072
>> [2016-01-15T05:13:24.790] sched: Allocate JobID=32124
>> NodeList=nid000[14-63,72-85] #CPUs=3072
>> [2016-01-15T05:19:43.565] sched: Allocate JobID=32128
>> NodeList=nid000[08-63,72-79] #CPUs=3072
>> [2016-01-15T05:19:53.023] sched: Allocate JobID=32129 NodeList=nid000[80-90]
>> #CPUs=528
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>
>

Comment 14 Doug Jacobsen 2016-01-15 11:26:40 MST

Hi Moe,

Just from the node list I can tell those were in the debug partition --
which regular does not have access to.

I'll send the detailed sacct records shortly.

-Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Fri, Jan 15, 2016 at 3:54 PM, <bugs@schedmd.com> wrote:

> *Comment # 12 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c12> on bug
> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette
> <jette@schedmd.com> *
>
> Doug, When you get a chance, could check the accounting records for some of the
> started jobs and tell me if these ran in the "regular" partition or some other
> partition? Also could you check if they happen to be associated with some
> advanced reservation (I expect that's a long-shot):
>
> [2016-01-15T05:01:12.566] sched: Allocate JobID=32117
> NodeList=nid00[102-127,136-173] #CPUs=3072
> [2016-01-15T05:04:14.884] sched: Allocate JobID=32121
> NodeList=nid00[102-127,136-173] #CPUs=3072
> [2016-01-15T05:13:24.790] sched: Allocate JobID=32124
> NodeList=nid000[14-63,72-85] #CPUs=3072
> [2016-01-15T05:19:43.565] sched: Allocate JobID=32128
> NodeList=nid000[08-63,72-79] #CPUs=3072
> [2016-01-15T05:19:53.023] sched: Allocate JobID=32129 NodeList=nid000[80-90]
> #CPUs=528
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 15 Moe Jette 2016-01-15 13:08:36 MST

It looks like every job started by the normal scheduler logic was in the debug partition.

All of the jobs started in the "normal" partition were started by backfill scheduling. I believe the underlying problem is the backfill scheduler avoids using nodes which are in a "completing" state. That make sense if we are trying to start a job immediately, but if we are trying to pick resources for a job to use an hour in the future, those are exactly the type of nodes that we want to be able to make use of. I would recommend that make use of an advanced reservation for now. I'll need to study the logic some more and do some testing, but I'm guessing that the patch below will fix the problem. "non_cg_bitmap" is a bitmap of the nodes NOT in a completing state.


diff --git a/src/plugins/sched/backfill/backfill.c b/src/plugins/sched/backfill/
index cde86e7..c52bd71 100644
--- a/src/plugins/sched/backfill/backfill.c
+++ b/src/plugins/sched/backfill/backfill.c
@@ -1292,7 +1292,7 @@ next_task:
                /* Identify usable nodes for this job */
                bit_and(avail_bitmap, part_ptr->node_bitmap);
                bit_and(avail_bitmap, up_node_bitmap);
-               bit_and(avail_bitmap, non_cg_bitmap);
+//             bit_and(avail_bitmap, non_cg_bitmap);
                for (j=0; ; ) {
                        if ((node_space[j].end_time > start_res) &&
                             node_space[j].next && (later_start == 0))

Comment 16 Moe Jette 2016-01-15 13:16:03 MST

I created a separate trouble ticket for the underflow problem described in comment 11. See:
http://bugs.schedmd.com/show_bug.cgi?id=2353

Comment 17 Doug Jacobsen 2016-01-15 18:54:34 MST

Thank you for looking into this Moe and Tim.

Regarding the completing state of the nodes, I had wondered about this in
the past, and had manipulated CompleteWait to see if it would help
(thinking that delaying the time where the scheduler considered the
completing nodes as unavailable would be helpful in preventing this).

On cori, we've seen similar effects but the queue was so complicated I
wasn't able to find a clean enough example, and have used the reservation
trick in the past.  I'm glad you were able to get to the bottom of this!
Thank you.

I have put these two large jobs into a reservation and I think they should
run this morning.  This may just encourage the user to submit more of
them... =)

-Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________

On Fri, Jan 15, 2016 at 7:16 PM, <bugs@schedmd.com> wrote:

> Moe Jette <jette@schedmd.com> changed bug 2350
> <http://bugs.schedmd.com/show_bug.cgi?id=2350>
> What Removed Added Status UNCONFIRMED CONFIRMED Ever confirmed   1
>
> *Comment # 16 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c16> on bug
> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette
> <jette@schedmd.com> *
>
> I created a separate trouble ticket for the underflow problem described incomment 11 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c11>. See:http://bugs.schedmd.com/show_bug.cgi?id=2353
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 18 Doug Jacobsen 2016-01-16 12:41:08 MST

Hi Moe,

Encouraged by the success of the user's jobs running in a reservation, the user submitted more.  Now job id 37358 and 37386.

In response to this I decided to try the patch you sent (I know you said you were still thinking about it).  Anyway I thought I'd let you know that these jobs have still been getting delayed with the patch applied.

I'm sending the updated logs from the patched version of slurmctld.

Thank you for your help with this,
Doug

Comment 19 Doug Jacobsen 2016-01-16 12:41:55 MST

Created attachment 2621 [details]
patched slurmctld logs with more top-job delays

Comment 20 Doug Jacobsen 2016-01-16 12:42:59 MST

(btw I've put these newest jobs in a reservation to get them to go).  Turns out the user runs very short jobs.  It would be useful to have a reservation flag that would cancel an active reservation once no jobs were queued/running or otherwise attached to it.

Comment 21 Moe Jette 2016-01-16 13:40:03 MST

I was able to replicate the backfill scheduling bug based upon your
configuration files and logs. Here's a detailed description with an example.
The fundamental issue is that nodes which are in COMPLETING state (in the
process of terminating a job) are avoided when it comes to scheduling.
For example, lets consider a simple system with four nodes, each with one
job running with the expected end time listed below:

Node name  End time
nid00000   16:00
nid00001   17:00
nid00002   18:00
nid00003   19:00

Now lets consider the highest priority pending job needs 3 full nodes. It's expected
start time will be 18:00 and make use of nid0000[0-2]. Now lets say that the
job on nid00000 reaches its time limit and is in the process of getting killed.
That node will be placed in a COMPLETING state and if the backfill scheduler runs
while it is completing, the node nid00000 will be removed from consideration for
use and the pending job's expected start time will be 19:00 and make use of nid0000[1-3].
We just slipped the job's expected start time by one hour.

My patch described in comment 15 was a partial, but incomplete solution. There also
needed to be changes in the resource selection plugin (select/cons_res). I've
committed a complete solution to github here (append ".patch" to the patch name
to generate a patch file):
https://github.com/SchedMD/slurm/commit/1a4b5983b13900302a114eb4a61d7b908c0fa2cf

I anticipate tagging a new release of Slurm (version 15.08.7) with this and a number
of other bug fixes around January 20.

Thank you for your patience. I'm closing the ticket based upon this patch.
Please re-open if necessary.

Comment 22 Moe Jette 2016-01-16 13:52:17 MST

(In reply to Doug Jacobsen from comment #20)
> (btw I've put these newest jobs in a reservation to get them to go).  Turns
> out the user runs very short jobs.  It would be useful to have a reservation
> flag that would cancel an active reservation once no jobs were
> queued/running or otherwise attached to it.

That should be pretty simple. I've created a new trouble ticket for that:
http://bugs.schedmd.com/show_bug.cgi?id=2355

Comment 23 Doug Jacobsen 2016-01-17 01:24:10 MST

Hi Moe,

Thank you so much for continuing to look at this and sending me this patch.  I did apply it and rebuilt SLURM with it in place, and then restarted. While the partitions were down I moved the big jobs out the reservation and then deleted the reservation.

Unfortunately it seems to have happened again.  At the termination of a large job (1152 nodes) the 4096 node job was delayed in the next backfill scheduler run (within a few seconds).  The primary event for this was at [2016-01-17T05:01:05.284].  I'll send the logs in a few minutes.

-Doug

Comment 24 Doug Jacobsen 2016-01-17 01:40:14 MST

Created attachment 2622 [details]
2nd patch slurmctld.log

Comment 25 Moe Jette 2016-01-17 02:23:04 MST

Almost certainly a distinct problem/bug. I will not be able to work on this until tonight at the earliest. 

On January 17, 2016 7:40:14 AM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=2350
>
>--- Comment #24 from Doug Jacobsen <dmjacobsen@lbl.gov> ---
>Created attachment 2622 [details]
>  --> http://bugs.schedmd.com/attachment.cgi?id=2622&action=edit
>2nd patch slurmctld.log
>
>-- 
>You are receiving this mail because:
>You are watching all bug changes.

Comment 26 Doug Jacobsen 2016-01-17 02:54:18 MST

Of course that's understandable :)

I just wanted to get the info to you as early as possible
On Jan 17, 2016 8:23 AM, <bugs@schedmd.com> wrote:

> *Comment # 25 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c25> on bug
> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette
> <jette@schedmd.com> *
>
> Almost certainly a distinct problem/bug. I will not be able to work on this
> until tonight at the earliest.
>
> On January 17, 2016 7:40:14 AM PST, bugs@schedmd.com wrote:>http://bugs.schedmd.com/show_bug.cgi?id=2350
> >>--- Comment #24 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c24> from Doug Jacobsen <dmjacobsen@lbl.gov> ---
> >Created attachment 2622 [details] <http://bugs.schedmd.com/attachment.cgi?id=2622> [details] <http://bugs.schedmd.com/attachment.cgi?id=2622&action=edit>
> >  --> http://bugs.schedmd.com/attachment.cgi?id=2622&action=edit
> >2nd patch slurmctld.log
> >>--
> >You are receiving this mail because:
> >You are watching all bug changes.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 27 Doug Jacobsen 2016-01-17 07:01:16 MST

Now that you mention it, I believe that I often see edison nodes transit
through a "mixed" state upon job completion (some still say completing).

I wonder if that is related to the gres decrement issue you noticed...
I'll see if I can trap any sinfo output that demonstrates this.

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Sun, Jan 17, 2016 at 8:29 AM, Douglas Jacobsen <dmjacobsen@lbl.gov>
wrote:

> Of course that's understandable :)
>
> I just wanted to get the info to you as early as possible
> On Jan 17, 2016 8:23 AM, <bugs@schedmd.com> wrote:
>
>> *Comment # 25 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c25> on bug
>> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette
>> <jette@schedmd.com> *
>>
>> Almost certainly a distinct problem/bug. I will not be able to work on this
>> until tonight at the earliest.
>>
>> On January 17, 2016 7:40:14 AM PST, bugs@schedmd.com wrote:>http://bugs.schedmd.com/show_bug.cgi?id=2350
>> >>--- Comment #24 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c24> from Doug Jacobsen <dmjacobsen@lbl.gov> ---
>> >Created attachment 2622 [details] <http://bugs.schedmd.com/attachment.cgi?id=2622> [details] <http://bugs.schedmd.com/attachment.cgi?id=2622&action=edit>
>> >  --> http://bugs.schedmd.com/attachment.cgi?id=2622&action=edit
>> >2nd patch slurmctld.log
>> >>--
>> >You are receiving this mail because:
>> >You are watching all bug changes.
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>

Comment 28 Moe Jette 2016-01-17 13:11:58 MST

(In reply to Doug Jacobsen from comment #27)
> Now that you mention it, I believe that I often see edison nodes transit
> through a "mixed" state upon job completion (some still say completing).

"mixed" in indicative of allocated and idle CPUs on a node. If there are any jobs in "completing" state on the node, no matter how many CPUs are allocated or idle, it's state will be reported as "completing".

The good news is that the "underflow" message from the first log are gone. They got cleaned up on restart, as expected. I did open a separate trouble ticket on that issue.

The log here looks very similar to the previous one in that the top priority job's expected start time gets pushed back when another job is in "completing" state. Forgive me, but I have to ask, are you sure this is running with the new sched/backfill and select/cons_res plugins rather than possibly old/cached versions?

Comment 30 Doug Jacobsen 2016-01-17 14:08:22 MST

Hi Moe,

Yeah I was worried about that too, but everything checks out in terms of
the installation.  The new BACKFILL flags are present in its slurm.h:

nid01605:/var/tmp/slurm # ls -l /opt/slurm/default
lrwxrwxrwx 1 root root 56 Jan 17 01:16 /opt/slurm/default ->
15.08.6_fixsched2350_15.08.6_fixsched2350_20160117010314
nid01605:/var/tmp/slurm # grep BACKFILL
/opt/slurm/default/include/slurm/slurm.h
#define BACKFILL_TEST      0x00000008 /* Backfill test in progress */
#define DEBUG_FLAG_BACKFILL 0x0000000000001000 /* debug for
#define DEBUG_FLAG_BACKFILL_MAP 0x0000000008000000 /* Backfill scheduler
node
nid01605:/var/tmp/slurm #
nid01605:/var/tmp/slurm #
nid01605:/var/tmp/slurm #
nid01605:/var/tmp/slurm # ls -l /opt/slurm/default/lib/slurm/*cons*
-rw-r--r-- 1 root root 410748 Jan 17 01:06
/opt/slurm/default/lib/slurm/select_cons_res.a
-rwxr-xr-x 1 root root   1034 Jan 17 01:06 /opt/slurm/default/lib/slurm/
select_cons_res.la
-rwxr-xr-x 1 root root 251569 Jan 17 01:06
/opt/slurm/default/lib/slurm/select_cons_res.so
nid01605:/var/tmp/slurm # ls -l /opt/slurm/default/lib/slurm/*backfill*
-rw-r--r-- 1 root root 168800 Jan 17 01:06
/opt/slurm/default/lib/slurm/sched_backfill.a
-rwxr-xr-x 1 root root   1027 Jan 17 01:06 /opt/slurm/default/lib/slurm/
sched_backfill.la
-rwxr-xr-x 1 root root 109748 Jan 17 01:06
/opt/slurm/default/lib/slurm/sched_backfill.so
nid01605:/var/tmp/slurm #


Regarding the "mixed" state, we do not have any shared nodes or allow over
subscription on edison, so the transit through "mixed" state has been
confusing to me.


I captured this output earlier, I'll try to get a relevant section of the
logs in a bit:

running "sinfo -p system" every 10s, I caught this:

Sun Jan 17 13:08:29 PST 2016
PARTITION AVAIL JOB_SIZE  TIMELIMIT   CPUS  S:C:T   NODES STATE
 NODELIST
system    up    1-infini      30:00     48 2:12:2       2 down*
 nid[00666,04250]
system    up    1-infini      30:00     48 2:12:2    5113 allocated
 nid[00008-00023,00233-00255,00264-00287,00297-00313,00334-00337,00350-00383,00388-00451,00456-00511,00516-00533,00537-00579,00584-00665,00667-00767,00772-00834,00849-00895,00900-00963,00968-01132,01143-01151,01156-01219,01224-01279,01284-01483,01502-01535,01540-01603,01608-01663,01668-01804,01809-01905,01916-01919,01924-01987,01992-02000,02007-02047,02052-02070,02077-02295,02301-02303,02308-02371,02376-02431,02436-02446,02456-02472,02490-02687,02692-02755,02760-02815,02820-02887,02893-03071,03076-03139,03144-03330,03363-03455,03460-03523,03528-03578,03589-03656,03667-03839,03844-03872,03883-03907,03912-03952,03964-04223,04228-04249,04251-04291,04296-04572,04587-04607,04992-05179,05185-05218,05226-05277,05280-05443,05448-05616,05633-05677,05697-05759,05764-05827,05832-05951,05956-06143]
system    up    1-infini      30:00     48 2:12:2     461 idle
nid[00024-00063,00072-00127,00136-00191,00200-00232,00288-00296,00314-00323,00328-00333,00338-00349,00534-00536,00835,00840-00848,01133-01142,01484-01501,01805-01808,01906-01915,02001-02006,02071-02076,02296-02300,02447-02455,02473-02489,02888-02892,03331-03362,03579-03588,03657-03666,03873-03882,03953-03963,04573-04586,05180-05184,05219-05225,05278-05279,05617-05632,05678-05696]
Sun Jan 17 13:08:39 PST 2016
PARTITION AVAIL JOB_SIZE  TIMELIMIT   CPUS  S:C:T   NODES STATE
 NODELIST
system    up    1-infini      30:00     48 2:12:2       2 down*
 nid[00666,04250]
system    up    1-infini      30:00     48 2:12:2    5026 allocated
 nid[00008-00023,00233-00255,00264-00287,00297-00313,00334-00337,00350-00383,00388-00451,00456-00511,00516-00533,00537-00579,00584-00665,00667-00767,00772-00806,00821-00834,00849-00895,00900-00963,00968-00992,00994-01132,01143-01151,01156-01219,01224-01279,01284-01483,01502-01535,01540-01603,01608-01663,01668-01804,01809-01905,01916-01919,01924-01962,02007-02047,02052-02070,02077-02295,02301-02303,02308-02371,02376-02431,02436-02446,02456-02472,02490-02687,02692-02755,02760-02815,02820-02887,02893-02894,02933-03071,03076-03139,03144-03330,03363-03455,03460-03523,03528-03578,03589-03656,03667-03839,03844-03872,03883-03907,03912-03952,03964-04223,04228-04249,04251-04291,04296-04572,04587-04607,04992-05179,05185-05218,05226-05277,05280-05443,05448-05616,05633-05677,05697-05759,05764-05827,05832-05951,05956-06143]
system    up    1-infini      30:00     48 2:12:2     548 mixed
 nid[00024-00063,00072-00127,00136-00191,00200-00232,00288-00296,00314-00323,00328-00333,00338-00349,00534-00536,00807-00820,00835,00840-00848,00993,01133-01142,01484-01501,01805-01808,01906-01915,01963-01987,01992-02006,02071-02076,02296-02300,02447-02455,02473-02489,02888-02892,02895-02932,03331-03362,03579-03588,03657-03666,03873-03882,03953-03963,04573-04586,05180-05184,05219-05225,05278-05279,05617-05632,05678-05696]
Sun Jan 17 13:08:49 PST 2016
PARTITION AVAIL JOB_SIZE  TIMELIMIT   CPUS  S:C:T   NODES STATE
 NODELIST
system    up    1-infini      30:00     48 2:12:2       2 down*
 nid[00666,04250]
system    up    1-infini      30:00     48 2:12:2    5025 allocated
 nid[00008-00023,00233-00255,00264-00287,00297-00313,00334-00337,00350-00383,00388-00451,00456-00511,00516-00533,00537-00579,00584-00665,00667-00767,00772-00806,00821-00834,00849-00895,00900-00963,00968-00992,00994-01132,01143-01151,01156-01219,01224-01279,01284-01483,01502-01535,01540-01603,01608-01663,01668-01804,01809-01905,01916-01919,01924-01962,02007-02047,02052-02070,02077-02295,02301-02303,02308-02371,02376-02431,02436-02446,02456-02472,02490-02687,02692-02755,02760-02815,02820-02887,02893-02894,02933-03071,03076-03139,03144-03330,03363-03455,03460-03523,03528-03578,03589-03656,03667-03839,03844-03872,03883-03907,03912-03952,03964-04223,04228-04249,04251-04291,04296-04572,04587-04607,04992-05179,05185-05218,05226-05277,05280-05443,05448-05597,05599-05616,05633-05677,05697-05759,05764-05827,05832-05951,05956-06143]
system    up    1-infini      30:00     48 2:12:2     549 idle
nid[00024-00063,00072-00127,00136-00191,00200-00232,00288-00296,00314-00323,00328-00333,00338-00349,00534-00536,00807-00820,00835,00840-00848,00993,01133-01142,01484-01501,01805-01808,01906-01915,01963-01987,01992-02006,02071-02076,02296-02300,02447-02455,02473-02489,02888-02892,02895-02932,03331-03362,03579-03588,03657-03666,03873-03882,03953-03963,04573-04586,05180-05184,05219-05225,05278-05279,05598,05617-05632,05678-05696]








----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Sun, Jan 17, 2016 at 7:11 PM, <bugs@schedmd.com> wrote:

> *Comment # 28 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c28> on bug
> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette
> <jette@schedmd.com> *
>
> (In reply to Doug Jacobsen from comment #27 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c27>)> Now that you mention it, I believe that I often see edison nodes transit
> > through a "mixed" state upon job completion (some still say completing).
>
> "mixed" in indicative of allocated and idle CPUs on a node. If there are any
> jobs in "completing" state on the node, no matter how many CPUs are allocated
> or idle, it's state will be reported as "completing".
>
> The good news is that the "underflow" message from the first log are gone. They
> got cleaned up on restart, as expected. I did open a separate trouble ticket on
> that issue.
>
> The log here looks very similar to the previous one in that the top priority
> job's expected start time gets pushed back when another job is in "completing"
> state. Forgive me, but I have to ask, are you sure this is running with the new
> sched/backfill and select/cons_res plugins rather than possibly old/cached
> versions?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 31 Moe Jette 2016-01-19 03:16:31 MST

(In reply to Doug Jacobsen from comment #30)
> Regarding the "mixed" state, we do not have any shared nodes or allow over
> subscription on edison, so the transit through "mixed" state has been
> confusing to me.
> 
> I captured this output earlier, I'll try to get a relevant section of the
> logs in a bit:

The mixed state would be indicative of a node being in an ALLOCATED state, but with some of the CPUs flagged as IDLE in a data structure. Clearly not a healthy situation. I have been studying the code and have not been able to reproduce or identify the source of this problem.

The patch that you have already applied definitely fixes a bug with respect to backfill scheduling while a job is in COMPLETING state, but this is a different problems. Please attach those logs when you have a chance.

Comment 32 Doug Jacobsen 2016-01-19 03:47:07 MST

Hi Moe,

I'm offsite at an all day thing today so will get back to you with logs
tonight.  I can definitely say that the patch is helping significantly and
I'm not seeing the level of issues I was before the fix -- thank you!
 edison achieved >95% utilization yesterday which is a vast improvement.
This patch will go onto cori tomorrow (along with a rapid upgrade to
15.08.7 once it comes out).

I have a top-priority 98% scale job in the queue for edison right now, so
that should be a fairly good test to see how this works.  It is scheduled
to run tomorrow at 19:22.

Thanks again,
Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________

On Tue, Jan 19, 2016 at 9:16 AM, <bugs@schedmd.com> wrote:

> *Comment # 31 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c31> on bug
> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette
> <jette@schedmd.com> *
>
> (In reply to Doug Jacobsen from comment #30 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c30>)> Regarding the "mixed" state, we do not have any shared nodes or allow over
> > subscription on edison, so the transit through "mixed" state has been
> > confusing to me.
> >
> > I captured this output earlier, I'll try to get a relevant section of the
> > logs in a bit:
>
> The mixed state would be indicative of a node being in an ALLOCATED state, but
> with some of the CPUs flagged as IDLE in a data structure. Clearly not a
> healthy situation. I have been studying the code and have not been able to
> reproduce or identify the source of this problem.
>
> The patch that you have already applied definitely fixes a bug with respect to
> backfill scheduling while a job is in COMPLETING state, but this is a different
> problems. Please attach those logs when you have a chance.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 33 Moe Jette 2016-01-19 05:13:56 MST

(In reply to Doug Jacobsen from comment #32)
> I'm not seeing the level of issues I was before the fix -- thank you!
>  edison achieved >95% utilization yesterday which is a vast improvement.

Excellent!

While I have not been able to reproduce the "mixed" node state, I suspect that it is related to core specialization. If I'm correct, the "mixed" state should be changed to "allocated" in the sinfo output, but I do not believe that is adversely impacting scheduling of pending jobs (which was my original concern). In any case, I'm continuing to work on this...

Comment 34 Moe Jette 2016-01-20 06:19:58 MST

Update: I don't have a fix for you yet, but understand what is happening and will not require additional logs. This is a distinct bug than that previously diagnosed and that you have a patch for.

There is a race condition in the job termination logic with respect to how Slurm runs the Cray Node Health Check (NHC). Not that it helps you, but this backfill bug will effect only Cray systems with NHC enabled. It is likely to be most significant when a job ends abnormally (non-zero exit code, reaches time limit, etc.).

Comment 35 Moe Jette 2016-01-20 06:47:50 MST

We do want to tag Slurm version 15.08.7 today with a multitude of bug fixes, but his is going to require a fairly complex change that will likely not be complete today. We will provide you with a patch when available.

You might consider disabling NHC, which would eliminate this race condition. At Cray's request, that will be the default behaviour in the next major release of Slurm (version 16.05). NHC can be disabled by adding "NHC_NO_STEPS,NHC_NO" to your SelectTypeParameters value in slurm.conf.

Comment 36 Doug Jacobsen 2016-01-20 06:49:41 MST

Ah!  That makes sense!  I greatly appreciate it.

-Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Wed, Jan 20, 2016 at 1:19 PM, <bugs@schedmd.com> wrote:

> *Comment # 34 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c34> on bug
> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette
> <jette@schedmd.com> *
>
> Update: I don't have a fix for you yet, but understand what is happening and
> will not require additional logs. This is a distinct bug than that previously
> diagnosed and that you have a patch for.
>
> There is a race condition in the job termination logic with respect to how
> Slurm runs the Cray Node Health Check (NHC). Not that it helps you, but this
> backfill bug will effect only Cray systems with NHC enabled. It is likely to be
> most significant when a job ends abnormally (non-zero exit code, reaches time
> limit, etc.).
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 37 Doug Jacobsen 2016-01-20 06:50:34 MST

Sounds good -- I'm looking forward to a number of the items in 15.08.7 and
am awaiting its release anxiously.

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Wed, Jan 20, 2016 at 1:47 PM, <bugs@schedmd.com> wrote:

> *Comment # 35 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c35> on bug
> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette
> <jette@schedmd.com> *
>
> We do want to tag Slurm version 15.08.7 today with a multitude of bug fixes,
> but his is going to require a fairly complex change that will likely not be
> complete today. We will provide you with a patch when available.
>
> You might consider disabling NHC, which would eliminate this race condition. At
> Cray's request, that will be the default behaviour in the next major release of
> Slurm (version 16.05). NHC can be disabled by adding "NHC_NO_STEPS,NHC_NO" to
> your SelectTypeParameters value in slurm.conf.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 38 Doug Jacobsen 2016-01-20 07:05:54 MST

What if we called the NHC from a job epilog instead?  And prevent that
epilog from completing until either the node is deemed clean or the suspect
period expires and the node is marked admindown?  Would that allow us to
continue using the NHC without generating this race condition?

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Wed, Jan 20, 2016 at 1:50 PM, Douglas Jacobsen <dmjacobsen@lbl.gov>
wrote:

> Sounds good -- I'm looking forward to a number of the items in 15.08.7 and
> am awaiting its release anxiously.
>
> ----
> Doug Jacobsen, Ph.D.
> NERSC Computer Systems Engineer
> National Energy Research Scientific Computing Center
> <http://www.nersc.gov>
> dmjacobsen@lbl.gov
>
> ------------- __o
> ---------- _ '\<,_
> ----------(_)/  (_)__________________________
>
>
> On Wed, Jan 20, 2016 at 1:47 PM, <bugs@schedmd.com> wrote:
>
>> *Comment # 35 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c35> on bug
>> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette
>> <jette@schedmd.com> *
>>
>> We do want to tag Slurm version 15.08.7 today with a multitude of bug fixes,
>> but his is going to require a fairly complex change that will likely not be
>> complete today. We will provide you with a patch when available.
>>
>> You might consider disabling NHC, which would eliminate this race condition. At
>> Cray's request, that will be the default behaviour in the next major release of
>> Slurm (version 16.05). NHC can be disabled by adding "NHC_NO_STEPS,NHC_NO" to
>> your SelectTypeParameters value in slurm.conf.
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>
>

Comment 39 Doug Jacobsen 2016-01-20 07:37:50 MST

I have almost no interest in running the post-step NHC and will recommend
that we disable that, at least.

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Wed, Jan 20, 2016 at 2:05 PM, Douglas Jacobsen <dmjacobsen@lbl.gov>
wrote:

> What if we called the NHC from a job epilog instead?  And prevent that
> epilog from completing until either the node is deemed clean or the suspect
> period expires and the node is marked admindown?  Would that allow us to
> continue using the NHC without generating this race condition?
>
> ----
> Doug Jacobsen, Ph.D.
> NERSC Computer Systems Engineer
> National Energy Research Scientific Computing Center
> <http://www.nersc.gov>
> dmjacobsen@lbl.gov
>
> ------------- __o
> ---------- _ '\<,_
> ----------(_)/  (_)__________________________
>
>
> On Wed, Jan 20, 2016 at 1:50 PM, Douglas Jacobsen <dmjacobsen@lbl.gov>
> wrote:
>
>> Sounds good -- I'm looking forward to a number of the items in 15.08.7
>> and am awaiting its release anxiously.
>>
>> ----
>> Doug Jacobsen, Ph.D.
>> NERSC Computer Systems Engineer
>> National Energy Research Scientific Computing Center
>> <http://www.nersc.gov>
>> dmjacobsen@lbl.gov
>>
>> ------------- __o
>> ---------- _ '\<,_
>> ----------(_)/  (_)__________________________
>>
>>
>> On Wed, Jan 20, 2016 at 1:47 PM, <bugs@schedmd.com> wrote:
>>
>>> *Comment # 35 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c35> on bug
>>> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Moe Jette
>>> <jette@schedmd.com> *
>>>
>>> We do want to tag Slurm version 15.08.7 today with a multitude of bug fixes,
>>> but his is going to require a fairly complex change that will likely not be
>>> complete today. We will provide you with a patch when available.
>>>
>>> You might consider disabling NHC, which would eliminate this race condition. At
>>> Cray's request, that will be the default behaviour in the next major release of
>>> Slurm (version 16.05). NHC can be disabled by adding "NHC_NO_STEPS,NHC_NO" to
>>> your SelectTypeParameters value in slurm.conf.
>>>
>>> ------------------------------
>>> You are receiving this mail because:
>>>
>>>    - You reported the bug.
>>>
>>>
>>
>

Comment 40 Danny Auble 2016-01-20 07:39:32 MST

Doing what happens in the NHC in the job epilog would be the dream of us all, but it won't work as you would expect. NHC has to be ran from the node where the slurmctld runs.  It wouldn't work from the EpilogSlurmctld either as the nodes have already been given back to the system at that point which would cause a different race condition.

I would also suggest just disabling NHC, NHC_NO implies NHC_NO_STEPS so you would only need to add NHC_NO to SelectTypeParameters to completely disable it.  If you start seeing issues it is rather easy to turn back on, with hopefully minimal issue if any.  As Moe said, Cray has asked it be disabled by default so it apparently isn't that needed to begin with.

Comment 41 Doug Jacobsen 2016-01-20 07:56:49 MST

Does that mean that Cray will also stop using the NHC for ALPS?  Or do they
expect native WLM sites to come up with some other mechanism for monitoring
node health?

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Wed, Jan 20, 2016 at 2:39 PM, <bugs@schedmd.com> wrote:

> *Comment # 40 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c40> on bug
> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Danny Auble
> <da@schedmd.com> *
>
> Doing what happens in the NHC in the job epilog would be the dream of us all,
> but it won't work as you would expect. NHC has to be ran from the node where
> the slurmctld runs.  It wouldn't work from the EpilogSlurmctld either as the
> nodes have already been given back to the system at that point which would
> cause a different race condition.
>
> I would also suggest just disabling NHC, NHC_NO implies NHC_NO_STEPS so you
> would only need to add NHC_NO to SelectTypeParameters to completely disable it.
>  If you start seeing issues it is rather easy to turn back on, with hopefully
> minimal issue if any.  As Moe said, Cray has asked it be disabled by default so
> it apparently isn't that needed to begin with.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 42 Danny Auble 2016-01-20 07:58:33 MST

I don't know.  With Slurm there is definitely ways to monitor things outside of NHC.  But I don't know what the end game is for this.

Comment 44 Doug Jacobsen 2016-01-20 12:47:06 MST

I'm not 100% sure but I think much of the end-of-job node cleanup and
memory compaction is initiated by the the NHC.  I'm going to have to take a
pretty serious look at the impact before disabling it.  Speaking with a few
of the others in my group, there does seem to be a generally negative
feeling towards disabling the end-of-job NHC check.

I'll look forward to the race condition fix, and will use the new
purge_comp capability for reservations in the meantime.  Though purge_comp
didn't work for me just now (invalid flag), will have to check to see if
the patch made it all the way in.

Thanks so much!
-Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________

On Wed, Jan 20, 2016 at 2:58 PM, <bugs@schedmd.com> wrote:

> *Comment # 42 <http://bugs.schedmd.com/show_bug.cgi?id=2350#c42> on bug
> 2350 <http://bugs.schedmd.com/show_bug.cgi?id=2350> from Danny Auble
> <da@schedmd.com> *
>
> I don't know.  With Slurm there is definitely ways to monitor things outside of
> NHC.  But I don't know what the end game is for this.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 45 Danny Auble 2016-01-21 04:22:58 MST

I am just the messenger from what Cray asked us to do for 16.05 :).  I would be very interested if there does appear to be a negative impact from not running NHC.

Comment 46 Moe Jette 2016-01-21 04:47:25 MST

A fix for Cray-specific backfill bug (NHC race condition) is now available. This fix will be in version 15.08.8 when released, so you'll need to manage this as a patch for now.

This is a second, Cray-specific patch in addition to the patch described in comment 21, which was in the version 15.08.7 release.

The commit of the new patch is here:
https://github.com/SchedMD/slurm/commit/79a21bd697cf2fd365e497872387628a1c670b39

Please re-open the ticket if you encounter any more problems.