Ticket 1774

Summary:	Job in a lower priority queue get into node before jobs in higher priority queue
Product:	Slurm	Reporter:	Akmal Madzlan <akmalm>
Component:	Scheduling	Assignee:	Moe Jette <jette>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	brian, da, hajek, jbjohnston
Version:	14.11.5
Hardware:	Linux
OS:	Linux
Site:	DownUnder GeoSolutions	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	14.11.8
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmctld log, config and some squeue output attached scontrol show jobs 14121153 slurmctld.log

Description Akmal Madzlan 2015-07-01 17:06:12 MDT

Created attachment 2004 [details]
slurmctld log, config and some squeue output attached

[root@pud22 ~]# squeue -aw clus[081-084,086-103,109-116,241-254],pnod[0666-0707]
             JOBID PARTITION     NAME   USER   ST
      14125227_157 teamcottG 230_SI_R   user   R
       14125227_45 teamcottG 230_SI_R   user   R
     14121150_1665 teamcottG 221_Asim   user1  R
       14121153_72 teamcottG 221_Asim   user1  R
       14121153_71 teamcottG 221_Asim   user1  R
       14121153_70 teamcottG 221_Asim   user1  R
       14121153_68 teamcottG 221_Asim   user1  R
       14121153_67 teamcottG 221_Asim   user1  R
       14121153_66 teamcottG 221_Asim   user1  R
      14112808_229   idleGpu rt_sten0   user2  R
       14121153_65 teamcottG 221_Asim   user1  R
       14121153_64 teamcottG 221_Asim   user1  R
       14121153_63 teamcottG 221_Asim   user1  R
       14121153_62 teamcottG 221_Asim   user1  R

user2 job (14112808_229) should not get into the node because there is a lot of user1's job still in PD state. It should get higher priority due to the higher priority queue.

One thing I noticed is the job is scheduled by backfill. Probably a backfill bug?

slurmctld log, config and some squeue output attached

Comment 1 Brian Christiansen 2015-07-01 17:52:11 MDT

Can you attach "scontrol show jobs" output as well? I see that a lot of the pending jobs are being held back because of dependencies. It's possible that the dependencies created a whole for the jobs to be backfilled into. I was able to recreate a similar situation.

Comment 2 Akmal Madzlan 2015-07-01 18:11:33 MDT

Created attachment 2005 [details]
scontrol show jobs 14121153

This is scontrol for jobs 14121153 (user1).
All his jobs is waiting for Resources.

Comment 3 Akmal Madzlan 2015-07-01 18:50:02 MDT

Another thing I notice is the job that 'steal' the node always have a lower JobID, meaning that they are submitted earlier?

Maybe they are already scheduled for that node before the job on the higher priority queue is submitted?

Comment 4 Moe Jette 2015-07-02 04:24:52 MDT

(In reply to Akmal Madzlan from comment #3)
> Another thing I notice is the job that 'steal' the node always have a lower
> JobID, meaning that they are submitted earlier?
> 
> Maybe they are already scheduled for that node before the job on the higher
> priority queue is submitted?

You have "SchedulerParameters=bf_continue" configured. What that means is the backfill scheduler builds a priority ordered queue of jobs pending then proceeds through the full queue (up to 50000 jobs in your case due to "bf_max_job_test=50000"), periodically releasing locks so that other events can take place, such as submitting new jobs. Depending upon your configuration, the backfill scheduler could take minutes to work its way through the full queue. If new higher-priority jobs become eligible during that interval, they will not be considered by the backfill scheduler, which will continue to work its way through the queue and start jobs.

Could you run the "sdiag" command and append it's output. That will provide some information about how long the backfill scheduler runs for.

Comment 5 Akmal Madzlan 2015-07-02 09:59:11 MDT

Here is sdiag output.
I didnt include RPC stats because it is too long and I think you dont need it?

[root@pud22 ~]# sdiag
*******************************************************
sdiag output at Fri Jul  3 05:54:51 2015
Data since      Thu Jul  2 08:00:02 2015
*******************************************************
Server thread count: 3
Agent queue size:    0

Jobs submitted: 2699
Jobs started:   16959
Jobs completed: 16870
Jobs canceled:  718
Jobs failed:    0

Main schedule statistics (microseconds):
	Last cycle:   99384
	Max cycle:    318869
	Total cycles: 7650
	Mean cycle:   80156
	Mean depth cycle:  1170
	Cycles per minute: 5
	Last queue length: 2425

Backfilling stats
	Total backfilled jobs (since last slurm start): 48341
	Total backfilled jobs (since last stats cycle start): 6962
	Total cycles: 1338
	Last cycle when: Fri Jul  3 05:53:55 2015
	Last cycle: 3247904657
	Max cycle:  3980170579
	Mean cycle: 1384795
	Last depth cycle: 2428
	Last depth cycle (try sched): 2305
	Depth Mean: 2396
	Depth Mean (try depth): 2287
	Last queue length: 2419
	Queue length mean: 2392

Comment 6 Moe Jette 2015-07-02 10:12:05 MDT

(In reply to Akmal Madzlan from comment #5)
> Here is sdiag output.
> I didnt include RPC stats because it is too long and I think you dont need
> it?

What you sent is exactly what I wanted to see.

Here's the interesting bit, the cycle times are in micro-seconds:
> Backfilling stats
> 	Last cycle: 3247904657
> 	Max cycle:  3980170579
> 	Mean cycle: 1384795

So you're averaging about 1.4 seconds, which is pretty reasonable, but the the worst case was 66 minutes. That isn't 66 minutes of run time, but 2 seconds of running alternating with at least 0.5 seconds of sleeping so that other operations can happen. 66 minutes is certainly sufficient time for a lot of other events (jobs coming and going, etc.). In any case, I'm used to seeing times well under 1 minute in the worst case. You've only got a couple of thousand jobs, so it's difficult for me to imagine where all of that time is going. Does anything come to your mind?

Comment 7 Akmal Madzlan 2015-07-02 10:22:52 MDT

(In reply to Moe Jette from comment #6)
> Does anything come to your mind?
I honestly have no idea


This is another occurence.

[root@pud22 ~]# squeue -aw clus[001-030]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      14099126_869      idle pt2_030_   user99  R    1:28:02      1 clus025
      14099126_868      idle pt2_030_   user99  R    1:37:16      1 clus028
      14099126_867      idle pt2_030_   user99  R    1:46:17      1 clus022
      14099126_866      idle pt2_030_   user99  R    2:04:58      1 clus006
      14099126_864      idle pt2_030_   user99  R    2:18:55      1 clus003
      14099126_863      idle pt2_030_   user99  R    2:31:44      1 clus026
      14099126_860      idle pt2_030_   user99  R    2:49:28      1 clus030
     14121153_1368  teamcott 221_Asim   user1   R       8:15      1 clus007
     14121153_1339  teamcott 221_Asim   user1   R      30:21      1 clus008
     14121153_1333  teamcott 221_Asim   user1   R      33:36      1 clus029
     14121153_1332  teamcott 221_Asim   user1   R      33:44      1 clus015
     14121153_1329  teamcott 221_Asim   user1   R      34:44      1 clus004
     14121153_1318  teamcott 221_Asim   user1   R      41:05      1 clus010
     14121153_1317  teamcott 221_Asim   user1   R      44:05      1 clus027
     14121153_1306  teamcott 221_Asim   user1   R      56:48      1 clus002
     14121153_1298  teamcott 221_Asim   user1   R    1:11:53      1 clus005
     14121153_1299  teamcott 221_Asim   user1   R    1:11:53      1 clus024
     14121153_1300  teamcott 221_Asim   user1   R    1:11:53      1 clus001
     14121153_1297  teamcott 221_Asim   user1   R    1:12:27      1 clus019
     14121153_1284  teamcott 221_Asim   user1   R    1:18:33      1 clus018
     14121153_1280  teamcott 221_Asim   user1   R    1:21:33      1 clus023
     14121153_1231  teamcott 221_Asim   user1   R    1:39:16      1 clus016
     14121153_1228  teamcott 221_Asim   user1   R    1:41:37      1 clus012
     14121153_1227  teamcott 221_Asim   user1   R    1:43:37      1 clus013
     14121153_1226  teamcott 221_Asim   user1   R    1:44:37      1 clus014
     14121153_1220  teamcott 221_Asim   user1   R    1:46:37      1 clus020
     14121153_1208  teamcott 221_Asim   user1   R    1:51:56      1 clus009
     14121153_1166  teamcott 221_Asim   user1   R    2:04:10      1 clus011
      14121153_990  teamcott 221_Asim   user1   R    3:26:50      1 clus017

Upon further inspection, not all job that steal the node is allocated by backfill

[root@pque0001 ~]# cat /var/log/slurm/slurmctld.log | grep 14099126_869
[2015-07-03T04:32:09.122] backfill: Started JobId=14099126_869 (14138523) on clus025

[root@pque0001 ~]# cat /var/log/slurm/slurmctld.log | grep 14099126_868
[2015-07-03T04:22:55.672] sched: Allocate JobId=14099126_868 (14138406) NodeList=clus028 #CPUs=64

[root@pque0001 ~]# cat /var/log/slurm/slurmctld.log | grep 14099126_867
[2015-07-03T04:13:54.698] sched: Allocate JobId=14099126_867 (14138333) NodeList=clus022 #CPUs=64

[root@pque0001 ~]# cat /var/log/slurm/slurmctld.log | grep 14099126_866
[2015-07-03T03:55:13.466] backfill: Started JobId=14099126_866 (14138171) on clus006

[root@pque0001 ~]# cat /var/log/slurm/slurmctld.log | grep 14099126_864
[2015-07-03T03:41:16.084] backfill: Started JobId=14099126_864 (14138067) on clus003

[root@pque0001 ~]# cat /var/log/slurm/slurmctld.log | grep 14099126_863
[2015-07-03T03:28:27.864] backfill: Started JobId=14099126_863 (14137965) on clus026

[root@pque0001 ~]# cat /var/log/slurm/slurmctld.log | grep 14099126_860
[2015-07-03T03:10:43.169] sched: Allocate JobId=14099126_860 (14137853) NodeList=clus030 #CPUs=64

Comment 8 Moe Jette 2015-07-02 11:16:16 MDT

What I would like to do is concentrate on getting the worst-case backfill scheduling cycle time down to something reasonable. Having it run with hour-old data is definitely going to cause scheduling anomalies.

I'd like to add a new parameter that will exit the backfill scheduler loop after reaching some timeout, say 10 seconds, no matter if it doesn't test all of the jobs, but that will not help you right now.

Here are a couple of suggested additions/changes to the SchedulerParameters configuration parameter to help now:

bf_resolution=600   (Increase from 300)
bf_max_job_part=100  (see slurm.conf for details, perhaps a different value would be better for your workload)
bf_max_job_test=1000  (see slurm.conf, perhaps decrease from 50000)

I would periodically run the sdiag command and look specifically at these items:
Backfilling stats
	Last cycle: 3247904657
	Max cycle:  3980170579
	Mean cycle: 1384795

You can use "sdiag --reset" as root to reset the counters.

Comment 9 Akmal Madzlan 2015-07-02 15:15:09 MDT

I've make those scheduler parameter changes and this is our current sdiag output

[root@pud22 ~]# sdiag
*******************************************************
sdiag output at Fri Jul  3 11:14:00 2015
Data since      Fri Jul  3 09:23:12 2015
*******************************************************
Server thread count: 4
Agent queue size:    0

Jobs submitted: 152
Jobs started:   1671
Jobs completed: 1642
Jobs canceled:  41
Jobs failed:    0

Main schedule statistics (microseconds):
	Last cycle:   41422
	Max cycle:    278514
	Total cycles: 651
	Mean cycle:   69400
	Mean depth cycle:  1095
	Cycles per minute: 5
	Last queue length: 2222

Backfilling stats
	Total backfilled jobs (since last slurm start): 662
	Total backfilled jobs (since last stats cycle start): 662
	Total cycles: 111
	Last cycle when: Fri Jul  3 11:13:49 2015
	Last cycle: 1701825
	Max cycle:  2513933656
	Mean cycle: 30731377
	Last depth cycle: 2222
	Last depth cycle (try sched): 617
	Depth Mean: 2131
	Depth Mean (try depth): 566
	Last queue length: 2222
	Queue length mean: 2128

Comment 10 Akmal Madzlan 2015-07-02 15:30:08 MDT

another occurence.
But this is a bit different because it is from the same queue.
Job 14143084 should not get into the queue until job 14121153 and 14144213 done am I right? Both of them is waiting for resources

[root@pud22 ~]# squeue -p teamcott -o "%20i %20P %10u %5t %10M %10Q"
JOBID                PARTITION            USER      ST    TIME       PRIORITY  
14144213_[265-288]   idle,teamcott        user      PD    0:00       500       
14121153_[2149-2182, teamcott,teamcottGpu user1     PD    0:00       500       
14143084_[379,381,38 teamcott,idle        user99    PD    0:00       100       
14121153_2145        teamcott             user1     R     1:32       500       
14121153_2138        teamcott             user1     R     10:32      500       
14121153_2119        teamcott             user1     R     21:32      500       
14121153_2115        teamcott             user1     R     22:19      500       
14121153_2081        teamcott             user1     R     36:32      500       
14121153_2082        teamcott             user1     R     36:32      500       
14121153_2036        teamcott             user1     R     48:57      500       
14121153_2035        teamcott             user1     R     49:12      500       
14121153_1930        teamcott             user1     R     1:10:33    500       
14121153_1926        teamcott             user1     R     1:12:53    500       
14121153_1924        teamcott             user1     R     1:13:41    500       
14121153_1894        teamcott             user1     R     1:32:13    500       
14121153_1577        teamcott             user1     R     1:32:53    500       
14121153_1887        teamcott             user1     R     1:35:33    500       
14121153_1882        teamcott             user1     R     1:37:33    500       
14121153_1876        teamcott             user1     R     1:40:33    500       
14121153_1868        teamcott             user1     R     1:44:54    500       
14121153_1390        teamcott             user1     R     1:55:14    500       
14121153_1846        teamcott             user1     R     1:56:54    500       
14121153_1834        teamcott             user1     R     2:02:55    500       
14121153_1819        teamcott             user1     R     2:11:36    500       
14121153_1818        teamcott             user1     R     2:12:16    500       
14121153_1812        teamcott             user1     R     2:13:33    500       
14121153_1792        teamcott             user1     R     2:20:17    500       
14121153_1793        teamcott             user1     R     2:20:17    500       
14143084_377         teamcott             user99    R     24:32      100       
14143084_375         teamcott             user99    R     38:52      100       
14143084_371         teamcott             user99    R     48:32      100       
14121153_1783        teamcott             user1     R     49:32      100       
14129605_42          teamcott,idle        user3     SE    0:00       0         
14121153_2058        teamcott,teamcottGpu user1     SE    0:00       0         
14121153_2065        teamcott,teamcottGpu user1     SE    0:00       0

Comment 11 Akmal Madzlan 2015-07-02 19:29:19 MDT

I think that changes does improve things.
I dont see any job from lower priority queue get into the node before job in higher priority queue today

Comment 12 Moe Jette 2015-07-06 05:14:13 MDT

I concentrated on getting the worst-case backfill scheduling cycle time down to something reasonable. Having it run with hour-old data is definitely going to cause scheduling anomalies. What I've done is modify the code to treat bf_interval as a maximum run time for the backfill scheduler. In your case, that's 50 seconds, which includes the run time plus the sleep time (2 seconds of run time and 0.5 seconds of sleeping for other operations). The downside is that the backfill scheduler will not go through many, but perhaps not all of the pending jobs. 

The commit/patch with the change is here:
https://github.com/SchedMD/slurm/commit/7e9442207a61b66f7428f91ea97dbc68b53c881d.patch

We will almost certainly tag a new version of Slurm this week with bug fixes, so you might want to use that rather than the patch if you can wait an extra day or two.

I suspect that the really long times are related to your configuration parameter of "OverTimeLimit=UNLIMITED". I need to study the ramifications of that some more.

Comment 13 Moe Jette 2015-07-06 06:13:26 MDT

(In reply to Moe Jette from comment #12)
> I suspect that the really long times are related to your configuration
> parameter of "OverTimeLimit=UNLIMITED". I need to study the ramifications of
> that some more.

This is almost certainly the reason for the really long run times of the backfill scheduler and other scheduling anomalies. I've added logic so the backfill scheduler considers OverTimeLimit and KillWait configuration parameters, but only for job's that have exceeded their time limit. The commit is here:
https://github.com/SchedMD/slurm/commit/edfbabe6125336bbb526e3bd6fa12155e0b1bf9d.patch

This change should also improve the responsiveness of Slurm for you, at least when there are jobs which have exceeded their time limits.

Comment 14 Akmal Madzlan 2015-07-06 14:16:15 MDT

Thanks Moe,
If I understand correctly, the first patch will make bf_interval be treated as OverTimeLimit? So if we use OverTimeLimit=UNLIMITED, the second patch will still works?

Akmal

Comment 15 Moe Jette 2015-07-06 15:12:46 MDT

(In reply to Akmal Madzlan from comment #14)
> Thanks Moe,
> If I understand correctly, the first patch will make bf_interval be treated
> as OverTimeLimit? So if we use OverTimeLimit=UNLIMITED, the second patch
> will still works?
> 
> Akmal

No. The first patch will cause the backfill scheduler to stop it's current cycle after running for bf_interval (50 seconds in your case). It will then sleep for bf_interval, build a new job queue, and start over.

The second patch fixes the real problem that caused the backfill scheduler to run for so long.

The second patch addresses jobs that have exceeded their time limit. What happens with the current code (i.e. without the patch) is Slurm says the job will end in the next second, creates a reservation for it and goes on to the next job. Since the backfill scheduler might right for longer than that, it messes up the planned schedule for pending jobs. What that second patch does is for jobs that exceed their time limit, its new estimated end time is set to the scheduled end time (based upon start time + time limit) plus OverTimeLimit plus KillWait. With OverTimeLimit=UNLIMITED, that will push its expected end time out beyond anything the backfill scheduler will reserve resources for. It would probably help if you have a finite value for OverTimeLimit, but this seems like the best option to me and will prevent the backfill scheduler running for so long, but not making real progress

Comment 16 Moe Jette 2015-07-08 03:47:54 MDT

Slurm version 14.11.8 has been released with this bug fix.

Comment 17 Akmal Madzlan 2015-08-05 20:30:58 MDT

I think we are seeing this issue again.
Job seems to be scheduled based on the job priority alone while ignoring the partition priority

user A submit job1 with priority=100 to partitionA with priority=21
user B submit job2 with priority=1000 to partitionB with priority=20

User B's job goes first

both job is submitted to multiple partition with mixed priority

Comment 18 Akmal Madzlan 2015-08-05 20:43:32 MDT

- Lower priority queue

[akmalm@pud22 ~]$ scontrol show partition idleGpu
PartitionName=idleGpu
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=01:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=clus[081-160,241-320,353-362,626-665],pnod[0666-0867]
   Priority=11 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=12608 TotalNodes=412 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

- Higher Priority queue

[akmalm@pud22 ~]$ scontrol show partition teamrottoGpu
PartitionName=teamrottoGpu
   AllowGroups=teamrotto,geodev AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=01:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=pnod[0748-0772],clus[136-154,268-270]
   Priority=21 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=1456 TotalNodes=47 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

- Job running in teamrottoGpu

[akmalm@pud22 ~]$ squeue -aw pnod[0748-0772],clus[136-154,268-270]
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
idleGpu     1000       rt_tomo4a_TTI_Panaus     zayn  R    1:13:04      1 pnod0763        14795781_272
idleGpu     1000       rt_tomo4a_TTI_Panaus     zayn  R    1:03:42      1 pnod0767        14795781_325
idle        800        pt2_000_reg01_3x5_pr     tomj  R    3:35:52      1 pnod0750        14751438_134
teamrottoGp 400        5804_SRME_South_8136  joshuaz  R      16:07      1 pnod0771        14799464_2284
teamrottoGp 400        5804_SRME_South_8136  joshuaz  R      16:07      1 pnod0766        14799464_2285
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      42:33      1 pnod0761        14786431_405
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      42:33      1 pnod0769        14786431_406
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      40:00      1 pnod0764        14786431_415
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      34:54      1 pnod0757        14786431_424
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      34:54      1 pnod0749        14786431_425
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      31:31      1 pnod0765        14786431_435
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      23:53      1 pnod0753        14786431_479
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      21:20      1 pnod0759        14786431_500
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      21:20      1 pnod0758        14786431_501
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      14:26      1 pnod0752        14786431_543
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      14:26      1 pnod0754        14786431_544
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      13:35      1 pnod0770        14786431_553
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      12:44      1 pnod0762        14786431_558
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      12:44      1 pnod0755        14786431_559
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      11:03      1 pnod0751        14786431_571
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      11:03      1 pnod0760        14786431_572
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R       8:30      1 pnod0768        14786431_586
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R       8:30      1 pnod0772        14786431_587
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R       5:08      1 pnod0748        14786431_604
idleGpu     200        180_SWAMP_dcepticon_ katnisss  R       0:53      1 pnod0756        14786431_624
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    6:18:07      1 clus268         14746919_238
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    6:14:47      1 clus270         14746919_240
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    4:39:08      1 clus146         14746919_242
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    4:38:22      1 clus140         14746919_243
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    4:36:46      1 clus143         14746919_244
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    4:32:28      1 clus149         14746919_245
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    4:28:31      1 clus152         14746919_246
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    4:26:41      1 clus138         14746919_247
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    4:26:30      1 clus153         14746919_248
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    4:25:03      1 clus147         14746919_249
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    4:07:28      1 clus151         14746919_250
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    4:02:49      1 clus137         14746919_251
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    4:00:27      1 clus139         14746919_252
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    3:33:39      1 clus145         14746919_253
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    3:16:55      1 clus141         14746919_254
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    3:13:58      1 clus142         14746919_255
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    2:35:54      1 clus148         14746919_257
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    2:28:33      1 clus136         14746919_258
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    2:17:41      1 clus150         14746919_259
teamrottoGp 100        dp_SWaMP_Nania         mishka  R    1:50:36      1 clus144         14746919_260
teamrottoGp 100        dp_SWaMP_Nania         mishka  R      37:53      1 clus269         14746919_261

- There is job from idleGpu get into the nodes while there is still pending job in teamrottoGpu

[akmalm@pud22 ~]$ squeue -p teamrottoGpu -tPD
PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
teamrottoGp 100        dp_SWaMP_Nania         mishka PD       0:00      1 (Resources)     14746919_162
teamrottoGp 100        dp_SWaMP_Nania         mishka PD       0:00      1 (Resources)     14746919_165
teamrottoGp 100        dp_SWaMP_Nania         mishka PD       0:00      1 (Resources)     14746919_166
teamrottoGp 100        dp_SWaMP_Nania         mishka PD       0:00      1 (Resources)     14746919_167
teamrottoGp 100        dp_SWaMP_Nania         mishka PD       0:00      1 (Resources)     14746919_168
teamrottoGp 100        dp_SWaMP_Nania         mishka PD       0:00      1 (Resources)     14746919_169
teamrottoGp 100        dp_SWaMP_Nania         mishka PD       0:00      1 (Resources)     14746919_[262-990]

teamrottoGpu and idleGpu have overlapped nodes

Comment 19 David Bigagli 2015-08-05 21:35:20 MDT

Let me investigate and get back to you on this.

David

Comment 20 Akmal Madzlan 2015-08-05 21:39:34 MDT

Thanks David

Comment 21 David Bigagli 2015-08-05 22:56:51 MDT

Have you applied patch for the 1892 problem?

David

Comment 22 Akmal Madzlan 2015-08-06 00:36:04 MDT

do you mean bug 1829?
Yes we've applied that patch yesterday

Comment 23 Moe Jette 2015-08-06 03:03:08 MDT

Please attach output from "scontrol show job" for some of the effected jobs (both pending and running), plus the slurmctld log file for the period when job were started, which should not have been.

Comment 24 Akmal Madzlan 2015-08-06 13:41:38 MDT

Created attachment 2102 [details]
slurmctld.log

- scontrol show for jobs that should have priority

JobId=14811805 ArrayJobId=14746919 ArrayTaskId=290 JobName=dp_SWaMP_asd
   UserId=mashak(1376) GroupId=teamrotto(2116)                           
   Priority=100 Nice=0 Account=(null) QOS=normal                         
   JobState=RUNNING Reason=None Dependency=(null)                        
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0                
   RunTime=07:01:12 TimeLimit=01:00:00 TimeMin=N/A                       
   SubmitTime=2015-08-04T13:02:08 EligibleTime=2015-08-04T13:02:09       
   StartTime=2015-08-07T02:34:31 EndTime=2015-08-07T03:34:31             
   PreemptTime=None SuspendTime=None SecsPreSuspend=0                    
   Partition=teamrottoGpu AllocNode:Sid=pud153:9997                      
   ReqNodeList=(null) ExcNodeList=(null)                                 
   NodeList=clus147                                                      
   BatchHost=clus147                                                     
   NumNodes=1 NumCPUs=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*                  
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*                      
   MinCPUsNode=1 MinMemoryNode=27244M MinTmpDiskNode=16000M              
   Features=gpu&localdisk Gres=(null) Reservation=(null)                 
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)                  
   Command=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/rj.dp_SWaMP_asd.NsIHW3
   WorkDir=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST                         
   Comment=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/production.job          
   StdErr=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/logs_cluster/dp_SWaMP_asd/dp_SWaMP_asd.o14746919.290
   StdIn=/dev/null                                                                                                                                                                       
   StdOut=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/logs_cluster/dp_SWaMP_asd/dp_SWaMP_asd.o14746919.290

JobId=14774568 ArrayJobId=14746919 ArrayTaskId=166 JobName=dp_SWaMP_asd
   UserId=mishka(1376) GroupId=teamrotto(2116)                           
   Priority=100 Nice=0 Account=(null) QOS=normal                         
   JobState=PENDING Reason=Resources Dependency=(null)                   
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0                
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A                       
   SubmitTime=2015-08-05T14:42:57 EligibleTime=2015-08-05T14:43:07       
   StartTime=Unknown EndTime=Unknown                                     
   PreemptTime=None SuspendTime=None SecsPreSuspend=0                    
   Partition=teamrottoGpu,idleGpu AllocNode:Sid=pud153:9997              
   ReqNodeList=(null) ExcNodeList=(null)                                 
   NodeList=(null)                                                       
   BatchHost=clus270                                                     
   NumNodes=1-1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*                 
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*                      
   MinCPUsNode=1 MinMemoryNode=27244M MinTmpDiskNode=16000M              
   Features=gpu&localdisk Gres=(null) Reservation=(null)                 
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)                  
   Command=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/rj.dp_SWaMP_asd.NsIHW3
   WorkDir=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST                         
   Comment=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/production.job          
   StdErr=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/logs_cluster/dp_SWaMP_asd/dp_SWaMP_asd.o14746919.166
   StdIn=/dev/null                                                                                                                                                                       
   StdOut=/p3/qqq/quoBonPr_011/seiTimeProc/prod/000scratch/008_asd.dugio/workflow_2015.08.04_13.01.42_WST/logs_cluster/asd/dp_SWaMP_asd.o14746919.166

- I dont have scontrol output for jobs that steal the node. They all have finished running

- This is the time when the lower priority job get scheduled, taken from the logs

[root@pque0001 ~]# squeue -aw clus[136-153,268-270],pnod[0748-0772] | grep ^idle | awk '{print $9}' | xargs -Ijobid grep jobid /var/log/slurm/slurmctld.log | grep -E "Allocate|Started"
[2015-08-06T12:19:30.429] sched: Allocate JobId=14795781_272 (14796635) NodeList=clus263 #CPUs=32
[2015-08-06T15:20:27.460] backfill: Started JobId=14795781_272 (14796635) on pnod0763
[2015-08-06T12:21:23.547] sched: Allocate JobId=14795781_325 (14796711) NodeList=pnod0727 #CPUs=24
[2015-08-06T15:29:49.682] backfill: Started JobId=14795781_325 (14796711) on pnod0767
[2015-08-06T12:22:14.478] sched: Allocate JobId=14795781_349 (14796755) NodeList=pnod0723 #CPUs=24
[2015-08-06T15:34:55.806] backfill: Started JobId=14795781_349 (14796755) on pnod0756
[2015-08-06T12:57:39.983] backfill: Started JobId=14751438_134 (14764471) on pnod0750
[2015-08-06T15:50:58.286] backfill: Started JobId=14786431_405 (14799194) on pnod0761
[2015-08-06T15:50:58.293] backfill: Started JobId=14786431_406 (14799195) on pnod0769
[2015-08-06T15:53:31.380] backfill: Started JobId=14786431_415 (14799215) on pnod0764
[2015-08-06T15:58:37.829] backfill: Started JobId=14786431_424 (14799258) on pnod0757
[2015-08-06T15:58:37.834] backfill: Started JobId=14786431_425 (14799259) on pnod0749
[2015-08-06T16:02:00.939] backfill: Started JobId=14786431_435 (14799292) on pnod0765
[2015-08-06T16:09:38.038] backfill: Started JobId=14786431_479 (14799358) on pnod0753
[2015-08-06T16:12:11.106] backfill: Started JobId=14786431_500 (14799414) on pnod0759
[2015-08-06T16:12:11.112] backfill: Started JobId=14786431_501 (14799415) on pnod0758
[2015-08-06T16:19:05.525] backfill: Started JobId=14786431_543 (14799480) on pnod0752
[2015-08-06T16:19:05.531] backfill: Started JobId=14786431_544 (14799481) on pnod0754
[2015-08-06T16:19:56.507] backfill: Started JobId=14786431_553 (14799491) on pnod0770
[2015-08-06T16:20:47.624] backfill: Started JobId=14786431_558 (14799496) on pnod0762
[2015-08-06T16:20:47.628] backfill: Started JobId=14786431_559 (14799497) on pnod0755
[2015-08-06T16:22:28.653] backfill: Started JobId=14786431_571 (14799509) on pnod0751
[2015-08-06T16:22:28.658] backfill: Started JobId=14786431_572 (14799510) on pnod0760
[2015-08-06T11:40:05.268] sched: Allocate JobId=14758925_6816 (14758925) NodeList=clus129 #CPUs=40
[2015-08-06T16:24:10.650] backfill: Started JobId=14758925_6816 (14758925) on pnod0748

- I've attached full log

Comment 25 Moe Jette 2015-08-07 11:37:34 MDT

This will not impact scheduling, but it could cause other problems: The slurm.conf files are different on quite a few of your nodes. Run "grep HASH" on your slurmctld log file to identify the nodes:

$ grep HASH slurmctld.log | head -3
[2015-08-06T03:47:26.196] error: Node bud34 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2015-08-06T04:20:47.079] error: Node bud34 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2015-08-06T04:54:11.535] error: Node bud34 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.

Comment 26 Moe Jette 2015-08-07 11:59:51 MDT

(In reply to Akmal Madzlan from comment #24)
> - scontrol show for jobs that should have priority
> 
> JobId=14811805 ArrayJobId=14746919 ArrayTaskId=290 JobName=dp_SWaMP_asd  
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0                              
>    Features=gpu&localdisk Gres=(null) Reservation=(null)                 
> 
> JobId=14774568 ArrayJobId=14746919 ArrayTaskId=166 JobName=dp_SWaMP_asd                 
>    Partition=teamrottoGpu,idleGpu AllocNode:Sid=pud153:9997                         
>    Features=gpu&localdisk Gres=(null) Reservation=(null)                 

Note that these two jobs have "Features=gpu&localdisk" while the nodes pnod[0666-0867] don't have those features:

> NodeName=pnod[0666-0867] CPUs=24 RealMemory=128000 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN Feature=gpu,4phi,4gpu

So while they might have a higher priority than other jobs, that's not going to stop other jobs from using those nodes:
[
> akmalm@pud22 ~]$ squeue -aw pnod[0748-0772],clus[136-154,268-270]
> PARTITION   PRIORITY   NAME                     USER ST       TIME  NODES NODELIST(REASON JOBID
> idleGpu     1000       rt_tomo4a_TTI_Panaus     zayn  R    1:13:04      1 pnod0763        14795781_272
> idleGpu     1000       rt_tomo4a_TTI_Panaus     zayn  R    1:03:42      1 pnod0767        14795781_325
> idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      42:33      1 pnod0761        14786431_405
> idleGpu     200        180_SWAMP_dcepticon_ katnisss  R      42:33      1 pnod0769        14786431_406

There definitely was a bug with respect to scheduling job arrays that you now have the fix for. I don't have a complete picture of what is happening with respect to the jobs/priorities/queues/etc., but I'm not seeing any evidence that anything is currently wrong.

Does the job's "Feature" specification explain the anomaly? If not, could you provide matching "scontrol show job" and slurmctld log files demonstrating the anomaly?

Comment 27 Akmal Madzlan 2015-08-07 14:00:20 MDT

I think you are right. That job doesnt get on those node because it request
localdisk which doesnt exist in pnod*.

Thanks Moe!

Comment 28 Moe Jette 2015-08-08 01:36:12 MDT

(In reply to Akmal Madzlan from comment #27)
> I think you are right. That job doesnt get on those node because it request
> localdisk which doesnt exist in pnod*.
> 
> Thanks Moe!

Closing this ticket.