Ticket 1550

Summary:	Jobs stop accruing age priority
Product:	Slurm	Reporter:	Martins Innus <minnus>
Component:	Scheduling	Assignee:	Brian Christiansen <brian>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	brian, da
Version:	14.11.4
Hardware:	Linux
OS:	Linux
Site:	University of Buffalo (SUNY)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	14.11.6 15.08.0pre4
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm-cluster.conf slurm.conf Sprio post slurm restart slurmctl log debug patch slurmctld log with priority debug patch set debug patch 2 slurmctl log with more debugging Unoptimize patch

Description Martins Innus 2015-03-20 01:41:29 MDT

Since the upgrade to 14.11.4 it appears that jobs have stopped accruing age priority as shown by sprio.  This is on Centos 6.6.  Relevant config:

PriorityParameters      = (null)
PriorityDecayHalfLife   = 30-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = 0
PriorityFlags           = 
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 50000
PriorityWeightFairShare = 50000
PriorityWeightJobSize   = 200000
PriorityWeightPartition = 1000000
PriorityWeightQOS       = 0

Is the priority being calculated but just not displayed properly, or is this no longer working.  Our upgrade was from 2.6.x

Martins

Comment 1 Brian Christiansen 2015-03-20 03:31:10 MDT

Will you send your slurm.conf? Do you see this on new jobs since the upgrade as well?

Thanks,
Brian

Comment 2 Martins Innus 2015-03-20 03:51:04 MDT

Created attachment 1746 [details]
slurm-cluster.conf

Here you go.  It looks like any jobs submitted since the upgrade have a 
age of "0".  Jobs that were pending in the queue during the upgrade have 
the age that they had prior to the upgrade.

And actually, looking further, it looks like no priorities are being 
changed after a job is submitted.  We run fairshare, and users who have 
jobs pending in the queue, their fairshare is not going down even though 
they have jobs running in the queue.

Martins

Comment 3 Martins Innus 2015-03-20 03:51:13 MDT

Created attachment 1747 [details]
slurm.conf

Comment 4 Brian Christiansen 2015-03-20 07:07:50 MDT

I'm not able to reproduce this yet. Will you run with the Priority debug flag (DebugFlags=Priority). You can use sview to set it on the fly too if you want. 

You should see lines like this in the logs every 5 minute since the PriorityCalcPeriod is set to the default:
[2015-03-20T11:52:00.645] Fairshare priority of job 24599 for user brian in acct bubu is 2**(-1.000000/0.145937) = 0.386770
[2015-03-20T11:52:00.645] Weighted Age priority is 0.018638 * 50000 = 931.88
[2015-03-20T11:52:00.645] Weighted Fairshare priority is 0.386770 * 50000 = 19338.50
[2015-03-20T11:52:00.646] Weighted JobSize priority is 0.100000 * 200000 = 20000.00
[2015-03-20T11:52:00.646] Weighted Partition priority is 1.000000 * 1000000 = 1000000.00
[2015-03-20T11:52:00.646] Weighted QOS priority is 0.000000 * 0 = 0.00


Will you then send the logs pointing me to a job that you think should be increasing in priority age?

Comment 5 Martins Innus 2015-03-20 08:00:46 MDT

Created attachment 1748 [details]
Sprio post slurm restart

Comment 6 Martins Innus 2015-03-20 08:02:54 MDT

We added the debug parameter and restarted slurm.  See in the attached sprio output that all jobs that existed before the slurm restart have -10000 nice and all the rest zero.  

Jobs submitted since the restart have whatever priority that was calculated at submit time and never seems to change.

Will send the other logs after a few scheduling cycles have run

Comment 7 Brian Christiansen 2015-03-20 08:05:29 MDT

Will you confirm that the priorities of 10000 are job arrays? This was fixed in 14.11.5.

https://github.com/SchedMD/slurm/commit/423029d8364e8856c6c32b019cb177e92ea18665

Comment 8 Martins Innus 2015-03-20 08:15:13 MDT

Some of them maybe, but certainly not all.

Comment 9 Martins Innus 2015-03-20 08:17:16 MDT

Created attachment 1749 [details]
slurmctl log

Comment 10 Martins Innus 2015-03-20 08:18:05 MDT

log attached.  one job that is not accruing age is 3505711, but i think they are all the same

Comment 11 Brian Christiansen 2015-03-20 08:54:06 MDT

Will you get the output of "scontrol show job 3505711"?

Comment 12 Martins Innus 2015-03-20 09:10:11 MDT

Thats probably a bad one since it has since started.  How about:

This one is in the log I sent you. Let me know if you want the other one

[minnus@rush:~]$ sprio -l |grep 3505675
        3505675 tiangebi        612        157        216        240          0          0      0
[minnus@rush:~]$ sprio -l | grep 3505675
        3505675 tiangebi        612        157        216        240          0          0      0


those sprios are long enough apart that they should increment.  The numbers are no longer 0 since we have done a scontrol job release on all jobs to try to get things moving.  The "release" seems to increment the counters when the command is run.  But then they stay the same after that

[minnus@rush:~]$ scontrol show job 3505675
JobId=3505675 JobName=0GPa_2FU_vdW-1x6-3
   UserId=tiangebi(423120) GroupId=ezurek(104475)
   Priority=612 Nice=0 Account=ezurek QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=10:00:00 TimeMin=N/A
   SubmitTime=2015-03-20T15:39:56 EligibleTime=2015-03-20T15:39:57
   StartTime=2015-03-22T04:32:00 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=general-compute AllocNode:Sid=k07n14:78666
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=d13n06
   NumNodes=1-1 NumCPUs=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=8:0:*:* CoreSpec=*
   MinCPUsNode=8 MinMemoryCPU=3000M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/gpfs/projects/ezurek/tiangebi/PH3/2FU/0GPa_2FU_vdW/00001x00006/job.slurm
   WorkDir=/gpfs/projects/ezurek/tiangebi/PH3/2FU/0GPa_2FU_vdW/00001x00006
   StdErr=/projects/ezurek/tiangebi/PH3/2FU/0GPa_2FU_vdW/00001x00006//../1x6-3.out
   StdIn=/dev/null
   StdOut=/projects/ezurek/tiangebi/PH3/2FU/0GPa_2FU_vdW/00001x00006//../1x6-3.out

Comment 13 Brian Christiansen 2015-03-20 09:18:47 MDT

Thanks. I'll get you a patch with additional logging. It's probably the end of your day, so we can pick it up on Monday.

Comment 14 Brian Christiansen 2015-03-20 10:45:18 MDT

Created attachment 1752 [details]
debug patch

Will you run with the attached patch and get logs with at least two instances of the following log line for a job? 

[2015-03-20T15:43:54.317] 24636->priority_age=0.029980 diff:18132 start_time:1426891434 use_time:1426873302 max_age:604800 weight_age:50000

Thanks,
Brian

Comment 15 Brian Christiansen 2015-03-20 10:46:57 MDT

I forgot to mention that the extra logs are turned on with the Priority debug flag.

Comment 16 Martins Innus 2015-03-23 02:55:57 MDT

Created attachment 1754 [details]
slurmctld log with priority debug patch set

Comment 17 Martins Innus 2015-03-23 02:56:48 MDT

OK, attached.  looks like it only outputs when a job is submitted.  I let it run through 3 cycles

Comment 18 Martins Innus 2015-03-23 02:58:38 MDT

Just a clarification on the last comment.  So that means there are no jobs that get that line more than once.  It only happens on job submit time and never for jobs already in the queue.

Comment 19 Brian Christiansen 2015-03-23 06:24:54 MDT

Created attachment 1755 [details]
debug patch 2

I've attached a new patch with more logging. It includes the previous logs as well. Will you rerun the test with this patch? Thanks for being patient and working with us.

Comment 20 Martins Innus 2015-03-23 07:54:54 MDT

Created attachment 1756 [details]
slurmctl log with more debugging

Comment 21 Martins Innus 2015-03-23 08:04:07 MDT

Output attached.  The output is not likely helpful since with this patch, and the priority debugging turned on, the priorities are all calculated correctly!!  That is, sprio shows correct output


So then, I did the following.

When I leave this patch compiled in, but turn priority debugging off.  We get the priorities not calculated correctly.  sprio shows "0" for all age priority.

Some sort of race condition?

Comment 22 Brian Christiansen 2015-03-23 08:35:24 MDT

hmm.. interesting. I'll look at the possibility of a race condition. Thanks for your help.

Comment 23 Brian Christiansen 2015-03-24 08:35:32 MDT

I just want to confirm one thing. With the patch compiled in but the debugging off, does sprio show an age of 0 for all jobs or is it just new jobs? If you do scontrol release on a job does the age change or does it stay at 0?

Comment 24 Brian Christiansen 2015-03-24 10:09:09 MDT

And what linux distro are you using on your the controller?

Comment 25 Martins Innus 2015-03-25 02:24:03 MDT

I replied to these yesterday by email, but looks like they don't show up. Trying again through the website.

When restarting slurmctld with this patch applied, but debugging turned off.  Existing jobs keep their priorities and new jobs show up as "0".  Nothing increments after this time.

I am fairly certain that on one of these tests, everything was "0" but I am unable to reproduce that.

If we do a "release", the priorities are calculated at that time, but do not increment after that time.  We have been running "release" on all jobs a couple times a day to keep things sane.

Centos 6.6 on the controller and everywhere else.

Comment 26 Brian Christiansen 2015-03-25 10:47:12 MDT

Created attachment 1772 [details]
Unoptimize patch

I was finally able to reproduce this. I can reproduce it when the binaries are compiled with optimizations (-O2) and I'm using Centos 6.6. I couldn't reproduce it when using Centos 6.5 or Ubuntu 14.10. Placing debug statements around the function that doesn't get called fixes it. It appears that Centos 6.6's gcc is optimizing out a function. 

I also saw where the all of the priorities were 10000 and the factors were 0's after a restart. This makes sense because the function that calculates the priorities and factors wasn't being called/optimized out.

Will you apply this patch and see if it fixes it for you? It fixes it for me.

Comment 27 Brian Christiansen 2015-03-25 10:56:16 MDT

Or the other option is to turn off optimization all together. It's something that we may do as a default in the future to avoid things like this.

CFLAGS="-O0"

Comment 28 Martins Innus 2015-03-26 02:54:22 MDT

Unfortunately the patch didn't work.  On slurmctl restart, existing jobs have the 10000/0 status and new jobs show up with 0 age and the rest of the priority is whatever it would be at submit time.

We are going to try with -O0 next.  It seems strange that it would be an optimizer bug, since in your last debug patch, toggling the debug variable changed the behavior after the code is already compiled.

Comment 29 Martins Innus 2015-03-26 04:04:59 MDT

Rebuilding with -O0 fixed the issue.  Do you expect any performance hits from doing this?  Our size is roughly:

~1000 nodes
~1500 users, but ~200 active at a time
5-10K jobs a day

Anything else we should try or do you think we can safely run this way moving forward?

Thanks for the help!

Comment 30 Brian Christiansen 2015-03-26 04:25:34 MDT

Great! We don't expect any significant performance hits. I would run with -O0 and also add the -g flag to get symbols. This will allow us to help diagnose any problems better that you may encounter.

Our plan is to disable optimizations and add symbols going forward.

Comment 31 Brian Christiansen 2015-04-01 11:36:38 MDT

I found the main cause of the priority issue. It had to do with checking return values from void functions -- which short circuited calculating the priorities for the rest of the jobs. It was still possible to see the issue even without optimizations like you had seen. With your help I was able to make progress on Bug 1469 which was seeing the same problem. 

It is fixed in:
https://github.com/SchedMD/slurm/commit/b2be6159517b197188b474a9477698f6f5edf480

And we've enabled debugging symbols and no optimizations by default in:
https://github.com/SchedMD/slurm/commit/614b9770ba1a7fc2b639768566849168aa883da4

Thanks for your help. Let me know if you have any questions.

Thanks,
Brian

Comment 32 Martins Innus 2015-04-02 01:45:57 MDT

Brian,

  Thanks!  That makes sense. Do you have an expectation on when 14.11.6 will be out.  Trying to figure out if we should just patch this in or wait for the next release.

Thanks

Martins

Comment 33 Brian Christiansen 2015-04-02 03:44:44 MDT

14.11.6 will probably be 2-3 weeks out.