Ticket 1469 - priority strangeness
Summary: priority strangeness
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 14.11.4
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Brian Christiansen
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-02-19 04:03 MST by wettstein
Modified: 2015-04-01 06:03 MDT (History)
3 users (show)

See Also:
Site: University of Chicago
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 14.11.6 15.08.0pre4
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
scontrol show config (6.29 KB, text/plain)
2015-02-19 04:03 MST, wettstein
Details
slurm.conf (8.76 KB, text/plain)
2015-02-19 05:31 MST, wettstein
Details
fair_tree disabled and weird priority (4.26 MB, text/plain)
2015-02-20 04:26 MST, wettstein
Details
fair_tree disabled slurm log (612.75 KB, text/plain)
2015-02-20 04:26 MST, wettstein
Details
fair_tree enabled and weird priority (5.04 MB, text/plain)
2015-02-20 04:27 MST, wettstein
Details
fair_tree enabled and weird priority after a few minutes (2.06 MB, text/plain)
2015-02-20 04:30 MST, wettstein
Details
fair_tree enabled slurm log (2.39 MB, text/plain)
2015-02-20 04:32 MST, wettstein
Details
gdb watch points (14.14 KB, patch)
2015-02-24 03:14 MST, Brian Christiansen
Details | Diff
patch (297 bytes, patch)
2015-03-27 07:08 MDT, Brian Christiansen
Details | Diff
fair_tree fix (820 bytes, patch)
2015-03-30 05:54 MDT, Brian Christiansen
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description wettstein 2015-02-19 04:03:15 MST
Created attachment 1653 [details]
scontrol show config

I just updated to slurm 14.11.4 last Friday. I'm seeing some strangeness with priority with several users' jobs. sprio shows something like this:

# sprio -l -j 11651503
          JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS   NICE
       11651503 stevetse          1          0          0          0          1          0      0


If I submit a new job the priority in sprio looks ok, but after a few minutes the sprio output for the pending job looks the same as the output above.


# date; sprio -l -j 11671959
Thu Feb 19 11:55:18 CST 2015
          JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS   NICE
       11671959 wettstei     180401          0      80171        231     100000          0      0

# date; sprio -l -j 11671959
Thu Feb 19 11:58:44 CST 2015
          JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS   NICE
       11671959 wettstei          1          0          0          0          1          0      0

Here is the scontrol show jobid. This is after sprio shows 1 for the priority. The priority is the same as the initial priority. It never updates. I don't know if that is correct behavior or not:

# scontrol show jobid=11671959
JobId=11671959 JobName=_interactive
   UserId=wettstein(891783663) GroupId=wettstein(891783663)
   Priority=180401 Nice=0 Account=rcc-staff QOS=sandyb
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2015-02-19T11:54:26 EligibleTime=2015-02-19T11:54:26
   StartTime=2015-02-19T14:06:50 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=sandyb AllocNode:Sid=midway-login1:1470
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=midway[036,069-071,083,086-087,091,098,109,195,197-199,220,222,225-226,260,306,321,323,326,329-330,334-336,351,369,380,382]
   NumNodes=32-32 NumCPUs=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/software/bin/_interactive
   WorkDir=/home/wettstein/redmine/pubsw/software/modulefiles/comsol
   StdErr=/dev/null
   StdIn=/dev/null
   StdOut=/dev/null


I've attached the output of scontrol show config.
Comment 1 Brian Christiansen 2015-02-19 04:30:03 MST
Are you using PriorityFlags=FAIR_TREE? scontrol doesn't display FAIR_TREE if it is set -- fixed in 14.11.5. This looks similar to Bug 1454.
Comment 2 Brian Christiansen 2015-02-19 05:25:50 MST
Will you attach your slurm.conf? 

Does it happen for all jobs?
Comment 3 wettstein 2015-02-19 05:31:54 MST
Created attachment 1655 [details]
slurm.conf
Comment 4 wettstein 2015-02-19 05:34:44 MST
(In reply to Brian Christiansen from comment #2)
> Will you attach your slurm.conf? 
> 
> Does it happen for all jobs?

All jobs that we had pending that were older than a few minutes had priority=1 in the sprio output. 

I've attached my slurm.conf. I turned off FAIR_TREE for now.
Comment 5 Brian Christiansen 2015-02-19 05:38:14 MST
Thanks. Let us know if turning off FAIR_TREE fixes the sprio output.
Comment 6 wettstein 2015-02-19 05:38:15 MST
Yes,

I am using FAIR_TREE and it does look similar to that bug. I turned
off FAIR_TREE and priority for new jobs stayed normal. The existing jobs
all looked like this in sprio:

   JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS   NICE
11651503 stevetse      10000          0          0          0          0          0 -10000


I did an scontrol release on all of those jobs and the priorities became
normal and the scontrol show jobid does show the updated priority
correctly, now.



On Thu, Feb 19, 2015 at 06:30:03PM +0000, bugs@schedmd.com wrote:
> http://bugs.schedmd.com/show_bug.cgi?id=1469
> 
> Brian Christiansen <brian@schedmd.com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|UNCONFIRMED                 |IN_PROGRESS
>      Ever confirmed|0                           |1
> 
> --- Comment #1 from Brian Christiansen <brian@schedmd.com> ---
> Are you using PriorityFlags=FAIR_TREE? scontrol doesn't display FAIR_TREE if it
> is set -- fixed in 14.11.5. This looks similar to Bug 1454.
> 
> -- 
> You are receiving this mail because:
> You reported the bug.
Comment 7 Brian Christiansen 2015-02-20 02:16:10 MST
In Bug 1454, the problem went away after restarting. Could you try turning fair_tree back on and seeing if the problem occurs again?

If it does, will you do the following:
1. Get controller logs of at least "info" with the "Priority" debug flag set
2. outputs of:
 a. sshare -l
 b. sprio -l
 c. scontrol show jobs
 d. sacctmgr show assoc tree

Thanks,
Brian
Comment 8 wettstein 2015-02-20 04:23:41 MST
Yes, it does seem to be intermittent. I restarted with FAIR_TREE on and priorities were normal. I enabled and disabled FAIR_TREE a few times a few times and managed to make the priorities weird with both FAIR_TREE enabled and disabled. 

All pending jobs looked like this with FAIR_TREE disabled:

          JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS   NICE
       11591807  dingpan      10000          0          0          0          0          0 -10000

I did capture the output from the commands and had Priority debug level set. I'll attach that output.

When I did the FAIR_TREE switch, I also switched from fairshare=parent to fairshare=1 on users. I switched that back and forth a few times this morning as well, but I didn't trigger anything that I could tell.

I've currently got FAIR_TREE turned on and it is setting jobs priority=1 after a few minutes in the queue. I can leave it this way for a little bit if you want me to collect additional information while it is happening.
Comment 9 wettstein 2015-02-20 04:26:19 MST
Created attachment 1657 [details]
fair_tree disabled and weird priority

sshare -l
sprio -l
scontrol show jobs
sacctmgr show assoc tree
Comment 10 wettstein 2015-02-20 04:26:56 MST
Created attachment 1658 [details]
fair_tree disabled slurm log
Comment 11 wettstein 2015-02-20 04:27:57 MST
Created attachment 1659 [details]
fair_tree enabled and weird priority

sshare -l
sprio -l
scontrol show jobs
sacctmgr show assoc tree
Comment 12 wettstein 2015-02-20 04:30:03 MST
Created attachment 1660 [details]
fair_tree enabled and weird priority after a few minutes

sprio -l
scontrol show job

Jobs 11677655 11677656 priority goes to 1.
Comment 13 wettstein 2015-02-20 04:32:22 MST
Created attachment 1661 [details]
fair_tree enabled slurm log
Comment 14 Brian Christiansen 2015-02-20 04:33:52 MST
Awesome, thanks! I'll mull over all of it. You can disable fair_tree now if you want.
Comment 15 Ryan Cox 2015-02-23 09:16:25 MST
We just started seeing this today with 14.11.4.  A quick check with +Priority looks like the following part of the code is not hit for some jobs (this was one of the few with non-zero info in sprio):
[2015-02-23T13:54:35.319] Job 5659119 priority: 0.00 + 1298.48 + 0.00 + 0.00 + 0.00 - 0 = 1298.48
[2015-02-23T13:54:35.319] Fairhare priority of job 5659124 for user klbell9 in acct jsk24 is 0.129848
[2015-02-23T13:54:35.319] Weighted Age priority is 0.000000 * 0 = 0.00
[2015-02-23T13:54:35.319] Weighted Fairshare priority is 0.129848 * 10000 = 1298.48
[2015-02-23T13:54:35.319] Weighted JobSize priority is 0.000000 * 0 = 0.00
[2015-02-23T13:54:35.319] Weighted Partition priority is 0.000000 * 0 = 0.00
[2015-02-23T13:54:35.319] Weighted QOS priority is 0.000000 * 1000000 = 0.00


It looks like it is skipped for jobs with zeroes in sprio.  I think that the jobs are getting an initial priority set then never hitting _get_priority_internal() again.  Why that would be, I have no idea.  It also seems that the priority value stays at what it was before the bug is triggered, but I'm not completely certain of that yet.

I'm adding some extra info statements to see where the problem is but a restart of slurmctld seems to temporarily fix the problem, so progress may be slow.

My question is how the jobs 1) got an initial priority, 2) still have a priority at all times, 3) have individual factors reset to 0 at some point, 4) never seem to hit _get_priority_internal() again, and 5) successfully get scheduled at some point.
Comment 16 Ryan Cox 2015-02-23 09:37:43 MST
One other piece of information is that scontrol show job and squeue showed incorrect priority values.  They were likely incorrect because the values weren't being updated anymore.  sshare, on the other hand, showed proper values for Level FS and Fairshare.  The distinction there is associations versus jobs.

It seems like the jobs aren't making it to the job list in the call to decay_apply_weighted_factors at https://github.com/SchedMD/slurm/blob/slurm-14.11/src/plugins/priority/multifactor/fair_tree.c#L70.  That's a pretty straight shot from https://github.com/SchedMD/slurm/blob/slurm-14.11/src/plugins/priority/multifactor/priority_multifactor.c#L1322.  That job_list is a global variable.  It's highly likely that something is corrupting job_list such that it only affects the multifactor plugin, if I'm reading the code correctly.
Comment 17 Ryan Cox 2015-02-23 09:47:09 MST
Odd... I have a script checking once a minute to see if sprio has zeroes.  It triggered once but the zeroes are gone now.  It happened to be at the same second that decay_apply_weighted_factors was being called.  Maybe there's a locking issue.
Comment 18 Ryan Cox 2015-02-23 09:57:30 MST
Now I have the script checking sprio continuously with only one second sleep in between.  It's showing zeroes much more often now and it's always the same second when the log shows decay_apply_weighted_factors is being called on jobs.  It should be locked at that point by https://github.com/SchedMD/slurm/blob/slurm-14.11/src/plugins/priority/multifactor/fair_tree.c#L69, unless that isn't the right lock to use.  Locking in Slurm is not my forte but it certainly seems like there's some kind of race condition due to locks not being used correctly.
Comment 19 Ryan Cox 2015-02-23 10:02:45 MST
One more thing to add before I leave for the day is that the values do exist again in sprio except while decay_apply_weighted_factors is being called.  This is different than the issue of the values disappearing until a slurmctld restart but it seems related.
Comment 20 Brian Christiansen 2015-02-23 10:09:54 MST
Thanks Ryan for looking into it as well. I'm still looking as well.
Comment 21 Ryan Cox 2015-02-24 02:43:42 MST
Based on the logging I added, it looks like the priority_fs value can get set to zero after some job array tasks get started but others are still pending.
Comment 22 Brian Christiansen 2015-02-24 03:14:49 MST
Created attachment 1673 [details]
gdb watch points

I can get this one now when running array jobs:

brian@compy:~/slurm/14.11/compy$ sprio -l -j 12063
          JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS   NICE
          12063    brian      10000          0          0          0          0          0 -10000


I've attached my gdb output of setting watch points on job->priority and job->prio_factors->priority_part. It appears the prio_factors are being recalculated inside the locks in fair_tree.
Comment 23 Ryan Cox 2015-02-24 03:51:57 MST
> I've attached my gdb output of setting watch points on job->priority and
> job->prio_factors->priority_part. It appears the prio_factors are being
> recalculated inside the locks in fair_tree.

That's intentional and similar to how the other algorithms do this, AFAICT.  Based on the wording of your comment I can't tell if you are saying "good that it's in locks", "it shouldn't be doing that there", or "just an observation".

set_priority_factors() may be the source of the zeroes (https://github.com/SchedMD/slurm/blob/slurm-14.11/src/plugins/priority/multifactor/priority_multifactor.c#L1901) but it is inside of a lock.  It zeroes the factors then it immediately sets the values.  However, it is inside of locks so nothing should see the zeroes, right?  Could there be something else in the code that should have a read lock but doesn't?

Of course this may have nothing to do with locks but that's my only guess at the moment besides job array entries losing their priority factors when a new array task starts running.
Comment 24 Brian Christiansen 2015-02-24 04:13:49 MST
Just an observation. Because the factors are being set within the locks, I wouldn't expect to see the zeros. If only I could reproduce it...
Comment 25 Ryan Cox 2015-02-24 04:53:55 MST
This is what I have been using to hit the one race condition fairly often:
while true; do out=$(time sprio -l | awk '$5 == 0 {print}'); if [[ "x$out" != "x" ]]; then echo -e "$(date)\n\n$out" | mail -s zeroes "$email"; echo Found at $(date); fi; sleep 1; done

If you submit a job array where some tasks start and others don't, you'll very likely see some weirdness almost immediately.  I still haven't encountered the initial condition I saw with all zeroes for a prolonged period.
Comment 26 Brian Christiansen 2015-02-24 05:48:36 MST
ok. I have a pretty good reproducer for the zeros issue. I was hoping to catch the issue where all the priorities are all 1's as well. I'll dive into this reproducer. Hopefully the two scenarios are related.

brian@compy:/tmp$ sprio -l
          JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS   NICE
          18789    brian      10000          0          0          0          0          0 -10000
Comment 27 Brian Christiansen 2015-02-24 09:01:03 MST
This commit fixes the job array issue.
https://github.com/SchedMD/slurm/commit/423029d8364e8856c6c32b019cb177e92ea18665

Ryan, have you seen the priorities all be 1 like Chicago has seen? I'm thinking that there is something lurking still because a lot of the Chicago jobs were not job arrays.
Comment 28 Ryan Cox 2015-02-24 09:40:45 MST
(In reply to Brian Christiansen from comment #27)
> This commit fixes the job array issue.
> https://github.com/SchedMD/slurm/commit/
> 423029d8364e8856c6c32b019cb177e92ea18665
> 

I had actually looked at that function earlier and figured that the source of the job array issues might be in there somewhere :).  The patch looks like it will fix that problem and we applied it.  So far the issue seems to have gone away with job arrays but we do still get zeroes occasionally when sprio and calls to decay_apply_weighted_factors occur at the same time (more below).

> Ryan, have you seen the priorities all be 1 like Chicago has seen? I'm
> thinking that there is something lurking still because a lot of the Chicago
> jobs were not job arrays.

I don't recall if I've ever seen the lines had a one for priority.  I just recall seeing this ticket and thinking we were seeing the exact same thing.  Unfortunately I accidentally lost the sprio output after restarting slurmctld (oops).  I do remember that it had zero for fairshare and I think it had a zero for Partition, unlike theirs.  Our partition weight is set to zero, so maybe that's why.  I really don't remember what the priority was set to.

The job's priorities according to squeue and scontrol job were definitely not being updated even though sshare showed correct values.


As far as the all zeroes issue, I see lines like this with sprio -l every once in a while:
        5759600 lundgrer          0          0          0          0          0          0      0
        5759601 lundgrer          0          0          0          0          0          0      0
        5765644 jtfuller          0          0          0          0          0          0      0
        5765649 jtfuller          0          0          0          0          0          0      0
        5765663 jtfuller          0          0          0          0          0          0      0
        5766071 jtfuller          0          0          0          0          0          0      0
        5766901 jtfuller          0          0          0          0          0          0      0
        5766903 njensen6          0          0          0          0          0          0      0
        5766928 njensen6          0          0          0          0          0          0      0

It's easier to trigger if I make the calculations in the multifactor plugin take longer, such as by outputting hundred of lines of debug statements per second to the logs.  I added some info statements that I activate by adding the priority debug flag (that's how I narrowed down the one issue to job arrays).  I bet if you put a sleep statement in decay_apply_weighted_factors or its calling function you could trigger this issue very easily.  You can then do a while loop with sprio like the one I mentioned in comment 25, maybe without the sleep.  I guess I haven't done a while loop with squeue and scontrol show job to look for zero priority values there, so that might be something else for you to look at if you can reproduce the issues I'm seeing.

Maybe at some point a race condition causes that zero priority to get copied into the job's priority field.  After that, it would skip almost all calculations since it's "held" and it wouldn't change from then on, except maybe after a slurmctld restart.  However, I don't think explains why sprio shows a 1 in Priority for Chicago, so maybe that's not a good theory.
Comment 29 Ryan Cox 2015-02-24 10:03:36 MST
One other interesting thing I noticed is that my extra debug statements in the logs show priority_p_get_priority_factors_list() being called in the middle of the Fair Tree debug output that is part of _apply_priority_fs().  That, however, does not consider jobs at all so it should be safe to happen simultaneously.

_apply_priority_fs() is called here after locking associations:
https://github.com/SchedMD/slurm/blob/slurm-14.11/src/plugins/priority/multifactor/fair_tree.c#L65.

priority_p_get_priority_factors_list() obtains a job read lock here: https://github.com/SchedMD/slurm/blob/slurm-14.11/src/plugins/priority/multifactor/priority_multifactor.c#L1784.

priority_p_get_priority_factors_list() also does some stuff with associations without a lock (that I can see) but I don't know if that's a problem.  It probably obtains locks in the function it calls but I'm taking off now and don't have time to check: https://github.com/SchedMD/slurm/blob/slurm-14.11/src/plugins/priority/multifactor/priority_multifactor.c#L1815 

I do not think that this is an issue but I figured I should mention it since I could be wrong. Since I'm calling sprio about once a second while testing, it's inevitable that the calls would happen simultaneously sometimes if they don't depend on locks on the same data structures.
Comment 30 Brian Christiansen 2015-02-27 07:05:55 MST
I've haven't been able to reproduce the all zeros case, even with a sleep and multiple while loops running sprio -l every .1 seconds.
Comment 31 Brian Christiansen 2015-03-05 04:10:13 MST
Having David put a second pair of eyes on this.
Comment 32 Ryan Cox 2015-03-11 05:03:37 MDT
FYI we still get zeroes pretty reliably as described before.
Comment 33 wettstein 2015-03-18 03:04:22 MDT
Without fairtree scheduling turned on, I have noticed that the Age priority seems to stop increasing at some point in time as well. One user had several jobs that had the age priority factor stuck at 766. I did an scontrol release on all of those jobs and the age priority immediately went to 1798. This seems to affect all jobs when that happens.

There also continues to be a smaller number of jobs that end up with a priority that looks like this:

   JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS   NICE
11872918 simonfre      10000          0          0          0          0          0 -10000

These also get corrected with scontrol release.
Comment 34 Brian Christiansen 2015-03-18 03:47:02 MDT
Thanks for the information. Did you apply the following patch:

https://github.com/SchedMD/slurm/commit/423029d8364e8856c6c32b019cb177e92ea18665

This should resolve the priorities like:
   JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS   NICE
11872918 simonfre      10000          0          0          0          0          0 -10000

It was an issue with array jobs.
Comment 35 Brian Christiansen 2015-03-26 04:29:54 MDT
What version OS/distro are you using? In Bug 1550, I was able to reproduce the priorities not being calculated -- which caused the 10000/0 priority outputs.

Will you try compiling without optimizations?
Comment 36 Ryan Cox 2015-03-26 04:35:03 MDT
I'll try to give that a shot today (and look at some of the other bug reports I've opened too).
Comment 37 wettstein 2015-03-26 04:39:29 MDT
(In reply to Brian Christiansen from comment #35)
> What version OS/distro are you using? In Bug 1550, I was able to reproduce
> the priorities not being calculated -- which caused the 10000/0 priority
> outputs.
> 
> Will you try compiling without optimizations?

I'm compiling on RHEL 6.6. I'm rebuilding now with -O0.
Comment 38 Ryan Cox 2015-03-26 05:05:24 MDT
I forgot to mention that the ctld and dbd run on Debian 6 (to be upgraded in the somewhat near future with gcc --version:  gcc (Debian 4.4.5-8) 4.4.5
Comment 39 Ryan Cox 2015-03-26 05:59:51 MDT
I just had the zeroes happen again while running with -O0.  It seems to happen less often now though.
Comment 40 Ryan Cox 2015-03-26 10:37:48 MDT
It turns out I didn't recompile all the code with -O0. Trying again now.
Comment 41 Ryan Cox 2015-03-26 11:02:16 MDT
I still get zeroes.  Unless I'm missing something, the following should work, right?

make clean
env CFLAGS=-O0 ./configure --sysconfdir=/etc/slurm
make install
Comment 42 Brian Christiansen 2015-03-26 11:21:57 MDT
Ya, that'll do it. I was running with fair_tree today and running your email command and I got one email where everything was:

3469    brian          2          0          0          1          1          0      0

I'll have to see if I can catch it again. I'm curious to see what Chicago sees.
Comment 43 Brian Christiansen 2015-03-27 07:08:44 MDT
Created attachment 1782 [details]
patch

Ryan, will you apply this patch. I believe this is the root cause of the issues. list_for_each is checking for a return value from void functions. Sometimes that value may be less than 0. When the return value is <0 it stops processing the rest of the jobs in the list. I was able to see this happen on my side. Let me know how it works for you.
Comment 44 Ryan Cox 2015-03-27 07:27:42 MDT
Wow. That's a subtle bug.

I applied the patch and am still on -O0. I'll let you know how it goes.
Comment 45 Ryan Cox 2015-03-27 08:02:16 MDT
Looking through that some more, I can see where Fair Tree makes mistakes with list_for_each.  Those functions should have returned an int, even if it's worthless, rather than be a void function.  We missed that the ListForF expected an int function rather than a void.  Essentially that results in the void return type returning whatever was on the stack last.  I wonder if there's anywhere else that uses list_for_each and makes that assumption.

I can submit a patch next week myself, but I imagine you'll want to solve it sooner by redefining those few functions used by Fair Tree to be int rather than void and just return SLURM_SUCCESS or 0 or whatever.

All that said, removing the break statement did not solve our zeroes problem but I can see how it would solve other problems.  Our issue might be different than Andy's after all.
Comment 46 Brian Christiansen 2015-03-27 08:46:06 MDT
You totally ruined my weekend. :)

I made the changes in:
https://github.com/SchedMD/slurm/commit/b2be6159517b197188b474a9477698f6f5edf480

I'll see if I can see zeros again with the commit.

Andy, will you let me know if this patches fixes it for you as well?
Comment 47 Ryan Cox 2015-03-27 09:02:20 MDT
It was a great find nonetheless :)

One other idea I had to track it down is to have the code core dump if it ever returns one of those values as a zero to the RPC call for sprio that should never be zero, such as fairshare.  That should give a pretty good view into a potential race condition to see what threads were where at the time.

I've been meaning to try replicating this in a test instance and then I had the crazy thought to try it in production then rely on a failover to keep things alive.  I haven't had the time to try the first idea and I'm not feeling too crazy about the second idea at the moment, but maybe I'll try it next week if I still haven't updated our test environment and I am feeling crazy.
Comment 48 Brian Christiansen 2015-03-30 05:54:35 MDT
Created attachment 1784 [details]
fair_tree fix

Ryan, I was able to reproduce it and find the problem -- hopefully for the last time :). The problem happens in fair_tree_decay(). 

First the _ft_decay_apply_new_usage is called for each job. Inside the function it calls set_priority_factors() which just sets the priority factors without applying the weights -- so you get small numbers.

ex.
(gdb) p *job->prio_factors
$7 = {job_id = 0, user_id = 0, priority_age = 0.00019345238095238096, 
  priority_fs = 0.5, priority_js = 1, priority_part = 1, priority_qos = 0, 
  nice = 10000}


The job locks are released and any pending sprio's can then read the job's priority factors -- showing the zeros because the weights haven't been applied. 

Then decay_apply_weighted_factors() is called for each job which calls _get_priority_internal() which calls set_priority_factors() and applies the weights.


It doesn't appear that fair_tree uses the priority factors, so the attached patch removes calling set_priority_factors() from _ft_decay_apply_new_usage(). Will you apply this patch and let me know how it works for you?
Comment 49 Ryan Cox 2015-03-31 08:37:23 MDT
Unfortunately I'm having a hard time going back and remembering all the calls to preexisting parts of the multifactor plugin.  It really would have been nice to have the decay part separated out into another plugin as well as individual plugins for each of the factors in multifactor plugins... :)

I have to admit that I'm a little confused about how this patch would work.  set_priority_factors works per job and calls xmalloc or memset at the start of it to wipe out the priority factors.  There are no return calls in the function, so the function runs to completion, all inside of a write lock.

set_priority_factors is also per job.  There would have to be an epic race condition for sprio to read the job data in the same order at the same rate in between memset and the priorities being set (assuming that locking is broken in the first place).  I just can't figure how sprio could see all zeroes due to this.  If the locking isn't right, sure, an entry or two may be zeroed out during sprio (or other calls).  But all of them (hundreds or thousands of jobs) being zeroed out at the exact same moment?

The call to set_priority_factors in _ft_decay_apply_new_usage may indeed be in error, but I can't see how it would result in zeroes for all jobs.  The locking around the call to _ft_decay_apply_new_usage (via list_for_each in fair_tree_decay) looks correct.

By the way, if this does end up being the correct solution you'll probably want to add a comment saying that you can't directly call decay_apply_new_usage in list_for_each because it has the wrong return type.

I'm still looking through the implications of removing the call to set_priority_factors there.  I know that we based that call on what other algorithms did.  It's quite possible that we got it wrong so I'm trying to check on it.  Since there are several entry points into set_priority_factors I'm validating that this one is redundant in some way.  I seem to recall that it was there due to applying decay somehow.
Comment 50 Ryan Cox 2015-03-31 08:46:44 MDT
Well, I just discovered something right after submitting.  My while loop has a problem!  It only outputs the rows with zeroes.  It should output ALL rows.

This is what I should have done:
while true; do time sprio_out=$(sprio -l); zeroes=$(echo "$sprio_out" | awk '$5 == 0 {print}'); if [[ "x$zeroes" != "x" ]]; then echo -e "$(date)\n\n$sprio_out" | mail -s zeroes "$email"; echo Found at $(date); fi; sleep 1; done


With that fix in place, I now see that I get all zeroes for everyone *except* one user.  He has a "1" in the fairshare column but zeroes in all others (except jobid) but should have a fairshare value of 6250 or so.  It has happened several times now and it's always user masaki.  I can see nothing odd about his jobs so far but I'm still checking.

I'm trying to figure out where the 1 in the fairshare column could come from.  This seems to be user-based which would imply something to do with assocs rather than jobs.  Maybe the part of the code assigning values to jobs should have an assoc mgr lock but doesn't?  I'm taking a look.
Comment 51 Brian Christiansen 2015-03-31 09:07:04 MDT
Let me try to explain it better. I'll show you it to you with gdb.


If you look at the job before it enters the first set of locks in fair_tree_decay() you'll see that the job's priority factors are correct from the last calculation:

Breakpoint 1, fair_tree_decay (jobs=0x1871d90, start=1427835011)                     
    at ../../../../../../slurm/src/plugins/priority/multifactor/fair_tree.c:54       
54                      { NO_LOCK, WRITE_LOCK, READ_LOCK, READ_LOCK };               
(gdb) set $job = (struct job_record *)jobs->head->next->data
(gdb) p $job->job_id                                                                 
$1 = 11018                                                                           
(gdb) p $job->priority                                                               
$2 = 1225015                                                                                                            
(gdb) p *$job->prio_factors                                                          
$4 = {job_id = 0, user_id = 0, priority_age = 15.294312169312169,                    
  priority_fs = 25000, priority_js = 200000, priority_part = 1000000,                
  priority_qos = 0, nice = 10000}                                                    
(gdb) c                                                                              
Continuing.                                                      


The jobs then enter _ft_decay_apply_new_usage which will call set_priority_factors() on all of the pending jobs. set_priority_factors only sets the factors and doesn't apply the weights to the factors. So the numbers are really small.
                    
ex.                                                                                  
Breakpoint 2, fair_tree_decay (jobs=0x1871d90, start=1427835011)                     
    at ../../../../../../slurm/src/plugins/priority/multifactor/fair_tree.c:64       
64              assoc_mgr_lock(&locks);                                              
(gdb) p *$job->prio_factors                                                          
$5 = {job_id = 0, user_id = 0, priority_age = 0.00040509259259259258,                
  priority_fs = 0.5, priority_js = 1, priority_part = 1, priority_qos = 0,           
  nice = 10000}                                                                      
(gdb) c                                                                              
Continuing.                                                                          

The job write lock is released and at this point any sprio's waiting with a read lock can now read the current priority factors. For this job, age and fairshare will show zero because they will be cast as int's and sprio will show a priority of 2 for the job (job size + partition).

The job then enters decay_apply_weighted_factors and _get_internal_priority() is called. Which set the factors and the applies the weights to the factors. So the factors now look like you would expect.
                                                                                     
Breakpoint 3, fair_tree_decay (jobs=0x1871d90, start=1427835011)
    at ../../../../../../slurm/src/plugins/priority/multifactor/fair_tree.c:71
71              unlock_slurmctld(job_write_lock);
(gdb) p *$job->prio_factors
$6 = {job_id = 0, user_id = 0, priority_age = 20.25462962962963, 
  priority_fs = 25000, priority_js = 200000, priority_part = 1000000, 
  priority_qos = 0, nice = 10000}


You could also not release the write lock until the end of fair_tree_decay() and it would fix it too but set_priority_factors() doesn't need to be called in _ft_decay_apply_new_usage() so it can just be removed.

You can put a sleep after the first release of the job write lock (line 62) and you'll see it happen more.

Do you see how it happens now?
Comment 52 Ryan Cox 2015-03-31 09:16:18 MDT
(In reply to Brian Christiansen from comment #51)

> Do you see how it happens now?

Ah.  That makes sense now.  I know you said in a previous comment that the weights weren't applied yet but somehow I forgot that when I read through the code.

We're trying out the patch now.
Comment 53 Ryan Cox 2015-03-31 10:56:11 MDT
The testing is very promising and conceptually seems like the right solution.  No zeroes yet.  It looks like you fixed several very subtle issues with this one ticket:  a job array issue, mistakenly casting void pointers as int pointers, and now this one (my bad on the last two).  Thanks!

I'm not expecting to see any more problems at this point but I'll let you know tomorrow how the overnight testing went.
Comment 54 Ryan Cox 2015-04-01 03:54:46 MDT
It worked!  I've had a loop running sprio every second since yesterday at 4pm.  There have been no instances of zeroes at all since then when I normally would have seen dozens of instances of it.
Comment 55 Brian Christiansen 2015-04-01 06:03:58 MDT
Awesome! I've committed the patch:

https://github.com/SchedMD/slurm/commit/f61475e8005bb9d834f8e227b0dc4282cc1f0aee

I'll close the ticket. Please reopen if you see anything else. Thanks for everyone's help!