Summary: | Jobs stop accruing age priority | ||
---|---|---|---|
Product: | Slurm | Reporter: | Martins Innus <minnus> |
Component: | Scheduling | Assignee: | Brian Christiansen <brian> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | brian, da |
Version: | 14.11.4 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | University of Buffalo (SUNY) | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | 14.11.6 15.08.0pre4 | |
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
slurm-cluster.conf
slurm.conf Sprio post slurm restart slurmctl log debug patch slurmctld log with priority debug patch set debug patch 2 slurmctl log with more debugging Unoptimize patch |
Description
Martins Innus
2015-03-20 01:41:29 MDT
Will you send your slurm.conf? Do you see this on new jobs since the upgrade as well? Thanks, Brian Created attachment 1746 [details]
slurm-cluster.conf
Here you go. It looks like any jobs submitted since the upgrade have a
age of "0". Jobs that were pending in the queue during the upgrade have
the age that they had prior to the upgrade.
And actually, looking further, it looks like no priorities are being
changed after a job is submitted. We run fairshare, and users who have
jobs pending in the queue, their fairshare is not going down even though
they have jobs running in the queue.
Martins
Created attachment 1747 [details]
slurm.conf
I'm not able to reproduce this yet. Will you run with the Priority debug flag (DebugFlags=Priority). You can use sview to set it on the fly too if you want. You should see lines like this in the logs every 5 minute since the PriorityCalcPeriod is set to the default: [2015-03-20T11:52:00.645] Fairshare priority of job 24599 for user brian in acct bubu is 2**(-1.000000/0.145937) = 0.386770 [2015-03-20T11:52:00.645] Weighted Age priority is 0.018638 * 50000 = 931.88 [2015-03-20T11:52:00.645] Weighted Fairshare priority is 0.386770 * 50000 = 19338.50 [2015-03-20T11:52:00.646] Weighted JobSize priority is 0.100000 * 200000 = 20000.00 [2015-03-20T11:52:00.646] Weighted Partition priority is 1.000000 * 1000000 = 1000000.00 [2015-03-20T11:52:00.646] Weighted QOS priority is 0.000000 * 0 = 0.00 Will you then send the logs pointing me to a job that you think should be increasing in priority age? Created attachment 1748 [details]
Sprio post slurm restart
We added the debug parameter and restarted slurm. See in the attached sprio output that all jobs that existed before the slurm restart have -10000 nice and all the rest zero. Jobs submitted since the restart have whatever priority that was calculated at submit time and never seems to change. Will send the other logs after a few scheduling cycles have run Will you confirm that the priorities of 10000 are job arrays? This was fixed in 14.11.5. https://github.com/SchedMD/slurm/commit/423029d8364e8856c6c32b019cb177e92ea18665 Some of them maybe, but certainly not all. Created attachment 1749 [details]
slurmctl log
log attached. one job that is not accruing age is 3505711, but i think they are all the same Will you get the output of "scontrol show job 3505711"? Thats probably a bad one since it has since started. How about: This one is in the log I sent you. Let me know if you want the other one [minnus@rush:~]$ sprio -l |grep 3505675 3505675 tiangebi 612 157 216 240 0 0 0 [minnus@rush:~]$ sprio -l | grep 3505675 3505675 tiangebi 612 157 216 240 0 0 0 those sprios are long enough apart that they should increment. The numbers are no longer 0 since we have done a scontrol job release on all jobs to try to get things moving. The "release" seems to increment the counters when the command is run. But then they stay the same after that [minnus@rush:~]$ scontrol show job 3505675 JobId=3505675 JobName=0GPa_2FU_vdW-1x6-3 UserId=tiangebi(423120) GroupId=ezurek(104475) Priority=612 Nice=0 Account=ezurek QOS=normal JobState=PENDING Reason=Priority Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=10:00:00 TimeMin=N/A SubmitTime=2015-03-20T15:39:56 EligibleTime=2015-03-20T15:39:57 StartTime=2015-03-22T04:32:00 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=general-compute AllocNode:Sid=k07n14:78666 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=d13n06 NumNodes=1-1 NumCPUs=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=8:0:*:* CoreSpec=* MinCPUsNode=8 MinMemoryCPU=3000M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/gpfs/projects/ezurek/tiangebi/PH3/2FU/0GPa_2FU_vdW/00001x00006/job.slurm WorkDir=/gpfs/projects/ezurek/tiangebi/PH3/2FU/0GPa_2FU_vdW/00001x00006 StdErr=/projects/ezurek/tiangebi/PH3/2FU/0GPa_2FU_vdW/00001x00006//../1x6-3.out StdIn=/dev/null StdOut=/projects/ezurek/tiangebi/PH3/2FU/0GPa_2FU_vdW/00001x00006//../1x6-3.out Thanks. I'll get you a patch with additional logging. It's probably the end of your day, so we can pick it up on Monday. Created attachment 1752 [details]
debug patch
Will you run with the attached patch and get logs with at least two instances of the following log line for a job?
[2015-03-20T15:43:54.317] 24636->priority_age=0.029980 diff:18132 start_time:1426891434 use_time:1426873302 max_age:604800 weight_age:50000
Thanks,
Brian
I forgot to mention that the extra logs are turned on with the Priority debug flag. Created attachment 1754 [details]
slurmctld log with priority debug patch set
OK, attached. looks like it only outputs when a job is submitted. I let it run through 3 cycles Just a clarification on the last comment. So that means there are no jobs that get that line more than once. It only happens on job submit time and never for jobs already in the queue. Created attachment 1755 [details]
debug patch 2
I've attached a new patch with more logging. It includes the previous logs as well. Will you rerun the test with this patch? Thanks for being patient and working with us.
Created attachment 1756 [details]
slurmctl log with more debugging
Output attached. The output is not likely helpful since with this patch, and the priority debugging turned on, the priorities are all calculated correctly!! That is, sprio shows correct output So then, I did the following. When I leave this patch compiled in, but turn priority debugging off. We get the priorities not calculated correctly. sprio shows "0" for all age priority. Some sort of race condition? hmm.. interesting. I'll look at the possibility of a race condition. Thanks for your help. I just want to confirm one thing. With the patch compiled in but the debugging off, does sprio show an age of 0 for all jobs or is it just new jobs? If you do scontrol release on a job does the age change or does it stay at 0? And what linux distro are you using on your the controller? I replied to these yesterday by email, but looks like they don't show up. Trying again through the website. When restarting slurmctld with this patch applied, but debugging turned off. Existing jobs keep their priorities and new jobs show up as "0". Nothing increments after this time. I am fairly certain that on one of these tests, everything was "0" but I am unable to reproduce that. If we do a "release", the priorities are calculated at that time, but do not increment after that time. We have been running "release" on all jobs a couple times a day to keep things sane. Centos 6.6 on the controller and everywhere else. Created attachment 1772 [details]
Unoptimize patch
I was finally able to reproduce this. I can reproduce it when the binaries are compiled with optimizations (-O2) and I'm using Centos 6.6. I couldn't reproduce it when using Centos 6.5 or Ubuntu 14.10. Placing debug statements around the function that doesn't get called fixes it. It appears that Centos 6.6's gcc is optimizing out a function.
I also saw where the all of the priorities were 10000 and the factors were 0's after a restart. This makes sense because the function that calculates the priorities and factors wasn't being called/optimized out.
Will you apply this patch and see if it fixes it for you? It fixes it for me.
Or the other option is to turn off optimization all together. It's something that we may do as a default in the future to avoid things like this. CFLAGS="-O0" Unfortunately the patch didn't work. On slurmctl restart, existing jobs have the 10000/0 status and new jobs show up with 0 age and the rest of the priority is whatever it would be at submit time. We are going to try with -O0 next. It seems strange that it would be an optimizer bug, since in your last debug patch, toggling the debug variable changed the behavior after the code is already compiled. Rebuilding with -O0 fixed the issue. Do you expect any performance hits from doing this? Our size is roughly: ~1000 nodes ~1500 users, but ~200 active at a time 5-10K jobs a day Anything else we should try or do you think we can safely run this way moving forward? Thanks for the help! Great! We don't expect any significant performance hits. I would run with -O0 and also add the -g flag to get symbols. This will allow us to help diagnose any problems better that you may encounter. Our plan is to disable optimizations and add symbols going forward. I found the main cause of the priority issue. It had to do with checking return values from void functions -- which short circuited calculating the priorities for the rest of the jobs. It was still possible to see the issue even without optimizations like you had seen. With your help I was able to make progress on Bug 1469 which was seeing the same problem. It is fixed in: https://github.com/SchedMD/slurm/commit/b2be6159517b197188b474a9477698f6f5edf480 And we've enabled debugging symbols and no optimizations by default in: https://github.com/SchedMD/slurm/commit/614b9770ba1a7fc2b639768566849168aa883da4 Thanks for your help. Let me know if you have any questions. Thanks, Brian Brian, Thanks! That makes sense. Do you have an expectation on when 14.11.6 will be out. Trying to figure out if we should just patch this in or wait for the next release. Thanks Martins 14.11.6 will probably be 2-3 weeks out. |