Ticket 3844 - Is there a good way to emulate maui's soft limits?
Summary: Is there a good way to emulate maui's soft limits?
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 16.05.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-05-26 15:58 MDT by NASA JSC Aerolab
Modified: 2018-08-23 02:40 MDT (History)
1 user (show)

See Also:
Site: Johnson Space Center
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.10 18.08.0rc1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
current slurm.conf (4.12 KB, text/plain)
2017-05-26 15:59 MDT, NASA JSC Aerolab
Details
hipri (151.97 KB, application/gzip)
2018-07-23 15:35 MDT, NASA JSC Aerolab
Details
sprio output (1.70 KB, text/plain)
2018-07-25 10:12 MDT, NASA JSC Aerolab
Details
squeue output (4.60 KB, text/plain)
2018-07-25 10:13 MDT, NASA JSC Aerolab
Details
Diagnostic files (442.13 KB, application/gzip)
2018-07-26 15:39 MDT, NASA JSC Aerolab
Details
patch not reviewed (1.46 KB, patch)
2018-08-08 02:48 MDT, Alejandro Sanchez
Details | Diff
Diagnostic files (273.66 KB, application/gzip)
2018-08-09 07:11 MDT, NASA JSC Aerolab
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description NASA JSC Aerolab 2017-05-26 15:58:51 MDT
Before switching to slurm we used torque and maui.  We had maui configured to put a soft limit of 768 cores on users jobs, which meant that all of a user's jobs were considered "idle" until going over the soft limits, at which point the user's jobs were considered "blocked".  Idle jobs were considered for scheduling before blocked jobs.  Is there a good way to emulate this with slurm?  I think QOS's could help with this but I haven't had the chance to understand them well enough.  

What the above configuration really accomplished well is to let all users have high priority on a small number of jobs (~10% of the system capacity in our case).  The end result is that if a lot of users were in the queue, everyone would get a few jobs started right away.  I'm trying to get back to the same point with slurm.  We tend to have times when some users' jobs get starved for long periods.
Comment 1 NASA JSC Aerolab 2017-05-26 15:59:43 MDT
Created attachment 4655 [details]
current slurm.conf
Comment 3 Alejandro Sanchez 2017-05-29 03:36:46 MDT
Slurm has no soft limits. You can take a look at the resource limits guide in the Slurm documentation webpage to have a feel of what type of limits can be imposed to jobs:

https://slurm.schedmd.com/resource_limits.html

The PriorityType=priority/multifactor plugin could also be used together with the FairShare factor. The fair-share component to a job's priority influences the order in which a user's queued jobs are scheduled to run based on the portion of the computing resources they have been allocated and the resources their jobs have already consumed.

https://slurm.schedmd.com/priority_multifactor.html

You might be accustomed to soft limits, but I would strongly encourage you to look at Slurm's Quality Of Service (QOS) capability as a better solution. QOS lets you establish different job limits and supports job preemption. The preemption is especially important in that you can preempt lower priority jobs at will rather than finding your machine full of low priority jobs all Monday morning just because the system went idle on the weekend. A typical configuration would be to establish a "standby" QOS with large time/size limits, but preemptable by normal QOS jobs on demand. See: 

https://slurm.schedmd.com/qos.html
https://slurm.schedmd.com/preempt.html

I've taken the liberty of reviewing your slurm.conf. You might consider doing these changes:

ProctrackType=proctrack/cgroup # This helps in job cleanup. When the job finishes anything spawned in the cgroup will be cleaned up. This prevents runaway jobs (ie. jobs that double forked themselves). NOTE that 'pgid' (your current setup value) mechanism is not entirely reliable for process tracking.

You have:
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageLoc=/tmp/slurm_job_accounting.txt

This fully qualified path name to the txt file would make sense if you had accounting_storage/filetxt. We usually recommend a setup with accounting_storage/slurmdbd and an underlying database like MySQL or MariaDB. If besides the database you'd like the information of your finished jobs to be stored somewhere, you could also consider using a JobComp plugin to complement the accounting, although you might find most of the information would be redundant since it can be retrieved with sacct if you already have the slurmdbd + the database in place.

I see you have:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
and all of your partitions defined with OverSubscribe=Exclusive. Perhaps if you want to allocate entire nodes to jobs you might want to consider switching to select/linear and avoid the overhead created by the use of select/cons_res.

# Backfill
If you haven't seen it, we think Doug Jacobsen did an excellent job of walking
people through how NERSC approaches some of their priority and scheduler
tuning. The presentation is here:

https://slurm.schedmd.com/SLUG16/NERSC.pdf

and may provide some insights. His mapping of priority to units of time we find
rather inspiring.

Our usual starting points for tuning are:
bf_continue
bf_window=(enough minutes to cover the highest MaxTime on the cluster.)
bf_resolution=(usually at least 600)
bf_min_prio_reserve may actually suit you well depending on your queue depth,
although you'd need to jump to a 16.05 / 17.02 release to get that. The idea
behind that being to only test if the lower priority jobs can launch
immediately, and not bother trying to slow them in to the backfill map
otherwise. That has *huge* performance gains for them, and lets them keep their
systems 95%+ occupied.

We usually highly encourage to stick updated to the latest Slurm stable release (currently 17.02.3 at the time of writing this comment), or at least to the latest 16.05 (currently 16.05.10-2). A lot of bugs have been fixed since 16.05.7 and from my experience we usually are able to reproduce faster and troubleshoot quicker bugs whose Slurm version is up-to-date. Please, let us know if you have further questions, I know it's kind of a lot of information to absorb, but feel free to ask anything.
Comment 4 NASA JSC Aerolab 2017-05-30 10:39:48 MDT
Yes, it is a lot of info to absorb.  I do have several comments/questions.  

I've been meaning to switch to ProctrackType=proctrack/cgroup for a while but haven't had the time to test this.  Do you have any recommendations for a good cgroup.conf configuation?  I think we want both ConstrainCores=yes and ConstrainRAMSpace=yes.  Anything else you recommend?

I think the AccountingStorageLoc being set is a leftover from our previous trials when first configuring slurm.  We are currently using slurmdbd with a mysql DB so I think we're set there.  

We switched from select/linear to select/cons_res in bug 3818.  I do intend to switch back to select/linear.  But we learned the hard way in that bug that we need to drain the jobs before switching this.  

I would also like to update to 17.02 soon and intend to do that.  

When I first got slurm running on our cluster (at the end of 2016), I spent a lot of time trying to understand how to use QOS and/or TRES to emulate the maui soft limit behavior I'm looking for.  I spend some more time looking at QOS and TRES this morning and I'm still not clear to to utilize these.  Can you please point me in the right direction?  How can I set up priority/multifactor such that a users priority would be high when they have less than X number of procs running but low when they have more than X procs running?
Comment 5 Alejandro Sanchez 2017-05-31 03:56:13 MDT
(In reply to NASA JSC Aerolab from comment #4)
> Yes, it is a lot of info to absorb.  I do have several comments/questions.  
> 
> I've been meaning to switch to ProctrackType=proctrack/cgroup for a while
> but haven't had the time to test this.  Do you have any recommendations for
> a good cgroup.conf configuation?  I think we want both ConstrainCores=yes
> and ConstrainRAMSpace=yes.  Anything else you recommend?

We usually recommend:

ProctrackType=proctrack/cgroup
and
TaskPlugin=task/affinity,task/cgroup <-- with TaskAffinity=no in cgroup.conf. This combines the best of the two task plugins. task/cgroup will be used to fence jobs into the specified memory, gpus, etc. and task/affinity will handle best the task binding/layout/affinity. The affinity logic in task/affinity is better is better than task/cgroup.

Regarding the cgroup.conf, a good setup could be:

CgroupMountpoint=/sys/fs/cgroup # or your cgroup mountpoint
CgroupAutomount=yes
CgroupReleaseAgentDir=/path/to/yours
ConstrainCores=no
ConstrainDevices=yes
ConstrainRAMSpace=yes
AllowedRAMSpace=100
ConstrainSwapSpace=yes
AllowedSwapSpace=0
TaskAffinity=no

Notes:

- Let me double check internally if ConstrainCores should be 'no' with TaskAffinity=no. I'll come back to you.
- Regarding the ReleaseAgent, supposedly since 16.05.5 Slurm automatically was able to remove the cpuset and devices subsystems without the need of a release_agent. Later we discovered that for some specific config/use-cases there were still steps hierarchies not cleaned up, so a fix was added in 17.02.3 and since that version all cleanups seems to work great. So in short, for your 16.05.7 version I'd still use the release agent.
- You can also use MemSpecLimit in your node definition to set aside some memory in the nodes for slurmd/slurmstepd usage. That memory won't be able for job allocations, thus leaving ((RealMemory - MemSpecLimit) * AllowedRAMSpace)/100 for job allocations.
- Memory can be enforced by two mechanisms in Slurm. One is by sampling at a frequent interval the memory statistics through the JobAcctGatherPlugin and another by using the task/cgroup part of the TaskPlugin=task/cgroup,task/affinity. If you use JobAcctGatherPlugin, we recommend the jobacct_gather/linux. We also encourage to not use the two mechanisms at the same time. If you have both plugins enabled, (JobAcct.. and Task..) then we suggest adjusting:

JobAcctGatherParams=NoOverMemoryKill # disables JobAcct... mem enforcement.
If the job is truly over its memory limits the cgroup enforcement is what should be killing it, and is not affected by this setting.

You may also want to set UsePSS as well - this changes the data collection from
RSS to PSS. If the application is heavily threaded it might be getting the
shared memory space from the application counted against it once per
thread/process, which could explain the apparently high usage from summing the
RSS values together. PSS divvies up the shared mem usage between all the
separate processes, so when summed back together you get a more realistic view
of the memory consumption.

> I think the AccountingStorageLoc being set is a leftover from our previous
> trials when first configuring slurm.  We are currently using slurmdbd with a
> mysql DB so I think we're set there.  

Ok.
 
> We switched from select/linear to select/cons_res in bug 3818.  I do intend
> to switch back to select/linear.  But we learned the hard way in that bug
> that we need to drain the jobs before switching this.  

I've just noticed you changed from linear to cons_res in that bug. Regarding the jobs being killed... yes it was an accidental bad advice. The slurm.conf man for SelectType warns though: "Changing this value can only be done by restarting the slurmctld daemon and  will  result  in the loss of all job information (running and pending) since the job state save format used by each plugin is different."

Anyhow and back to the point, if you were recommended to stick to select/cons_res in that bug, do not change back. Also I saw Danny pointed to a commit which is included since 16.05.8+.

> I would also like to update to 17.02 soon and intend to do that.  
> 
> When I first got slurm running on our cluster (at the end of 2016), I spent
> a lot of time trying to understand how to use QOS and/or TRES to emulate the
> maui soft limit behavior I'm looking for.  I spend some more time looking at
> QOS and TRES this morning and I'm still not clear to to utilize these.  Can
> you please point me in the right direction?  How can I set up
> priority/multifactor such that a users priority would be high when they have
> less than X number of procs running but low when they have more than X procs
> running?

Let me do some tests and come back to you for this question.
Comment 10 NASA JSC Aerolab 2017-05-31 09:55:44 MDT
The cgroup info is very helpful - thanks.  

No worries about the cons_res mishap - I missed it too.  

Looking forward to your recommendations for the multifactor setup we are trying to achieve.
Comment 11 Alejandro Sanchez 2017-05-31 10:22:42 MDT
(In reply to NASA JSC Aerolab from comment #4)
> When I first got slurm running on our cluster (at the end of 2016), I spent
> a lot of time trying to understand how to use QOS and/or TRES to emulate the
> maui soft limit behavior I'm looking for.  I spend some more time looking at
> QOS and TRES this morning and I'm still not clear to to utilize these.  Can
> you please point me in the right direction?  How can I set up
> priority/multifactor such that a users priority would be high when they have
> less than X number of procs running but low when they have more than X procs
> running?

I've been doing some tests today and also discussed this internally. As I anticipated before, Slurm doesn't support the concept of soft limits.

A brief description of how Scheduling works in Slurm is as follows. Anyhow I'd recommend reading the docs and this sched_tutorial:
https://slurm.schedmd.com/SUG14/sched_tutorial.pdf

When not FIFO scheduling, jobs are prioritized in the following order:
              1. Jobs that can preempt
              2. Jobs with an advanced reservation
              3. Partition Priority Tier
              4. Job Priority
              5. Job Id

Point 4) is where priority/multifactor comes into play. A Job Priority is the result of the sum of various factors multiplied by admin defined weights. One of the factors is the FairShare which looks at the past usage. This usage can be decomposed so the admin can say how much to charge for every different TRES. The admin can also clear past usage by using either PriorityDecayHalfLife or PriorityUsageResetPeriod. Anyhow, if the PriorityWeightFairshare isn't high enough as compared to the rest of multifactor weights, the FairShare won't have much impact on the final job priority, which, let's keep in mind, is the 4th point above to be considered when Scheduling.

Besides that, a bunch of SchedulerParameters are used to take more scheduling decisions, which can be consulted in the man page. 

So again, Slurm has no soft limits and the closest approaches we've come up to cover that use-case would be:

a) PriorityFavorSmall=yes, so the smaller the job the higher the priority.  But this would not be an aggregate, but on a per job basis.
b) GrpTRES=cpu=X: all running jobs combined for the association and its children can be consuming up to X cpus at the same time. If limit is reached, job will be PD with reason AssocGrpCpuLimit, EVEN IF THERE ARE FREE RESOURCES. I think you would like that new jobs from the association that reached the limit get only left PD if no other jobs from other assocs want to consume the resources, which would be favored since this assoc already used X cpus. But currently the Grp* limits are hard limits, and this is the way Slurm works today.
c) Do not set any Grp* limits, and just let the FairShare factor affect the final job priority. You can specify through TRESBillingWeights which weight to assign to each TRES and PriorityWeightTRES sets the degree each TRES Type contributes to the job's priority.

Any solution (we can think of anyway) would require code to make anything like this happen. There isn't anything in Slurm that says, you are running x so after x+1 don't give priority. If you're interested in sponsoring something like that, let us know and we can discuss that further.
Comment 12 Alejandro Sanchez 2017-05-31 10:32:41 MDT
Btw, answering to my comment #5, set ConstrainCores=yes. If you're doing task binding to cpus we recommend enabling this; it'll force the tasks to only run on the CPUs explicitly assigned to them. Otherwise they may run on whichever cores the Linux kernel schedules them on automatically, and they can use more than their share of the CPUs.
Comment 13 Alejandro Sanchez 2017-06-19 06:39:09 MDT
Hi, is there anything else we can assist you with this bug? Thanks.
Comment 14 NASA JSC Aerolab 2017-06-19 08:29:25 MDT
Yes, please keep this open.  We would still like help tweaking our scheduling to achieve something like soft limits.  We are trying to get our cluster emulated on one of our workstations but we've been distracted by other things and haven't finished that yet.
Comment 15 Alejandro Sanchez 2017-07-11 07:48:09 MDT
Hi, do you need anything else from this bug? Thanks.
Comment 16 NASA JSC Aerolab 2017-07-11 07:49:43 MDT
Yes, still working on this.  Please keep it open.
Comment 17 Alejandro Sanchez 2017-08-01 03:05:10 MDT
Please reopen if any more is required on this.
Comment 18 NASA JSC Aerolab 2018-07-20 08:50:11 MDT
Hello.  I'm reopening this bug as I think I have an approach for what we are trying to accomplish, but I could still use some help.  Kind of a lot has changed since I opened this bug.  We've been adding users from other groups and so, by necessity, we've become a lot more familiar with the options to limit users and resources.  Here is what I have in mind.  

We currently have 5 partitions setup - normal, idle, long, debug and twoday.  The normal and idle partitions are the two main ones and are really about job priority.  Almost all of our jobs run in normal.  We restrict most jobs to 8 hours or less as a means of naturally cycling jobs through the queue and keeping maintenance more accessible (replacing failed memory, etc.).  The long and twoday queues are restricted in various ways.  

What I propose is creating a partition called hipri with a Priority 100x larger than normal.  This would also be attached to a corresponding hipri QOS that has MaxTRESPerUser set to some small fraction of the machine, probably CPU=768.  We would use our job_submit.lua to automatically place jobs submitted to normal into both normal and hipri.  The end result should be what we are trying to achieve - each user will get a small number of jobs that start above other jobs, then the effectively fall into another priority range.  Please let me know if this sounds reasonable or if you see any issues with that approach.  

Secondly, it would be nice if this could be accomplished directly with QOS's, without the need for the extra partition.  I think this could work if a job could request multiple QOS's (i.e. --qoa=normal,hipri) and we could also use job_submit.lua to automatically do this for the user.  But it doesn't look like you can currently do multiple QOS's in --qos?
Comment 19 Alejandro Sanchez 2018-07-23 06:00:29 MDT
(In reply to NASA JSC Aerolab from comment #18)
> Hello.  I'm reopening this bug as I think I have an approach for what we are
> trying to accomplish, but I could still use some help.  Kind of a lot has
> changed since I opened this bug.  We've been adding users from other groups
> and so, by necessity, we've become a lot more familiar with the options to
> limit users and resources.  Here is what I have in mind.  
> 
> We currently have 5 partitions setup - normal, idle, long, debug and twoday.
> The normal and idle partitions are the two main ones and are really about
> job priority.  Almost all of our jobs run in normal.  We restrict most jobs
> to 8 hours or less as a means of naturally cycling jobs through the queue
> and keeping maintenance more accessible (replacing failed memory, etc.). 
> The long and twoday queues are restricted in various ways.  
> 
> What I propose is creating a partition called hipri with a Priority 100x
> larger than normal.  This would also be attached to a corresponding hipri
> QOS that has MaxTRESPerUser set to some small fraction of the machine,
> probably CPU=768.  We would use our job_submit.lua to automatically place
> jobs submitted to normal into both normal and hipri.  The end result should
> be what we are trying to achieve - each user will get a small number of jobs
> that start above other jobs, then the effectively fall into another priority
> range.  Please let me know if this sounds reasonable or if you see any
> issues with that approach.  

Note that Partition Priority option might not be doing what you may think it is[1]. If you want to create a higher priority partition I'd increase the value of the partition PriorityTier option instead of the PriorityJobFactor, since PriorityJobFactor would only contribute to one of the multiple factors for the Job Priority (point 4 from comment 11).

Your approach might work although I think another approach would be to just create a higher priority and limited QOS that users are aware of and they can submit jobs to when needed knowing that it is limited. Once they hit the limit they can submit to the regular QOS. Instead of creating a higher priority partition attached to another QOS.
 
> Secondly, it would be nice if this could be accomplished directly with
> QOS's, without the need for the extra partition.  I think this could work if
> a job could request multiple QOS's (i.e. --qoa=normal,hipri) and we could
> also use job_submit.lua to automatically do this for the user.  But it
> doesn't look like you can currently do multiple QOS's in --qos?

You can't submit to more than one QOS at present.

[1] https://slurm.schedmd.com/SLUG17/FieldNotes.pdf look for Partition Priority.
Comment 20 NASA JSC Aerolab 2018-07-23 09:46:44 MDT
It would be preferable to make this as transparent to the users as possible.  Also, the idea of users manually submitting to the high priority QOS for some jobs and the lower QOS for others is a no go.  This has to be something that happens automatically.  We'll try the partition approach and see how it goes.
Comment 21 NASA JSC Aerolab 2018-07-23 15:35:01 MDT
Created attachment 7376 [details]
hipri

I've tried to implement this but something isn't right.  I created the qos with:

sacctmgr create qos name=hipri
sacctmgr modify qos name=hipri set MaxTRESPerUser=CPU=768

Then added a corresponding hipri partition. 

PartitionName=hipri  Nodes=r1i[0-2]n[0-35] Priority=50000 State=UP Qos=hipri

That's not all the options for the partition so see the attached slurm.conf for the full details.  I then had the user aschwing (Alan) manually add both the normal and hipri partitions to their job scripts.  The way the multifactor was working out prior to the hipri changes, only a few of his jobs (~300 cores worth) were running.  User ema was dominating the queue.  So I expected Alan to get 768 jobs running before the MaxTRESPerUser=CPU=768 limit kicked in and his jobs competed in the normal queue again.  This is not happening.  Its like the hipri TRES limit is not taking affect and Alan's jobs are starving all other jobs now.  This is easiest to see in the q.txt file I've attached, which is sorted by priority.  Any idea what's going on here?
Comment 22 Alejandro Sanchez 2018-07-24 04:17:53 MDT
User aschwing has these 5 jobs running in highpri partition:

Job ID   Username Queue   Jobname              N:ppn Proc Wall  S Elap    Prio     Reason Features  
-------- -------- ------- -------------------- ----- ---- ----- - ----- ------ ---------- --------------
20497    aschwing hipri   m1.10a10_dx0.06_umb2  6:24  144 04:00 R 01:44   3941       None [BRO|sky] 
20514    aschwing hipri   m1.10a0b-10_dx0.06_u  6:24  144 04:00 R 00:09   3941       None [BRO|sky] 
20496    aschwing hipri   m1.20a-10_dx0.06_umb  6:24  144 04:00 R 01:50   3943       None [BRO|sky] 
20513    aschwing hipri   m3.00a20_dx0.25_umb1  5:32  160 04:00 R 00:11   3946       None [bro|SKY] 
20498    aschwing hipri   m0.50a-20_dx0.25_umb  6:24  144 04:00 R 00:24   3995       None [BRO|sky] 

They sum up a total of 144*4 + 160 = 736 Proc. There's another job submitted by aschwing to both normal,highpri which isn't running:

20515    aschwing normal,h m1.10a-7.11b-7.05_dx  5:28  144 04:00 Q 00:09   3946   Priority [bro|sky] 

since it requests 144 Proc, and 736 which are already running + 144 would sum up 880 Proc which is greater than the highprio MaxTRESPerUser=cpu=768, that's why it falls back to normal partition, but normal partition doesn't have such a high priority so that's why this job is waiting with Reason Priority.

You mentioned that:

"So I expected Alan to get 768 jobs running before the MaxTRESPerUser=CPU=768 limit kicked in and his jobs competed in the normal queue again."

If you want to limit the number of jobs per user you should configure the MaxJobsPerUser limit instead.

In any case, if in general terms what you want to accomplish is that no user dominates the cluster, what I'd do is to give more importance to the PriorityWeightFairshare, increase it a few orders of magnitude over the rest of the factors and setup the PriorityFlags=FAIR_TREE.

https://slurm.schedmd.com/priority_multifactor.html#fairshare

and

https://slurm.schedmd.com/fair_tree.html

Note also that as I said in my previous comment, you have your partitions set with a deprecated option Priority. Perhaps you have it set on purpose but I'd encourage you to read if you haven't already the slide 12 onwards from this presentation:

https://slurm.schedmd.com/SLUG17/FieldNotes.pdf

where it is explained the difference between PriorityTier, PriorityJobFactor and (the obsolote) Priority.

Does it make sense?
Comment 23 NASA JSC Aerolab 2018-07-24 07:09:17 MDT
Sorry, I didn't mean 768 jobs, I meant 768 cores worth of jobs.  This is want we want.  

I will look into using PriorityTier and PriorityJobFactor instead of just Priority.  But does this explain why these jobs are now starving all other jobs in the queue?  I was expecting this to work such that jobs that were submitted to both normal and hipri will run over other jobs that are just in normal until 768 cores for that user are running.  Then a mix of users jobs will run in normal, according to the multifactor weights we have set up.  But that is not happening.  User aschwing is getting All 2880 cores on this cluster.  Another aspect of this is confusing to me.  See below, but the list is sorted by job priority.  All the running jobs at the bottom of the list have a very high priority, that should correspond to getting into the hipri partition.  But most of those are listed as normal.  The jobs at the top are listed as hipri but have a low priority.  Why is that?  And why are those jobs at the top of the list (lowest priority in the queue) running before the jobs in the 4000's from ema and lhalstro?  

Will using PriorityTier and PriorityJobFactor fix this?  


Job ID   Username Queue   Jobname              N:ppn Proc Wall  S Elap    Prio     Reason Features  
-------- -------- ------- -------------------- ----- ---- ----- - ----- ------ ---------- --------------
20685    aschwing hipri   m3.00a0b-20_dx0.25_u  6:24  144 04:00 R 00:57   3940       None [BRO|sky] 
20673    aschwing hipri   m3.00a-20_dx0.25_umb  5:32  160 04:00 R 01:36   3941       None [bro|SKY] 
20681    aschwing hipri   m4.00a20_dx0.25_umb1  6:24  144 04:00 R 01:05   3941       None [BRO|sky] 
20697    aschwing normal  m0.50a0b-20_dx0.25_u  5:28  144 04:00 Q 00:05   3943   Priority [bro|sky] 
20696    aschwing normal,h m0.50a0b-20_dx0.25_u  5:28  144 04:00 Q 00:07   3945   Priority [bro|sky] 
20687    aschwing hipri   m3.00a-20_dx0.25_umb  6:24  144 04:00 R 00:07   3973       None [BRO|sky] 
20494    lhalstro normal  m0.65a0.0_fso3_dcfpe 12:32  384 08:00 Q 18:04   4609   Priority [sky|bro] 
20482    lhalstro normal  m0.80a0.0_fso3_dcfpe 16:24  384 08:00 Q 18:52   4643   Priority [bro|sky] 
20490    ema      normal  m0.50a80r90_upwind_d  3:32   96 08:00 Q 18:22   4722   Priority [bro|sky] 
20481    ema      normal  m0.20a70r20_lowspeed  4:24   96 05:00 Q 19:25   4766   Priority [bro|sky] 
20479    ema      normal  m0.30a80r70_lowspeed  3:32   96 08:00 Q 19:45   4780   Priority [bro|sky] 
20480    ema      normal  m0.30a70r60_lowspeed  4:24   96 08:00 Q 19:45   4780   Priority [bro|sky] 
20478    ema      normal  m0.30a90r60_lowspeed  4:24   96 08:00 Q 19:46   4781   Priority [bro|sky] 
20477    ema      normal  m0.30a80r80_lowspeed  4:24   96 08:00 Q 19:49   4782   Priority [bro|sky] 
20476    ema      normal  m0.30a60r70_lowspeed  4:24   96 08:00 Q 19:53   4785   Priority [bro|sky] 
20431    lhalstro normal  m0.40a0.0_fso3_dcfpe 12:32  384 08:00 Q 22:18   4786   Priority [sky]     
20475    ema      normal  m0.30a70r80_lowspeed  4:24   96 08:00 Q 19:55   4786   Priority [bro|sky] 
20474    ema      normal  m0.30a90r80_lowspeed  4:24   96 08:00 Q 20:13   4799   Priority [bro|sky] 
20472    ema      normal  m0.30a60r60_lowspeed  4:24   96 08:00 Q 20:18   4802   Priority [bro|sky] 
20470    ema      normal  m0.30a80r50_lowspeed  3:32   96 08:00 Q 20:24   4806   Priority [bro|sky] 
20468    ema      normal  m0.50a70r30_upwind_d  4:24   96 08:00 Q 20:28   4809   Priority [bro|sky] 
20467    ema      normal  m0.50a70r60_upwind_d  4:24   96 08:00 Q 20:32   4812   Priority [bro|sky] 
20465    ema      normal  m0.30a60r10_lowspeed  3:32   96 05:00 Q 20:34   4813   Priority [bro|sky] 
20466    ema      normal  m0.30a60r20_lowspeed  4:24   96 08:00 Q 20:33   4813   Priority [bro|sky] 
20464    ema      normal  m0.50a60r80_upwind_d  4:24   96 08:00 Q 20:39   4817   Priority [bro|sky] 
20463    ema      normal  m0.30a90r50_lowspeed  4:24   96 08:00 Q 20:46   4822   Priority [bro|sky] 
20462    ema      normal  m0.50a60r60_upwind_d  3:32   96 08:00 Q 20:47   4823   Priority [bro|sky] 
20456    ema      normal  m0.50a70r90_upwind_d  4:24   96 08:00 Q 21:01   4832   Priority [bro|sky] 
20458    ema      normal  m0.30a70r50_lowspeed  3:32   96 08:00 Q 21:00   4832   Priority [bro|sky] 
20459    ema      normal  m0.50a80r70_upwind_d  3:32   96 08:00 Q 21:00   4832   Priority [bro|sky] 
20454    ema      normal  m0.20a80r60_lowspeed  4:24   96 08:00 Q 21:04   4834   Priority [bro|sky] 
20453    ema      normal  m0.20a90r70_lowspeed  4:24   96 08:00 Q 21:16   4842   Priority [bro|sky] 
20452    ema      normal  m0.50a90r70_upwind_d  4:24   96 08:00 Q 21:20   4846   Priority [bro|sky] 
20451    ema      normal  m0.20a70r60_lowspeed  4:24   96 08:00 Q 21:23   4848   Priority [bro|sky] 
20449    ema      normal  m0.50a80r80_upwind_d  3:32   96 08:00 Q 21:27   4850   Priority [bro|sky] 
20448    ema      normal  m0.20a90r40_lowspeed  3:32   96 05:00 Q 21:30   4852   Priority [bro|sky] 
20447    ema      normal  m0.50a70r80_upwind_d  4:24   96 08:00 Q 21:31   4853   Priority [bro|sky] 
20443    ema      normal  m0.50a70r70_upwind_d  3:32   96 08:00 Q 21:35   4856   Priority [bro|sky] 
20444    ema      normal  m0.20a90r60_lowspeed  3:32   96 08:00 Q 21:35   4856   Priority [bro|sky] 
20442    ema      normal  m0.20a60r10_lowspeed  3:32   96 05:00 Q 21:45   4863   Priority [bro|sky] 
20440    ema      normal  m0.50a80r20_upwind_d  3:32   96 08:00 Q 21:48   4865   Priority [bro|sky] 
20441    ema      normal  m0.50a60r70_upwind_d  3:32   96 08:00 Q 21:48   4865   Priority [bro|sky] 
20437    ema      normal  m0.20a80r70_lowspeed  3:32   96 08:00 Q 21:57   4871   Priority [bro|sky] 
20434    ema      normal  m0.20a70r10_lowspeed  4:24   96 05:00 Q 22:14   4883   Priority [bro|sky] 
20432    ema      normal  m0.20a80r40_lowspeed  3:32   96 05:00 Q 22:17   4885   Priority [bro|sky] 
20428    ema      normal  m0.20a80r20_lowspeed  4:24   96 05:00 Q 22:32   4896   Priority [bro|sky] 
20427    ema      normal  m0.50a80r60_upwind_d  4:24   96 08:00 Q 22:35   4898   Priority [bro|sky] 
20425    ema      normal  m0.20a70r40_lowspeed  4:24   96 05:00 Q 22:43   4903   Priority [bro|sky] 
20419    ema      normal  m0.30a80r60_lowspeed  3:32   96 08:00 R 00:05   4917       None [bro|SKY] 
20420    ema      normal  m0.30a70r70_lowspeed  3:32   96 08:00 Q 23:08   4920  Resources [bro|sky] 
20418    ema      normal  m0.30a90r70_lowspeed  3:32   96 08:00 R 00:05   4928       None [bro|SKY] 
20677    aschwing normal  m0.50a20_dx0.25_umb1  6:24  144 04:00 R 01:25  11940       None [BRO|sky] 
20678    aschwing normal  m0.50a0b-20_dx0.25_u  5:32  160 04:00 C 01:14  11940       None [bro|SKY] 
20679    aschwing normal  m0.50a0b-20_dx0.25_u  5:32  160 04:00 R 01:08  11940       None [bro|SKY] 
20680    aschwing normal  m4.00a20_dx0.25_umb1  6:24  144 04:00 R 01:08  11940       None [BRO|sky] 
20682    aschwing hipri   m4.00a-20_dx0.25_umb  6:24  144 04:00 R 01:05  11940       None [BRO|sky] 
20683    aschwing normal  m0.50a-20_dx0.25_umb  6:24  144 04:00 R 01:02  11940       None [BRO|sky] 
20686    aschwing normal  m4.00a0b-20_dx0.25_u  6:24  144 04:00 R 00:58  11940       None [BRO|sky] 
20688    aschwing normal  m4.00a0b-20_dx0.25_u  5:32  160 04:00 R 00:51  11940       None [bro|SKY] 
20689    aschwing normal  m0.50a20_dx0.25_umb3  6:24  144 04:00 R 00:48  11940       None [BRO|sky] 
20690    aschwing normal  m0.50a-20_dx0.25_umb  5:32  160 04:00 R 00:35  11940       None [bro|SKY] 
20691    aschwing normal  m4.00a-20_dx0.25_umb  6:24  144 04:00 R 00:34  11940       None [BRO|sky] 
20692    aschwing normal  m0.50a20_dx0.25_umb3  5:32  160 04:00 R 00:25  11940       None [bro|SKY] 
20693    aschwing normal  m0.50a-20_dx0.25_umb  6:24  144 04:00 R 00:23  11940       None [BRO|sky] 
20694    aschwing normal  m3.00a0b-20_dx0.25_u  5:32  160 04:00 R 00:19  11940       None [bro|SKY] 
20695    aschwing normal  m3.00a20_dx0.25_umb1  6:24  144 04:00 R 00:08  11940       None [BRO|sky]
Comment 24 NASA JSC Aerolab 2018-07-24 07:22:06 MDT
Sorry - a couple other things I forgot to mention.  

That job list in the last comment is from this morning.  Only aschwing is using hipri at the moment.  We want to verify that only 768 cores worth of jobs per user will run in hipri before using hipri more widely.  I'd like to get that fixed ASAP.  

Fairshare/fairtree doesn't do exactly what we want.  We want each user running on the system to have a very high priority and get a certain number of cores running before dropping to a lower priority.  Fairshare alone won't do that.  That's why we are trying to get these high and low priority partitions working with the enforced limits.
Comment 25 NASA JSC Aerolab 2018-07-24 08:46:38 MDT
I've read through the Partition Priority info in the field notes presentation.  That all makes sense and the different tiers are pretty much what we are after here.  As described in the doc, only setting the Priority value applies that value to both the tier and job factor parameters.  


PartitionName=normal
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=04:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=08:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=r1i[0-2]n[0-35]
   PriorityJobFactor=10000 PriorityTier=10000 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2880 TotalNodes=108 SelectTypeParameters=NONE
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED


PartitionName=hipri
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=hipri
   DefaultTime=04:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=08:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=r1i[0-2]n[0-35]
   PriorityJobFactor=50000 PriorityTier=50000 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2880 TotalNodes=108 SelectTypeParameters=NONE
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED


With the way we have our weights set:

PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityWeightFairshare=5000
PriorityWeightAge=1000
PriorityWeightPartition=10000
PriorityWeightJobSize=2000
PriorityWeightQOS=0
PriorityMaxAge=1-0
PriorityFavorSmall=YES

This should achieve what we are after - hipri jobs run first.  But the QOS limits are not being honored, which is what I'd like to fix.  

Its been my observation that the job that will run next has always been the one with Reason=Resources.  This usually (always?) is the pending job with the highest priority, which makes sense.  If you look at the latest job listing I sent, one of ema's jobs is listed.  But this job keeps getting starved.
Comment 26 Jason Booth 2018-07-25 09:01:12 MDT
Hi Darby,

Jess reached out to me and mentioned that you would like more constant updates in this ticket. I have read over the ticket and believe that Alejandro has been very responsive to your requests. Alejandro is looking at this ticket but it does take time to respond since the changes you are asking for are more site-specific and require some testing, so we ask for your patience while he works this issue.

I also wanted to point out that we actively try to reach out and meet or exceed our service level agreements. This issue is a Severity 4 issues – Minor Issues so it is entitled to the following:

● Initial Response (during normal work hours) ­ As available
● Status Updates ­ As available
● Work Schedule ­ As available


Alejandro should be followup up with you shortly with an update.

Best regards,
Jason 
Director of Support
Comment 27 Alejandro Sanchez 2018-07-25 09:27:41 MDT
Hi,

jobs being submitted to both regular and highprio partitions will be ordered from highest PriorityTier to lowest so the scheduler will first try to schedule the job through highprio and if not possible (for instance because a limit has been reached) will try with the regular partition. Example:

test@ibiza:~/t$ scontrol show part highprio | egrep "Priority|QoS"
   AllocNodes=ALL Default=NO QoS=highprio
   PriorityJobFactor=1000 PriorityTier=1000 RootOnly=NO ReqResv=NO OverSubscribe=NO
test@ibiza:~/t$ sacctmgr show qos format=name,maxtrespu
      Name     MaxTRESPU 
---------- ------------- 
    normal               
  highprio         cpu=2 
test@ibiza:~/t$ sbatch -p regular,highprio -c2 --wrap "sleep 9999"
Submitted batch job 20034
test@ibiza:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20034  highprio     wrap     test  R       0:01      1 compute1
test@ibiza:~/t$ sbatch -p regular,highprio -c2 --wrap "sleep 9999"
Submitted batch job 20035
test@ibiza:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20035   regular     wrap     test  R       0:00      1 compute1
             20034  highprio     wrap     test  R       0:03      1 compute1
test@ibiza:~/t$ sbatch -p regular,highprio -c2 --wrap "sleep 9999"
Submitted batch job 20036
test@ibiza:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20036   regular     wrap     test  R       0:01      1 compute1
             20035   regular     wrap     test  R       0:06      1 compute1
             20034  highprio     wrap     test  R       0:09      1 compute1
test@ibiza:~/t$

Note job 20024 is scheduled with highprio partition, so the 2 cpu max limit is reached and subsequent jobs 20035 and 20026, also submitted to both partitions, are scheduled but they run under the regular partition cause highprio already reached the configured limit.

I'd also encourage you to set AccountingStorageEnforce to 'safe,qos' instead of your current 'limits'.

(In reply to NASA JSC Aerolab from comment #23)
> Sorry, I didn't mean 768 jobs, I meant 768 cores worth of jobs.  This is
> want we want.  

Ok, no problem.
 
> I will look into using PriorityTier and PriorityJobFactor instead of just
> Priority.  But does this explain why these jobs are now starving all other
> jobs in the queue?  I was expecting this to work such that jobs that were
> submitted to both normal and hipri will run over other jobs that are just in
> normal until 768 cores for that user are running. Then a mix of users jobs
> will run in normal, according to the multifactor weights we have set up. 
> But that is not happening.  User aschwing is getting All 2880 cores on this
> cluster.  

aschwing's jobs may all sum up 2880 cores, but the subset of jobs running under the hipri partition shouldn't sum up more than 768 cores, which is the limit configured. Grepping your list of jobs for hipri, I see this:

alex@ibiza:~/t$ cat t | grep hip
20685    aschwing hipri   m3.00a0b-20_dx0.25_u  6:24  144 04:00 R 00:57   3940       None [BRO|sky] 
20673    aschwing hipri   m3.00a-20_dx0.25_umb  5:32  160 04:00 R 01:36   3941       None [bro|SKY] 
20681    aschwing hipri   m4.00a20_dx0.25_umb1  6:24  144 04:00 R 01:05   3941       None [BRO|sky] 
20687    aschwing hipri   m3.00a-20_dx0.25_umb  6:24  144 04:00 R 00:07   3973       None [BRO|sky] 
20682    aschwing hipri   m4.00a-20_dx0.25_umb  6:24  144 04:00 R 01:05  11940       None [BRO|sky] 
alex@ibiza:~/t$

144*4 + 160 = 736 < MaxTresPerUser=cpu=768. The rest of the R jobs by aschwing are running under the normal partition which isn't cpu constrained as far as I know. So this all seem coherent to me.

> Another aspect of this is confusing to me.  See below, but the
> list is sorted by job priority.  All the running jobs at the bottom of the
> list have a very high priority, that should correspond to getting into the
> hipri partition.  But most of those are listed as normal.  The jobs at the
> top are listed as hipri but have a low priority.  Why is that?  And why are
> those jobs at the top of the list (lowest priority in the queue) running
> before the jobs in the 4000's from ema and lhalstro?  

Can you please attach the output of

$ squeue -O jobid,partition,prioritylong,username,state,starttime,schednodes,nodelist,reason --sort=S

and

$ sprio -l

I'm not sure which utility you're using to report the jobs.
Comment 28 NASA JSC Aerolab 2018-07-25 10:11:32 MDT
I've changed AccountingStorageEnforce=safe,qos and restarted slurmctld.  

I'll attach the these files:

[root@europa ~]# squeue -O jobid,partition,prioritylong,username,state,starttime,schednodes,nodelist,reason --sort=S > squeue.txt
[root@europa ~]# sprio -l > sprio.txt

I'm not sure how useful that is right now though since we removed the hipri partion from a bunch of aschwing's jobs.  We had to do this because everyone was getting starved.  

The output I sent earlier is from a script we wrote that combines data from several commands in a qstat-like output.  But its just getting info from squeue and scontrol.  

I agree that, strictly speaking, the limits are being enforced properly and each user is getting only 768 cores or less of jobs running in hipri.  But something else is happening to the scheduling that is causing jobs to submitted to both partitions to be favored.  Before we created the hiprio partition, ema's jobs were getting most of the cluster (~75%) and the other two users were getting the remaining cores.  After creating the hipri partition and having aschwing use it for all his jobs (i.e. nobody else using hipri), his jobs starved all others in the queue.  That's the part that doesn't make sense to me.  Once he reached the 768 cpu limit, his jobs should compete in normal with all the others and there should be a mix of jobs from all users.  But instead, even his normal jobs ran ahead of all other jobs in the queue.  The "Prio" column from comment 23 is just the Priority field that "scontrol show job" displays.  So I don't understand why the aschwing's lower priority jobs are starting over higher priority jobs from other users.  Can you explain that?  The output in comment 23 is when all of aschwing's jobs were using both normal and hipri.
Comment 29 NASA JSC Aerolab 2018-07-25 10:12:19 MDT
Created attachment 7404 [details]
sprio output
Comment 30 NASA JSC Aerolab 2018-07-25 10:13:04 MDT
Created attachment 7405 [details]
squeue output
Comment 31 Alejandro Sanchez 2018-07-26 05:16:31 MDT
The backfill scheduler builds an ordered queue of (job, partition) pairs sorted as follows:

1. Job can preempt
2. Job with an advanced reservation
3. Job partition PriorityTier
4. Job priority (sum of multifactor plugin terms)
5. Job submission time
6. JobId
7. ArrayTaskId

For instance

User 'test2' submits a job '20083' only to partition 'normal', so this job will have only one entry in the scheduler built list:

(20083, test2, normal)

User 'test' job '20086' submitted to 'highprio' and 'normal' partitions, will have two entries in the scheduler built list:

(20086, test, highprio)
(20086, test, normal)

When the scheduler sorts the list, it will place the highprio entry first, because it has a higher PriorityTier than the rest (even if it has a lower Job Priority):

(20086, test, highprio)

and then the scheduler needs to know which entry goes next: (20083, test2, normal) or (20086, test, normal). Since both of them have the same PriorityTier, the scheduler goes and decides based on the Job Priority (from multifactor). sprio -l can be used to show how each of these factors contributes to the Job Priority (point 4 above):

alex@ibiza:~/t$ sprio
          JOBID PARTITION   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION
          20083 normal          4498         47       1076       1375       2000
          20086 highprio       11513         47         92       1375      10000
          20086 normal          3513         47         92       1375       2000
alex@ibiza:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20083    normal     wrap    test2 PD       0:00      1 (Resources)
             20086 highprio,     wrap     test PD       0:00      1 (Priority)
             20076    normal     wrap     test  R    1:12:15      1 compute1
             20074    normal     wrap     test  R    1:12:47      1 compute1
             20073  highprio     wrap     test  R    1:15:04      1 compute1
             20079    normal     wrap    test2  R    1:11:59      1 compute2
             20077    normal     wrap    test2  R    1:12:10      1 compute2
             20075    normal     wrap    test2  R    1:12:21      1 compute1
             20081    normal     wrap    test2  R    1:08:46      1 compute2
             20082    normal     wrap    test2  R    1:02:44      1 compute2

Since (20083, test2, normal) has a Job Priority 4502 which is higher than (20086, test, normal), then the former will be the next in the sorted scheduler queue, which will finllay be:

(20086, test, highprio)
(20083, test2, normal)
(20086, test, normal)

Now if I enable DebugFlags=Backfill (scontrol setdebugflags +backfill), I can see backfill attempts to schedule the jobs in this order:

slurmctld: backfill: beginning
slurmctld: backfill test for JobID=20086 Prio=11518 Partition=highprio
slurmctld: backfill test for JobID=20083 Prio=4494 Partition=normal
slurmctld: backfill test for JobID=20086 Prio=3518 Partition=normal

Currently, there are no resources available so none of them can start. If I scancel the running job 20074:

alex@ibiza:~/t$ scancel 20074
alex@ibiza:~/t$

let's see what does backfill do now:

slurmctld: backfill: beginning
slurmctld: debug:  backfill: 3 jobs to backfill
slurmctld: backfill test for JobID=20086 Prio=11522 Partition=highprio
(it first tries (20086, test, highprio))
slurmctld: debug2: job 20086 being held, if allowed the job request will exceed QOS highprio max tres(cpu) per user limit 2 with already used 2 + requested 2
slurmctld: backfill: adding reservation for job 20086 blocked by acct_policy_job_runnable_post_select
(it can't, because I have MaxTRESPerUser=CPU=2 configured in my highprio QoS):
      Name     MaxTRESPU 
---------- ------------- 
    normal               
  highprio         cpu=2 
and user 'test' already has job 20073 running under highprio consuming 2 cpus, so backfill continues to the next entry in the sorted list:

slurmctld: backfill test for JobID=20083 Prio=4489 Partition=normal
slurmctld: Job 20083 to start at 2018-07-26T12:54:47, end at 2019-07-26T12:54:00 on compute1
So the second entry in the sorted list starts (20083, test2, normal), since it competing with the (20086, test, normal) entry it has more priority.

alex@ibiza:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20086 highprio,     wrap     test PD       0:00      1 (Resources)
             20076    normal     wrap     test  R    1:25:22      1 compute1
             20073  highprio     wrap     test  R    1:28:11      1 compute1
             20079    normal     wrap    test2  R    1:25:06      1 compute2
             20077    normal     wrap    test2  R    1:25:17      1 compute2
             20075    normal     wrap    test2  R    1:25:28      1 compute1
             20081    normal     wrap    test2  R    1:21:53      1 compute2
             20082    normal     wrap    test2  R    1:15:51      1 compute2
             20083    normal     wrap    test2  R       6:23      1 compute1
alex@ibiza:~/t$ sprio
          JOBID PARTITION   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION
          20086 highprio       11526         57         94       1375      10000
          20086 normal          3526         57         94       1375       2000
alex@ibiza:~/t$


As you can see, even if user 'test' submitted job to both highprio and normal partitions in job 20086, since the 'highprio' entry in the scheduler can't be scheduled due to limits, the 'normal' entry has competed with job 20083 from user 'test2' which was only submitted to normal partition, and job 20086 has _not_ starved job 20083. Since both entries have the same PriorityTier, the scheduler has checked the next stage in the precedence order which is the Job Priority and it resulted that job 20083 had a higher Job Priority (4498) than 20086 normal (3513), so job 20083 was started first.

Does it make sense?
Comment 32 NASA JSC Aerolab 2018-07-26 10:17:04 MDT
While I appreciate the detailed explanation of what's going on, I'm having a hard time absorbing all that and understanding why it translates into the behavior we are seeing.  

I'm about to start logging the following information on our system every 10 minutes:

~dvicker/bin/q -pn > q.txt.$date.$time
/software/x86_64/bin/nodeinfo.pl > nodeinfo.out.$date.$time
sprio -l > sprio.out.$date.$time
scontrol -a show job > scontrol_show_jobs.out.$date.$time
squeue -a > squeue.out.$date.$time
sinfo -a -N -o "%.20n %.15C %.10t %.10e %.15P %.15f" > sinfo.out.$date.$time
sdiag > sdiag.out.$date.$time

I've also done this: "scontrol setdebugflags +backfill"

I intend to demonstrate how when a single user starts using hipri for all their jobs, it starves all other jobs in the system.  Please let me know if there are other commands you'd like me to log.  

You understand what we are trying to accomplish, right?  Can you please help me determine the right scheduler configuration to achieve this?
Comment 33 NASA JSC Aerolab 2018-07-26 11:13:21 MDT
Here is the baseline.  We removed hiprio from everyone's batch script and let the queues run long enough so that Alan's (aschwing) jobs are queued in normal.  There are still some of lhalstro's jobs running in hipri but those will resubmit to normal only.  We have now added normal and hipri back to all of Alan's batch script.  I'm expecting all of Alan's jobs to start.  What we want to happen is a few of them start (<= 768 cores) and then a mix of jobs run after that.  

[root@europa slurm_data]# date
Thu Jul 26 11:56:43 CDT 2018
[root@europa slurm_data]# q -p
Job ID   Username Queue   Jobname              N:ppn Proc Wall  S Elap    Prio     Reason Features  
-------- -------- ------- -------------------- ----- ---- ----- - ----- ------ ---------- --------------
21203    aschwing normal  m0.50a0b-20_dx0.25_u  5:28  144 04:00 Q 00:10   3947   Priority [bro|sky] 
21202    aschwing normal  m0.50a0b-20_dx0.25_u  5:28  144 04:00 Q 00:22   3955   Priority [bro|sky] 
21201    aschwing normal  m3.00a20_dx0.25_umb3  6:24  144 04:00 Q 00:25   3958   Priority [bro|sky] 
21181    aschwing normal  m0.50a-20_dx0.25_umb  5:32  160 04:00 C 01:57   4019       None [bro|SKY] 
21191    aschwing normal  m0.50a20_dx0.25_umb3  6:24  144 04:00 Q 02:01   4024   Priority [bro|sky] 
21192    aschwing normal  m4.00a20_dx0.25_umb1  6:24  144 04:00 Q 02:00   4024   Priority [bro|sky] 
21188    aschwing normal  m3.00a0b-20_dx0.25_u  6:24  144 04:00 R 00:10   4032       None [BRO|sky] 
21190    aschwing normal  m3.00a-20_dx0.25_umb  6:24  144 04:00 Q 02:11   4032  Resources [bro|sky] 
21173    aschwing normal  m3.00a-20_dx0.25_umb  6:24  144 04:00 C 02:03   4083       None [BRO|sky] 
21182    aschwing normal  m0.50a0b-20_dx0.25_u  5:32  160 04:00 R 00:21   4083       None [bro|SKY] 
21197    stuart   normal  r04                   8:24  192 08:00 R 00:25   4109       None [sky|BRO] 
21198    stuart   normal  r06                   8:24  192 08:00 R 00:22   4112       None [sky|BRO] 
21199    stuart   normal  r24                   6:32  192 08:00 R 00:22   4112       None [SKY|bro] 
21183    lhalstro normal  m0.40a0.0_fso3_dcfpe 12:32  384 08:00 R 03:47  11857       None [SKY]     
21187    lhalstro hipri   m0.80a0.0_fso3_dcfpe 16:24  384 08:00 R 02:34  11857       None [BRO|sky] 
21189    lhalstro hipri   m0.65a0.0_fso3_dcfpe 16:24  384 08:00 R 02:14  11857       None [sky|BRO] 
21195    aschwing hipri   m4.00a10_dx0.18_umb3  6:24  144 05:00 R 01:27  11940       None [BRO|sky] 
21196    aschwing hipri   m4.00a7.11b-7.05_dx0  5:32  160 05:00 R 01:20  11940       None [bro|SKY] 
21200    aschwing hipri   m3.00a20_dx0.25_umb3  6:24  144 04:00 R 01:03  11940       None [BRO|sky] 
21204    aschwing hipri   m0.50a-20_dx0.25_umb  5:32  160 04:00 R 00:02  11940       None [bro|SKY] 
21205    aschwing hipri   m3.00a-20_dx0.25_umb  6:24  144 04:00 R 00:02  11940       None [BRO|sky] 

Stats:
 total (108/2880) ||    bro ( 72/1728) |    sky ( 36/1152) 
S Node   CPU  Job || S Node   CPU  Job | S Node   CPU  Job 
- ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- 
C   11   304    2 || C    6   144    1 | C    5   160    1 
Q   60  1728    6 || Q   36   864    6 | Q   24   864    6 
R  105  2784   13 || R   72  1728    8 | R   33  1056    5 
- ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- 
C  10%   10%      || C   8%    0%      | C  13%    0%      
Q  55%   60%      || Q  50%    2%      | Q  66%    2%      
R  97%   96%      || R 100%    4%      | R  91%    2%      
- ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- 
F    3    96      || F    0     0      | F    3    96      

User Stats:
            ------------ cores ------------  ------------ jobs  ------------
      User  # running  # pending      total  # running  # pending      total
  aschwing       1056        864       2224          7          6         15
  lhalstro       1152          0       1152          3          0          3
    stuart        576          0        576          3          0          3
---------- ---------- ---------- ---------- ---------- ---------- ----------
     total       2784        864       3952
Comment 34 NASA JSC Aerolab 2018-07-26 13:49:53 MDT
This isn't the best example of the problem (all jobs are running).  But it does show that a couple of Alan's jobs that are running in normal have a Priority as if they are running in hipri.  This seems like the root of the problem to me.  Let me know if you want me to upload the slurmctld log file or the other files I mentioned above.  

[dvicker@europa run]% date
Thu Jul 26 14:45:05 CDT 2018
[dvicker@europa run]% qp
Job ID   Username Queue   Jobname              N:ppn Proc Wall  S Elap    Prio     Reason Features  
-------- -------- ------- -------------------- ----- ---- ----- - ----- ------ ---------- --------------
21210    lhalstro normal  m0.80a0.0_fso3_dcfpe 12:32  384 08:00 R 00:45   3884       None [bro|SKY] 
21213    lhalstro normal  m0.65a0.0_fso3_dcfpe 12:32  384 08:00 R 00:04   3896       None [SKY|bro] 
21212    aschwing normal  m0.50a0b-20_dx0.25_u  6:24  144 04:00 R 01:03   3940       None [BRO|sky] 
21201    aschwing normal  m3.00a20_dx0.25_umb3  6:24  144 04:00 R 01:26   4010       None [BRO|sky] 
21203    aschwing normal  m0.50a0b-20_dx0.25_u  5:32  160 04:00 R 01:07   4013       None [bro|SKY] 
21202    aschwing normal  m0.50a0b-20_dx0.25_u  6:24  144 04:00 R 01:11   4018       None [BRO|sky] 
21197    stuart   normal  r04                   8:24  192 08:00 R 03:08   4109       None [sky|BRO] 
21198    stuart   normal  r06                   8:24  192 08:00 R 03:05   4112       None [sky|BRO] 
21199    stuart   normal  r24                   6:32  192 08:00 C 03:00   4112       None [SKY|bro] 
21208    aschwing hipri   m3.00a20_dx0.25_umb3  6:24  144 04:00 R 01:41  11940       None [BRO|sky] 
21211    aschwing normal  m3.00a0b-20_dx0.25_u  6:24  144 04:00 R 01:13  11940       None [BRO|sky] 
21214    aschwing normal  m3.00a-20_dx0.25_umb  6:24  144 04:00 R 01:00  11940       None [BRO|sky] 
21215    aschwing hipri   m4.00a10_dx0.18_umb3  6:24  144 05:00 R 00:49  11940       None [BRO|sky] 
21216    aschwing hipri   m0.50a-20_dx0.25_umb  6:24  144 04:00 R 00:45  11940       None [BRO|sky] 
21217    aschwing hipri   m3.00a-20_dx0.25_umb  6:24  144 04:00 R 00:45  11940       None [BRO|sky] 
21218    aschwing hipri   m0.50a20_dx0.25_umb3  5:32  160 04:00 R 00:40  11940       None [bro|SKY] 

Stats:
 total (108/2880) ||    bro ( 72/1728) |    sky ( 36/1152) 
S Node   CPU  Job || S Node   CPU  Job | S Node   CPU  Job 
- ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- 
C    6   192    1 || C    0     0    0 | C    6   192    1 
R  104  2768   15 || R   70  1680   11 | R   34  1088    4 
- ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- 
C   5%    6%      || C   0%    0%      | C  16%    0%      
R  96%   96%      || R  97%    4%      | R  94%    2%      
- ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- 
F    4   112      || F    2    48      | F    2    64      

User Stats:
            ------------ cores ------------  ------------ jobs  ------------
      User  # running  # pending      total  # running  # pending      total
  aschwing       1616          0       1616         11          0         11
  lhalstro        768          0        768          2          0          2
    stuart        384          0        576          2          0          3
---------- ---------- ---------- ---------- ---------- ---------- ----------
     total       2768          0       2960         15          0         16
[dvicker@europa run]%
Comment 35 NASA JSC Aerolab 2018-07-26 15:38:41 MDT
Current state below.  

[dvicker@europa run]% date
Thu Jul 26 16:23:45 CDT 2018
[dvicker@europa run]% qp
Job ID   Username Queue   Jobname              N:ppn Proc Wall  S Elap    Prio     Reason Features  
-------- -------- ------- -------------------- ----- ---- ----- - ----- ------ ---------- --------------
21210    lhalstro normal  m0.80a0.0_fso3_dcfpe 12:32  384 08:00 R 02:24   3884       None [bro|SKY] 
21213    lhalstro normal  m0.65a0.0_fso3_dcfpe 12:32  384 08:00 R 01:43   3896       None [SKY|bro] 
21247    aschwing normal,h m0.50a-10_dx0.25_umb  6:24  144 05:00 Q 00:13   3948   Priority [bro|sky] 
21245    aschwing normal,h m3.00a-20_dx0.25_umb  6:24  144 04:00 Q 00:17   3951   Priority [bro|sky] 
21243    aschwing normal,h m1.20a0b-10_dx0.25_u  6:24  144 05:00 Q 00:20   3954   Priority [bro|sky] 
21241    aschwing normal,h m0.50a20_dx0.25_umb3  6:24  144 04:00 Q 00:22   3955   Priority [bro|sky] 
21224    aschwing hipri   m0.50a0b-10_dx0.25_u  6:24  144 05:00 R 00:50   3962       None [BRO|sky] 
21235    aschwing normal,h m4.00a10_dx0.18_umb3  6:24  144 05:00 Q 00:50   3974   Priority [bro|sky] 
21225    aschwing hipri   m0.50a10_dx0.25_umb0  5:32  160 05:00 R 00:22   3980       None [bro|SKY] 
21226    aschwing hipri   m1.10a-10_dx0.25_umb  6:24  144 05:00 R 00:20   3983       None [BRO|sky] 
21227    aschwing hipri   m1.10a0b-10_dx0.25_u  6:24  144 05:00 R 00:17   3983       None [BRO|sky] 
21228    aschwing hipri   m1.10a10_dx0.25_umb0  6:24  144 05:00 R 00:13   3986       None [BRO|sky] 
21232    aschwing normal,h m3.00a20_dx0.25_umb3  6:24  144 04:00 Q 01:10   3988   Priority [bro|sky] 
21229    aschwing normal,h m1.10a-10_dx0.25_umb  6:24  144 05:00 Q 01:23   3997   Priority [bro|sky] 
21230    aschwing normal,h m1.10a0b-10_dx0.25_u  6:24  144 05:00 Q 01:23   3997   Priority [bro|sky] 
21246    stuart   normal  r07                   8:24  192 08:00 Q 00:17   4027   Priority [bro]     
21244    stuart   normal  r06                   6:32  192 08:00 Q 00:19   4029   Priority [sky|bro] 
21242    stuart   normal  r04r                  8:24  192 00:05 Q 00:21   4030  Resources [bro]     
21219    aschwing normal  m3.00a20_dx0.25_umb3  6:24  144 04:00 R 01:24  11940       None [BRO|sky] 
21231    aschwing normal  m3.00a0b-20_dx0.25_u  6:24  144 04:00 R 01:11  11940       None [BRO|sky] 
21234    aschwing normal  m3.00a-20_dx0.25_umb  6:24  144 04:00 R 00:59  11940       None [BRO|sky] 
21236    aschwing normal  m0.50a0b-20_dx0.25_u  5:32  160 04:00 R 00:45  11940       None [bro|SKY] 
21237    aschwing normal  m1.20a-10_dx0.25_umb  6:24  144 05:00 R 00:40  11940       None [BRO|sky] 
21238    aschwing normal  m0.50a-10_dx0.25_umb  6:24  144 05:00 R 00:39  11940       None [BRO|sky] 
21239    aschwing normal  m0.50a10_dx0.25_umb0  6:24  144 05:00 R 00:38  11940       None [BRO|sky] 
21240    aschwing normal  m0.50a0b-10_dx0.25_u  6:24  144 05:00 R 00:37  11940       None [BRO|sky] 
21248    aschwing normal  m1.10a10_dx0.25_umb0  6:24  144 05:00 R 00:05  11940       None [BRO|sky] 

Stats:
 total (108/2880) ||    bro ( 72/1728) |    sky ( 36/1152) 
S Node   CPU  Job || S Node   CPU  Job | S Node   CPU  Job 
- ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- 
Q  110  3072   11 || Q   72  1728   11 | Q   38  1344    9 
R  106  2816   16 || R   72  1728   12 | R   34  1088    4 
- ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- 
Q 101%  106%      || Q 100%    4%      | Q 105%    3%      
R  98%   97%      || R 100%    4%      | R  94%    2%      
- ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- 
F    2    64      || F    0     0      | F    2    64      

User Stats:
            ------------ cores ------------  ------------ jobs  ------------
      User  # running  # pending      total  # running  # pending      total
  aschwing       2048       1152       3200         14          8         22
  lhalstro        768          0        768          2          0          2
    stuart          0        576        576          0          3          3
---------- ---------- ---------- ---------- ---------- ---------- ----------
     total       2816       1728       4544         16         11         27
[dvicker@europa run]% 



Again, I think the problem is that Alan's normal jobs all have a high priority, but his hipri jobs have a low priority.  From sprio:


[dvicker@europa run]% sprio -l
          JOBID PARTITION     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS        NICE                 TRES
          21229 hipri     aschwing      11997         57          0       1941      10000          0           0                     
          21229 normal    aschwing       3997         57          0       1941       2000          0           0                     
          21230 hipri     aschwing      11997         57          0       1941      10000          0           0                     
          21230 normal    aschwing       3997         57          0       1941       2000          0           0                     
          21232 hipri     aschwing      11988         48          0       1941      10000          0           0                     
          21232 normal    aschwing       3988         48          0       1941       2000          0           0                     
          21235 hipri     aschwing      11974         34          0       1941      10000          0           0                     
          21235 normal    aschwing       3974         34          0       1941       2000          0           0                     
          21241 hipri     aschwing      11955         15          0       1941      10000          0           0                     
          21241 normal    aschwing       3955         15          0       1941       2000          0           0                     
          21242 normal      stuart       4030         14         93       1924       2000          0           0                     
          21243 hipri     aschwing      11954         13          0       1941      10000          0           0                     
          21243 normal    aschwing       3954         13          0       1941       2000          0           0                     
          21244 normal      stuart       4029         13         93       1924       2000          0           0                     
          21245 hipri     aschwing      11951         11          0       1941      10000          0           0                     
          21245 normal    aschwing       3951         11          0       1941       2000          0           0                     
          21246 normal      stuart       4027         11         93       1924       2000          0           0                     
          21247 hipri     aschwing      11948          8          0       1941      10000          0           0                     
          21247 normal    aschwing       3948          8          0       1941       2000          0           0                     
[dvicker@europa run]% 


You can see that all of Alan's jobs are listed twice - one for each of the partitions.  The priorities listed there look good - hipri has the high priority and normal has the low priority.  But this is reversed in the actual running jobs.  For example:

[root@europa slurm_data]# scontrol show job 21248 | grep -e Priority -e Part
   Priority=11940 Nice=0 Account=aerolab QOS=normal
   Partition=normal AllocNode:Sid=r1i0n30:24340
[root@europa slurm_data]# 


This job is running in the normal queue (and Qos=normal), but it got the hipri Priority value.  This sure seems like a bug to me.  Please help me understand if I'm wrong.  

I'm going to upload the diagnostic files from today too, including slurmctld_log with "scontrol setdebugflags +backfill".
Comment 36 NASA JSC Aerolab 2018-07-26 15:39:47 MDT
Created attachment 7432 [details]
Diagnostic files
Comment 37 Alejandro Sanchez 2018-07-27 08:13:41 MDT
I think I see what the problem is. I believe this is not a scheduling problem but a limitation on how squeue and/or scontrol show job display a job's priority, which I understand might be causing confusion to you.

A job's record has these two members:

        uint32_t priority;              /* relative priority of the job,
                                         * zero == held (don't initiate) */
        uint32_t *priority_array;       /* partition based priority */

while sprio command disaggregates a job's priority showing each of the stored values in priority_array for each partition, squeue and/or scontrol show job display Priority based on the current value of the priority struct member and do not disaggregate by partition. Since the priority value fluctuates based upon the scheduler trying to schedule the job in one partition or the other, depending on the moment you query squeue and/or scontrol the priority shows the value for one or the other partition.

I'm gonna see what I find and come back to you.
Comment 43 NASA JSC Aerolab 2018-07-30 09:52:28 MDT
Any updates on this today?  We've continued to log info over the weekend if you want more data.  I'm not convinced this is just a reporting issue since the entire cluster will drain in preference for one persons jobs if they are the only person using hipri.  But I'll be anxious to hear what you find out.
Comment 44 Jason Booth 2018-07-30 12:01:34 MDT
Greetings,

 Alejandro is currently not in the office this week so I have asked Felip to look over your latest update and respond.

Best regards,
Jason
Comment 45 Felip Moll 2018-07-31 11:16:56 MDT
Hi,

This is a long thread so I am taking some time to get updated with everything.

What Alex wrote in comment 37 is true. When you have a job with multiple partitions and query for the information of this job, slurm internally provides the information of the current partition that's being considered for scheduling, thus possibly showing a mismatch with the displayed partition and the displayed priority.

In slurm v18.08 this is addressed creating an array of priorities that will match the list of partitions.

To summarize, the priorities you see may be wrong and not match the partition that's being shown.


I am still reading through everything, if I am not misunderstanding it, you are still seeking for the initial demand of emulating maui's soft limits, so, you want:

The first jobs, until 768 used cores, will be high priority. The following jobs, will be low priority.



I am still analyzing your comments 33 34 35 36. will come back as soon as I have relevant feedback.
Comment 46 NASA JSC Aerolab 2018-07-31 12:49:22 MDT
Thanks for the update Felip.  A couple comments:

> To summarize, the priorities you see may be wrong and not match the
> partition that's being shown.

I still think this deeper than just a reporting issue because of the following:

- several jobs submitted from various users using only normal
- many jobs submitted from a single user using both normal and hipri
- all the nornmal+hipri jobs will run ahead of the jobs using only normal

This shouldn't happen - once 768 cores of the users hipri jobs from that user are running, it should resume to a mix of normal jobs from various users (like we get if everyone uses only normal).  That isn't happening.  
  
> I am still reading through everything, if I am not misunderstanding it, you
> are still seeking for the initial demand of emulating maui's soft limits,
> so, you want:
> 
> The first jobs, until 768 used cores, will be high priority. The following
> jobs, will be low priority.

That is correct.
Comment 47 Felip Moll 2018-08-02 05:51:46 MDT
> I still think this deeper than just a reporting issue because of the
> following:

I see what you are telling. I will try to reproduce.

Btw, are you on 17.11.5 right?
Comment 48 NASA JSC Aerolab 2018-08-02 07:54:54 MDT
Tbanks.  That is correct, we are on 17.11.5.
Comment 49 Alejandro Sanchez 2018-08-06 05:29:08 MDT
I've been able to reproduce this and can confirm it isn't just a displaying but a scheduling issue.
Comment 51 NASA JSC Aerolab 2018-08-07 12:01:13 MDT
Thanks for the confirmation.  Do you have an estimate on when you might have a fix?  Part of the reason I ask is because the specific person who has been testing this on our end is leaving soon (last day is Thursday).  It would be really nice to test a fix before that happens.
Comment 52 Alejandro Sanchez 2018-08-08 02:48:09 MDT
Created attachment 7535 [details]
patch not reviewed

On my local tests, this patch fixes the scheduling stall issue although it is pending for review. I asked Moe to review it but we're a bit overloaded these days with the 18.08 release candidate version. Feel free to try it ahead of review or wait for Moe's opinion on it.
Comment 55 NASA JSC Aerolab 2018-08-08 09:38:21 MDT
Excellent - thanks.  I'll give this a try and let you know how it works for us.
Comment 56 NASA JSC Aerolab 2018-08-09 07:10:05 MDT
I upgraded to 17.11.8 with the patch you supplied yesterday morning.  

Should that patch have fixed the priority values displayed by "scontrol show job"?  If so, we are still not seeing that.  See below for the output.   Note that any job runnning in hipri should have a priority value at or above 10000, while the jobs running in the normal queue should be ~4000.  



Job ID   Username Queue   Jobname              N:ppn Proc Wall  S Elap    Prio     Reason Features  
-------- -------- ------- -------------------- ----- ---- ----- - ----- ------ ---------- --------------
23616    lhalstro normal  m0.80a0.0_fso3_dcfpe 12:32  384 08:00 R 03:05   3897       None [bro|SKY] 
23620    lhalstro normal  m0.40a0.0_fso3_dcfpe 12:32  384 08:00 R 02:37   3897       None [SKY]     
23635    lhalstro normal,h m0.40a0.0_dcfpeg5rdx 16:24  384 08:00 Q 00:35   3922  Resources [bro]     
23606    lhalstro normal  m0.40a0.0_fso3_dcfpe 12:32  384 08:00 R 03:17   3948       None [SKY]     
23627    aschwing normal  m3.00a3.54b-3.53_AMC  6:24  144 05:00 R 01:44   3957       None [BRO|sky] 
23628    aschwing normal  m5.00a3.54b-3.53_AMC  6:24  144 05:00 R 01:39   3957       None [BRO|sky] 
23630    aschwing normal  m4.00a3.54b-3.53_AMC  6:24  144 05:00 R 01:16   3957       None [BRO|sky] 
23631    aschwing hipri   m2.00a3.54b-3.53_AMC  6:24  144 05:00 R 00:54   3957       None [BRO|sky] 
23632    aschwing hipri   m0.80a3.54b-3.53_AMC  6:24  144 05:00 R 00:50   3957       None [BRO|sky] 
23633    aschwing hipri   m0.90a3.54b-3.53_AMC  6:24  144 05:00 R 00:45   3957       None [BRO|sky] 
23634    aschwing normal  m1.20a3.54b-3.53_AMC  5:32  160 05:00 R 00:36   3957       None [bro|SKY] 
23636    aschwing hipri   m0.50a3.54b-3.53_AMC  6:24  144 05:00 R 00:19   3957       None [BRO|sky] 
23637    aschwing hipri   m1.60a3.54b-3.53_AMC  5:32  160 05:00 R 00:19   3957       None [bro|SKY] 
23638    aschwing normal  m1.40a3.54b-3.53_AMC  6:24  144 05:00 R 00:05   3957       None [BRO|sky] 
23639    aschwing normal  m1.10a3.54b-3.53_AMC  6:24  144 05:00 R 00:03   3957       None [BRO|sky] 
23622    lhalstro normal  m0.40a0.0_dcfpeg5rdx 16:24  384 08:00 R 00:35   3966       None [BRO]     
23629    lhalstro hipri   m0.40a0.0_fso3_dcfpe 12:32  384 08:00 R 01:18  11897       None [SKY]     
23600    lhalstro hipri   m0.40a0.0_fso3_dcfpe 12:32  384 08:00 R 04:32  11921       None [SKY]     


I'm going to upload another batch of diagnostic data from yesterday as well, including our slurmctld_log.
Comment 57 NASA JSC Aerolab 2018-08-09 07:11:59 MDT
Created attachment 7552 [details]
Diagnostic files
Comment 58 Alejandro Sanchez 2018-08-09 07:17:27 MDT
No, the patch only fixes the scheduler portion so that same job submitted to multiple partitions is assigned the correct priority value on the job queue built for scheduling purposes. You shouldn't see jobs submitted to both hipri,normal and not being able to run on hipri so only in normal stalling other users' normal jobs as before.

The display part for scontrol show job, squeue will show the priority for a job from the partition currently being considered, it is not disaggregated as in sprio. That would require another patch but at least you shouldn't have the stalling problem you reported before.

Can you verify the stalling problem isn't happening anymore?
Comment 59 Alejandro Sanchez 2018-08-09 07:49:48 MDT
Note also the patch will fix newly submitted jobs not the ones submitted before the patch.
Comment 63 Alejandro Sanchez 2018-08-16 03:28:22 MDT
Hi,

The proposed patch was checked-in to 17.11 (with some subtle modifications) in the following three commits:

(17.11.9)
https://github.com/SchedMD/slurm/commit/d2a1a96c54a6d556f2d91c1f24845a8f4089b41f

(17.11.9-2 due to an accidental bad casting on review)
https://github.com/SchedMD/slurm/commit/21d2ab6ed1694bf7b12e824ce41b66bb143e22b3

(17.11.10)
https://github.com/SchedMD/slurm/commit/67a82c369a7530ce7838e6294973af0082d8905b

The one I attached and you applied has the good casting, so it is fine you continue with it. I'd like to check-in with you and confirm you're not experiencing the stalling problems you were reporting before. I'm aware the client commands like squeue and scontrol show job could potentially change the way they list/display the job info so that priorities are disaggregated by partition like sprio does, and we can talk further about addressing that later. But first I want to ensure that jobs submitted against hipri,normal and can't run on hipri and only on normal aren't stalling anymore older jobs that were submitted only to normal since you applied the patch. So I'd highly appreciate any new feeback on this question. Thanks!
Comment 64 Alejandro Sanchez 2018-08-22 03:30:36 MDT
Hi, any updates on this? thank you.
Comment 65 NASA JSC Aerolab 2018-08-22 09:36:35 MDT
We've had the patch and our normal/hipri partitions implemented in our environment for the last two weeks.  Based on the day-to-day usage, things appear to be behaving as intended and we have not seen 'hiphi' jobs starving 'normal' jobs like we did in the past.  We also performed a number of controlled tests that (previously) would have manifest the issue and did not see any stalling after applying the patch.  Our cluster hasn't been under a relatively light load so there is the possibility of issues showing up later when there is increased contention, but based on our test cases we feel that is unlikely.

Thanks for the help, I'm glad that we could find and help diagnose this bug.
Comment 66 Alejandro Sanchez 2018-08-23 02:40:24 MDT
(In reply to NASA JSC Aerolab from comment #65)
> We've had the patch and our normal/hipri partitions implemented in our
> environment for the last two weeks.  Based on the day-to-day usage, things
> appear to be behaving as intended and we have not seen 'hiphi' jobs starving
> 'normal' jobs like we did in the past.  We also performed a number of
> controlled tests that (previously) would have manifest the issue and did not
> see any stalling after applying the patch.  Our cluster hasn't been under a
> relatively light load so there is the possibility of issues showing up later
> when there is increased contention, but based on our test cases we feel that
> is unlikely.
> 
> Thanks for the help, I'm glad that we could find and help diagnose this bug.

Great, thanks for your feedback. I've opened a separate sev-5 enhancement[1] request to track the modification of the rest of the user commands to have the ability to display disaggregated by partition job priority information, like sprio currently has.

I'm gonna go ahead and close this bug as fixed. Please, reopen if you encounter further issues. Thank you.

[1] https://bugs.schedmd.com/show_bug.cgi?id=5614