Before switching to slurm we used torque and maui. We had maui configured to put a soft limit of 768 cores on users jobs, which meant that all of a user's jobs were considered "idle" until going over the soft limits, at which point the user's jobs were considered "blocked". Idle jobs were considered for scheduling before blocked jobs. Is there a good way to emulate this with slurm? I think QOS's could help with this but I haven't had the chance to understand them well enough. What the above configuration really accomplished well is to let all users have high priority on a small number of jobs (~10% of the system capacity in our case). The end result is that if a lot of users were in the queue, everyone would get a few jobs started right away. I'm trying to get back to the same point with slurm. We tend to have times when some users' jobs get starved for long periods.
Created attachment 4655 [details] current slurm.conf
Slurm has no soft limits. You can take a look at the resource limits guide in the Slurm documentation webpage to have a feel of what type of limits can be imposed to jobs: https://slurm.schedmd.com/resource_limits.html The PriorityType=priority/multifactor plugin could also be used together with the FairShare factor. The fair-share component to a job's priority influences the order in which a user's queued jobs are scheduled to run based on the portion of the computing resources they have been allocated and the resources their jobs have already consumed. https://slurm.schedmd.com/priority_multifactor.html You might be accustomed to soft limits, but I would strongly encourage you to look at Slurm's Quality Of Service (QOS) capability as a better solution. QOS lets you establish different job limits and supports job preemption. The preemption is especially important in that you can preempt lower priority jobs at will rather than finding your machine full of low priority jobs all Monday morning just because the system went idle on the weekend. A typical configuration would be to establish a "standby" QOS with large time/size limits, but preemptable by normal QOS jobs on demand. See: https://slurm.schedmd.com/qos.html https://slurm.schedmd.com/preempt.html I've taken the liberty of reviewing your slurm.conf. You might consider doing these changes: ProctrackType=proctrack/cgroup # This helps in job cleanup. When the job finishes anything spawned in the cgroup will be cleaned up. This prevents runaway jobs (ie. jobs that double forked themselves). NOTE that 'pgid' (your current setup value) mechanism is not entirely reliable for process tracking. You have: AccountingStorageType=accounting_storage/slurmdbd AccountingStorageLoc=/tmp/slurm_job_accounting.txt This fully qualified path name to the txt file would make sense if you had accounting_storage/filetxt. We usually recommend a setup with accounting_storage/slurmdbd and an underlying database like MySQL or MariaDB. If besides the database you'd like the information of your finished jobs to be stored somewhere, you could also consider using a JobComp plugin to complement the accounting, although you might find most of the information would be redundant since it can be retrieved with sacct if you already have the slurmdbd + the database in place. I see you have: SelectType=select/cons_res SelectTypeParameters=CR_CPU and all of your partitions defined with OverSubscribe=Exclusive. Perhaps if you want to allocate entire nodes to jobs you might want to consider switching to select/linear and avoid the overhead created by the use of select/cons_res. # Backfill If you haven't seen it, we think Doug Jacobsen did an excellent job of walking people through how NERSC approaches some of their priority and scheduler tuning. The presentation is here: https://slurm.schedmd.com/SLUG16/NERSC.pdf and may provide some insights. His mapping of priority to units of time we find rather inspiring. Our usual starting points for tuning are: bf_continue bf_window=(enough minutes to cover the highest MaxTime on the cluster.) bf_resolution=(usually at least 600) bf_min_prio_reserve may actually suit you well depending on your queue depth, although you'd need to jump to a 16.05 / 17.02 release to get that. The idea behind that being to only test if the lower priority jobs can launch immediately, and not bother trying to slow them in to the backfill map otherwise. That has *huge* performance gains for them, and lets them keep their systems 95%+ occupied. We usually highly encourage to stick updated to the latest Slurm stable release (currently 17.02.3 at the time of writing this comment), or at least to the latest 16.05 (currently 16.05.10-2). A lot of bugs have been fixed since 16.05.7 and from my experience we usually are able to reproduce faster and troubleshoot quicker bugs whose Slurm version is up-to-date. Please, let us know if you have further questions, I know it's kind of a lot of information to absorb, but feel free to ask anything.
Yes, it is a lot of info to absorb. I do have several comments/questions. I've been meaning to switch to ProctrackType=proctrack/cgroup for a while but haven't had the time to test this. Do you have any recommendations for a good cgroup.conf configuation? I think we want both ConstrainCores=yes and ConstrainRAMSpace=yes. Anything else you recommend? I think the AccountingStorageLoc being set is a leftover from our previous trials when first configuring slurm. We are currently using slurmdbd with a mysql DB so I think we're set there. We switched from select/linear to select/cons_res in bug 3818. I do intend to switch back to select/linear. But we learned the hard way in that bug that we need to drain the jobs before switching this. I would also like to update to 17.02 soon and intend to do that. When I first got slurm running on our cluster (at the end of 2016), I spent a lot of time trying to understand how to use QOS and/or TRES to emulate the maui soft limit behavior I'm looking for. I spend some more time looking at QOS and TRES this morning and I'm still not clear to to utilize these. Can you please point me in the right direction? How can I set up priority/multifactor such that a users priority would be high when they have less than X number of procs running but low when they have more than X procs running?
(In reply to NASA JSC Aerolab from comment #4) > Yes, it is a lot of info to absorb. I do have several comments/questions. > > I've been meaning to switch to ProctrackType=proctrack/cgroup for a while > but haven't had the time to test this. Do you have any recommendations for > a good cgroup.conf configuation? I think we want both ConstrainCores=yes > and ConstrainRAMSpace=yes. Anything else you recommend? We usually recommend: ProctrackType=proctrack/cgroup and TaskPlugin=task/affinity,task/cgroup <-- with TaskAffinity=no in cgroup.conf. This combines the best of the two task plugins. task/cgroup will be used to fence jobs into the specified memory, gpus, etc. and task/affinity will handle best the task binding/layout/affinity. The affinity logic in task/affinity is better is better than task/cgroup. Regarding the cgroup.conf, a good setup could be: CgroupMountpoint=/sys/fs/cgroup # or your cgroup mountpoint CgroupAutomount=yes CgroupReleaseAgentDir=/path/to/yours ConstrainCores=no ConstrainDevices=yes ConstrainRAMSpace=yes AllowedRAMSpace=100 ConstrainSwapSpace=yes AllowedSwapSpace=0 TaskAffinity=no Notes: - Let me double check internally if ConstrainCores should be 'no' with TaskAffinity=no. I'll come back to you. - Regarding the ReleaseAgent, supposedly since 16.05.5 Slurm automatically was able to remove the cpuset and devices subsystems without the need of a release_agent. Later we discovered that for some specific config/use-cases there were still steps hierarchies not cleaned up, so a fix was added in 17.02.3 and since that version all cleanups seems to work great. So in short, for your 16.05.7 version I'd still use the release agent. - You can also use MemSpecLimit in your node definition to set aside some memory in the nodes for slurmd/slurmstepd usage. That memory won't be able for job allocations, thus leaving ((RealMemory - MemSpecLimit) * AllowedRAMSpace)/100 for job allocations. - Memory can be enforced by two mechanisms in Slurm. One is by sampling at a frequent interval the memory statistics through the JobAcctGatherPlugin and another by using the task/cgroup part of the TaskPlugin=task/cgroup,task/affinity. If you use JobAcctGatherPlugin, we recommend the jobacct_gather/linux. We also encourage to not use the two mechanisms at the same time. If you have both plugins enabled, (JobAcct.. and Task..) then we suggest adjusting: JobAcctGatherParams=NoOverMemoryKill # disables JobAcct... mem enforcement. If the job is truly over its memory limits the cgroup enforcement is what should be killing it, and is not affected by this setting. You may also want to set UsePSS as well - this changes the data collection from RSS to PSS. If the application is heavily threaded it might be getting the shared memory space from the application counted against it once per thread/process, which could explain the apparently high usage from summing the RSS values together. PSS divvies up the shared mem usage between all the separate processes, so when summed back together you get a more realistic view of the memory consumption. > I think the AccountingStorageLoc being set is a leftover from our previous > trials when first configuring slurm. We are currently using slurmdbd with a > mysql DB so I think we're set there. Ok. > We switched from select/linear to select/cons_res in bug 3818. I do intend > to switch back to select/linear. But we learned the hard way in that bug > that we need to drain the jobs before switching this. I've just noticed you changed from linear to cons_res in that bug. Regarding the jobs being killed... yes it was an accidental bad advice. The slurm.conf man for SelectType warns though: "Changing this value can only be done by restarting the slurmctld daemon and will result in the loss of all job information (running and pending) since the job state save format used by each plugin is different." Anyhow and back to the point, if you were recommended to stick to select/cons_res in that bug, do not change back. Also I saw Danny pointed to a commit which is included since 16.05.8+. > I would also like to update to 17.02 soon and intend to do that. > > When I first got slurm running on our cluster (at the end of 2016), I spent > a lot of time trying to understand how to use QOS and/or TRES to emulate the > maui soft limit behavior I'm looking for. I spend some more time looking at > QOS and TRES this morning and I'm still not clear to to utilize these. Can > you please point me in the right direction? How can I set up > priority/multifactor such that a users priority would be high when they have > less than X number of procs running but low when they have more than X procs > running? Let me do some tests and come back to you for this question.
The cgroup info is very helpful - thanks. No worries about the cons_res mishap - I missed it too. Looking forward to your recommendations for the multifactor setup we are trying to achieve.
(In reply to NASA JSC Aerolab from comment #4) > When I first got slurm running on our cluster (at the end of 2016), I spent > a lot of time trying to understand how to use QOS and/or TRES to emulate the > maui soft limit behavior I'm looking for. I spend some more time looking at > QOS and TRES this morning and I'm still not clear to to utilize these. Can > you please point me in the right direction? How can I set up > priority/multifactor such that a users priority would be high when they have > less than X number of procs running but low when they have more than X procs > running? I've been doing some tests today and also discussed this internally. As I anticipated before, Slurm doesn't support the concept of soft limits. A brief description of how Scheduling works in Slurm is as follows. Anyhow I'd recommend reading the docs and this sched_tutorial: https://slurm.schedmd.com/SUG14/sched_tutorial.pdf When not FIFO scheduling, jobs are prioritized in the following order: 1. Jobs that can preempt 2. Jobs with an advanced reservation 3. Partition Priority Tier 4. Job Priority 5. Job Id Point 4) is where priority/multifactor comes into play. A Job Priority is the result of the sum of various factors multiplied by admin defined weights. One of the factors is the FairShare which looks at the past usage. This usage can be decomposed so the admin can say how much to charge for every different TRES. The admin can also clear past usage by using either PriorityDecayHalfLife or PriorityUsageResetPeriod. Anyhow, if the PriorityWeightFairshare isn't high enough as compared to the rest of multifactor weights, the FairShare won't have much impact on the final job priority, which, let's keep in mind, is the 4th point above to be considered when Scheduling. Besides that, a bunch of SchedulerParameters are used to take more scheduling decisions, which can be consulted in the man page. So again, Slurm has no soft limits and the closest approaches we've come up to cover that use-case would be: a) PriorityFavorSmall=yes, so the smaller the job the higher the priority. But this would not be an aggregate, but on a per job basis. b) GrpTRES=cpu=X: all running jobs combined for the association and its children can be consuming up to X cpus at the same time. If limit is reached, job will be PD with reason AssocGrpCpuLimit, EVEN IF THERE ARE FREE RESOURCES. I think you would like that new jobs from the association that reached the limit get only left PD if no other jobs from other assocs want to consume the resources, which would be favored since this assoc already used X cpus. But currently the Grp* limits are hard limits, and this is the way Slurm works today. c) Do not set any Grp* limits, and just let the FairShare factor affect the final job priority. You can specify through TRESBillingWeights which weight to assign to each TRES and PriorityWeightTRES sets the degree each TRES Type contributes to the job's priority. Any solution (we can think of anyway) would require code to make anything like this happen. There isn't anything in Slurm that says, you are running x so after x+1 don't give priority. If you're interested in sponsoring something like that, let us know and we can discuss that further.
Btw, answering to my comment #5, set ConstrainCores=yes. If you're doing task binding to cpus we recommend enabling this; it'll force the tasks to only run on the CPUs explicitly assigned to them. Otherwise they may run on whichever cores the Linux kernel schedules them on automatically, and they can use more than their share of the CPUs.
Hi, is there anything else we can assist you with this bug? Thanks.
Yes, please keep this open. We would still like help tweaking our scheduling to achieve something like soft limits. We are trying to get our cluster emulated on one of our workstations but we've been distracted by other things and haven't finished that yet.
Hi, do you need anything else from this bug? Thanks.
Yes, still working on this. Please keep it open.
Please reopen if any more is required on this.
Hello. I'm reopening this bug as I think I have an approach for what we are trying to accomplish, but I could still use some help. Kind of a lot has changed since I opened this bug. We've been adding users from other groups and so, by necessity, we've become a lot more familiar with the options to limit users and resources. Here is what I have in mind. We currently have 5 partitions setup - normal, idle, long, debug and twoday. The normal and idle partitions are the two main ones and are really about job priority. Almost all of our jobs run in normal. We restrict most jobs to 8 hours or less as a means of naturally cycling jobs through the queue and keeping maintenance more accessible (replacing failed memory, etc.). The long and twoday queues are restricted in various ways. What I propose is creating a partition called hipri with a Priority 100x larger than normal. This would also be attached to a corresponding hipri QOS that has MaxTRESPerUser set to some small fraction of the machine, probably CPU=768. We would use our job_submit.lua to automatically place jobs submitted to normal into both normal and hipri. The end result should be what we are trying to achieve - each user will get a small number of jobs that start above other jobs, then the effectively fall into another priority range. Please let me know if this sounds reasonable or if you see any issues with that approach. Secondly, it would be nice if this could be accomplished directly with QOS's, without the need for the extra partition. I think this could work if a job could request multiple QOS's (i.e. --qoa=normal,hipri) and we could also use job_submit.lua to automatically do this for the user. But it doesn't look like you can currently do multiple QOS's in --qos?
(In reply to NASA JSC Aerolab from comment #18) > Hello. I'm reopening this bug as I think I have an approach for what we are > trying to accomplish, but I could still use some help. Kind of a lot has > changed since I opened this bug. We've been adding users from other groups > and so, by necessity, we've become a lot more familiar with the options to > limit users and resources. Here is what I have in mind. > > We currently have 5 partitions setup - normal, idle, long, debug and twoday. > The normal and idle partitions are the two main ones and are really about > job priority. Almost all of our jobs run in normal. We restrict most jobs > to 8 hours or less as a means of naturally cycling jobs through the queue > and keeping maintenance more accessible (replacing failed memory, etc.). > The long and twoday queues are restricted in various ways. > > What I propose is creating a partition called hipri with a Priority 100x > larger than normal. This would also be attached to a corresponding hipri > QOS that has MaxTRESPerUser set to some small fraction of the machine, > probably CPU=768. We would use our job_submit.lua to automatically place > jobs submitted to normal into both normal and hipri. The end result should > be what we are trying to achieve - each user will get a small number of jobs > that start above other jobs, then the effectively fall into another priority > range. Please let me know if this sounds reasonable or if you see any > issues with that approach. Note that Partition Priority option might not be doing what you may think it is[1]. If you want to create a higher priority partition I'd increase the value of the partition PriorityTier option instead of the PriorityJobFactor, since PriorityJobFactor would only contribute to one of the multiple factors for the Job Priority (point 4 from comment 11). Your approach might work although I think another approach would be to just create a higher priority and limited QOS that users are aware of and they can submit jobs to when needed knowing that it is limited. Once they hit the limit they can submit to the regular QOS. Instead of creating a higher priority partition attached to another QOS. > Secondly, it would be nice if this could be accomplished directly with > QOS's, without the need for the extra partition. I think this could work if > a job could request multiple QOS's (i.e. --qoa=normal,hipri) and we could > also use job_submit.lua to automatically do this for the user. But it > doesn't look like you can currently do multiple QOS's in --qos? You can't submit to more than one QOS at present. [1] https://slurm.schedmd.com/SLUG17/FieldNotes.pdf look for Partition Priority.
It would be preferable to make this as transparent to the users as possible. Also, the idea of users manually submitting to the high priority QOS for some jobs and the lower QOS for others is a no go. This has to be something that happens automatically. We'll try the partition approach and see how it goes.
Created attachment 7376 [details] hipri I've tried to implement this but something isn't right. I created the qos with: sacctmgr create qos name=hipri sacctmgr modify qos name=hipri set MaxTRESPerUser=CPU=768 Then added a corresponding hipri partition. PartitionName=hipri Nodes=r1i[0-2]n[0-35] Priority=50000 State=UP Qos=hipri That's not all the options for the partition so see the attached slurm.conf for the full details. I then had the user aschwing (Alan) manually add both the normal and hipri partitions to their job scripts. The way the multifactor was working out prior to the hipri changes, only a few of his jobs (~300 cores worth) were running. User ema was dominating the queue. So I expected Alan to get 768 jobs running before the MaxTRESPerUser=CPU=768 limit kicked in and his jobs competed in the normal queue again. This is not happening. Its like the hipri TRES limit is not taking affect and Alan's jobs are starving all other jobs now. This is easiest to see in the q.txt file I've attached, which is sorted by priority. Any idea what's going on here?
User aschwing has these 5 jobs running in highpri partition: Job ID Username Queue Jobname N:ppn Proc Wall S Elap Prio Reason Features -------- -------- ------- -------------------- ----- ---- ----- - ----- ------ ---------- -------------- 20497 aschwing hipri m1.10a10_dx0.06_umb2 6:24 144 04:00 R 01:44 3941 None [BRO|sky] 20514 aschwing hipri m1.10a0b-10_dx0.06_u 6:24 144 04:00 R 00:09 3941 None [BRO|sky] 20496 aschwing hipri m1.20a-10_dx0.06_umb 6:24 144 04:00 R 01:50 3943 None [BRO|sky] 20513 aschwing hipri m3.00a20_dx0.25_umb1 5:32 160 04:00 R 00:11 3946 None [bro|SKY] 20498 aschwing hipri m0.50a-20_dx0.25_umb 6:24 144 04:00 R 00:24 3995 None [BRO|sky] They sum up a total of 144*4 + 160 = 736 Proc. There's another job submitted by aschwing to both normal,highpri which isn't running: 20515 aschwing normal,h m1.10a-7.11b-7.05_dx 5:28 144 04:00 Q 00:09 3946 Priority [bro|sky] since it requests 144 Proc, and 736 which are already running + 144 would sum up 880 Proc which is greater than the highprio MaxTRESPerUser=cpu=768, that's why it falls back to normal partition, but normal partition doesn't have such a high priority so that's why this job is waiting with Reason Priority. You mentioned that: "So I expected Alan to get 768 jobs running before the MaxTRESPerUser=CPU=768 limit kicked in and his jobs competed in the normal queue again." If you want to limit the number of jobs per user you should configure the MaxJobsPerUser limit instead. In any case, if in general terms what you want to accomplish is that no user dominates the cluster, what I'd do is to give more importance to the PriorityWeightFairshare, increase it a few orders of magnitude over the rest of the factors and setup the PriorityFlags=FAIR_TREE. https://slurm.schedmd.com/priority_multifactor.html#fairshare and https://slurm.schedmd.com/fair_tree.html Note also that as I said in my previous comment, you have your partitions set with a deprecated option Priority. Perhaps you have it set on purpose but I'd encourage you to read if you haven't already the slide 12 onwards from this presentation: https://slurm.schedmd.com/SLUG17/FieldNotes.pdf where it is explained the difference between PriorityTier, PriorityJobFactor and (the obsolote) Priority. Does it make sense?
Sorry, I didn't mean 768 jobs, I meant 768 cores worth of jobs. This is want we want. I will look into using PriorityTier and PriorityJobFactor instead of just Priority. But does this explain why these jobs are now starving all other jobs in the queue? I was expecting this to work such that jobs that were submitted to both normal and hipri will run over other jobs that are just in normal until 768 cores for that user are running. Then a mix of users jobs will run in normal, according to the multifactor weights we have set up. But that is not happening. User aschwing is getting All 2880 cores on this cluster. Another aspect of this is confusing to me. See below, but the list is sorted by job priority. All the running jobs at the bottom of the list have a very high priority, that should correspond to getting into the hipri partition. But most of those are listed as normal. The jobs at the top are listed as hipri but have a low priority. Why is that? And why are those jobs at the top of the list (lowest priority in the queue) running before the jobs in the 4000's from ema and lhalstro? Will using PriorityTier and PriorityJobFactor fix this? Job ID Username Queue Jobname N:ppn Proc Wall S Elap Prio Reason Features -------- -------- ------- -------------------- ----- ---- ----- - ----- ------ ---------- -------------- 20685 aschwing hipri m3.00a0b-20_dx0.25_u 6:24 144 04:00 R 00:57 3940 None [BRO|sky] 20673 aschwing hipri m3.00a-20_dx0.25_umb 5:32 160 04:00 R 01:36 3941 None [bro|SKY] 20681 aschwing hipri m4.00a20_dx0.25_umb1 6:24 144 04:00 R 01:05 3941 None [BRO|sky] 20697 aschwing normal m0.50a0b-20_dx0.25_u 5:28 144 04:00 Q 00:05 3943 Priority [bro|sky] 20696 aschwing normal,h m0.50a0b-20_dx0.25_u 5:28 144 04:00 Q 00:07 3945 Priority [bro|sky] 20687 aschwing hipri m3.00a-20_dx0.25_umb 6:24 144 04:00 R 00:07 3973 None [BRO|sky] 20494 lhalstro normal m0.65a0.0_fso3_dcfpe 12:32 384 08:00 Q 18:04 4609 Priority [sky|bro] 20482 lhalstro normal m0.80a0.0_fso3_dcfpe 16:24 384 08:00 Q 18:52 4643 Priority [bro|sky] 20490 ema normal m0.50a80r90_upwind_d 3:32 96 08:00 Q 18:22 4722 Priority [bro|sky] 20481 ema normal m0.20a70r20_lowspeed 4:24 96 05:00 Q 19:25 4766 Priority [bro|sky] 20479 ema normal m0.30a80r70_lowspeed 3:32 96 08:00 Q 19:45 4780 Priority [bro|sky] 20480 ema normal m0.30a70r60_lowspeed 4:24 96 08:00 Q 19:45 4780 Priority [bro|sky] 20478 ema normal m0.30a90r60_lowspeed 4:24 96 08:00 Q 19:46 4781 Priority [bro|sky] 20477 ema normal m0.30a80r80_lowspeed 4:24 96 08:00 Q 19:49 4782 Priority [bro|sky] 20476 ema normal m0.30a60r70_lowspeed 4:24 96 08:00 Q 19:53 4785 Priority [bro|sky] 20431 lhalstro normal m0.40a0.0_fso3_dcfpe 12:32 384 08:00 Q 22:18 4786 Priority [sky] 20475 ema normal m0.30a70r80_lowspeed 4:24 96 08:00 Q 19:55 4786 Priority [bro|sky] 20474 ema normal m0.30a90r80_lowspeed 4:24 96 08:00 Q 20:13 4799 Priority [bro|sky] 20472 ema normal m0.30a60r60_lowspeed 4:24 96 08:00 Q 20:18 4802 Priority [bro|sky] 20470 ema normal m0.30a80r50_lowspeed 3:32 96 08:00 Q 20:24 4806 Priority [bro|sky] 20468 ema normal m0.50a70r30_upwind_d 4:24 96 08:00 Q 20:28 4809 Priority [bro|sky] 20467 ema normal m0.50a70r60_upwind_d 4:24 96 08:00 Q 20:32 4812 Priority [bro|sky] 20465 ema normal m0.30a60r10_lowspeed 3:32 96 05:00 Q 20:34 4813 Priority [bro|sky] 20466 ema normal m0.30a60r20_lowspeed 4:24 96 08:00 Q 20:33 4813 Priority [bro|sky] 20464 ema normal m0.50a60r80_upwind_d 4:24 96 08:00 Q 20:39 4817 Priority [bro|sky] 20463 ema normal m0.30a90r50_lowspeed 4:24 96 08:00 Q 20:46 4822 Priority [bro|sky] 20462 ema normal m0.50a60r60_upwind_d 3:32 96 08:00 Q 20:47 4823 Priority [bro|sky] 20456 ema normal m0.50a70r90_upwind_d 4:24 96 08:00 Q 21:01 4832 Priority [bro|sky] 20458 ema normal m0.30a70r50_lowspeed 3:32 96 08:00 Q 21:00 4832 Priority [bro|sky] 20459 ema normal m0.50a80r70_upwind_d 3:32 96 08:00 Q 21:00 4832 Priority [bro|sky] 20454 ema normal m0.20a80r60_lowspeed 4:24 96 08:00 Q 21:04 4834 Priority [bro|sky] 20453 ema normal m0.20a90r70_lowspeed 4:24 96 08:00 Q 21:16 4842 Priority [bro|sky] 20452 ema normal m0.50a90r70_upwind_d 4:24 96 08:00 Q 21:20 4846 Priority [bro|sky] 20451 ema normal m0.20a70r60_lowspeed 4:24 96 08:00 Q 21:23 4848 Priority [bro|sky] 20449 ema normal m0.50a80r80_upwind_d 3:32 96 08:00 Q 21:27 4850 Priority [bro|sky] 20448 ema normal m0.20a90r40_lowspeed 3:32 96 05:00 Q 21:30 4852 Priority [bro|sky] 20447 ema normal m0.50a70r80_upwind_d 4:24 96 08:00 Q 21:31 4853 Priority [bro|sky] 20443 ema normal m0.50a70r70_upwind_d 3:32 96 08:00 Q 21:35 4856 Priority [bro|sky] 20444 ema normal m0.20a90r60_lowspeed 3:32 96 08:00 Q 21:35 4856 Priority [bro|sky] 20442 ema normal m0.20a60r10_lowspeed 3:32 96 05:00 Q 21:45 4863 Priority [bro|sky] 20440 ema normal m0.50a80r20_upwind_d 3:32 96 08:00 Q 21:48 4865 Priority [bro|sky] 20441 ema normal m0.50a60r70_upwind_d 3:32 96 08:00 Q 21:48 4865 Priority [bro|sky] 20437 ema normal m0.20a80r70_lowspeed 3:32 96 08:00 Q 21:57 4871 Priority [bro|sky] 20434 ema normal m0.20a70r10_lowspeed 4:24 96 05:00 Q 22:14 4883 Priority [bro|sky] 20432 ema normal m0.20a80r40_lowspeed 3:32 96 05:00 Q 22:17 4885 Priority [bro|sky] 20428 ema normal m0.20a80r20_lowspeed 4:24 96 05:00 Q 22:32 4896 Priority [bro|sky] 20427 ema normal m0.50a80r60_upwind_d 4:24 96 08:00 Q 22:35 4898 Priority [bro|sky] 20425 ema normal m0.20a70r40_lowspeed 4:24 96 05:00 Q 22:43 4903 Priority [bro|sky] 20419 ema normal m0.30a80r60_lowspeed 3:32 96 08:00 R 00:05 4917 None [bro|SKY] 20420 ema normal m0.30a70r70_lowspeed 3:32 96 08:00 Q 23:08 4920 Resources [bro|sky] 20418 ema normal m0.30a90r70_lowspeed 3:32 96 08:00 R 00:05 4928 None [bro|SKY] 20677 aschwing normal m0.50a20_dx0.25_umb1 6:24 144 04:00 R 01:25 11940 None [BRO|sky] 20678 aschwing normal m0.50a0b-20_dx0.25_u 5:32 160 04:00 C 01:14 11940 None [bro|SKY] 20679 aschwing normal m0.50a0b-20_dx0.25_u 5:32 160 04:00 R 01:08 11940 None [bro|SKY] 20680 aschwing normal m4.00a20_dx0.25_umb1 6:24 144 04:00 R 01:08 11940 None [BRO|sky] 20682 aschwing hipri m4.00a-20_dx0.25_umb 6:24 144 04:00 R 01:05 11940 None [BRO|sky] 20683 aschwing normal m0.50a-20_dx0.25_umb 6:24 144 04:00 R 01:02 11940 None [BRO|sky] 20686 aschwing normal m4.00a0b-20_dx0.25_u 6:24 144 04:00 R 00:58 11940 None [BRO|sky] 20688 aschwing normal m4.00a0b-20_dx0.25_u 5:32 160 04:00 R 00:51 11940 None [bro|SKY] 20689 aschwing normal m0.50a20_dx0.25_umb3 6:24 144 04:00 R 00:48 11940 None [BRO|sky] 20690 aschwing normal m0.50a-20_dx0.25_umb 5:32 160 04:00 R 00:35 11940 None [bro|SKY] 20691 aschwing normal m4.00a-20_dx0.25_umb 6:24 144 04:00 R 00:34 11940 None [BRO|sky] 20692 aschwing normal m0.50a20_dx0.25_umb3 5:32 160 04:00 R 00:25 11940 None [bro|SKY] 20693 aschwing normal m0.50a-20_dx0.25_umb 6:24 144 04:00 R 00:23 11940 None [BRO|sky] 20694 aschwing normal m3.00a0b-20_dx0.25_u 5:32 160 04:00 R 00:19 11940 None [bro|SKY] 20695 aschwing normal m3.00a20_dx0.25_umb1 6:24 144 04:00 R 00:08 11940 None [BRO|sky]
Sorry - a couple other things I forgot to mention. That job list in the last comment is from this morning. Only aschwing is using hipri at the moment. We want to verify that only 768 cores worth of jobs per user will run in hipri before using hipri more widely. I'd like to get that fixed ASAP. Fairshare/fairtree doesn't do exactly what we want. We want each user running on the system to have a very high priority and get a certain number of cores running before dropping to a lower priority. Fairshare alone won't do that. That's why we are trying to get these high and low priority partitions working with the enforced limits.
I've read through the Partition Priority info in the field notes presentation. That all makes sense and the different tiers are pretty much what we are after here. As described in the doc, only setting the Priority value applies that value to both the tier and job factor parameters. PartitionName=normal AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=04:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=08:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=r1i[0-2]n[0-35] PriorityJobFactor=10000 PriorityTier=10000 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2880 TotalNodes=108 SelectTypeParameters=NONE DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=hipri AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=hipri DefaultTime=04:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=08:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=r1i[0-2]n[0-35] PriorityJobFactor=50000 PriorityTier=50000 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2880 TotalNodes=108 SelectTypeParameters=NONE DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED With the way we have our weights set: PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityWeightFairshare=5000 PriorityWeightAge=1000 PriorityWeightPartition=10000 PriorityWeightJobSize=2000 PriorityWeightQOS=0 PriorityMaxAge=1-0 PriorityFavorSmall=YES This should achieve what we are after - hipri jobs run first. But the QOS limits are not being honored, which is what I'd like to fix. Its been my observation that the job that will run next has always been the one with Reason=Resources. This usually (always?) is the pending job with the highest priority, which makes sense. If you look at the latest job listing I sent, one of ema's jobs is listed. But this job keeps getting starved.
Hi Darby, Jess reached out to me and mentioned that you would like more constant updates in this ticket. I have read over the ticket and believe that Alejandro has been very responsive to your requests. Alejandro is looking at this ticket but it does take time to respond since the changes you are asking for are more site-specific and require some testing, so we ask for your patience while he works this issue. I also wanted to point out that we actively try to reach out and meet or exceed our service level agreements. This issue is a Severity 4 issues – Minor Issues so it is entitled to the following: ● Initial Response (during normal work hours) As available ● Status Updates As available ● Work Schedule As available Alejandro should be followup up with you shortly with an update. Best regards, Jason Director of Support
Hi, jobs being submitted to both regular and highprio partitions will be ordered from highest PriorityTier to lowest so the scheduler will first try to schedule the job through highprio and if not possible (for instance because a limit has been reached) will try with the regular partition. Example: test@ibiza:~/t$ scontrol show part highprio | egrep "Priority|QoS" AllocNodes=ALL Default=NO QoS=highprio PriorityJobFactor=1000 PriorityTier=1000 RootOnly=NO ReqResv=NO OverSubscribe=NO test@ibiza:~/t$ sacctmgr show qos format=name,maxtrespu Name MaxTRESPU ---------- ------------- normal highprio cpu=2 test@ibiza:~/t$ sbatch -p regular,highprio -c2 --wrap "sleep 9999" Submitted batch job 20034 test@ibiza:~/t$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20034 highprio wrap test R 0:01 1 compute1 test@ibiza:~/t$ sbatch -p regular,highprio -c2 --wrap "sleep 9999" Submitted batch job 20035 test@ibiza:~/t$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20035 regular wrap test R 0:00 1 compute1 20034 highprio wrap test R 0:03 1 compute1 test@ibiza:~/t$ sbatch -p regular,highprio -c2 --wrap "sleep 9999" Submitted batch job 20036 test@ibiza:~/t$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20036 regular wrap test R 0:01 1 compute1 20035 regular wrap test R 0:06 1 compute1 20034 highprio wrap test R 0:09 1 compute1 test@ibiza:~/t$ Note job 20024 is scheduled with highprio partition, so the 2 cpu max limit is reached and subsequent jobs 20035 and 20026, also submitted to both partitions, are scheduled but they run under the regular partition cause highprio already reached the configured limit. I'd also encourage you to set AccountingStorageEnforce to 'safe,qos' instead of your current 'limits'. (In reply to NASA JSC Aerolab from comment #23) > Sorry, I didn't mean 768 jobs, I meant 768 cores worth of jobs. This is > want we want. Ok, no problem. > I will look into using PriorityTier and PriorityJobFactor instead of just > Priority. But does this explain why these jobs are now starving all other > jobs in the queue? I was expecting this to work such that jobs that were > submitted to both normal and hipri will run over other jobs that are just in > normal until 768 cores for that user are running. Then a mix of users jobs > will run in normal, according to the multifactor weights we have set up. > But that is not happening. User aschwing is getting All 2880 cores on this > cluster. aschwing's jobs may all sum up 2880 cores, but the subset of jobs running under the hipri partition shouldn't sum up more than 768 cores, which is the limit configured. Grepping your list of jobs for hipri, I see this: alex@ibiza:~/t$ cat t | grep hip 20685 aschwing hipri m3.00a0b-20_dx0.25_u 6:24 144 04:00 R 00:57 3940 None [BRO|sky] 20673 aschwing hipri m3.00a-20_dx0.25_umb 5:32 160 04:00 R 01:36 3941 None [bro|SKY] 20681 aschwing hipri m4.00a20_dx0.25_umb1 6:24 144 04:00 R 01:05 3941 None [BRO|sky] 20687 aschwing hipri m3.00a-20_dx0.25_umb 6:24 144 04:00 R 00:07 3973 None [BRO|sky] 20682 aschwing hipri m4.00a-20_dx0.25_umb 6:24 144 04:00 R 01:05 11940 None [BRO|sky] alex@ibiza:~/t$ 144*4 + 160 = 736 < MaxTresPerUser=cpu=768. The rest of the R jobs by aschwing are running under the normal partition which isn't cpu constrained as far as I know. So this all seem coherent to me. > Another aspect of this is confusing to me. See below, but the > list is sorted by job priority. All the running jobs at the bottom of the > list have a very high priority, that should correspond to getting into the > hipri partition. But most of those are listed as normal. The jobs at the > top are listed as hipri but have a low priority. Why is that? And why are > those jobs at the top of the list (lowest priority in the queue) running > before the jobs in the 4000's from ema and lhalstro? Can you please attach the output of $ squeue -O jobid,partition,prioritylong,username,state,starttime,schednodes,nodelist,reason --sort=S and $ sprio -l I'm not sure which utility you're using to report the jobs.
I've changed AccountingStorageEnforce=safe,qos and restarted slurmctld. I'll attach the these files: [root@europa ~]# squeue -O jobid,partition,prioritylong,username,state,starttime,schednodes,nodelist,reason --sort=S > squeue.txt [root@europa ~]# sprio -l > sprio.txt I'm not sure how useful that is right now though since we removed the hipri partion from a bunch of aschwing's jobs. We had to do this because everyone was getting starved. The output I sent earlier is from a script we wrote that combines data from several commands in a qstat-like output. But its just getting info from squeue and scontrol. I agree that, strictly speaking, the limits are being enforced properly and each user is getting only 768 cores or less of jobs running in hipri. But something else is happening to the scheduling that is causing jobs to submitted to both partitions to be favored. Before we created the hiprio partition, ema's jobs were getting most of the cluster (~75%) and the other two users were getting the remaining cores. After creating the hipri partition and having aschwing use it for all his jobs (i.e. nobody else using hipri), his jobs starved all others in the queue. That's the part that doesn't make sense to me. Once he reached the 768 cpu limit, his jobs should compete in normal with all the others and there should be a mix of jobs from all users. But instead, even his normal jobs ran ahead of all other jobs in the queue. The "Prio" column from comment 23 is just the Priority field that "scontrol show job" displays. So I don't understand why the aschwing's lower priority jobs are starting over higher priority jobs from other users. Can you explain that? The output in comment 23 is when all of aschwing's jobs were using both normal and hipri.
Created attachment 7404 [details] sprio output
Created attachment 7405 [details] squeue output
The backfill scheduler builds an ordered queue of (job, partition) pairs sorted as follows: 1. Job can preempt 2. Job with an advanced reservation 3. Job partition PriorityTier 4. Job priority (sum of multifactor plugin terms) 5. Job submission time 6. JobId 7. ArrayTaskId For instance User 'test2' submits a job '20083' only to partition 'normal', so this job will have only one entry in the scheduler built list: (20083, test2, normal) User 'test' job '20086' submitted to 'highprio' and 'normal' partitions, will have two entries in the scheduler built list: (20086, test, highprio) (20086, test, normal) When the scheduler sorts the list, it will place the highprio entry first, because it has a higher PriorityTier than the rest (even if it has a lower Job Priority): (20086, test, highprio) and then the scheduler needs to know which entry goes next: (20083, test2, normal) or (20086, test, normal). Since both of them have the same PriorityTier, the scheduler goes and decides based on the Job Priority (from multifactor). sprio -l can be used to show how each of these factors contributes to the Job Priority (point 4 above): alex@ibiza:~/t$ sprio JOBID PARTITION PRIORITY AGE FAIRSHARE JOBSIZE PARTITION 20083 normal 4498 47 1076 1375 2000 20086 highprio 11513 47 92 1375 10000 20086 normal 3513 47 92 1375 2000 alex@ibiza:~/t$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20083 normal wrap test2 PD 0:00 1 (Resources) 20086 highprio, wrap test PD 0:00 1 (Priority) 20076 normal wrap test R 1:12:15 1 compute1 20074 normal wrap test R 1:12:47 1 compute1 20073 highprio wrap test R 1:15:04 1 compute1 20079 normal wrap test2 R 1:11:59 1 compute2 20077 normal wrap test2 R 1:12:10 1 compute2 20075 normal wrap test2 R 1:12:21 1 compute1 20081 normal wrap test2 R 1:08:46 1 compute2 20082 normal wrap test2 R 1:02:44 1 compute2 Since (20083, test2, normal) has a Job Priority 4502 which is higher than (20086, test, normal), then the former will be the next in the sorted scheduler queue, which will finllay be: (20086, test, highprio) (20083, test2, normal) (20086, test, normal) Now if I enable DebugFlags=Backfill (scontrol setdebugflags +backfill), I can see backfill attempts to schedule the jobs in this order: slurmctld: backfill: beginning slurmctld: backfill test for JobID=20086 Prio=11518 Partition=highprio slurmctld: backfill test for JobID=20083 Prio=4494 Partition=normal slurmctld: backfill test for JobID=20086 Prio=3518 Partition=normal Currently, there are no resources available so none of them can start. If I scancel the running job 20074: alex@ibiza:~/t$ scancel 20074 alex@ibiza:~/t$ let's see what does backfill do now: slurmctld: backfill: beginning slurmctld: debug: backfill: 3 jobs to backfill slurmctld: backfill test for JobID=20086 Prio=11522 Partition=highprio (it first tries (20086, test, highprio)) slurmctld: debug2: job 20086 being held, if allowed the job request will exceed QOS highprio max tres(cpu) per user limit 2 with already used 2 + requested 2 slurmctld: backfill: adding reservation for job 20086 blocked by acct_policy_job_runnable_post_select (it can't, because I have MaxTRESPerUser=CPU=2 configured in my highprio QoS): Name MaxTRESPU ---------- ------------- normal highprio cpu=2 and user 'test' already has job 20073 running under highprio consuming 2 cpus, so backfill continues to the next entry in the sorted list: slurmctld: backfill test for JobID=20083 Prio=4489 Partition=normal slurmctld: Job 20083 to start at 2018-07-26T12:54:47, end at 2019-07-26T12:54:00 on compute1 So the second entry in the sorted list starts (20083, test2, normal), since it competing with the (20086, test, normal) entry it has more priority. alex@ibiza:~/t$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 20086 highprio, wrap test PD 0:00 1 (Resources) 20076 normal wrap test R 1:25:22 1 compute1 20073 highprio wrap test R 1:28:11 1 compute1 20079 normal wrap test2 R 1:25:06 1 compute2 20077 normal wrap test2 R 1:25:17 1 compute2 20075 normal wrap test2 R 1:25:28 1 compute1 20081 normal wrap test2 R 1:21:53 1 compute2 20082 normal wrap test2 R 1:15:51 1 compute2 20083 normal wrap test2 R 6:23 1 compute1 alex@ibiza:~/t$ sprio JOBID PARTITION PRIORITY AGE FAIRSHARE JOBSIZE PARTITION 20086 highprio 11526 57 94 1375 10000 20086 normal 3526 57 94 1375 2000 alex@ibiza:~/t$ As you can see, even if user 'test' submitted job to both highprio and normal partitions in job 20086, since the 'highprio' entry in the scheduler can't be scheduled due to limits, the 'normal' entry has competed with job 20083 from user 'test2' which was only submitted to normal partition, and job 20086 has _not_ starved job 20083. Since both entries have the same PriorityTier, the scheduler has checked the next stage in the precedence order which is the Job Priority and it resulted that job 20083 had a higher Job Priority (4498) than 20086 normal (3513), so job 20083 was started first. Does it make sense?
While I appreciate the detailed explanation of what's going on, I'm having a hard time absorbing all that and understanding why it translates into the behavior we are seeing. I'm about to start logging the following information on our system every 10 minutes: ~dvicker/bin/q -pn > q.txt.$date.$time /software/x86_64/bin/nodeinfo.pl > nodeinfo.out.$date.$time sprio -l > sprio.out.$date.$time scontrol -a show job > scontrol_show_jobs.out.$date.$time squeue -a > squeue.out.$date.$time sinfo -a -N -o "%.20n %.15C %.10t %.10e %.15P %.15f" > sinfo.out.$date.$time sdiag > sdiag.out.$date.$time I've also done this: "scontrol setdebugflags +backfill" I intend to demonstrate how when a single user starts using hipri for all their jobs, it starves all other jobs in the system. Please let me know if there are other commands you'd like me to log. You understand what we are trying to accomplish, right? Can you please help me determine the right scheduler configuration to achieve this?
Here is the baseline. We removed hiprio from everyone's batch script and let the queues run long enough so that Alan's (aschwing) jobs are queued in normal. There are still some of lhalstro's jobs running in hipri but those will resubmit to normal only. We have now added normal and hipri back to all of Alan's batch script. I'm expecting all of Alan's jobs to start. What we want to happen is a few of them start (<= 768 cores) and then a mix of jobs run after that. [root@europa slurm_data]# date Thu Jul 26 11:56:43 CDT 2018 [root@europa slurm_data]# q -p Job ID Username Queue Jobname N:ppn Proc Wall S Elap Prio Reason Features -------- -------- ------- -------------------- ----- ---- ----- - ----- ------ ---------- -------------- 21203 aschwing normal m0.50a0b-20_dx0.25_u 5:28 144 04:00 Q 00:10 3947 Priority [bro|sky] 21202 aschwing normal m0.50a0b-20_dx0.25_u 5:28 144 04:00 Q 00:22 3955 Priority [bro|sky] 21201 aschwing normal m3.00a20_dx0.25_umb3 6:24 144 04:00 Q 00:25 3958 Priority [bro|sky] 21181 aschwing normal m0.50a-20_dx0.25_umb 5:32 160 04:00 C 01:57 4019 None [bro|SKY] 21191 aschwing normal m0.50a20_dx0.25_umb3 6:24 144 04:00 Q 02:01 4024 Priority [bro|sky] 21192 aschwing normal m4.00a20_dx0.25_umb1 6:24 144 04:00 Q 02:00 4024 Priority [bro|sky] 21188 aschwing normal m3.00a0b-20_dx0.25_u 6:24 144 04:00 R 00:10 4032 None [BRO|sky] 21190 aschwing normal m3.00a-20_dx0.25_umb 6:24 144 04:00 Q 02:11 4032 Resources [bro|sky] 21173 aschwing normal m3.00a-20_dx0.25_umb 6:24 144 04:00 C 02:03 4083 None [BRO|sky] 21182 aschwing normal m0.50a0b-20_dx0.25_u 5:32 160 04:00 R 00:21 4083 None [bro|SKY] 21197 stuart normal r04 8:24 192 08:00 R 00:25 4109 None [sky|BRO] 21198 stuart normal r06 8:24 192 08:00 R 00:22 4112 None [sky|BRO] 21199 stuart normal r24 6:32 192 08:00 R 00:22 4112 None [SKY|bro] 21183 lhalstro normal m0.40a0.0_fso3_dcfpe 12:32 384 08:00 R 03:47 11857 None [SKY] 21187 lhalstro hipri m0.80a0.0_fso3_dcfpe 16:24 384 08:00 R 02:34 11857 None [BRO|sky] 21189 lhalstro hipri m0.65a0.0_fso3_dcfpe 16:24 384 08:00 R 02:14 11857 None [sky|BRO] 21195 aschwing hipri m4.00a10_dx0.18_umb3 6:24 144 05:00 R 01:27 11940 None [BRO|sky] 21196 aschwing hipri m4.00a7.11b-7.05_dx0 5:32 160 05:00 R 01:20 11940 None [bro|SKY] 21200 aschwing hipri m3.00a20_dx0.25_umb3 6:24 144 04:00 R 01:03 11940 None [BRO|sky] 21204 aschwing hipri m0.50a-20_dx0.25_umb 5:32 160 04:00 R 00:02 11940 None [bro|SKY] 21205 aschwing hipri m3.00a-20_dx0.25_umb 6:24 144 04:00 R 00:02 11940 None [BRO|sky] Stats: total (108/2880) || bro ( 72/1728) | sky ( 36/1152) S Node CPU Job || S Node CPU Job | S Node CPU Job - ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- C 11 304 2 || C 6 144 1 | C 5 160 1 Q 60 1728 6 || Q 36 864 6 | Q 24 864 6 R 105 2784 13 || R 72 1728 8 | R 33 1056 5 - ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- C 10% 10% || C 8% 0% | C 13% 0% Q 55% 60% || Q 50% 2% | Q 66% 2% R 97% 96% || R 100% 4% | R 91% 2% - ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- F 3 96 || F 0 0 | F 3 96 User Stats: ------------ cores ------------ ------------ jobs ------------ User # running # pending total # running # pending total aschwing 1056 864 2224 7 6 15 lhalstro 1152 0 1152 3 0 3 stuart 576 0 576 3 0 3 ---------- ---------- ---------- ---------- ---------- ---------- ---------- total 2784 864 3952
This isn't the best example of the problem (all jobs are running). But it does show that a couple of Alan's jobs that are running in normal have a Priority as if they are running in hipri. This seems like the root of the problem to me. Let me know if you want me to upload the slurmctld log file or the other files I mentioned above. [dvicker@europa run]% date Thu Jul 26 14:45:05 CDT 2018 [dvicker@europa run]% qp Job ID Username Queue Jobname N:ppn Proc Wall S Elap Prio Reason Features -------- -------- ------- -------------------- ----- ---- ----- - ----- ------ ---------- -------------- 21210 lhalstro normal m0.80a0.0_fso3_dcfpe 12:32 384 08:00 R 00:45 3884 None [bro|SKY] 21213 lhalstro normal m0.65a0.0_fso3_dcfpe 12:32 384 08:00 R 00:04 3896 None [SKY|bro] 21212 aschwing normal m0.50a0b-20_dx0.25_u 6:24 144 04:00 R 01:03 3940 None [BRO|sky] 21201 aschwing normal m3.00a20_dx0.25_umb3 6:24 144 04:00 R 01:26 4010 None [BRO|sky] 21203 aschwing normal m0.50a0b-20_dx0.25_u 5:32 160 04:00 R 01:07 4013 None [bro|SKY] 21202 aschwing normal m0.50a0b-20_dx0.25_u 6:24 144 04:00 R 01:11 4018 None [BRO|sky] 21197 stuart normal r04 8:24 192 08:00 R 03:08 4109 None [sky|BRO] 21198 stuart normal r06 8:24 192 08:00 R 03:05 4112 None [sky|BRO] 21199 stuart normal r24 6:32 192 08:00 C 03:00 4112 None [SKY|bro] 21208 aschwing hipri m3.00a20_dx0.25_umb3 6:24 144 04:00 R 01:41 11940 None [BRO|sky] 21211 aschwing normal m3.00a0b-20_dx0.25_u 6:24 144 04:00 R 01:13 11940 None [BRO|sky] 21214 aschwing normal m3.00a-20_dx0.25_umb 6:24 144 04:00 R 01:00 11940 None [BRO|sky] 21215 aschwing hipri m4.00a10_dx0.18_umb3 6:24 144 05:00 R 00:49 11940 None [BRO|sky] 21216 aschwing hipri m0.50a-20_dx0.25_umb 6:24 144 04:00 R 00:45 11940 None [BRO|sky] 21217 aschwing hipri m3.00a-20_dx0.25_umb 6:24 144 04:00 R 00:45 11940 None [BRO|sky] 21218 aschwing hipri m0.50a20_dx0.25_umb3 5:32 160 04:00 R 00:40 11940 None [bro|SKY] Stats: total (108/2880) || bro ( 72/1728) | sky ( 36/1152) S Node CPU Job || S Node CPU Job | S Node CPU Job - ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- C 6 192 1 || C 0 0 0 | C 6 192 1 R 104 2768 15 || R 70 1680 11 | R 34 1088 4 - ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- C 5% 6% || C 0% 0% | C 16% 0% R 96% 96% || R 97% 4% | R 94% 2% - ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- F 4 112 || F 2 48 | F 2 64 User Stats: ------------ cores ------------ ------------ jobs ------------ User # running # pending total # running # pending total aschwing 1616 0 1616 11 0 11 lhalstro 768 0 768 2 0 2 stuart 384 0 576 2 0 3 ---------- ---------- ---------- ---------- ---------- ---------- ---------- total 2768 0 2960 15 0 16 [dvicker@europa run]%
Current state below. [dvicker@europa run]% date Thu Jul 26 16:23:45 CDT 2018 [dvicker@europa run]% qp Job ID Username Queue Jobname N:ppn Proc Wall S Elap Prio Reason Features -------- -------- ------- -------------------- ----- ---- ----- - ----- ------ ---------- -------------- 21210 lhalstro normal m0.80a0.0_fso3_dcfpe 12:32 384 08:00 R 02:24 3884 None [bro|SKY] 21213 lhalstro normal m0.65a0.0_fso3_dcfpe 12:32 384 08:00 R 01:43 3896 None [SKY|bro] 21247 aschwing normal,h m0.50a-10_dx0.25_umb 6:24 144 05:00 Q 00:13 3948 Priority [bro|sky] 21245 aschwing normal,h m3.00a-20_dx0.25_umb 6:24 144 04:00 Q 00:17 3951 Priority [bro|sky] 21243 aschwing normal,h m1.20a0b-10_dx0.25_u 6:24 144 05:00 Q 00:20 3954 Priority [bro|sky] 21241 aschwing normal,h m0.50a20_dx0.25_umb3 6:24 144 04:00 Q 00:22 3955 Priority [bro|sky] 21224 aschwing hipri m0.50a0b-10_dx0.25_u 6:24 144 05:00 R 00:50 3962 None [BRO|sky] 21235 aschwing normal,h m4.00a10_dx0.18_umb3 6:24 144 05:00 Q 00:50 3974 Priority [bro|sky] 21225 aschwing hipri m0.50a10_dx0.25_umb0 5:32 160 05:00 R 00:22 3980 None [bro|SKY] 21226 aschwing hipri m1.10a-10_dx0.25_umb 6:24 144 05:00 R 00:20 3983 None [BRO|sky] 21227 aschwing hipri m1.10a0b-10_dx0.25_u 6:24 144 05:00 R 00:17 3983 None [BRO|sky] 21228 aschwing hipri m1.10a10_dx0.25_umb0 6:24 144 05:00 R 00:13 3986 None [BRO|sky] 21232 aschwing normal,h m3.00a20_dx0.25_umb3 6:24 144 04:00 Q 01:10 3988 Priority [bro|sky] 21229 aschwing normal,h m1.10a-10_dx0.25_umb 6:24 144 05:00 Q 01:23 3997 Priority [bro|sky] 21230 aschwing normal,h m1.10a0b-10_dx0.25_u 6:24 144 05:00 Q 01:23 3997 Priority [bro|sky] 21246 stuart normal r07 8:24 192 08:00 Q 00:17 4027 Priority [bro] 21244 stuart normal r06 6:32 192 08:00 Q 00:19 4029 Priority [sky|bro] 21242 stuart normal r04r 8:24 192 00:05 Q 00:21 4030 Resources [bro] 21219 aschwing normal m3.00a20_dx0.25_umb3 6:24 144 04:00 R 01:24 11940 None [BRO|sky] 21231 aschwing normal m3.00a0b-20_dx0.25_u 6:24 144 04:00 R 01:11 11940 None [BRO|sky] 21234 aschwing normal m3.00a-20_dx0.25_umb 6:24 144 04:00 R 00:59 11940 None [BRO|sky] 21236 aschwing normal m0.50a0b-20_dx0.25_u 5:32 160 04:00 R 00:45 11940 None [bro|SKY] 21237 aschwing normal m1.20a-10_dx0.25_umb 6:24 144 05:00 R 00:40 11940 None [BRO|sky] 21238 aschwing normal m0.50a-10_dx0.25_umb 6:24 144 05:00 R 00:39 11940 None [BRO|sky] 21239 aschwing normal m0.50a10_dx0.25_umb0 6:24 144 05:00 R 00:38 11940 None [BRO|sky] 21240 aschwing normal m0.50a0b-10_dx0.25_u 6:24 144 05:00 R 00:37 11940 None [BRO|sky] 21248 aschwing normal m1.10a10_dx0.25_umb0 6:24 144 05:00 R 00:05 11940 None [BRO|sky] Stats: total (108/2880) || bro ( 72/1728) | sky ( 36/1152) S Node CPU Job || S Node CPU Job | S Node CPU Job - ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- Q 110 3072 11 || Q 72 1728 11 | Q 38 1344 9 R 106 2816 16 || R 72 1728 12 | R 34 1088 4 - ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- Q 101% 106% || Q 100% 4% | Q 105% 3% R 98% 97% || R 100% 4% | R 94% 2% - ---- ----- ---- || - ---- ----- ---- | - ---- ----- ---- F 2 64 || F 0 0 | F 2 64 User Stats: ------------ cores ------------ ------------ jobs ------------ User # running # pending total # running # pending total aschwing 2048 1152 3200 14 8 22 lhalstro 768 0 768 2 0 2 stuart 0 576 576 0 3 3 ---------- ---------- ---------- ---------- ---------- ---------- ---------- total 2816 1728 4544 16 11 27 [dvicker@europa run]% Again, I think the problem is that Alan's normal jobs all have a high priority, but his hipri jobs have a low priority. From sprio: [dvicker@europa run]% sprio -l JOBID PARTITION USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS NICE TRES 21229 hipri aschwing 11997 57 0 1941 10000 0 0 21229 normal aschwing 3997 57 0 1941 2000 0 0 21230 hipri aschwing 11997 57 0 1941 10000 0 0 21230 normal aschwing 3997 57 0 1941 2000 0 0 21232 hipri aschwing 11988 48 0 1941 10000 0 0 21232 normal aschwing 3988 48 0 1941 2000 0 0 21235 hipri aschwing 11974 34 0 1941 10000 0 0 21235 normal aschwing 3974 34 0 1941 2000 0 0 21241 hipri aschwing 11955 15 0 1941 10000 0 0 21241 normal aschwing 3955 15 0 1941 2000 0 0 21242 normal stuart 4030 14 93 1924 2000 0 0 21243 hipri aschwing 11954 13 0 1941 10000 0 0 21243 normal aschwing 3954 13 0 1941 2000 0 0 21244 normal stuart 4029 13 93 1924 2000 0 0 21245 hipri aschwing 11951 11 0 1941 10000 0 0 21245 normal aschwing 3951 11 0 1941 2000 0 0 21246 normal stuart 4027 11 93 1924 2000 0 0 21247 hipri aschwing 11948 8 0 1941 10000 0 0 21247 normal aschwing 3948 8 0 1941 2000 0 0 [dvicker@europa run]% You can see that all of Alan's jobs are listed twice - one for each of the partitions. The priorities listed there look good - hipri has the high priority and normal has the low priority. But this is reversed in the actual running jobs. For example: [root@europa slurm_data]# scontrol show job 21248 | grep -e Priority -e Part Priority=11940 Nice=0 Account=aerolab QOS=normal Partition=normal AllocNode:Sid=r1i0n30:24340 [root@europa slurm_data]# This job is running in the normal queue (and Qos=normal), but it got the hipri Priority value. This sure seems like a bug to me. Please help me understand if I'm wrong. I'm going to upload the diagnostic files from today too, including slurmctld_log with "scontrol setdebugflags +backfill".
Created attachment 7432 [details] Diagnostic files
I think I see what the problem is. I believe this is not a scheduling problem but a limitation on how squeue and/or scontrol show job display a job's priority, which I understand might be causing confusion to you. A job's record has these two members: uint32_t priority; /* relative priority of the job, * zero == held (don't initiate) */ uint32_t *priority_array; /* partition based priority */ while sprio command disaggregates a job's priority showing each of the stored values in priority_array for each partition, squeue and/or scontrol show job display Priority based on the current value of the priority struct member and do not disaggregate by partition. Since the priority value fluctuates based upon the scheduler trying to schedule the job in one partition or the other, depending on the moment you query squeue and/or scontrol the priority shows the value for one or the other partition. I'm gonna see what I find and come back to you.
Any updates on this today? We've continued to log info over the weekend if you want more data. I'm not convinced this is just a reporting issue since the entire cluster will drain in preference for one persons jobs if they are the only person using hipri. But I'll be anxious to hear what you find out.
Greetings, Alejandro is currently not in the office this week so I have asked Felip to look over your latest update and respond. Best regards, Jason
Hi, This is a long thread so I am taking some time to get updated with everything. What Alex wrote in comment 37 is true. When you have a job with multiple partitions and query for the information of this job, slurm internally provides the information of the current partition that's being considered for scheduling, thus possibly showing a mismatch with the displayed partition and the displayed priority. In slurm v18.08 this is addressed creating an array of priorities that will match the list of partitions. To summarize, the priorities you see may be wrong and not match the partition that's being shown. I am still reading through everything, if I am not misunderstanding it, you are still seeking for the initial demand of emulating maui's soft limits, so, you want: The first jobs, until 768 used cores, will be high priority. The following jobs, will be low priority. I am still analyzing your comments 33 34 35 36. will come back as soon as I have relevant feedback.
Thanks for the update Felip. A couple comments: > To summarize, the priorities you see may be wrong and not match the > partition that's being shown. I still think this deeper than just a reporting issue because of the following: - several jobs submitted from various users using only normal - many jobs submitted from a single user using both normal and hipri - all the nornmal+hipri jobs will run ahead of the jobs using only normal This shouldn't happen - once 768 cores of the users hipri jobs from that user are running, it should resume to a mix of normal jobs from various users (like we get if everyone uses only normal). That isn't happening. > I am still reading through everything, if I am not misunderstanding it, you > are still seeking for the initial demand of emulating maui's soft limits, > so, you want: > > The first jobs, until 768 used cores, will be high priority. The following > jobs, will be low priority. That is correct.
> I still think this deeper than just a reporting issue because of the > following: I see what you are telling. I will try to reproduce. Btw, are you on 17.11.5 right?
Tbanks. That is correct, we are on 17.11.5.
I've been able to reproduce this and can confirm it isn't just a displaying but a scheduling issue.
Thanks for the confirmation. Do you have an estimate on when you might have a fix? Part of the reason I ask is because the specific person who has been testing this on our end is leaving soon (last day is Thursday). It would be really nice to test a fix before that happens.
Created attachment 7535 [details] patch not reviewed On my local tests, this patch fixes the scheduling stall issue although it is pending for review. I asked Moe to review it but we're a bit overloaded these days with the 18.08 release candidate version. Feel free to try it ahead of review or wait for Moe's opinion on it.
Excellent - thanks. I'll give this a try and let you know how it works for us.
I upgraded to 17.11.8 with the patch you supplied yesterday morning. Should that patch have fixed the priority values displayed by "scontrol show job"? If so, we are still not seeing that. See below for the output. Note that any job runnning in hipri should have a priority value at or above 10000, while the jobs running in the normal queue should be ~4000. Job ID Username Queue Jobname N:ppn Proc Wall S Elap Prio Reason Features -------- -------- ------- -------------------- ----- ---- ----- - ----- ------ ---------- -------------- 23616 lhalstro normal m0.80a0.0_fso3_dcfpe 12:32 384 08:00 R 03:05 3897 None [bro|SKY] 23620 lhalstro normal m0.40a0.0_fso3_dcfpe 12:32 384 08:00 R 02:37 3897 None [SKY] 23635 lhalstro normal,h m0.40a0.0_dcfpeg5rdx 16:24 384 08:00 Q 00:35 3922 Resources [bro] 23606 lhalstro normal m0.40a0.0_fso3_dcfpe 12:32 384 08:00 R 03:17 3948 None [SKY] 23627 aschwing normal m3.00a3.54b-3.53_AMC 6:24 144 05:00 R 01:44 3957 None [BRO|sky] 23628 aschwing normal m5.00a3.54b-3.53_AMC 6:24 144 05:00 R 01:39 3957 None [BRO|sky] 23630 aschwing normal m4.00a3.54b-3.53_AMC 6:24 144 05:00 R 01:16 3957 None [BRO|sky] 23631 aschwing hipri m2.00a3.54b-3.53_AMC 6:24 144 05:00 R 00:54 3957 None [BRO|sky] 23632 aschwing hipri m0.80a3.54b-3.53_AMC 6:24 144 05:00 R 00:50 3957 None [BRO|sky] 23633 aschwing hipri m0.90a3.54b-3.53_AMC 6:24 144 05:00 R 00:45 3957 None [BRO|sky] 23634 aschwing normal m1.20a3.54b-3.53_AMC 5:32 160 05:00 R 00:36 3957 None [bro|SKY] 23636 aschwing hipri m0.50a3.54b-3.53_AMC 6:24 144 05:00 R 00:19 3957 None [BRO|sky] 23637 aschwing hipri m1.60a3.54b-3.53_AMC 5:32 160 05:00 R 00:19 3957 None [bro|SKY] 23638 aschwing normal m1.40a3.54b-3.53_AMC 6:24 144 05:00 R 00:05 3957 None [BRO|sky] 23639 aschwing normal m1.10a3.54b-3.53_AMC 6:24 144 05:00 R 00:03 3957 None [BRO|sky] 23622 lhalstro normal m0.40a0.0_dcfpeg5rdx 16:24 384 08:00 R 00:35 3966 None [BRO] 23629 lhalstro hipri m0.40a0.0_fso3_dcfpe 12:32 384 08:00 R 01:18 11897 None [SKY] 23600 lhalstro hipri m0.40a0.0_fso3_dcfpe 12:32 384 08:00 R 04:32 11921 None [SKY] I'm going to upload another batch of diagnostic data from yesterday as well, including our slurmctld_log.
Created attachment 7552 [details] Diagnostic files
No, the patch only fixes the scheduler portion so that same job submitted to multiple partitions is assigned the correct priority value on the job queue built for scheduling purposes. You shouldn't see jobs submitted to both hipri,normal and not being able to run on hipri so only in normal stalling other users' normal jobs as before. The display part for scontrol show job, squeue will show the priority for a job from the partition currently being considered, it is not disaggregated as in sprio. That would require another patch but at least you shouldn't have the stalling problem you reported before. Can you verify the stalling problem isn't happening anymore?
Note also the patch will fix newly submitted jobs not the ones submitted before the patch.
Hi, The proposed patch was checked-in to 17.11 (with some subtle modifications) in the following three commits: (17.11.9) https://github.com/SchedMD/slurm/commit/d2a1a96c54a6d556f2d91c1f24845a8f4089b41f (17.11.9-2 due to an accidental bad casting on review) https://github.com/SchedMD/slurm/commit/21d2ab6ed1694bf7b12e824ce41b66bb143e22b3 (17.11.10) https://github.com/SchedMD/slurm/commit/67a82c369a7530ce7838e6294973af0082d8905b The one I attached and you applied has the good casting, so it is fine you continue with it. I'd like to check-in with you and confirm you're not experiencing the stalling problems you were reporting before. I'm aware the client commands like squeue and scontrol show job could potentially change the way they list/display the job info so that priorities are disaggregated by partition like sprio does, and we can talk further about addressing that later. But first I want to ensure that jobs submitted against hipri,normal and can't run on hipri and only on normal aren't stalling anymore older jobs that were submitted only to normal since you applied the patch. So I'd highly appreciate any new feeback on this question. Thanks!
Hi, any updates on this? thank you.
We've had the patch and our normal/hipri partitions implemented in our environment for the last two weeks. Based on the day-to-day usage, things appear to be behaving as intended and we have not seen 'hiphi' jobs starving 'normal' jobs like we did in the past. We also performed a number of controlled tests that (previously) would have manifest the issue and did not see any stalling after applying the patch. Our cluster hasn't been under a relatively light load so there is the possibility of issues showing up later when there is increased contention, but based on our test cases we feel that is unlikely. Thanks for the help, I'm glad that we could find and help diagnose this bug.
(In reply to NASA JSC Aerolab from comment #65) > We've had the patch and our normal/hipri partitions implemented in our > environment for the last two weeks. Based on the day-to-day usage, things > appear to be behaving as intended and we have not seen 'hiphi' jobs starving > 'normal' jobs like we did in the past. We also performed a number of > controlled tests that (previously) would have manifest the issue and did not > see any stalling after applying the patch. Our cluster hasn't been under a > relatively light load so there is the possibility of issues showing up later > when there is increased contention, but based on our test cases we feel that > is unlikely. > > Thanks for the help, I'm glad that we could find and help diagnose this bug. Great, thanks for your feedback. I've opened a separate sev-5 enhancement[1] request to track the modification of the rest of the user commands to have the ability to display disaggregated by partition job priority information, like sprio currently has. I'm gonna go ahead and close this bug as fixed. Please, reopen if you encounter further issues. Thank you. [1] https://bugs.schedmd.com/show_bug.cgi?id=5614