Ticket 9726

Summary: Problems scheduling in a particular partition
Product: Slurm Reporter: Lloyd Brown <lloyd_brown>
Component: SchedulingAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: lloyd_brown, ryan_cox
Version: 20.02.4   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=9795
Site: BYU - Brigham Young University Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 20.11pre1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Collection of configs, log excerpts, and output of various diagnostic commands (eg. sdiag, sinfo, etc.)
perf.data (compressed to save space)
associated perf.data.tar.bz2
sha256 sums of perf.data (uncompressed) and perf.data.tar.bz2
Excerpt of logs showing (slow?) backfill behavior
squeue m7 partition
sdiag.1599692151 before changing bf params
sdiag.1599692451 after changing bf params
sdiag.1599692751
sdiag.1599693051
sinfo.1599692301
output of "sprio -p m7"
output of "sacct -r m7 -p -o all"
output of "squeue -p m7"
slurmctld.log excerpt 2020-09-10 08:00-08:56
slurmctld.log excerpt 2020-09-09 17:00-19:01
veryveryverboselogs.tar.gz
output of "scontrol show res"

Description Lloyd Brown 2020-09-02 15:37:41 MDT
Created attachment 15713 [details]
Collection of configs, log excerpts, and output of various diagnostic commands (eg. sdiag, sinfo, etc.)

We're trying to figure out an issue with slurm and scheduling, where one of our partitions (m9) isn't getting new jobs to start.  This has been occurring for a day or two, despite a large number of eligible jobs available, and a large number of nodes available.  I'm including a number of details (logs, configs, etc.) here that we think are probably relevant.  In particular, the slurmctld.log should include at least 2 full backfill cycles, while I had the "backfill" debugflags set (eg. "scontrol setdebugflags +backfill").

We suspect that the problem relates to backfill, since we've historically relied on backfill quite heavily.  It's very possible that we've got something poorly tuned in relation to backfill.  So, while we definitely want to figure out the near-term problem with the m9 partition, if there are any more general backfill tuning recommendations you have, we'd love to hear about it.

We've had the vague impression in the past that backfill might not be operating optimally, but haven't had anything concrete before this.
Comment 1 Marcin Stolarek 2020-09-03 02:49:30 MDT
From your description and attached diagnostic output, it looks like backfill is not reaching all jobs in the queue. This applies also to jobs in m9 partition. 

Based on sdiag and config, my first suggestion would be to increase bf_interval - let's start with doubling it to bf_interval=600.

Another thing that would be helpful is perf data for slurmctld, please execute following commands:
perf record  -s --call-graph dwarf -p `pidof slurmctld` sleep 600
perf archive perf.data
then send both perf.data.tar.bz2 and perf.data - .tar.bz2 is not a compressed archive of perf.data it contains code annotations. This should help us to understand your backfill and select plugin profiles to potentially optimize the code.

cheers,
Marcin
Comment 2 Marcin Stolarek 2020-09-03 04:01:15 MDT
Additionally, I see that 1192 of total 1336 jobs in m9 partition was submitted by UserId=dcarls, that's quite uncommon - other users in that partition have no more than 32 jobs, 10 is quite common. If you want to make the start of those jobs more efficient you can increase bf_max_job_user_part=20, but it's a little bit of a policy question.

It's also important to note that those jobs are mainly pending because of dependency.
    720 Reason=Dependency
    442 Reason=Priority
      1 Reason=Resources


cheers,
Marcin
Comment 3 Lloyd Brown 2020-09-03 09:08:22 MDT
Created attachment 15726 [details]
perf.data (compressed to save space)
Comment 4 Lloyd Brown 2020-09-03 09:10:44 MDT
Created attachment 15727 [details]
associated perf.data.tar.bz2
Comment 5 Lloyd Brown 2020-09-03 09:11:32 MDT
Created attachment 15728 [details]
sha256 sums of perf.data (uncompressed) and perf.data.tar.bz2
Comment 6 Lloyd Brown 2020-09-03 09:17:17 MDT
Marcin,

We'll certainly look at those two tunings you've suggested (bf_interval, bf_max_user_part).  But as a quick sanity check, how many jobs should we reasonably expect to be started or considered per backfill iteration?  Anecdotally, we're not seeing the "Total backfilled jobs (since last slurm start)" statistic from sdiag, increase very quickly; something on the order of 10-20 per 5-minute scheduling cycle.  I'm not sure if there's a better statistic to show how many jobs are being considered during each backfill cycle.

I've uploaded the perf.data and perf.data.tar.bz2, as requested.  I also included a sha256sum file, so you can double-check that it has not been corrupted; note that I compressed the perf.data to save space during the upload, but the sha256 file is for the uncompressed version.
Comment 7 Marcin Stolarek 2020-09-03 13:10:15 MDT
You can look at:
> Last depth cycle: 772
> Last depth cycle (try sched): 87

The first value is the number of jobs that were checked by backfill during the cycle. Second means that after checking if the job can be started because of any limits (qos, user, assoc, backfill params), being eligible (--begin-time, job dependencies) backfill attempted job scheduling - this is the time-consuming part of backfill.

Looking at the logs there is no like backfill test for `JobId=XXX Prio=XXX Partition=m9` which means that there were no jobs evaluated in m9 partition within 15minutes we have in the attached log. Backfill attempts jobs in priority order, so if there were no in m9 within the 772 on top of the queue of all jobs then they won't be checked.

I think that you should increase bf_interval at least for the short term. I'll check perf results to see if we're hitting a performance bottleneck that can be eliminated.

cheers,
Marcin
Comment 8 Marcin Stolarek 2020-09-04 02:40:11 MDT
I'm not sure if the slot when the perf data was collected is fully representative. If it is than ~40% of time is spent preparing a reply for REQUEST_JOB_INFO RPC, which is most probably a result of frequent execution of command like `squeue` or `scontrol show job`. This may severely impact scheduler performance since it requires the acquisition of job lock.

Based on the sdiag you shared before it's pretty high ~40 RPC/min. Is it possible that you have it running somewhere under root users (health check, prolog/epilog scripts)?

I think that you should also check the activity of other top users (in terms of number of RPCs from sdiag): bmt43, erh24, brettss, benfogg. I'd start looking at the login/submit host, however, you can also enable Protocol debug[1] flag which should generate info messages like:
>slurmctld_req: received opcode REQUEST_JOB_INFO from X.X.X.X:YYY uid Z

cheers,
Marcin
Comment 9 Marcin Stolarek 2020-09-04 03:15:47 MDT
It's also possible that you're hitting Bug 8978 that was fixed in 20.02.3. That bug was specifically related with low performance in preparation of reply to REQUEST_JOB_INFO RPC when job was started with multiple arguments to final application - like:
>Command=/lustre/scratch/usr/dcarls0n/compute_blocks/n-hexatriacontane_param/eps0.133_sig4.040_n16_atom3/T_573.54K/NPT/workspace60/start_NPT.sbatch n-hexatriacontane 800 573.54 0.005784 1046554.3851834491 740000 4 PotoffDJC UA pascal
You're not in so critical case as in the related bug, but considering number of arguments and jobs with it (~2k jobs with more than 10 arguments to command) I think that performance improvement for this RPC will be visible for you.

Upgrade to 20.02.3 is just a minor release and should be straight forward.

cheers,
Marcin
Comment 11 Lloyd Brown 2020-09-04 10:34:09 MDT
Marcin,

We've just upgraded the slurmdbd/slurmctld servers to 20.02.4, and it appears to be functional.  I don't know if it's any better performance yet, but it's not broken, so that's something.  Hopefully it'll be clearer over the next couple of hours, whether the problems are improving or not.

As far as the frequent RPCs go, I'll see what we can figure out there.  We'd already been considering implementing a cached request mechanism for our login/submission nodes at least, that we can implement.  That should reduce the number of RPCs from users fairly significantly, as it currently won't run the actual command more than once every 10 seconds.

I'll try to report back again by the end of the day.

Lloyd
Comment 12 Lloyd Brown 2020-09-04 13:40:36 MDT
Marcin,

One clarification question.  When I see a line like "Last depth cycle: 4745" in the output of sdiag, is that strictly speaking the number of jobs being evaluated during the backfill cycle, or is it the job/partition combinations being evaluated?

We understand that when jobs exist in multiple partitions, then that job will be separately evaluated in each of those partitions, and that often this is a cause of the scheduling process slowing down.

If it's true that it's the job/partition combinations being counted, that would go a long way in explaining how we can have ~4700 jobs evaluated according to sdiag, with eligible m9 jobs starting around ~2700 in the priority order, and still have them not get started.  For example, if 2000 jobs ahead of that eligible m9 job had 3 partitions each (which is not uncommon here), then that could easily push the m9 job in question off the end of the consideration list.
Comment 13 Marcin Stolarek 2020-09-06 23:58:41 MDT
>job/partition combinations being evaluated?
It's a job partition combination since the job can have different priorities in different partition every combination is evaluated separately. The job queue for backfill is sorted by priority, so it may happen that the job will be evaluated in one partition and not in the other. 

>[...]job had 3 partitions each (which is not uncommon here[...]
I think that we better understand the issue now and maybe instead of removing/increasing  bf_max_job_user_part=20 you should switch to bf_max_job_user=150? Looking at scontrol show job result you had 33 users with pending jobs, which gives 4700/33=142. This may result in lower priority jobs starting before high-priority jobs (same applies for bf_max_job_user_part) but will limit the impact of users submitting to multiple partitions on scheduling other users jobs.

>As far as the frequent RPCs[...]
Based on the perf results you shared I have a patch that can potentially further speed-up processing of REQUEST_JOB_INFO. I moved it to our review queue. Let me know if you'd like to test it before the QA.

cheers,
Marcin

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_bf_max_job_part=#
[2] https://slurm.schedmd.com/slurm.conf.html#OPT_bf_max_job_user_part=#
Comment 14 Lloyd Brown 2020-09-08 16:16:03 MDT
Marcin,

I'm not entirely sure I understand.  If we already have bf_max_job_user_part=20, wouldn't that be effective in limiting the number of jobs per user to be 20*number-of-partitions-requested(often 3-5) per backfill cycle?  I'm not entirely sure how removing that and adding bf_max_job_user=150 would help, unless bf_max_job_user_part is somehow not functioning correctly.

And even aside from backfill, we're seeing jobs that are eligible to use m9 nodes, and are being evaluated (the "LastSchedEval" in "scontrol show job" is being updated), but still are pending.  If we're correct, and that's only updated by the main scheduling loop, and not the backfill scheduler, then it's not clear why this would be happening.

Here's an example:

root@sched1:~# scontrol show job 38254202
JobId=38254202 JobName=kfT1150n20r.sh
   UserId=bbattraw(23467) GroupId=bbattraw(24118) MCS_label=N/A
   Priority=100889 Nice=0 Account=stelladn QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=2-23:56:00 TimeMin=N/A
   SubmitTime=2020-09-07T20:29:54 EligibleTime=2020-09-07T20:29:54
   AccrueTime=2020-09-07T20:30:13
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-08T16:09:34
   Partition=m9 AllocNode:Sid=m8g-1-8:136017
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=28 NumTasks=28 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=28,mem=126000M,node=1,billing=28
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4500M MinTmpDiskNode=0
   Features=rhel7&ib DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/lustre/scratch/grp/fslg_nickerson/MSR/Data/kfT1150n20/kfT1150n20r.sh
   WorkDir=/lustre/scratch/grp/fslg_nickerson/MSR/Data/kfT1150n20
   StdErr=/lustre/scratch/grp/fslg_nickerson/MSR/Data/kfT1150n20/slurm-38254202.out
   StdIn=/dev/null
   StdOut=/lustre/scratch/grp/fslg_nickerson/MSR/Data/kfT1150n20/slurm-38254202.out
   Switches=1@00:20:00
   Power=
   MailUser=(null) MailType=NONE

root@sched1:~# 






In this case, there are about 128 available m9 nodes, all of which have 28 processor cores, 128GB of RAM, and the "rhel7" and "ib" node features.  Since this particular job is being evaluated (recent LastSchedEval), and still isn't starting, we're seriously confused.
Comment 15 Marcin Stolarek 2020-09-09 08:00:09 MDT
>removing that and adding bf_max_job_user=150 would help
The reasoning behind that in a few points was:
- You have 9k jobs, some of those in 3-5 partitions, 9*4=36k (job partition pairs)
- It looks like in current configuration and load backfill is able to handle (4k job partition pairs).
- If a specific user is submitting jobs only to lower priority partitions we may not reach those jobs.
- At the same time backfill evaluates up to 20 jobs per user per partition in priority order, potentially starving some users who submitted to idle partition (because the jobs are low priority)
- If we change "per user per partition" limit to just "per user" we'll give more fair "backfill share to them", especially to thoes submitting only to one partition. (like 1255 jobs pending in m9 from dcarlxxx)

It's not perfect, since every limit like that may start lower priority job starving higher priority one in general, but.. that's what I thought about.
---

Are you still experiencing the original issue after the upgrade? I believe that performance degradation came from the bug mentioned in comment 9.
Could you please collect current state info as you did opening the case and get current perf results, it should differ substantially.


cheers,
Marcin
Comment 16 Lloyd Brown 2020-09-09 08:24:23 MDT
Marcin,

I don't mean to be rude, but I'm not sure you've fully understood the details from my last post.

We are seeing jobs that:
- Are eligible to start (not held, not waiting on dependency or limits, etc)
- Have sufficient resources available to start
- Have their "LastSchedEval" being updated (I *think* means it's being evaluated by the main scheduling loop, and not just the backfill)
- Are still not starting

Given this, I don't think this is a problem with only the backfill scheduling cycle.  While we thought so in the beginning, this seems like it might be a deeper problem.

While we probably should do some work to optimize the backfill in our situation, we no longer think that's the primary problem here.
Comment 17 Marcin Stolarek 2020-09-09 08:46:51 MDT
Lloyd, 

>While we thought so in the beginning, this seems like it might be a deeper problem.

Deeper or just different than what we had at the beginning. (Both sdiag and perf showed that REQUEST_JOB_INFO was processed slowly and it was slow in the specific place that got fixed). That's why I'd like to look at logs/diag command and perf at the current state.

Do you see jobs waiting with "(Resources)" reason that you believe that should start? This would suggest an issue in SelectPlugin. Are you able to collect logs with the SelectType debug flag enabled were such a job evaluation is logged?
A Higher debug level may be helpful, but I understand that depending on your log processing infrastructure some levels may not be possible/production safe.

cheers,
Marcin
Comment 18 Lloyd Brown 2020-09-09 10:12:27 MDT
Created attachment 15808 [details]
Excerpt of logs showing (slow?) backfill behavior
Comment 19 Lloyd Brown 2020-09-09 10:12:44 MDT
Marcin,

I will try to get the logs/output that you're asking for.  It may take a bit of time, since we've got some external mitigations in place, that will confuse things pretty significantly.  We've turned those off for now, but it may be hours (or possibly days) before the problem reasserts itself, and I can get logs that exhibit the behavior in question.

Going back briefly to the hypothesis of backfill operating slowly, what's a reasonable rate of jobs/second for backfill processing?  I'm attaching a log excerpt (backfill_log_excerpt_8Sept2020) for example, from a test when most of the partitions were disabled (therefore backfill should've run quite quickly).  If I'm interpreting it right, it looks like it took ~13 seconds (21:00:30.774 to 21:00:43.439) to progress from 480 to 482 jobs considered.  Am I misunderstanding the log entries?  What are those numbers in the "yielding locks after testing" lines referring to?
Comment 20 Lloyd Brown 2020-09-09 10:23:27 MDT
To clarify, when I said "most of the partitions were disabled", I meant they were in a "down" state, eg. using syntax like this:

scontrol update partitionname=m7 state=down
Comment 23 Nate Rini 2020-09-09 16:30:06 MDT
(In reply to Lloyd Brown from comment #19)
> Marcin,
Marcin is out currently and I'll be covering this ticket until his return.
 
> I will try to get the logs/output that you're asking for.  It may take a bit
> of time, since we've got some external mitigations in place, that will
> confuse things pretty significantly.  We've turned those off for now, but it
> may be hours (or possibly days) before the problem reasserts itself, and I
> can get logs that exhibit the behavior in question.
 
> Going back briefly to the hypothesis of backfill operating slowly, what's a
> reasonable rate of jobs/second for backfill processing?
This depends on the complexity of the jobs and how backfill is configured to run.

>        Total backfilled jobs (since last slurm start): 160
This looks pretty low given overall number of jobs running.

>        Last cycle: 276354510 -> 276s
>        Max cycle:  294713580 -> 294s
>        Mean cycle: 270291398 -> 270s
Backfill is running for a very long time currently and the bf_interval is also very long. Considering bf_continue is set, then the backfiller is likely getting out of date before it can do anything useful.

> I'm attaching a log
> excerpt (backfill_log_excerpt_8Sept2020) for example, from a test when most
> of the partitions were disabled (therefore backfill should've run quite
> quickly).  If I'm interpreting it right, it looks like it took ~13 seconds
> (21:00:30.774 to 21:00:43.439) to progress from 480 to 482 jobs considered. 
Note the "yielding locks" part where backfill is releasing its locks. Backfill is actually only runnings for 2s at a time here. I think tweaking the backfiller settings might be useful here.

Please change the following in SchedulerParameters and enable all the partitions:
> bf_interval=2
Don't give the locks for so long.
> bf_window=24192
Based on partitions.conf, the max wall time for any job is 16 days. Lets increase this just a little past the 16 day limit.

> assoc_limit_continue
This is deprecated: please remove it. Should it be assoc_limit_stop instead?

> max_rpc_cnt=200
Lets increase this to have fewer lock interruptions.

>defer
Lets add this to reduce lock contention on job submission but new jobs may not instantly start anymore with the quick scheduler.

> batch_sched_delay=10
Increase this from the default (3s) to pile up more batch jobs at once for scheduling.

> bf_job_part_count_reserve=10
Start encouraging bf to reserve for job starts to avoid starvation. This may be the issue seen and worked around in comment #20.

> bf_max_job_test=500
Lets reduce this value as we go down the queue, less likely jobs will be able to start.

> bf_max_time=600
Lets make sure the backfiller doesn't run soo long any more that it will be working with potentially out of date information about nodes.

> bf_min_prio_reserve=??
I suggest setting this option too. I don't have enough information to suggest the value though, but please take a look at your sprio output to determine this.

> bf_yield_interval=2500000
Increase the time that bf will hold locks slightly.

> bf_yield_sleep=100000
Lets decrease the time that backfiller will sleep during yields.

Is it possible to get sdiag output once every 5 minutes 3 times after the changes are applied?
Comment 24 Lloyd Brown 2020-09-09 16:39:38 MDT
Nate,

We are looking at these suggestions now, and will try to get them implemented shortly.

Do you have any insight into the question about jobs that are (we think) being evaluated by the main (non-backfill) scheduling loop, and still aren't starting?  Am I misinterpreting the updates to the jobs' LastSchedEval?  It's not clear from the description if that's in connection to either the main or backfill scheduler, or just the main one (which is my assumption).
Comment 25 Nate Rini 2020-09-09 16:50:05 MDT
(In reply to Lloyd Brown from comment #24)
> Am I misinterpreting the updates to the jobs' LastSchedEval?

I assume you mean this question:

(In reply to Lloyd Brown from comment #14)
> nodes, and are being evaluated (the "LastSchedEval" in "scontrol show job"
> is being updated), but still are pending.  If we're correct, and that's only

LastSchedEval is updated every time the scheduler looks at the job. This could be the job was immediately found to be ineligible to run or the scheduler attempted to reserve resources and run it. Its the equiv of access time on a filesystem...not has helpful as one might hope.

> It's not clear from the description if that's in connection to either the
> main or backfill scheduler, or just the main one (which is my assumption).
Slurm uses a single select plugin to actually evaluate the jobs so its not recorded there. I believe there was an RFE ticket to actually record who touched it last, backfill, main or fast schedulers but I would have to look for it.

> Do you have any insight into the question about jobs that are (we think)
> being evaluated by the main (non-backfill) scheduling loop, and still aren't
> starting?
Usually the only way to know for sure is to activate the Backfill and SelectType debugflags and up the slurmctld logging to debug2+. That amount of logging can really slow down and system and should only be done for short intervals.
Comment 27 Lloyd Brown 2020-09-09 16:58:37 MDT
So, it could be touched by either of the scheduling loops.  Good to know.

The example I included was a job that was having it's LastSchedEval updated, but wasn't starting, and had the reason "Priority", obviously meaning there was a higher-priority job that wasn't started yet.  I probably should've used an example where it wasn't started because of "Resources" instead.

The pattern at the time was to have a large number of jobs that couldn't start (eg. for "Dependency"), then one that had "Resources", then a large number that had "Priority" as their reason.  That doesn't seem to be the case at the moment, though, so I can't find you a good example right now.
Comment 28 Nate Rini 2020-09-09 17:02:08 MDT
(In reply to Lloyd Brown from comment #27)
> The example I included was a job that was having it's LastSchedEval updated,
> but wasn't starting, and had the reason "Priority", obviously meaning there
> was a higher-priority job that wasn't started yet.  I probably should've
> used an example where it wasn't started because of "Resources" instead.
Only the last change to the reason is shared with the user (via scontrol show job). We need to enable higher debug logging to know exactly why.
 
> The pattern at the time was to have a large number of jobs that couldn't
> start (eg. for "Dependency"), then one that had "Resources", then a large
> number that had "Priority" as their reason.  That doesn't seem to be the
> case at the moment, though, so I can't find you a good example right now.
Slurm orders absolutely with priority but may not change the reason if another more verbose reason is the cause. Do you have sprio output on these jobs at the time of?
Comment 29 Ryan Cox 2020-09-09 17:14:35 MDT
(In reply to Nate Rini from comment #23)

Nate,

Thanks for all of these.  These were just applied, with the following exceptions as described below:

> Please change the following in SchedulerParameters and enable all the
> partitions:
> > bf_window=24192

That may be in partitions.conf, but that is an anomaly we'll ignore in the name of performance :).  They're way off the reservation, so to speak, in going above 7 days so we'll ignore them for now :)

> 
> > assoc_limit_continue
> This is deprecated: please remove it. Should it be assoc_limit_stop instead?

There was a reason we had this enabled and it fixed actual problems for us. Removed.


> > max_rpc_cnt=200
> Lets increase this to have fewer lock interruptions.

Done, though we had RPC timeouts so we had to enable this

> > bf_max_job_test=500
> Lets reduce this value as we go down the queue, less likely jobs will be
> able to start.

We have bf_max_job_user_part=20 set currently.  We would rather not enable bf_max_job_test because people often submit to multiple partitions, and some people submit thousands of jobs.  bf_max_job_user_part has been good at being a sane limit so far, as far as we can tell.

> > bf_min_prio_reserve=??
> I suggest setting this option too. I don't have enough information to
> suggest the value though, but please take a look at your sprio output to
> determine this.

I've looked at this before and I don't know what a sane value would be either, or if there is one to be had. Conditions in the queue change so often over time that it would be hard to pick something useful.

> Is it possible to get sdiag output once every 5 minutes 3 times after the
> changes are applied?

In progress...


I'm not seeing a huge change yet but we'll see.  Unfortunately Lloyd and I don't have more time today to work on this but we'll pick it up in the morning.

I'll attach the sdiag outputs.  One partition (m7) still has tons of idle nodes and work that can fill it, but it is not being filled.  There are a lot of jobs with limits and dependencies ahead in priority of some that can run, so maybe that is the problem.
Comment 30 Ryan Cox 2020-09-09 17:17:32 MDT
Created attachment 15825 [details]
squeue m7 partition

Note that the PartitionDown is not reflective of the current state.  No partitions are down, though they had been in the past.  It just hasn't eval'd these recently enough it appears.
Comment 31 Ryan Cox 2020-09-09 17:18:11 MDT
Created attachment 15826 [details]
sdiag.1599692151 before changing bf params
Comment 32 Ryan Cox 2020-09-09 17:18:40 MDT
Created attachment 15827 [details]
sdiag.1599692451 after changing bf params
Comment 33 Ryan Cox 2020-09-09 17:19:01 MDT
Created attachment 15828 [details]
sdiag.1599692751
Comment 34 Ryan Cox 2020-09-09 17:19:17 MDT
Created attachment 15829 [details]
sdiag.1599693051
Comment 35 Ryan Cox 2020-09-09 17:19:35 MDT
Created attachment 15830 [details]
sinfo.1599692301
Comment 36 Nate Rini 2020-09-09 18:21:50 MDT
(In reply to Ryan Cox from comment #29)
> > > max_rpc_cnt=200
> > Lets increase this to have fewer lock interruptions.
> 
> Done, though we had RPC timeouts so we had to enable this
We may need to tweak any of these settings based on how it performs.
 
> > > bf_max_job_test=500
> > Lets reduce this value as we go down the queue, less likely jobs will be
> > able to start.
> 
> We have bf_max_job_user_part=20 set currently.  We would rather not enable
> bf_max_job_test because people often submit to multiple partitions, and some
> people submit thousands of jobs.  bf_max_job_user_part has been good at
> being a sane limit so far, as far as we can tell.
How many users are there active at any one time?
Comment 37 Nate Rini 2020-09-09 18:23:50 MDT
(In reply to Ryan Cox from comment #30)
> Created attachment 15825 [details]
> squeue m7 partition

Is it possible to get this too:
> sprio -p m7
Comment 38 Nate Rini 2020-09-09 18:37:52 MDT
(In reply to Ryan Cox from comment #32)
> Created attachment 15827 [details]
> sdiag.1599692451 after changing bf params

Looks like `sdiag -r` or slurmctld was hard reset for the changes?

> Main Last cycle:   629730 -> 597103 -> 606311 -> 603422
> Main Mean cycle:   638300 -> 590021 -> 595427 -> 594870
Looks like the main scheduler cycle time reduced slightly.

> Backfill Last cycle: 276872699 -> 60090 -> 587406089 -> 60090
One of the runs lasted ~2x the time before. I wonder if the 2 short ones were due to yield ending the backfill early.

> Backfill Last depth cycle: 484 -> 7204 -> 503 -> 7209
> Last depth cycle (try sched): 119 -> 401 -> 135 ->  406
Looks like backfill is looking at a larger number of jobs and trying to run even more of them now.

Can we also get this output:
> sacct -r m7 -p -o all
Comment 39 Nate Rini 2020-09-09 18:39:27 MDT
(In reply to Nate Rini from comment #38)
> > Backfill Last cycle: 276872699 -> 60090 -> 587406089 -> 60090
> One of the runs lasted ~2x the time before. I wonder if the 2 short ones
> were due to yield ending the backfill early.

Please also attach the slurmctld during this period.
Comment 40 Lloyd Brown 2020-09-10 09:12:43 MDT
(In reply to Nate Rini from comment #36)
> How many users are there active at any one time?

If you mean the number of users with actively-running jobs, that's 44 at the moment, and I suspect that's pretty typical.





(In reply to Nate Rini from comment #37)
> Is it possible to get this too:
> > sprio -p m7

(In reply to Nate Rini from comment #38)
Can we also get this output:
> sacct -r m7 -p -o all


I've gathered both of these this morning, along with another squeue for m7, so they're run at about the same time, and will attach them shortly.



(In reply to Nate Rini from comment #39)
> Please also attach the slurmctld during this period.

Assuming you mean the slurmctld.log, I've extracted two sections of that, to cover approximately the time that Ryan was getting you data last night (2020-09-09T17:00 to 2020-09-09T19:01) as well as the time this morning when I was gathering stats (2020-09-10T08:00 to 2020-09-10T08:56).  Both logs will be attached shortly.
Comment 41 Lloyd Brown 2020-09-10 09:14:26 MDT
Created attachment 15837 [details]
output of "sprio -p m7"
Comment 42 Lloyd Brown 2020-09-10 09:14:54 MDT
Created attachment 15838 [details]
output of "sacct -r m7 -p -o all"
Comment 43 Lloyd Brown 2020-09-10 09:15:23 MDT
Created attachment 15839 [details]
output of "squeue -p m7"
Comment 44 Lloyd Brown 2020-09-10 09:16:03 MDT
Created attachment 15840 [details]
slurmctld.log excerpt 2020-09-10 08:00-08:56
Comment 45 Lloyd Brown 2020-09-10 09:16:37 MDT
Created attachment 15841 [details]
slurmctld.log excerpt 2020-09-09 17:00-19:01
Comment 46 Marcin Stolarek 2020-09-10 10:35:24 MDT
I understand that you're in a difficult situation from an operations perspective, however, I'd like to clearly formulate the issue. I'll try to ask a few leading questions - I don't expect you to answer all of them.

I see that your focus is on m7 partition now, which came into play in comment 20. The initial report was about jobs in m9 partition. Do you see a similar problem in multiple partitions? 

Did you notice a job waiting with "(Resources)" reason for a long time while you believe that there were enough resources in the partition? Such a job should be started by the main scheduler and if it's not then it's very likely an issue with a SelectType plugin. Did you switch to cons_tres recently? Do you have an example of such a jobid?

Do you see jobs waiting for long time with "(Priority)" reason that you think should get started by backfill scheduler - quite high priority, small in terms of time*resources product compared to higher priority jobs? An example will be helpful.

I'll check the logs to see if I'm seeing anything odd or indicating low performance.

cheers,
Marcin

--
Version changed per comment 11
Comment 47 Lloyd Brown 2020-09-10 11:04:21 MDT
(In reply to Marcin Stolarek from comment #46)
> I understand that you're in a difficult situation from an operations
> perspective, however, I'd like to clearly formulate the issue. I'll try to
> ask a few leading questions - I don't expect you to answer all of them.
> 
> I see that your focus is on m7 partition now, which came into play in
> comment 20. The initial report was about jobs in m9 partition. Do you see a
> similar problem in multiple partitions? 


Yes.  After our mitigation efforts (mentioned in comment #19), the problem never reappeared on the m9 partition, but did on the m7 partition.  There are some slight differences, as will come out in the discussion below.


> Did you notice a job waiting with "Resources" reason for a long time while
> you believe that there were enough resources in the partition? 

Anecdotally, I did see one of those with the m9 partition, as well as a series of subsequent jobs that said "Priority".  I used an example of one of those "Priority" jobs in comment #14, though I now recognize I should've used the "Resources" example.  Because the problem is not occurring there, I don't have an example to give you.

With the "m7" partition, we are not seeing exactly the same symptoms, though.  Instead, we see jobs that we *think* should be able to start, but are not, with the "PartitionDown" reason.  This is likely a side-effect of the now-discontinued mitigation strategy, which did disable the partition temporarily, re-enabling it a few minutes later.  We're confused, though, why these jobs still have the "PartitionDown" reason, when their "LastSchedEval" is being updated, indicating that they're being evaluated.  Here's an example of one of these:

root@sched1:/tmp# scontrol show job -dd 38162604
JobId=38162604 JobName=OutputHybrid/Hybrid_nGrains_200_nObs_11_rep_1
   UserId=snow25(22941) GroupId=snow25(23404) MCS_label=N/A
   Priority=100677 Nice=0 Account=okj2 QOS=normal
   JobState=PENDING Reason=PartitionDown Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=2-23:40:00 TimeMin=N/A
   SubmitTime=2020-08-31T15:30:16 EligibleTime=2020-08-31T15:30:16
   AccrueTime=Unknown
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-10T10:53:45
   Partition=m8n,m8,m7,m9 AllocNode:Sid=login02:376196
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=16000M,node=1,billing=3
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=16000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/lustre/scratch/grp/fslg_localize/Final_3DOF/runHybrid3DOF.sh
   WorkDir=/lustre/scratch/grp/fslg_localize/Final_3DOF
   StdErr=/lustre/scratch/grp/fslg_localize/Final_3DOF/OutputHybrid/Hybrid_nGrains_200_nObs_11_rep_1.out
   StdIn=/dev/null
   StdOut=/lustre/scratch/grp/fslg_localize/Final_3DOF/OutputHybrid/Hybrid_nGrains_200_nObs_11_rep_1.out
   Power=
   MailUser=(null) MailType=NONE





> Did you switch to cons_tres recently? 

We switched from "cons_res" to "cons_tres" when we upgraded to v20.02 on April 22.





> Do you see jobs waiting for long time with "(Priority)" reason that you
> think should get started by backfill scheduler - quite high priority, small
> in terms of time*resources product compared to higher priority jobs? An
> example will be helpful.
 

I don't currently have a good example of this.   At the moment, there are only 2 m7 jobs that are showing "Priority".  

One of these is requesting large numbers of partitions, with a long walltime (14 days); that long walltime is only allowed in some of the partitions listed (not m7), which have few resources and are currently busy, so it's not surprising that the job isn't starting.

The other job appears like it could theoretically start, but when sorted by priority, there are a large number of jobs ahead of it, with the following reasons (listed with their job count):

   1 Priority
 100 AssocGrpMemRunMinute
1537 Dependency
9987 PartitionDown


So, I'm not surprised if this job isn't starting either, since there are many higher priority jobs that should also be eligible to start.




In short, I don't have any good examples right now, though we are very confused why so many jobs are still showing "PartitionDown", when all the partitions are up, and we *think* (based on LastSchedEval) that the jobs are being evaluated.
Comment 48 Marcin Stolarek 2020-09-10 11:53:44 MDT
Looking at slurmctld logs the top error messages are related to:
>error: unpack_header: protocol_version 8192 not supported
8192 corresponds to Slurm 17.11. Could you please check the following hosts for the older version of Slurm:
128.187.49.24
128.187.49.25
192.168.219.189
192.168.221.109

Did you share complete slurmctld logs? Did you enable backfill/SelectType debug flags? I don't see characteristic messages appropriate for your SlurmctlDebug and those flags.

I see you're using:
>SlurmctldDebug=3
We generally recommend using strings which is less confusing, "info" is the equivalent of 3.

cheers,
Marcin
Comment 49 Marcin Stolarek 2020-09-10 12:20:47 MDT
If by 'excerpt' you mean that the attachments are just lines matching some pattern, can you run grep "38162604" on slurmctld logs or just share full log?

cheers,
Marcin
Comment 50 Lloyd Brown 2020-09-10 12:39:59 MDT
(In reply to Marcin Stolarek from comment #48)
> Looking at slurmctld logs the top error messages are related to:
> >error: unpack_header: protocol_version 8192 not supported
> 8192 corresponds to Slurm 17.11. Could you please check the following hosts
> for the older version of Slurm:
> 128.187.49.24
> 128.187.49.25
> 192.168.219.189
> 192.168.221.109


Yes.  I am aware of those 4 hosts.  3 of them could easily be shut down, and I suspect the 4th could be upgraded reasonably easily, though there are a few questions still outstanding.  None of the 4 are being used as submission or compute nodes, none are listed in the slurm.conf (or nodes.conf, which is included in slurm.conf).

Is the presence of these explicitly causing problems, beyond noise in the logs?


> 
> Did you share complete slurmctld logs?

If you're referring to the most recent set of logs I uploaded, no.  The only request I had was the following (from comment #39):
> Please also attach the slurmctld during this period.

I was guessing that he meant the slurmctld.log in the first place.  

I simply manually picked a start and end date/time, used "grep -n" to find the line numbers for those timestamps, and then used awk to extract everything between those line numbers.


> Did you enable backfill/SelectType
> debug flags? I don't see characteristic messages appropriate for your
> SlurmctlDebug and those flags.


Again, if you're referring to the most recent logs, then no, those were not enabled.  You led us to believe that the SelectType at least, would be a significant operational impact, and shouldn't be done for more than a few minutes at a time.

Eg., in comment #25
> That amount of logging can really slow down and system and should only be done for short intervals.

We certainly can try to capture those, but we'd need a clearer understanding of how long, and what other data you need captured at the same time, so we don't have to do this more than once.
 
> I see you're using:
> >SlurmctldDebug=3
> We generally recommend using strings which is less confusing, "info" is the
> equivalent of 3.

So noted.  But unless there's a functional difference, I'm not going to worry about that for now.



(In reply to Marcin Stolarek from comment #49)
> If by 'excerpt' you mean that the attachments are just lines matching some
> pattern, can you run grep "38162604" on slurmctld logs or just share full
> log?

As I explained, the extract was done based on a time range, not based on a pattern match.  

Having said that, the only lines that match, are as follows, which may not be terribly useful to you:

root@sched1:/tmp# grep 38162604 /var/log/slurm/slurmctld.log{,.1}
/var/log/slurm/slurmctld.log:[2020-09-10T07:30:44.079] Recovered JobId=38162604 Assoc=3234
/var/log/slurm/slurmctld.log.1:[2020-09-09T06:30:45.350] Recovered JobId=38162604 Assoc=3234
/var/log/slurm/slurmctld.log.1:[2020-09-09T07:22:48.957] Recovered JobId=38162604 Assoc=3234
/var/log/slurm/slurmctld.log.1:[2020-09-09T11:56:44.377] Recovered JobId=38162604 Assoc=3234
/var/log/slurm/slurmctld.log.1:[2020-09-09T16:56:27.857] Recovered JobId=38162604 Assoc=3234
/var/log/slurm/slurmctld.log.1:[2020-09-09T21:36:44.181] Recovered JobId=38162604 Assoc=3234
root@sched1:/tmp#
Comment 51 Marcin Stolarek 2020-09-10 13:19:05 MDT
>Is the presence of these explicitly causing problems, beyond noise in the logs?
It's rather just noise in logs. However, keeping those clean especially if it's rather easy may be a step to a healthy environment. 

>You led us to believe that the SelectType at least, would be a significant operational impact, and shouldn't be done for more than a few minutes at a time.

Those will increase logs volume significantly, which means that potentially more disk space and IOPS will be required. We need at least schedulers cycle there (2* bf_interval = 10 minutes should be enough). You can enable/disable those without slurmctld restart using scontrol setdebugflag[1]. I'd say be cautious and monitor the disk/array where logs are stored. If this will work fine, you can try gathering logs at higher debug level, again no need to restart slurmctl just `scontrol setdebug verbose` or even `scontrol setdebug debug` when flags are enabled. If you'll notice that your backend is overloaded just switch back to `scontrol setdebug info; scontrol setdebugflags -SelectType; scontrol setdebugflags -Backfill`. 

>what other data you need captured at the same time
I understand that our focus is now on job 38162604 - functionality, not performance, so if you'll see a serious of logs referring to this JobId this may help us to understand what backfill and/or select plugins did and we can go from that place.


cheers,
Marcin
[1]https://slurm.schedmd.com/scontrol.html#OPT_setdebugflags
Comment 52 Ryan Cox 2020-09-10 14:27:34 MDT
Created attachment 15846 [details]
veryveryverboselogs.tar.gz

Logs with: DEBUG, +backfill +selecttype
Additionally, some other outputs (squeue, sdiag, etc) during that time. slurmctld.log is 4.8G when extracted
Comment 53 Marcin Stolarek 2020-09-11 06:18:18 MDT
Looking at scontrol show job results (and slurmctld logs) we see that that JobId=38162604 is evaluated by backfill scheduler, which creates reservation for it on m8-17-10 starting on 2020-09-12T06:00:00. Other jobs are getting StartTime up to 2020-09-19T21:17:16, which looks close to the end of your bf_window. Looking at the job spec and nodes in the partition the job evaluation by select plugin looks correct to me.

It looks like we didn't gather the whole backfill cycle because of slurmctld restart issued during its execution. The restart cleared debug flags and reset log level too.

There are two things I'd like to ask you to check:
1)Am I guessing correctly that short_floating is a floating reservation? Could you please verify if removing it(or just changing to non-floating) will improve backfill performance?

2)I see that you have:
>PreemptType=preempt/qos
>PreemptMode=requeue
but the mechanism is not really in use - only one job (in scontrol show job) was requeued. Enabling preemption may have a negative impact on the time each "try sched" takes, can you disable it and check if this significantly changed number of jobs attempted by backfill per cycle?

cheers,
Marcin
Comment 55 Ryan Cox 2020-09-11 13:55:27 MDT
Fortunately/unfortunately everything started scheduling last night.  We're working to see if we can figure what changed in the queue, potentially some job that started running which then allowed other jobs to run?
Comment 56 Marcin Stolarek 2020-09-14 03:52:42 MDT
Ryan,

The logs you shared suggest that you've hit a bug in backfill scheduling. The issue is that once backfill was evaluating job for the "later start" for some job the time was shifter only by a few seconds for the next evaluation. I'm trying to figure out conditions when this can happen.

On further optimizations you may also consider using "bf_one_resv_per_job"[1].

Having in mind that the issue "disappeared" - are you OK with lowering the ticket severity to 3?

cheers,
Marcin

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_bf_one_resv_per_job
Comment 57 Lloyd Brown 2020-09-14 12:55:54 MDT
Marcin,

I agree that the situation probably warrants a downgrade in priority at this point.  I've adjusted it to a level 3, or "Medium Impact"

As far as I can determine, the only portion of our system that are not as full as they could reasonably be at the moment, is the m7 partition.  And looking at the list of pending jobs that list the m7 partition (squeue -t pending -p m7), everything is blocked for other reasons (eg. "Dependency", "AssocGrpMemRunMinutes").

We would, however, be very interested in what you figure out WRT the suspected backfill bug you referenced.
Comment 58 Marcin Stolarek 2020-09-14 12:58:10 MDT
Could you please share the result of `scontrol show res`? Did you change (or any advanced reservation ended) recently?

cheers,
Marcin
Comment 59 Lloyd Brown 2020-09-14 14:25:37 MDT
Created attachment 15891 [details]
output of "scontrol show res"
Comment 60 Lloyd Brown 2020-09-14 14:27:23 MDT
Marcin,

I've attached the output you requested.  We are not aware of any major reservations that would've ended.

As you'll probably notice, we do have an appreciable number of nodes that are failing health checks, and are getting a series of corresponding reservations.  Those are generally short-lived, and overlap with others, as long as the condition lasts.  So those would've ended, but that's not anything new or unique.
Comment 61 Marcin Stolarek 2020-09-14 15:17:27 MDT
Thanks - I don't think that there is anything wrong with your healthcheck reservations. Like I mentioned in comment 53 I suspect incorrect handling of floating reservations, the fact that short_floting reservation contains m8 and m7 nodes makes it probable.

I'll keep you posted.

cheers,
Marcin
Comment 63 Marcin Stolarek 2020-09-21 08:26:47 MDT
Lloyd,

Looking at logs carefully I think that we've met a specific circumstances where multiple advanced reservations(that were done by healthcheck in your case) may cause a low performance of backfill. What normally happens when backfill cannot start a job it tries to reserve some time for this job in the future (to prevent backfill starting a lower priority job that will result in starvation of higher priority one). Setting a new potential start time (in certain circumstances) backfill will set it to end time of advanced reservation in your case it was multiple start times just one second later on nodes that were not able to handle a job. This resulted in backfill try_later attempts failing to allocate a job and finally reducing backfill efficiency.

I have a patch in mind that should let us avoid that, it's something we'll rather treat as an improvement (if really can be done with regards to current code state) and address in Slurm 20.11.

Did the issue reoccur to you in the past week? As a workaround can you consider reservation update - add a node to HC reservation, instead of separate reservation per node? Another alternative would be to set the node state to drain instaed of reserving it.

I'll keep you posted on potential patch/improvement.

cheers,
Marcin
Comment 70 Marcin Stolarek 2020-10-13 01:52:26 MDT
Lloyd,

Based on the logs you shared with us we introduced an improvement to backfill code that will avoid calling computationally intensive part of resource selection more frequently than defined bf_resolution[1].

As an improvement it was added to master branch and will be part of upcomming major release of Slurm 20.11.

Should you have any questions please reopen.

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/5c4249e0c22569f6baec2e5ef8af2f95f1bea8e5
Comment 71 Ryan Cox 2020-10-13 09:33:34 MDT
Thank you.