Ticket 6448

Summary:	how to optimize configuration for large number of small jobs
Product:	Slurm	Reporter:	Randy Smith <rsmith>
Component:	Scheduling	Assignee:	Ben Roberts <ben>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	17.11.10
Hardware:	Linux
OS:	Linux
Site:	TGen	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	current sdiag output PendingJobs.txt

Description Randy Smith 2019-02-01 11:32:19 MST

Created attachment 9070 [details]
current sdiag output

Last week a user complained that jobs were taking 10 minutes to schedule when there were 10s of idle cores.   This user was submitting on the order of 5,000 jobs (single core) and at the time while the cluster had about 2600 functioning cores.  During this same period, there was at lease one user submitting several jobs requesting 40 core exclusive requests.  Given this information, how can I determine if we are optimally configured to support this type of workload?

Comment 2 Ben Roberts 2019-02-04 13:40:30 MST

Hi Randy,

There are a number of factors that could contribute to jobs being slow to start. There is a scheduler parameter you can set called 'defer'. Setting this will prevent slurmctld from starting a scheduling cycle immediately when a job is submitted, but will wait for a later time when scheduling multiple jobs simultaneously may be possible. This option may improve system responsiveness when you're submitting a large number of jobs at the same time, but it will delay the initiation time of individual jobs. Is this a typical situation, where you have users submitting thousands of jobs at a time? If this were the case then the user should have seen a lot of jobs starting after the defer time, not just one job at a time. This is, of course, assuming there wasn't another job with higher priority that was scheduled to start on the free nodes in less time than the walltime of the new jobs. Did the user submitting the 40 core jobs experience similar delays? Would you expect jobs from that user to have higher priority than the user submitting thousands of jobs?

We do have a section of our documentation that talks about configuring the cluster for high throughput that may be helpful:
https://slurm.schedmd.com/high_throughput.html

It is difficult to say with certainty what might have caused a delay for these jobs. If you are able to send a copy of your slurm.conf I can look over it and probably have a better idea of what may have caused it. If you have this type of thing happen frequently it would also be helpful to collect some information about your cluster while it's happening. If it does happen again I'd like to see:
sinfo
squeue
sprio
scontrol show nodes
scontrol show jobs (for a job you think should run and several jobs that you can see have higher priority from the sprio output)

Thanks,
Ben

Comment 3 Ben Roberts 2019-02-04 14:21:53 MST

Hi Randy,

One of my colleagues pointed out that we have a recent copy of your slurm.conf.  I went over it and don't see that you have parameters set that look like they would account for delays in scheduling jobs in scenarios when there are plenty of resources free.  It would be good to see if you can reproduce this behavior (maybe during off hours) by submitting thousands of jobs at once.  If this is reproducible by submitting a lot of jobs at once then some of the parameters mentioned in the high throughput documentation should help.  If you don't see it by submitting a lot of jobs, but do come across it again it would be good to collect the output I mentioned.  

Thanks,
Ben

Comment 4 Randy Smith 2019-02-04 14:25:29 MST

Thanks, Ben.
I’ll investigate the high throughput suggestions and run the thousands of tests  to see if I can reproduce the problem.   Moab had a way to replay a workload does SLURM have a mechanism for doing something similar?
-r

> On Feb 4, 2019, at 2:21 PM, bugs@schedmd.com wrote:
> 
> 
> Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=6448#c3> on bug 6448 <https://bugs.schedmd.com/show_bug.cgi?id=6448> from Ben <mailto:ben@schedmd.com>
> Hi Randy,
> 
> One of my colleagues pointed out that we have a recent copy of your slurm.conf.
>  I went over it and don't see that you have parameters set that look like they
> would account for delays in scheduling jobs in scenarios when there are plenty
> of resources free.  It would be good to see if you can reproduce this behavior
> (maybe during off hours) by submitting thousands of jobs at once.  If this is
> reproducible by submitting a lot of jobs at once then some of the parameters
> mentioned in the high throughput documentation should help.  If you don't see
> it by submitting a lot of jobs, but do come across it again it would be good to
> collect the output I mentioned.  
> 
> Thanks,
> Ben
> 
> You are receiving this mail because:
> You reported the bug.

Comment 5 Ben Roberts 2019-02-04 14:52:08 MST

Slurm does not have a mechanism to replay workload.  You would have to try and reproduce it with new jobs on your system.

Thanks,
Ben

Comment 6 Randy Smith 2019-02-06 12:38:33 MST

Created attachment 9097 [details]
PendingJobs.txt

At this point in time I have about 1051 pending jobs.  Of those jobs, 35 have  TRES=cpu=1,mem=4000M,node=1
There are 36 x 28 Core nodes idle.  Why are the 1 CPU jobs pending?  I've attached output from sinfo, squeue and sprio.  

dback-c1-n[03-08],dback-c2-n[01-05,07-08     36 defq*             idle   28   2:14:1 112000        0      1 28core,Haswell,FC430 none                 (null)

Randy Smith
Sr. HPC Engineer
Translational Genomics Research Institute

445 N 5th Street, Phoenix, AZ 85004
(602) 343-8547
rsmith@tgen.org <mailto:rsmith@tgen.org>





> On Feb 4, 2019, at 2:52 PM, bugs@schedmd.com wrote:
> 
> 
> Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=6448#c5> on bug 6448 <https://bugs.schedmd.com/show_bug.cgi?id=6448> from Ben <mailto:ben@schedmd.com>
> Slurm does not have a mechanism to replay workload.  You would have to try and
> reproduce it with new jobs on your system.
> 
> Thanks,
> Ben
> 
> You are receiving this mail because:
> You reported the bug.

Comment 7 Ben Roberts 2019-02-07 10:45:53 MST

Hi Randy,

Thanks for collecting the sinfo, squeue and sprio output.  Along with that output I was hoping to see the details of some of the nodes (scontrol show nodes <node name>) that are idle and should be able to start the jobs along with the details of an example job (scontrol show job <job id>) or two.  A lot of times a job will show a reason it can't start that can help understand what's happening.  From the output you sent it looks like 2058178 may be one of the small jobs you expect to be able to start.  It shows '(Resources)' as a reason for it to be pending. The job details for that job would probably provide some additional insight.  

Was this a scenario where you submitted a bunch of jobs to reproduce the behavior, or did this come up again on it's own?  I assume that these jobs do run on their own.  How long does it take for them to start?  

Thanks,
Ben

Comment 8 Randy Smith 2019-02-07 11:20:48 MST

Thanks, Ben;

The next time I encounter this situation I’ll be sure to include 'scontrol show job <job id> 'and 'scontrol show nodes <node name>’ in the upload.
The 'scontrol show job I did examine’  listed priority as the reason for the pend.  
The scenario was the result of the user community submitting a normal work load and the jobs eventually start after on average about 10 minutes.

Will the scontrol show  that a job is pending because it is backfilling?  If not, is there a way to monitor which nodes are being used for backfill or a way to monitory the scheduling algorithm (excluding strace)?

Randy Smith

Translational Genomics Research Institute

445 N 5th Street, Phoenix, AZ 85004
(602) 343-8547
rsmith@tgen.org

> On Feb 7, 2019, at 10:45 AM, bugs@schedmd.com wrote:
> 
> 
> Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=6448#c7> on bug 6448 <https://bugs.schedmd.com/show_bug.cgi?id=6448> from Ben <mailto:ben@schedmd.com>
> Hi Randy,
> 
> Thanks for collecting the sinfo, squeue and sprio output.  Along with that
> output I was hoping to see the details of some of the nodes (scontrol show
> nodes <node name>) that are idle and should be able to start the jobs along
> with the details of an example job (scontrol show job <job id>) or two.  A lot
> of times a job will show a reason it can't start that can help understand
> what's happening.  From the output you sent it looks like 2058178 may be one of
> the small jobs you expect to be able to start.  It shows '(Resources)' as a
> reason for it to be pending. The job details for that job would probably
> provide some additional insight.  
> 
> Was this a scenario where you submitted a bunch of jobs to reproduce the
> behavior, or did this come up again on it's own?  I assume that these jobs do
> run on their own.  How long does it take for them to start?  
> 
> Thanks,
> Ben
> 
> You are receiving this mail because:
> You reported the bug.

Comment 9 Ben Roberts 2019-02-07 13:39:40 MST

Hi Randy,

The scontrol output for a job will sometimes give more information for a job when it shows that it is pending for resources rather than priority.  But it is frequently difficult to track down.  Probably the best way to get some insight into what is happening with backfill is to enable debug logging for the backfill plugin.  Along with the rest of the information you're going to gather it would be useful to see some of these debug logs the next time this happens.  When you've got jobs that look like they should be able to run you can run these commands to enable the extra logging:

scontrol setdebugflags +Backfill
scontrol setdebugflags +BackfillMap

Then let it run for a few minutes before turning the logging back off, like this:
scontrol setdebugflags -BackfillMap
scontrol setdebugflags -Backfill

The additional logs go to the slurmctld.log file.  

Thanks,
Ben

Comment 10 Ben Roberts 2019-02-28 08:22:12 MST

Hi Randy,

I wanted to follow up and see if this has come up as an issue again and whether you've been able to collect some debug logs related to backfill.  Let me know if you still need help with this ticket.

Thanks,
Ben

Comment 11 Ben Roberts 2019-03-20 10:11:15 MDT

Hi Randy,

It's been a while since I've heard from you on this ticket so I assume the issue with a large number of small jobs hasn't come up again. I looked at the settings you have again to see if there is anything I can suggest with the information I have. I do think there are a couple SchedulerParameters related to backfill we can change that might help in future scenarios like you described. One parameter to consider is bf_continue. The backfill scheduling cycle will work for the amount of time specified by bf_max_time, which defaults to 30 seconds. If it doesn't get to all the jobs in that time then the next time it evaluates jobs for backfill it will start over at the beginning of the job list. When you set bf_continue it will continue where it left off in the list of eligible jobs rather than starting over each time. If the first jobs in the list aren't able to start then it will keep looping over them and not starting anything, which could be what you were experiencing.

Depending on the maximum walltime of your jobs you may want to consider increasing the bf_window value as well. This controls how far into the future to look when considering jobs to schedule. If slurm is trying to schedule jobs that are longer than the bf_window it won't know if resources are going to be free that far out and they may not be scheduled. The default window is 1440 minutes (one day). If you do increase the bf_window it's also recommended to increase the bf_resolution parameter. This controls how long information about the start and end time of jobs on the system is cached so that the scheduler doesn't have to query all the running jobs each time it wants to evaluate eligible jobs.

These are guesses of things that might help with the information I have. I think the most likely thing to help is the bf_continue parameter.

Let me know if you have questions about this.

Thanks,
Ben

Comment 12 Ben Roberts 2019-04-11 08:42:50 MDT

Hi Randy,

In the last update you sent on this ticket you were going to collect some more output when the issue came up again.  I've also sent some suggestions that I think might help.  Since this is the case I'll close the ticket and if it happens again you can update the ticket to reopen the issue.

Thanks,
Ben

Comment 13 Randy Smith 2019-04-11 09:13:01 MDT

Thanks, Ben;
Load on the system is low at the moment and has been for the last month or
so.  Please do close the issue and I will reopen when I can gather more
data.

Randy

On Thu, Apr 11, 2019 at 7:42 AM <bugs@schedmd.com> wrote:

> Ben <ben@schedmd.com> changed bug 6448
> <https://bugs.schedmd.com/show_bug.cgi?id=6448>
> What Removed Added
> Resolution --- INFOGIVEN
> Status OPEN RESOLVED
>
> *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=6448#c12> on bug
> 6448 <https://bugs.schedmd.com/show_bug.cgi?id=6448> from Ben
> <ben@schedmd.com> *
>
> Hi Randy,
>
> In the last update you sent on this ticket you were going to collect some more
> output when the issue came up again.  I've also sent some suggestions that I
> think might help.  Since this is the case I'll close the ticket and if it
> happens again you can update the ticket to reopen the issue.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>