Ticket 6688

Summary:	Slurm job submission interval
Product:	Slurm	Reporter:	NASA JSC Aerolab <JSC-DL-AEROLAB-ADMIN>
Component:	Configuration	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	17.11.2
Hardware:	Linux
OS:	Linux
Site:	Johnson Space Center	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.11.2
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm config file.

Description NASA JSC Aerolab 2019-03-13 10:09:26 MDT

We have a project that would like to run over 100000 jobs. Currently they submit jobs every 0.1 sec and Slurm controller is OK, but as soon as they try submitting jobs every 0.05 sec controller comes to a halt. What should be an ideal job submission time limits? What is a best way to sumbit 100000 jobs, which would not impact a controller?

Comment 3 Felip Moll 2019-03-14 06:44:01 MDT

(In reply to NASA JSC Aerolab from comment #0)
> We have a project that would like to run over 100000 jobs. Currently they
> submit jobs every 0.1 sec and Slurm controller is OK, but as soon as they
> try submitting jobs every 0.05 sec controller comes to a halt. What should
> be an ideal job submission time limits? What is a best way to sumbit 100000
> jobs, which would not impact a controller?

Hi,

There are many parameters we can tune. I'd like to start analyzing what you
currently have set in slurm.conf and from this I will see if there's a valid
proposal for you.

Also, are these single bursts of 100.000 jobs? or is it continue? How many
jobs are usually in the queue? How many jobs do you allow to be started
at a time? Is throughput important or just want to be able to submit
all these jobs and let the scheduler do its job more "slowly"?

There's usually a compromise between responsiveness to RPCs (commands,
communications, and so on) and scheduling work.

Please, send your slurm.conf and if possible define a little bit more the
workflow.

Comment 4 NASA JSC Aerolab 2019-03-14 10:16:48 MDT

Created attachment 9572 [details]
slurm config file.

Comment 5 NASA JSC Aerolab 2019-03-14 10:20:22 MDT

Jobs are submitted continuously, but ideally we would like to submit 100000 jobs and let slurm take care of it. We don't have many limits in place at the moment as you will see in our slurm.conf file.

Workflow: File manipulation -> computation which generate files -> file manipulation which takes place in parallel.

Comment 6 Felip Moll 2019-03-15 06:05:19 MDT

(In reply to NASA JSC Aerolab from comment #5)
> Jobs are submitted continuously, but ideally we would like to submit 100000
> jobs and let slurm take care of it. We don't have many limits in place at
> the moment as you will see in our slurm.conf file.
> 
> Workflow: File manipulation -> computation which generate files -> file
> manipulation which takes place in parallel.

Well, we will try to set some initial parameters and see how it works.
Tuning schedule is normally a process that require many iterations until we get the correct settings.

We will start by backfill parameters, and continue with builtin scheduler ones.

I see you just have 41 nodes with 56 threads each. This is a small system, so
it won't be possible to run 100000 jobs at a time. That means that we do not
need to try to schedule a lot of jobs because anyway they won't fit in the
nodes.

Let's say that each job can use one single thread. There are 56*41=2296 CPUs available which means that at most, there will be 2296 jobs running in the system. That would be true if you hadn't the "Oversubscribe=YES" in the partition, in any case we will assume this value for now. I would say then to set the max queue depth to 500, with this setting you will see a big percentage of jobs tried to be/being scheduled at each iteration, with a compromise with responsiveness. If responsiveness is still not good, we can decrease this value.

bf_max_job_test=500

We can limit the amount of time the backfill scheduler can spend in one cycle even if maximum job count has not been reached. This is a control for improving responsiveness. For now we are not changing the default value (same as bf_interval), but we can take this one into account for the future.

bf_max_time=

Since we are potentially cutting the scheduling when it is in the middle of a cycle, lets also add the bf_continue parameter.

bf_continue

To limit the number of seconds between backfill iterations we can set bf_interval. Default is set at 30 seconds, so we can leave it as is:

bf_interval=

Also, we can limit the scheduler resolution when looking and planning for the future. Higher values gives better responsiveness, but scheduling can be a bit less precise, which I guess it is not so important. This is affected by bf_interval which we could read as: Looking bf_window minutes in the future with bf_resolution seconds resolution.
Your bf_window is not big as you will see below, but in any case let's set 5 minutes just in case. If everything works fine you can decrease this parameter to achieve more precise scheduling times.

bf_resolution=300

Then we can instruct the scheduler to just look into the future for a limited number of minutes. This value has to be at least as long as the highest allowed time limit. I see your partition with DefaultTime=04:00:00 MaxTime=8:00:00. So lets put here 8 hours. If everything works fine, you can increase this parameter. Take in mind bf_resolution too:

bf_window=480

We then have bf_yield_interval and bf_yield_sleep. These are, each X seconds release the locks to permit RPCs and other operations to be processed, and for how Y seconds to yield these locks, respectively. Default is 2 seconds and 0.5 seconds. If you continue seeing responsiveness issues, you can make the scheduler to "stop" more frequently decreasing bf_yield_interval, or let it sleep for more time with bf_yield_sleep. We will leave them with default values, but if you continue seeing responsiveness problems, use this parameters.

bf_yield_interval=
bf_yield_sleep=

Added to the previous two, we can also change max_rpc_cnt. This one will instruct backfill to defer work when there are too many RPCs pending. Also, the time that the scheduler will sleep will depend on this parameter. It will sleep at least as the max_rpc_cnt is reduced by 10 times. I will put here a value of 100, since there can be at most 256 pending operations to be served at a time (equivalent to MAX_SERVER_THREADS). This parameter should help responsiveness:

max_rpc_cnt=100

Now for the builtin scheduler:

The equivalent of bf_max_job_test but for the builtin scheduler, let's keep the default of 100 here, because this affects the number of jobs to try to be scheduled at submit time or when a job finishes:

default_queue_depth=

If submitting thousands of jobs still blocks ctld, we can evaluate the drastic 'defer' parameter later.

Finally one more than can help us with your responsiveness. Every job submission the builtin scheduler will start trying to schedule, by default, at most every 1 second interval. Since you are submitting thousands of jobs per second, the scheduler will be running all the time. Let's make it to not try to schedule jobs at as short intervals by increasing from 1 second to 2 seconds.

sched_min_interval=2000000

Others parameters to consider if after a couple of iterations we still doesn't achieve desired performance:
max_sched_time=#
sched_interval


To summarize, let's try to set this line in slurm.conf, and restart slurmctld, then run a reconfigure:

SchedulerParameters=bf_max_job_test=500,bf_resolution=300,bf_window=480,max_rpc_cnt=100,bf_continue,sched_min_interval=2000000


These parameters are for the future:
bf_interval=
bf_max_time=
bf_yield_interval=
bf_yield_sleep=
default_queue_depth=

Tell me what improvements you see and attach here some sdiag outputs captured in a periodic basis (each 10 seconds during 5 minutes when submitting 100000 jobs).

Comment 7 NASA JSC Aerolab 2019-03-15 15:47:12 MDT

I'm reviewing your recommendations and will inform you after we implement and run some jobs with it. 

According to the following documentation, array can be used to distribute millions 
of jobs in milliseconds. Do you think it's possible with compute resource we have.

https://slurm.schedmd.com/job_array.html

Comment 8 Felip Moll 2019-03-18 03:04:42 MDT

(In reply to NASA JSC Aerolab from comment #7)
> I'm reviewing your recommendations and will inform you after we implement
> and run some jobs with it. 
> 
> According to the following documentation, array can be used to distribute
> millions 
> of jobs in milliseconds. Do you think it's possible with compute resource we
> have.
> 
> https://slurm.schedmd.com/job_array.html

Yes, indeed.

It is a good practice and we recommend so. If you are able to use job arrays maybe your problem will be much less complicated.

I didn't know if this many jobs were from several users or just one, so I assumed you wanted just scheduler tunning.

In any case, you can implement the proposed parameters, and then use job arrays.

Comment 9 NASA JSC Aerolab 2019-03-18 10:47:40 MDT

Only one user was submitting all jobs.
Currently MaxArraySize is set to 2500, In past a single user tried to submit around 100K jobs, but it brought machine down to halt due to lack of Memory. How much RAM does Slurm recommend if we increase MaxArraySize to 10000 and MaxJobCount=100000.

Should a user submit 10 different jobs to submit 100K jobs?
Should there be a delay of a second between each job submission?

Comment 10 Felip Moll 2019-03-18 11:29:10 MDT

(In reply to NASA JSC Aerolab from comment #9)
> Only one user was submitting all jobs.
> Currently MaxArraySize is set to 2500, In past a single user tried to submit
> around 100K jobs, but it brought machine down to halt due to lack of Memory.
> How much RAM does Slurm recommend if we increase MaxArraySize to 10000 and
> MaxJobCount=100000.
> 
> Should a user submit 10 different jobs to submit 100K jobs?
> Should there be a delay of a second between each job submission?

When a job array is submitted just one "object" is created in memory with the metadata of the array until it is modified or initiated, so there will be no RAM issues at that time. Nevertheless performance can be impacted due to scheduling.

One key here is to limit the number of jobs to be scheduled at a time.

I cannot say a number for RAM, it will depend on how many jobs you will start or modify at a given time and the scheduler pressure. I think that with the provided parameters plus something like you propose:

MaxArraySize=10000
MaxJobCount=100000

and maybe setting a MaxSubmitJobs if users are misbehaving, it should be good enough. You can even increase both values, but I'd recommend to try it step by step and monitor the system resources.

This is an example on my laptop submitting a job array of 30000 tasks, almost no memory consumption since I just schedule/initiate a few jobs at a time:

[lipi@llagosti 18.08]$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7858        3508         734         787        3614        3290
Swap:          1535           2        1533

[lipi@llagosti 18.08]$ sbatch -N1 -a 1-29999 --wrap='sleep 1'
Submitted batch job 6561

[lipi@llagosti 18.08]$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7858        3548         685         794        3623        3242
Swap:          1535           2        1533

[lipi@llagosti 18.08]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            6544_1     debug     wrap     lipi CG       0:02      1 gamba1
            6544_2     debug     wrap     lipi CG       0:03      1 gamba1
            6544_3     debug     wrap     lipi CG       0:04      1 gamba1
            6544_4     debug     wrap     lipi CG       0:04      1 gamba1
   6561_[42-29999]     debug     wrap     lipi PD       0:00      1 (Resources)
           6561_41     debug     wrap     lipi  R       0:00      1 gamba2
           6561_39     debug     wrap     lipi  R       0:01      1 gamba3
           6561_40     debug     wrap     lipi  R       0:01      1 gamba3
           6561_29     debug     wrap     lipi  R       0:02      1 gamba3
           6561_30     debug     wrap     lipi  R       0:02      1 gamba3
           6561_31     debug     wrap     lipi  R       0:02      1 gamba2
           6561_32     debug     wrap     lipi  R       0:02      1 gamba2
           6561_33     debug     wrap     lipi  R       0:02      1 gamba2
           6561_34     debug     wrap     lipi  R       0:02      1 gamba4
           6561_35     debug     wrap     lipi  R       0:02      1 gamba4
           6561_36     debug     wrap     lipi  R       0:02      1 gamba4
           6561_37     debug     wrap     lipi  R       0:02      1 gamba4

Comment 11 Felip Moll 2019-03-18 11:36:52 MDT

That's a synthetic test launching 90.000 jobs within 3 arrays and starting a few of them. Almost no RAM consumption and launched at the same time.

Submit a fourth 30000 job would fail because my MaxJobCount is 100000 and in total it would've required 120.000 jobs.


[lipi@llagosti 18.08]$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7858        3654        2736         593        1467        3353
Swap:          1535           3        1532

[lipi@llagosti 18.08]$ cat /tmp/test.sh 
#!/bin/bash
sbatch -N1 -a 1-29999 --wrap='sleep 1' && sbatch -N1 -a 1-29999 --wrap='sleep 1' && sbatch -N1 -a 1-29999 --wrap='sleep 1'


[lipi@llagosti 18.08]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)


[lipi@llagosti 18.08]$ time /tmp/test.sh 
Submitted batch job 9093
Submitted batch job 9094
Submitted batch job 9095

real	0m0,089s
user	0m0,032s
sys	0m0,028s


[lipi@llagosti 18.08]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            9093_7     debug     wrap     lipi CG       0:01      1 gamba3
            9093_9     debug     wrap     lipi CG       0:01      1 gamba4
           9093_12     debug     wrap     lipi CG       0:01      1 gamba4
    9094_[1-29999]     debug     wrap     lipi PD       0:00      1 (Priority)
    9095_[1-29999]     debug     wrap     lipi PD       0:00      1 (Priority)
   9093_[15-29999]     debug     wrap     lipi PD       0:00      1 (Resources)
           9093_13     debug     wrap     lipi  R       0:00      1 gamba2
           9093_14     debug     wrap     lipi  R       0:00      1 gamba3

[lipi@llagosti 18.08]$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7858        3701        2689         583        1466        3316
Swap:          1535           3        1532

[lipi@llagosti 18.08]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    9094_[1-29999]     debug     wrap     lipi PD       0:00      1 (Priority)
    9095_[1-29999]     debug     wrap     lipi PD       0:00      1 (Priority)
   9093_[51-29999]     debug     wrap     lipi PD       0:00      1 (Resources)
           9093_49     debug     wrap     lipi  R       0:00      1 gamba2
           9093_50     debug     wrap     lipi  R       0:00      1 gamba3
           9093_39     debug     wrap     lipi  R       0:01      1 gamba2
           9093_40     debug     wrap     lipi  R       0:01      1 gamba2
           9093_41     debug     wrap     lipi  R       0:01      1 gamba2
           9093_42     debug     wrap     lipi  R       0:01      1 gamba3
           9093_43     debug     wrap     lipi  R       0:01      1 gamba3
           9093_44     debug     wrap     lipi  R       0:01      1 gamba3
           9093_45     debug     wrap     lipi  R       0:01      1 gamba4
           9093_46     debug     wrap     lipi  R       0:01      1 gamba4
           9093_47     debug     wrap     lipi  R       0:01      1 gamba4
           9093_48     debug     wrap     lipi  R       0:01      1 gamba4

[lipi@llagosti 18.08]$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7858        3682        2706         585        1469        3333
Swap:          1535           3        1532

Comment 12 Felip Moll 2019-03-25 07:53:51 MDT

Hi, have you applied the recommendations or tried with arrays? Is the scheduler performing better?

Comment 13 NASA JSC Aerolab 2019-03-25 10:49:47 MDT

Only change I have made is on array size as we discussed and I think we are good for right now. Thank you for your assistance.

Comment 14 Felip Moll 2019-03-26 01:46:23 MDT

Ok, I am marking the issue as Infogiven then. Please reopen if you still see issues even after implementing our recommendations.

Regards