We have a project that would like to run over 100000 jobs. Currently they submit jobs every 0.1 sec and Slurm controller is OK, but as soon as they try submitting jobs every 0.05 sec controller comes to a halt. What should be an ideal job submission time limits? What is a best way to sumbit 100000 jobs, which would not impact a controller?
(In reply to NASA JSC Aerolab from comment #0) > We have a project that would like to run over 100000 jobs. Currently they > submit jobs every 0.1 sec and Slurm controller is OK, but as soon as they > try submitting jobs every 0.05 sec controller comes to a halt. What should > be an ideal job submission time limits? What is a best way to sumbit 100000 > jobs, which would not impact a controller? Hi, There are many parameters we can tune. I'd like to start analyzing what you currently have set in slurm.conf and from this I will see if there's a valid proposal for you. Also, are these single bursts of 100.000 jobs? or is it continue? How many jobs are usually in the queue? How many jobs do you allow to be started at a time? Is throughput important or just want to be able to submit all these jobs and let the scheduler do its job more "slowly"? There's usually a compromise between responsiveness to RPCs (commands, communications, and so on) and scheduling work. Please, send your slurm.conf and if possible define a little bit more the workflow.
Created attachment 9572 [details] slurm config file.
Jobs are submitted continuously, but ideally we would like to submit 100000 jobs and let slurm take care of it. We don't have many limits in place at the moment as you will see in our slurm.conf file. Workflow: File manipulation -> computation which generate files -> file manipulation which takes place in parallel.
(In reply to NASA JSC Aerolab from comment #5) > Jobs are submitted continuously, but ideally we would like to submit 100000 > jobs and let slurm take care of it. We don't have many limits in place at > the moment as you will see in our slurm.conf file. > > Workflow: File manipulation -> computation which generate files -> file > manipulation which takes place in parallel. Well, we will try to set some initial parameters and see how it works. Tuning schedule is normally a process that require many iterations until we get the correct settings. We will start by backfill parameters, and continue with builtin scheduler ones. I see you just have 41 nodes with 56 threads each. This is a small system, so it won't be possible to run 100000 jobs at a time. That means that we do not need to try to schedule a lot of jobs because anyway they won't fit in the nodes. Let's say that each job can use one single thread. There are 56*41=2296 CPUs available which means that at most, there will be 2296 jobs running in the system. That would be true if you hadn't the "Oversubscribe=YES" in the partition, in any case we will assume this value for now. I would say then to set the max queue depth to 500, with this setting you will see a big percentage of jobs tried to be/being scheduled at each iteration, with a compromise with responsiveness. If responsiveness is still not good, we can decrease this value. bf_max_job_test=500 We can limit the amount of time the backfill scheduler can spend in one cycle even if maximum job count has not been reached. This is a control for improving responsiveness. For now we are not changing the default value (same as bf_interval), but we can take this one into account for the future. bf_max_time= Since we are potentially cutting the scheduling when it is in the middle of a cycle, lets also add the bf_continue parameter. bf_continue To limit the number of seconds between backfill iterations we can set bf_interval. Default is set at 30 seconds, so we can leave it as is: bf_interval= Also, we can limit the scheduler resolution when looking and planning for the future. Higher values gives better responsiveness, but scheduling can be a bit less precise, which I guess it is not so important. This is affected by bf_interval which we could read as: Looking bf_window minutes in the future with bf_resolution seconds resolution. Your bf_window is not big as you will see below, but in any case let's set 5 minutes just in case. If everything works fine you can decrease this parameter to achieve more precise scheduling times. bf_resolution=300 Then we can instruct the scheduler to just look into the future for a limited number of minutes. This value has to be at least as long as the highest allowed time limit. I see your partition with DefaultTime=04:00:00 MaxTime=8:00:00. So lets put here 8 hours. If everything works fine, you can increase this parameter. Take in mind bf_resolution too: bf_window=480 We then have bf_yield_interval and bf_yield_sleep. These are, each X seconds release the locks to permit RPCs and other operations to be processed, and for how Y seconds to yield these locks, respectively. Default is 2 seconds and 0.5 seconds. If you continue seeing responsiveness issues, you can make the scheduler to "stop" more frequently decreasing bf_yield_interval, or let it sleep for more time with bf_yield_sleep. We will leave them with default values, but if you continue seeing responsiveness problems, use this parameters. bf_yield_interval= bf_yield_sleep= Added to the previous two, we can also change max_rpc_cnt. This one will instruct backfill to defer work when there are too many RPCs pending. Also, the time that the scheduler will sleep will depend on this parameter. It will sleep at least as the max_rpc_cnt is reduced by 10 times. I will put here a value of 100, since there can be at most 256 pending operations to be served at a time (equivalent to MAX_SERVER_THREADS). This parameter should help responsiveness: max_rpc_cnt=100 Now for the builtin scheduler: The equivalent of bf_max_job_test but for the builtin scheduler, let's keep the default of 100 here, because this affects the number of jobs to try to be scheduled at submit time or when a job finishes: default_queue_depth= If submitting thousands of jobs still blocks ctld, we can evaluate the drastic 'defer' parameter later. Finally one more than can help us with your responsiveness. Every job submission the builtin scheduler will start trying to schedule, by default, at most every 1 second interval. Since you are submitting thousands of jobs per second, the scheduler will be running all the time. Let's make it to not try to schedule jobs at as short intervals by increasing from 1 second to 2 seconds. sched_min_interval=2000000 Others parameters to consider if after a couple of iterations we still doesn't achieve desired performance: max_sched_time=# sched_interval To summarize, let's try to set this line in slurm.conf, and restart slurmctld, then run a reconfigure: SchedulerParameters=bf_max_job_test=500,bf_resolution=300,bf_window=480,max_rpc_cnt=100,bf_continue,sched_min_interval=2000000 These parameters are for the future: bf_interval= bf_max_time= bf_yield_interval= bf_yield_sleep= default_queue_depth= Tell me what improvements you see and attach here some sdiag outputs captured in a periodic basis (each 10 seconds during 5 minutes when submitting 100000 jobs).
I'm reviewing your recommendations and will inform you after we implement and run some jobs with it. According to the following documentation, array can be used to distribute millions of jobs in milliseconds. Do you think it's possible with compute resource we have. https://slurm.schedmd.com/job_array.html
(In reply to NASA JSC Aerolab from comment #7) > I'm reviewing your recommendations and will inform you after we implement > and run some jobs with it. > > According to the following documentation, array can be used to distribute > millions > of jobs in milliseconds. Do you think it's possible with compute resource we > have. > > https://slurm.schedmd.com/job_array.html Yes, indeed. It is a good practice and we recommend so. If you are able to use job arrays maybe your problem will be much less complicated. I didn't know if this many jobs were from several users or just one, so I assumed you wanted just scheduler tunning. In any case, you can implement the proposed parameters, and then use job arrays.
Only one user was submitting all jobs. Currently MaxArraySize is set to 2500, In past a single user tried to submit around 100K jobs, but it brought machine down to halt due to lack of Memory. How much RAM does Slurm recommend if we increase MaxArraySize to 10000 and MaxJobCount=100000. Should a user submit 10 different jobs to submit 100K jobs? Should there be a delay of a second between each job submission?
(In reply to NASA JSC Aerolab from comment #9) > Only one user was submitting all jobs. > Currently MaxArraySize is set to 2500, In past a single user tried to submit > around 100K jobs, but it brought machine down to halt due to lack of Memory. > How much RAM does Slurm recommend if we increase MaxArraySize to 10000 and > MaxJobCount=100000. > > Should a user submit 10 different jobs to submit 100K jobs? > Should there be a delay of a second between each job submission? When a job array is submitted just one "object" is created in memory with the metadata of the array until it is modified or initiated, so there will be no RAM issues at that time. Nevertheless performance can be impacted due to scheduling. One key here is to limit the number of jobs to be scheduled at a time. I cannot say a number for RAM, it will depend on how many jobs you will start or modify at a given time and the scheduler pressure. I think that with the provided parameters plus something like you propose: MaxArraySize=10000 MaxJobCount=100000 and maybe setting a MaxSubmitJobs if users are misbehaving, it should be good enough. You can even increase both values, but I'd recommend to try it step by step and monitor the system resources. This is an example on my laptop submitting a job array of 30000 tasks, almost no memory consumption since I just schedule/initiate a few jobs at a time: [lipi@llagosti 18.08]$ free -m total used free shared buff/cache available Mem: 7858 3508 734 787 3614 3290 Swap: 1535 2 1533 [lipi@llagosti 18.08]$ sbatch -N1 -a 1-29999 --wrap='sleep 1' Submitted batch job 6561 [lipi@llagosti 18.08]$ free -m total used free shared buff/cache available Mem: 7858 3548 685 794 3623 3242 Swap: 1535 2 1533 [lipi@llagosti 18.08]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 6544_1 debug wrap lipi CG 0:02 1 gamba1 6544_2 debug wrap lipi CG 0:03 1 gamba1 6544_3 debug wrap lipi CG 0:04 1 gamba1 6544_4 debug wrap lipi CG 0:04 1 gamba1 6561_[42-29999] debug wrap lipi PD 0:00 1 (Resources) 6561_41 debug wrap lipi R 0:00 1 gamba2 6561_39 debug wrap lipi R 0:01 1 gamba3 6561_40 debug wrap lipi R 0:01 1 gamba3 6561_29 debug wrap lipi R 0:02 1 gamba3 6561_30 debug wrap lipi R 0:02 1 gamba3 6561_31 debug wrap lipi R 0:02 1 gamba2 6561_32 debug wrap lipi R 0:02 1 gamba2 6561_33 debug wrap lipi R 0:02 1 gamba2 6561_34 debug wrap lipi R 0:02 1 gamba4 6561_35 debug wrap lipi R 0:02 1 gamba4 6561_36 debug wrap lipi R 0:02 1 gamba4 6561_37 debug wrap lipi R 0:02 1 gamba4
That's a synthetic test launching 90.000 jobs within 3 arrays and starting a few of them. Almost no RAM consumption and launched at the same time. Submit a fourth 30000 job would fail because my MaxJobCount is 100000 and in total it would've required 120.000 jobs. [lipi@llagosti 18.08]$ free -m total used free shared buff/cache available Mem: 7858 3654 2736 593 1467 3353 Swap: 1535 3 1532 [lipi@llagosti 18.08]$ cat /tmp/test.sh #!/bin/bash sbatch -N1 -a 1-29999 --wrap='sleep 1' && sbatch -N1 -a 1-29999 --wrap='sleep 1' && sbatch -N1 -a 1-29999 --wrap='sleep 1' [lipi@llagosti 18.08]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [lipi@llagosti 18.08]$ time /tmp/test.sh Submitted batch job 9093 Submitted batch job 9094 Submitted batch job 9095 real 0m0,089s user 0m0,032s sys 0m0,028s [lipi@llagosti 18.08]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9093_7 debug wrap lipi CG 0:01 1 gamba3 9093_9 debug wrap lipi CG 0:01 1 gamba4 9093_12 debug wrap lipi CG 0:01 1 gamba4 9094_[1-29999] debug wrap lipi PD 0:00 1 (Priority) 9095_[1-29999] debug wrap lipi PD 0:00 1 (Priority) 9093_[15-29999] debug wrap lipi PD 0:00 1 (Resources) 9093_13 debug wrap lipi R 0:00 1 gamba2 9093_14 debug wrap lipi R 0:00 1 gamba3 [lipi@llagosti 18.08]$ free -m total used free shared buff/cache available Mem: 7858 3701 2689 583 1466 3316 Swap: 1535 3 1532 [lipi@llagosti 18.08]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9094_[1-29999] debug wrap lipi PD 0:00 1 (Priority) 9095_[1-29999] debug wrap lipi PD 0:00 1 (Priority) 9093_[51-29999] debug wrap lipi PD 0:00 1 (Resources) 9093_49 debug wrap lipi R 0:00 1 gamba2 9093_50 debug wrap lipi R 0:00 1 gamba3 9093_39 debug wrap lipi R 0:01 1 gamba2 9093_40 debug wrap lipi R 0:01 1 gamba2 9093_41 debug wrap lipi R 0:01 1 gamba2 9093_42 debug wrap lipi R 0:01 1 gamba3 9093_43 debug wrap lipi R 0:01 1 gamba3 9093_44 debug wrap lipi R 0:01 1 gamba3 9093_45 debug wrap lipi R 0:01 1 gamba4 9093_46 debug wrap lipi R 0:01 1 gamba4 9093_47 debug wrap lipi R 0:01 1 gamba4 9093_48 debug wrap lipi R 0:01 1 gamba4 [lipi@llagosti 18.08]$ free -m total used free shared buff/cache available Mem: 7858 3682 2706 585 1469 3333 Swap: 1535 3 1532
Hi, have you applied the recommendations or tried with arrays? Is the scheduler performing better?
Only change I have made is on array size as we discussed and I think we are good for right now. Thank you for your assistance.
Ok, I am marking the issue as Infogiven then. Please reopen if you still see issues even after implementing our recommendations. Regards