Ticket 10256

Summary:	slow preemption/requeue
Product:	Slurm	Reporter:	Sophie Créno <sophie.creno>
Component:	Scheduling	Assignee:	Director of Support <support>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	cashley, marshall
Version:	20.02.5
Hardware:	Linux
OS:	Linux
Site:	Institut Pasteur	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	configuration with preempt/partition_prio and partition QoS first sbatch 691433 launched at 17:55, start of preemption (for job 696738) at 17:58:33 corresponding sdiag done after 750 tasks out 1000 have started slurm.conf of our end-of-life cluster tgz containing slurm commands output + slurmctld.log + users' command lines

Description Sophie Créno 2020-11-19 11:40:10 MST

Created attachment 16744 [details]
configuration with preempt/partition_prio and partition QoS

Hello,

  I'm currently adjusting the configuration of our new cluster.
Please find in attachment our current slurm.conf.

  All the nodes belonging to private partitions are also put in 
another one called dedicated. When these nodes are not used by
their owners, other people can run opportunistic jobs on them
using the dedicated partition with the fast (2 hours) QoS. But
when the owner of the nodes launch a job, if necessary, the nodes
must be preempted and the opportunistic jobs must be requeued.

  That's why I have set
* globally:
** JobRequeue=1
** PreemptMode=REQUEUE
** PreemptType=preempt/partition_prio
** PreemptExemptTime=00:00:02

* and for partitions:
** PreemptMode=off PriorityTier=10000 PriorityJobFactor=10000
for the private ones 
** while it is only PreemptMode=requeue for the dedicated partition.

* and the following priority parameters
PriorityFlags=FAIR_TREE
PriorityType=priority/multifactor
PriorityWeightAge=200
PriorityWeightFairshare=700
PriorityWeightPartition=1000

so that jobs in private partitions are considered first for scheduling
and the preemption of private nodes can be effective.

It works but the preemption and the start of the owner's jobs are very
slow so I guess that something is wrong at least in my SchedulerParameters

SchedulerParameters=kill_invalid_depend,nohold_on_prolog_fail,pack_serial_at_end,enable_user_top,permit_job_expansion,partition_job_depth=5000,default_queue_depth=5000,bf_continue,bf_max_job_part=5000,bf_max_job_user=5000,bf_max_job_test=5000,bf_interval=30,bf_max_time=600,bf_yield_interval=10000,bf_resolution=600,bf_min_age_reserve=0,preempt_youngest_first,sched_min_interval=300000,batch_sched_delay=10,sched_max_job_start=5000
 
  When an owner launches a job array, I want almost all of the required
nodes preempted at once so that as many job array tasks as possible
can start immediately. 
  We have that kind of behavior on our old cluster but the configuration
is a bit different because we didn't use partition QoS at that time.

  Could you help me to obtain the sought-after behavior?

  Thanks in advance,

Comment 2 Colby Ashley 2020-11-20 15:50:43 MST

Hey Sophie,

>It works but the preemption and the start of the owner's jobs are very
>slow so I guess that something is wrong at least in my SchedulerParameters

When you say very slow about how long is that. 
The changes to preemption look correct. I noticed a few of your backfilling intervals are changed up from the defaults. It could be an issue on backfilling but I would need a better idea on how slow it is actually going.

One more thing, a colleague brought to my attention that preemption with QOS might be slow in 20.02. I will look into that and let you know.

~Colby

Comment 3 Sophie Créno 2020-11-23 10:07:07 MST

Hi Colby,

  For example, 5 minutes after the owner of the partition has submitted
his job array, only 175 tasks out of 1000 have been scheduled and have
started. The rest is still pending.
  Do you need something more accurate?

  Thanks,

Comment 7 Colby Ashley 2020-11-23 12:24:10 MST

Hey Sophie,

>For example, 5 minutes after the owner of the partition has submitted
>his job array, only 175 tasks out of 1000 have been scheduled and have
>started. The rest is still pending.
>Do you need something more accurate?

That helps a lot thank you. The issue with the QoS being slow was coming from setting PreemptType=preempt/qos. You have set it to preempt/partition_prio so its not a current bug which is nice. Would it be possible to get the slurm.log from the time of the jobsubmit up until now? Also if possible could you change your debug flags in the slurm.conf to include Backfill to gather more detailed logging for me? Along with the output of sdiag.

~Colby

Comment 8 Sophie Créno 2020-11-24 10:58:58 MST

Created attachment 16804 [details]
first sbatch 691433 launched at 17:55, start of preemption (for job 696738) at 17:58:33

Hi Colby,

  Here is the slurmctld log for the time window containing the start of
the opportunistic job array 691433 and then the one of the owner 696738
(at 17:58:33) until all its tasks are running.

Comment 9 Sophie Créno 2020-11-24 11:00:47 MST

Created attachment 16805 [details]
corresponding sdiag done after 750 tasks out 1000 have started

Comment 10 Sophie Créno 2020-11-30 04:47:35 MST

Hello,

  I change the Importance to "medium impact" because we are supposed
to open that cluster to the whole Institute next week and we won't
be able to stand the usual thousands of opportunistic jobs given
the slowness of the current preemption.

Thanks for your help,

Comment 11 Marshall Garey 2020-11-30 16:22:17 MST

(In reply to Sophie Créno from comment #0)
>   When an owner launches a job array, I want almost all of the required
> nodes preempted at once so that as many job array tasks as possible
> can start immediately. 
>   We have that kind of behavior on our old cluster but the configuration
> is a bit different because we didn't use partition QoS at that time.

Can you upload slurm.conf from your old cluster? Did you use preemption at all on your older cluster?

I don't think partition QoS has anything to do with this. I suspect that this is just a side effect of how preemption with job arrays work. Only a limited number of tasks (determined by bf_max_job_array_resv) in the job array will preempt per scheduling cycle. If the array job that you want to be preempting other jobs isn't the highest priority job and is therefore only preempting during the backfill cycle, then it could take quite a lot of backfill cycles for the array job to preempt enough resources for all its tasks to start.

I think this is probably what is happening.

Comment 12 Sophie Créno 2020-12-01 10:28:23 MST

Created attachment 16881 [details]
slurm.conf of our end-of-life cluster

Hi Marshall,

  In attachment is our end-of-life cluster slurm.conf. Yes, we are used to
using preemption. At the very beginning, it was with QoS but we use 

PreemptType=preempt/partition_prio

for several years now. An output of sprio on that cluster looks like:

JOBID PARTITION     USER   PRIORITY     AGE  FAIRSHARE  JOBSIZE  PARTITION     QOS   NICE
29232435 common    user1      51971    1000      50957       15          0       0      0                     
30526994 ebmc      user2      50560     194      40348       18          0   10000      0                     
30526994 ebmc      user2      50551     185      40348       18          0   10000      0
30929166 common    user3       8307      22       8174       12          0     100      0
31005524 common    user4      22934       9      21913       12          0    1000      0

because the QoS for that cluster have different priorities. 

Job 29232435 has been submitted on the shared nodes with a QoS of
priority 0 because there is no timelimit.
Jobs 30526994 and 30526994 have been launched with a QoS that is
used for a private partition. As a consequence, it's priority is
of 10000 to allow users to start quickly on their own nodes.
Job 30929166 has been submitted on shared nodes but with a QoS
limited to 24 hours that has 100 as priority.
Job 31005524 has been submitted on shared nodes but with a QoS
limited to 2 hours that has 1000 as priority.

  Hope it helps. Thanks,

Comment 13 Marshall Garey 2020-12-01 18:20:34 MST

Colby is on vacation for a couple more days, so I'm going to keep responding.

========================================================================
Here's what's happening when I try to replicate this:
========================================================================

The main scheduler is preempting resources for 1 job array task at a time, as expected. The backfill scheduler will preempt up to bf_max_job_array_resv jobs at a time, as expected. But the main scheduler is running really often - it is queued every time a job completes, which is happening often (about once per second) because of the preemptions. Then the high priority job is starting and the next task in the job array preempts another one.

So, it seems like it's preempting fairly fast. It's probably faster for me since the schedulers don't have to run through hundreds or thousands of jobs and slurmctld doesn't have to process a bunch of other RPC's in between.



========================================================================
Analyzing your numbers:
========================================================================

>For example, 5 minutes after the owner of the partition has submitted
>his job array, only 175 tasks out of 1000 have been scheduled and have
>started. The rest is still pending.

175 tasks / 5 minutes = 35 tasks per minute started. If the backfill scheduler preempted resources for 20 jobs once per minute, and then the main scheduler ran 15 times per minute to preempt resources for 1 job each time, that would be preempting resources for 35 jobs in the array per minute.

Also, are there points where the owners jobs are filling the partition they're submitted to and


Actual sdiag data:

sdiag output at Tue Nov 24 18:18:42 2020 (1606238322)
Data since      Tue Nov 24 01:00:00 2020 (1606176000)

Main schedule statistics (microseconds):
    Mean depth cycle:  71
    Cycles per minute: 6

Backfilling stats
    Total cycles: 1976
    Depth Mean: 34

1976 backfill cycles in 17 hours, 18 minutes, and 42 seconds. Truncating the 42 seconds, that's 1976 backfill cycles in 1038 minutes, or 1.9 backfill cycles per minute. So the backfill scheduler is actually running almost twice per minute and the main+quick scheduler about 6 times per minute. Rounding up to twice per minute on the backfill scheduler, that means the schedulers can preempt 46 jobs per minute per array job, only if there are enough resources for those jobs to run in the partition.


Summary:
You're saying 35 tasks per minute are started. Looking at sdiag, I think the best you would see with your current settings is about 46 tasks per minute per owner partition. That's assuming the backfill and main schedulers get  Either way, I suspect you want it faster.



========================================================================
So now what?
========================================================================

This all seems pretty normal to me. Was it noticeably different on your older cluster?

>  When an owner launches a job array, I want almost all of the required
> nodes preempted at once so that as many job array tasks as possible
> can start immediately. 

I'm not sure this is possible with job arrays. Is this what you had in your old cluster? The configuration doesn't seem very different.


I can have you run a test for me to get detailed logging and see exactly what is happening, but I think I already have a good idea. Let me know if you want to run a test with detailed logging.


Suggestions:

- Increase bf_max_job_array_resv. Don't increase it by a lot at first - I suggest starting with 30 (the default is 20). You can increase this more as needed.
  - It could adversely affect backfill in other ways - the backfill scheduler might not be able to "backfill" as many jobs as it otherwise might have, but it sounds like this would fit your workflow better.
- Rather than having owner partitions have specific nodes, I suggest using a "floating" partition. You can read more about that here: https://slurm.schedmd.com/qos.html (search for the word "floating"). Just set GrpTres=Nodes=<# of allowed nodes> on the QOS of each owner partition. This will have the effect of the owners being able to run on many different nodes, but still being limited to their number of purchased nodes. You could reduce the amount of preemptions that happen by doing this and increase utilization of the cluster.


What do you think? Do you have any questions?

Comment 14 Sophie Créno 2020-12-02 10:58:10 MST

Hi Marshall and thanks to handle my request while Colby is away,

> This all seems pretty normal to me. Was it noticeably different
> on your older cluster?

  If I look more closely the first tasks (of the second job array
that requires preemption) start earlier but the last start
approximately at the same time.
  Even if our end-of-life cluster is much more used at the moment
than the new one, the first 175 tasks for example take 2'30" to be 
allocated on the old one vs almost 5' on the new one. But if I look
at the last ones, they are allocated 15' after the submission of
the second job array in both cases. These figures are obtained after
setting bf_max_job_array_resv=30 and scontrol reconfigure.


> I can have you run a test for me to get detailed logging and
> see exactly what is happening, but I think I already have a
> good idea. Let me know if you want to run a test with detailed
> logging.

  I must admit that I would be interested in knowing if the difference
in the start time of the first tasks is due to circumstances or if
there is something in the configuration that is responsible for that.
But since it's not true for later tasks, maybe it isn't worth it. What
do you think?


  Regarding the "floating partitions", the idea is appealing indeed!
But we are waiting for another type of nodes with 2 To of RAM. As
a consequence, nodes won't be interchangeable anymore. So we could
create partitions for standard nodes, others for BigMem nodes and
limit the number of nodes with partition QoS GrpNodes for each
private floating partition. But for research units with both types
of nodes, we would need to keep static private partitions, right?

  Thanks a lot for your time and suggestions,

Comment 15 Marshall Garey 2020-12-02 15:59:04 MST

(In reply to Sophie Créno from comment #14)
>   I must admit that I would be interested in knowing if the difference
> in the start time of the first tasks is due to circumstances or if
> there is something in the configuration that is responsible for that.
> But since it's not true for later tasks, maybe it isn't worth it. What
> do you think?

I'm inclined to think that it's circumstantial, but let's run a test to see what we learn.


1. Start the opportunistic job. Wait for it to start running.
2. Run the following to get more verbose logging:

scontrol setdebug debug
scontrol setdebugflags +backfill

3. Submit the owner job.
4. Wait for about 30 minutes, or for the owner job to completely start running (whichever is less time). Run these commands every 2 minutes:

sprio
sinfo
squeue -a
sdiag

5. Reset logging back to normal:

scontrol setdebug info
scontrol setdebugflags -backfill

The increased logging may adversely affect performance, but shouldn't be too bad (may not even be noticeable at all) unless the filesystem or storage that the slurmctld log file is on is slow.

6. Then can you upload the following (in a compressed folder so it's just a single upload):

- The job submission of the opportunistic job and the owner job (command line arguments and script)
- The slurmctld log file during this time
- Output of all the commands (sprio, sinfo, squeue -a, sdiag)


I'm hoping to see the status of the nodes, the jobs that are available to be preempted, the owner jobs that will be preempting, the priority of all jobs, and if the owner jobs ever preempt enough resources to fill up the partition they're submitted to.



>   Regarding the "floating partitions", the idea is appealing indeed!
> But we are waiting for another type of nodes with 2 To of RAM. As
> a consequence, nodes won't be interchangeable anymore. So we could
> create partitions for standard nodes, others for BigMem nodes and
> limit the number of nodes with partition QoS GrpNodes for each
> private floating partition. But for research units with both types
> of nodes, we would need to keep static private partitions, right?

Here's an idea that might work to avoid static private partitions. For research groups with both types of nodes, you could create a private floating partition for each type of node.

As an example:

# Use NodeSet syntax with features bigmem and normal
NodeName=node[1-100] memory=50000 feature=normal
NodeName=node[101-105] memory=2000000 feature=bigmem
NodeSet=BigMemNodes Feature=bigmem
NodeSet=NormalNodes Feature=normal

# Separate
PartitionName=ResearchGroup1 Nodes=BigMemNodes Qos=rg1_bigmem
PartitionName=ResearchGroup1 Nodes=NormalNodes Qos=rg1_normal

And set GrpNodes in rg1_bigmem and rg1_normal to however many nodes that group owns in each group.

(I use NodeSet syntax which was introduced in Slurm 20.02. You can read more about it in the SC20 BoF presentation by Tim Wickberg here: https://slurm.schedmd.com/publications.html)


Disadvantages: You'd have more partitions and more QOS's to manage, but this isn't a technical limitation. The real technical limitation is that jobs can't span multiple partitions, so a single job couldn't use both types of nodes. If you don't want a single job to use both types of nodes, then I think this idea would work. But if you want a single job to be able to use both types of nodes, then you'd probably need the static private partitions.

Comment 16 Sophie Créno 2020-12-03 12:44:04 MST

Created attachment 16954 [details]
tgz containing slurm commands output + slurmctld.log + users' command lines

Hi Marshall,

  Here is what you asked for (for the new cluster). 

  Sorry, I was a bit elliptical. Indeed, I would prefer to keep only
1 private partition per research unit. If it wasn't the case, yes, that's
exactly what I had imagined when I read the page about floating partitions.
It would be interesting to avoid too many requeued jobs indeed. But still,
you would keep preemption at the partition level to allow users with small
and short jobs to fill the gaps in private partitions right?

  If I wanted to do that type of change in the configuration, would I have
to put all nodes in a maintenance reservation, change the configuration, do
a scontrol reconfigure and delete the reservation? I'm used to adding nodes
or to migrating some nodes from a partition to another but these are rather
minor changes. Here it's different since most of the nodes are affected.

  Thanks a lot for your suggestions,

Comment 17 Marshall Garey 2020-12-04 08:58:04 MST

(In reply to Sophie Créno from comment #16)
> Created attachment 16954 [details]
> tgz containing slurm commands output + slurmctld.log + users' command lines
> 
> Hi Marshall,
> 
>   Here is what you asked for (for the new cluster). 

I'll look through it and let you know what I find.


>   Sorry, I was a bit elliptical. Indeed, I would prefer to keep only
> 1 private partition per research unit. If it wasn't the case, yes, that's
> exactly what I had imagined when I read the page about floating partitions.
> It would be interesting to avoid too many requeued jobs indeed. But still,
> you would keep preemption at the partition level to allow users with small
> and short jobs to fill the gaps in private partitions right?

Yes, you would definitely need to specify PreemptMode=off for those partitions that you don't want to be preempted. It looks like I accidentally left that off.

You could specify PreemptMode=requeue globally and then only specify PreemptMode=off for partitions where you want to disable preemption; or you could not specify PreemptMode globally and specify it for every single partition. It doesn't matter - either way is the same.


>   If I wanted to do that type of change in the configuration, would I have
> to put all nodes in a maintenance reservation, change the configuration, do
> a scontrol reconfigure and delete the reservation? I'm used to adding nodes
> or to migrating some nodes from a partition to another but these are rather
> minor changes. Here it's different since most of the nodes are affected.

You shouldn't have to worry. You should be able to make the changes and run scontrol reconfigure without draining nodes or having a system wide maintenance.

I just ran an experiment. I submitted a job to a partition with 10 nodes, then changed the partition definition to have 7 nodes and ran scontrol reconfigure. The job continued to run and completed normally. I wasn't sure what would happen, but that seems good.

These changes aren't even removing nodes from partitions, they're adding nodes and setting a GrpTRES limit on the QOS, so my worry about removing nodes from partitions isn't even applicable here.

Still, I recommend testing these changes on a test system first.

If you want to be extra safe, you could always set a maintenance reservation like you propose, but that shouldn't be necessary.






One more thing I want to make you aware of: GrpTRES isn't limited to Nodes only, it's available for any TRES, including CPUs. GrpTRES=CPU would allow users to spread there jobs across more nodes but still be restricted to a certain number of CPUs, so this could be potentially even more beneficial than GrpTRES=Nodes. But it all depends on your site policies and what they're comfortable with.

Comment 18 Sophie Créno 2020-12-04 13:00:59 MST

Hi Marshall,

> You could specify PreemptMode=requeue globally and then only specify
> PreemptMode=off for partitions where you want to disable preemption;
> or you could not specify PreemptMode globally and specify it for
> every single partition. It doesn't matter - either way is the same.

  Yes and it's more or less the same for us since we only have 
1 partition (dedicated) that allows preemption ;)


> I just ran an experiment. I submitted a job to a partition with 10 nodes,
> then changed the partition definition to have 7 nodes and ran scontrol 
> reconfigure. The job continued to run and completed normally. I wasn't
> sure what would happen, but that seems good.

  Yes, I saw the same several times during migration of nodes from
a partition to another. As long as it hasn't been submitted in a
preemptable partition, it doesn't seem to matter. The job continues
to run as if nothing had happened.


> Still, I recommend testing these changes on a test system first.

  Yes sure. I'll do that during the shutdown that is taking place
this weekend for other reasons.


> If you want to be extra safe, you could always set a maintenance
> reservation like you propose, but that shouldn't be necessary.

  I think I'll do it in production conditions, just to be reassured.


> One more thing I want to make you aware of: GrpTRES isn't limited
> to Nodes only, it's available for any TRES, including CPUs.

  Yes, I know. We considered that way of doing things at
the beginning.


> GrpTRES=CPU would allow users to spread there jobs across more
> nodes but still be restricted to a certain number of CPUs,
> so this could be potentially even more beneficial than
> GrpTRES=Nodes. But it all depends on your site policies and
> what they're comfortable with.

  Unfortunately, it doesn't suit some of our use cases, such as
metagenomic for example, where a single process can require the
whole memory of a node. So, in some cases, even if the unit
doesn't use all its resources/"virtual nodes", the job can
remain PENDING because there is no node left with enough free RAM.
At the beginning, we had the same problem with cores due to some
programs only multithreaded but now, with 96 cores per node,
this problem should be quite scarce.

  Thanks again for your time and suggestions,

Comment 19 Colby Ashley 2020-12-07 15:15:19 MST

Hey Sophie,

It looks like Marshall gave you a bunch of ideas on how to go about solving this issue. Is there anything he did not answer or that you have questions about?

~Colby

Comment 20 Sophie Créno 2020-12-11 06:35:10 MST

Hi Colby,

  Given that Marshall said

> I'll look through it and let you know what I find.

  I'm waiting for his conclusions while we are debugging an issue on
one of our storage bays that appeared last weekend and has impacted 
the new cluster since then :(

Comment 21 Colby Ashley 2020-12-14 15:24:37 MST

Hey Sophie,

Marshall and I did some testing before the weekend and I have been diving into the logs you sent us.

First from the logs I would suggest setting bf_max_job_part to something much lower than 5000. Try setting it to 20 but feel free to tune this value. The backfiller was still trying to backfill more of the jobs from the dedicated partition rather than the one with higher priority.
From the slumf.conf docs:
>bf_max_job_part=#
>The maximum number of jobs per partition to attempt starting with the backfill
>scheduler. This can be especially helpful for systems with large numbers of
>partitions and jobs. This option applies only to SchedulerType=sched/backfill.
>Also see the partition_job_depth and bf_max_job_test options. Set bf_max_job_test
>to a value much higher than bf_max_job_part. Default: 0 (no limit), Min: 0, 
>Max: bf_max_job_test.
From our testing we saw that once the backfiller put a job on a node it wouldn't try to add another job to that node even if it had idle resources. The debug flag BackfillMap is what we used to see this occur. With the test you ran. Each job in the array would only be taking up 1 core to sleep. From your slurm.conf you have maestro 1000-1095 setup with 2 sockets and 48 cores. So you should be able to fit 96 of these jobs per node. To quote what Marshall previously said about GrpTRES=CPU: 
>One more thing I want to make you aware of: GrpTRES isn't limited to Nodes
>only, it's available for any TRES, including CPUs. GrpTRES=CPU would allow
>users to spread there jobs across more nodes but still be restricted to a
>certain number of CPUs, so this could be potentially even more beneficial than
>GrpTRES=Nodes. But it all depends on your site policies and what they're
>comfortable with
This would also allow the backfiller to put these single core jobs onto those 96 core nodes.

The last thing I want to bring up. The scheduler runs roughly every 2 seconds, which is reflected in the logs you sent us. Each time it runs it preempts 1 job. So if we are trying to get job array of 1000 running it would take about 1 hour minimum with just preemption to get the jobs on their own node. This is not including wait times for other same prio or higher prio jobs it would have to wait for.

In short, try tuning bf_max_job_part and see if you can add GrpTRES=CPU to your system.

~Colby

Comment 22 Colby Ashley 2021-01-14 15:39:45 MST

Hey Sophie,

Have you had the chance to try some our tuning suggestions? Are you still experiencing an issue with the slow preemption?

~Colby

Comment 23 Colby Ashley 2021-01-27 15:53:12 MST

Closing this out, if you still need help or have questions let us know.