Ticket 9365 - unsteady backfilling
Summary: unsteady backfilling
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 20.02.3
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
: 10271 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2020-07-09 04:15 MDT by Brigitte May
Modified: 2020-12-16 09:54 MST (History)
3 users (show)

See Also:
Site: KIT
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
sdiag_20200709 (22.55 KB, text/plain)
2020-07-09 04:58 MDT, Brigitte May
Details
slurm.conf (283.19 KB, text/plain)
2020-07-09 05:09 MDT, Brigitte May
Details
sdiag after reset with sdiag -r (5.82 KB, text/plain)
2020-07-10 03:01 MDT, Brigitte May
Details
slurmctld.log since [2020-07-10T09:52:22.021] (5.14 MB, text/plain)
2020-07-10 03:32 MDT, Brigitte May
Details
squeue --start (127.32 KB, text/plain)
2020-07-10 07:55 MDT, Brigitte May
Details
backfill testing in comparison with and without bf_running_job_reserve (125.74 KB, text/plain)
2020-07-30 04:17 MDT, Brigitte May
Details
sdiag (20.45 KB, text/plain)
2020-07-30 04:17 MDT, Brigitte May
Details
squeue --start (110.24 KB, text/plain)
2020-07-30 04:18 MDT, Brigitte May
Details
perf record -s --call-graph dwarf -p 31126 sleep 600;perf archive perf.data (6.84 MB, application/x-bzip)
2020-08-05 07:00 MDT, karl-heinz.schmidmeier
Details
perf record -s --call-graph dwarf -p 31126 sleep 600;bzip2 perf.data (6.84 MB, application/x-bzip)
2020-08-05 08:23 MDT, karl-heinz.schmidmeier
Details
perf.data.bz2 (6.67 MB, application/octet-stream)
2020-08-06 00:21 MDT, karl-heinz.schmidmeier
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Brigitte May 2020-07-09 04:15:40 MDT
Hi,
 the mean depth is always less than the last queue length at the command "sdiag".
03.07.2020 :
        Depth Mean (try depth): 375
        Last queue length: 999
09.07.2020:
    Depth Mean: 797
        Depth Mean (try depth): 626
        Last queue length: 1099

We started with [2020-06-18T14:45:12.141] backfill: completed testing 8(8) jobs. After we defined the SchedulerParameters     = bf_continue,bf_max_time=60  the backfill: completed testing jobs increase to 
[2020-07-09T11:35:14.797] backfill: completed testing 375(1) jobs, usec=844880
[2020-07-09T11:36:44.162] backfill: completed testing 403(3) jobs, usec=1534323 .

Why isn't it possible to test all jobs with the backfill algorithm?


Thanks!
-- 
Brigitte May
Comment 1 Dominik Bartkiewicz 2020-07-09 04:46:41 MDT
Hi

Can you send me full sdiag output and slurm.conf?

Dominik
Comment 2 Brigitte May 2020-07-09 04:58:00 MDT
Created attachment 14961 [details]
sdiag_20200709
Comment 3 Brigitte May 2020-07-09 05:09:44 MDT
Created attachment 14962 [details]
slurm.conf
Comment 4 Dominik Bartkiewicz 2020-07-09 05:24:56 MDT
Hi

Could you try to use these SchedulerParameters?

SchedulerParameters = bf_max_job_test=400,bf_max_time=120,bf_continue,bf_resolution=120,bf_window=8640

After applying this change, if you send me next sdiag output (with ~5 backfill cycle) we can make the next tuning iteration.

Dominik
Comment 5 Brigitte May 2020-07-10 02:16:05 MDT
Hi,

  Fr Jul 10-10:06:34 (34/10233) -  ACTIVE
/etc/slurm# scontrol reconfigure

Fr Jul 10-10:07:08 (36/10235) -  ACTIVE
/etc/slurm# scontrol show config | grep SchedulerParameters
SchedulerParameters     = bf_max_job_test=400,bf_max_time=120,bf_continue,bf_resolution=120,bf_window=8640

Fr Jul 10-10:08:21 (15/665)
/var/log/slurm# tail -5000 slurmctld.log | grep  "backfill: completed testing"
[2020-07-10T10:07:01.696] backfill: completed testing 182(182) jobs, usec=12101456
[2020-07-10T10:08:06.035] backfill: completed testing 436(251) jobs, usec=188765
[2020-07-10T10:08:55.183] backfill: completed testing 437(251) jobs, usec=207134
[2020-07-10T10:09:44.397] backfill: completed testing 437(251) jobs, usec=205988
[2020-07-10T10:10:37.730] backfill: completed testing 437(251) jobs, usec=205493

Fr Jul 10-10:11:24 (16/666)
/var/log/slurm# tail -5000 slurmctld.log | grep  "backfill: completed testing"
[2020-07-10T10:08:55.183] backfill: completed testing 437(251) jobs, usec=207134
[2020-07-10T10:09:44.397] backfill: completed testing 437(251) jobs, usec=205988
[2020-07-10T10:10:37.730] backfill: completed testing 437(251) jobs, usec=205493
[2020-07-10T10:11:35.253] backfill: completed testing 439(246) jobs, usec=185437
[2020-07-10T10:12:47.049] backfill: completed testing 438(246) jobs, usec=185180

Brigitte
Comment 6 Dominik Bartkiewicz 2020-07-10 02:26:54 MDT
Hi

Could you reset bf statistic "sdiag -r" and send me sdiag output grabbed after ~15 minutes?

Dominik
Comment 7 Dominik Bartkiewicz 2020-07-10 02:40:23 MDT
Hi

I forgot, could you also send me slurmctld.log?

Dominik
Comment 8 Brigitte May 2020-07-10 03:01:38 MDT
Created attachment 14977 [details]
sdiag after reset with sdiag -r
Comment 9 Brigitte May 2020-07-10 03:32:31 MDT
Created attachment 14978 [details]
slurmctld.log since [2020-07-10T09:52:22.021]
Comment 10 Dominik Bartkiewicz 2020-07-10 05:20:28 MDT
Log like this means that all "Eligible" jobs from the queue were processed.
"backfill: reached end of job queue"

Could you send me the output from "squeue --start"

In the log, I noticed that uc1n679 seems to be misconfigured.

-----

I would also like to point out that we take our severity levels very seriously
and ask that you set the severity accordingly since a severity 2 and severity 1
would disrupt current work which we are engaged in and are also attached to the
service level agreements. The severity should reflect the impact on the system
only. In this case, it seems like you are asking for configuration assistance
which is best suited for a severity 3 or 4. 

Below is a link to the support site which describes ticket severity.

https://www.schedmd.com/support.php

SEVERITY LEVELS
Severity 1 — Major Impact

A Severity 1 issue occurs when there is a continued system outage that affects
a large set of end users. The system is down and non-functional due to Slurm
problem(s) and no procedural workaround exists.
Severity 2 — High Impact

A Severity 2 issue is a high-impact problem that is causing sporadic outages or
is consistently encountered by end users with adverse impact to end user
interaction with the system.
Severity 3 — Medium Impact

A Severity 3 issue is a medium-to-low impact problem that includes partial
non-critical loss of system access or which impairs some operations on the
system but allows the end user to continue to function on the system with
workarounds.
Severity 4 — Minor Issues

A Severity 4 issue is a minor issue with limited or no loss in functionality
within the customer environment. Severity 4 issues may also be used for
recommendations for future product enhancements or modifications.


Dominik
Comment 11 Brigitte May 2020-07-10 07:55:52 MDT
Created attachment 14980 [details]
squeue --start
Comment 12 Brigitte May 2020-07-10 07:57:04 MDT
Hi,

  I changed the importance to 3 - medium impact.

Thanks,

Brigitte
Comment 13 Dominik Bartkiewicz 2020-07-10 10:13:51 MDT
Hi

It seems that backfill is processing all eligible jobs.

The difference between a number of processed jobs and the number of jobs in the queue is caused by many jobs that stopped due to AssocGrpJobsLimit, PartitionConfig and Dependency.

As additional tuning, you can try to add "bf_running_job_reserve" to SchedulerParameters and slightly increase "bf_max_time"

Dominik
Comment 14 Dominik Bartkiewicz 2020-07-29 08:53:20 MDT
Hi

any news for this issue?

Dominik
Comment 15 Brigitte May 2020-07-30 02:58:55 MDT
Hi,

  last week I changed the SchedulerParameters 

SchedulerParameters=bf_max_job_test=400,bf_max_time=150,bf_continue,bf_resolution=120,bf_window=8640,bf_running_job_reserve

 for one day, but the backfilling was not good.
Then I set 
SchedulerParameters=bf_max_job_test=400,bf_max_time=90,bf_continue,bf_resolution=120,bf_window=8640

At the moment often nodes are idle and many  jobs are waiting .
Do Jul 30-10:43:12 (11/10892) -  ACTIVE
root@uc2n999:/etc/slurm# sinfo -t idle
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
dev_single        up      30:00      0    n/a
single            up 3-00:00:00      0    n/a
dev_multiple      up      30:00      8   idle uc2n[001-008]
multiple          up 3-00:00:00      0    n/a
fat               up 3-00:00:00      0    n/a
dev_multiple_e    up      30:00      7   idle uc1n[601-607]
multiple_e        up 3-00:00:00      3   idle uc1n[744,866,900]
jupyter_uc1e      up 3-00:00:00      2   idle uc1n[611-612]
dev_special       up      30:00      2   idle uc1n[931-932]
special           up 3-00:00:00      0    n/a
gpu_4             up 2-00:00:00      0    n/a
gpu_8             up 2-00:00:00      5   idle uc2n[508,510,515-516,518]
slurm             up   infinite      0    n/a
tsmserver         up   infinite      0    n/a
login             up   infinite      0    n/a
headnode          up   infinite      0    n/a

Now I will go back to the Parameters before:
SchedulerParameters=bf_max_job_test=400,bf_max_time=120,bf_continue,bf_resolution=120,bf_window=8640

because the behaviour of the job scheduling since last week is bad.

Brigitte
Comment 16 Dominik Bartkiewicz 2020-07-30 03:20:23 MDT
(In reply to Brigitte May from comment #15)
> Hi,
> 
>   last week I changed the SchedulerParameters 
> 
> SchedulerParameters=bf_max_job_test=400,bf_max_time=150,bf_continue,
> bf_resolution=120,bf_window=8640,bf_running_job_reserve
> 
>  for one day, but the backfilling was not good.
Hi

Can you describe what mean 'not good'?

> Then I set 
> SchedulerParameters=bf_max_job_test=400,bf_max_time=90,bf_continue,
> bf_resolution=120,bf_window=8640
> 
> At the moment often nodes are idle and many  jobs are waiting .
> Do Jul 30-10:43:12 (11/10892) -  ACTIVE
> root@uc2n999:/etc/slurm# sinfo -t idle
> PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
> dev_single        up      30:00      0    n/a
> single            up 3-00:00:00      0    n/a
> dev_multiple      up      30:00      8   idle uc2n[001-008]
> multiple          up 3-00:00:00      0    n/a
> fat               up 3-00:00:00      0    n/a
> dev_multiple_e    up      30:00      7   idle uc1n[601-607]
> multiple_e        up 3-00:00:00      3   idle uc1n[744,866,900]
> jupyter_uc1e      up 3-00:00:00      2   idle uc1n[611-612]
> dev_special       up      30:00      2   idle uc1n[931-932]
> special           up 3-00:00:00      0    n/a
> gpu_4             up 2-00:00:00      0    n/a
> gpu_8             up 2-00:00:00      5   idle uc2n[508,510,515-516,518]
> slurm             up   infinite      0    n/a
> tsmserver         up   infinite      0    n/a
> login             up   infinite      0    n/a
> headnode          up   infinite      0    n/a
> 
> Now I will go back to the Parameters before:
> SchedulerParameters=bf_max_job_test=400,bf_max_time=120,bf_continue,
> bf_resolution=120,bf_window=8640
> 
> because the behaviour of the job scheduling since last week is bad.
Can you send me outputs from  'squeue --start' and sdiag?

As I mentioned in comment 13, I think increasing bf_max_time is the direction you should follow.

Dominik
Comment 17 Brigitte May 2020-07-30 04:14:21 MDT
Hi,

  in comparison to 22.7.2020 
[2020-07-22T03:41:45.228] backfill: completed testing 619(2) jobs
the number of completed testing jobs was smaller (see attachement 2020_07_23).

At the moment we have tickets with long latency in class multiple and in interactive mode.

see outputs for more information

Thanks!

Brigitte
Comment 18 Brigitte May 2020-07-30 04:17:13 MDT
Created attachment 15238 [details]
backfill testing in comparison with and without bf_running_job_reserve
Comment 19 Brigitte May 2020-07-30 04:17:49 MDT
Created attachment 15239 [details]
sdiag
Comment 20 Brigitte May 2020-07-30 04:18:46 MDT
Created attachment 15240 [details]
squeue --start
Comment 21 Dominik Bartkiewicz 2020-07-30 05:03:46 MDT
Hi

When exactly did you enable bf_running_job_reserve?
I don't see that enabling this parameter made backfill works less efficient.

From sdiag I noticed that someone (I expect some script) send over 200 rpcs/sec.
This number of rpcs can kill scheduling performance.
Do you know who and why it is generating all these requests?

sdiag_20200730:

...
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:5958111 ave_time:476721 total_time:2840358163735
	REQUEST_JOB_INFO                        ( 2003) count:1801077 ave_time:502850 total_time:905671648182
...
	om0394          (  223703) count:16392346 ave_time:175409 total_time:2875379660294
	root            (       0) count:3300217 ave_time:309919 total_time:1022802299381
	bf4607          (  218591) count:1931293 ave_time:178975 total_time:345654240236
	hu_mathlout     (  927457) count:732438 ave_time:332149 total_time:243279257036
...

Dominik
Comment 22 Brigitte May 2020-07-30 08:26:27 MDT
Hi,

  I enable bf_running_job_reserve
2020-07-23 15:21:19 scontrol reconfigure
till
2020-07-24 11:16:31 scontrol reconfigure

Brigitte
Comment 23 Brigitte May 2020-07-30 08:48:59 MDT
Hi,

  I don't know who and why it is generating all these requests.
Do you mean especially the accounts om0394 and so on.

Brigitte
Comment 24 Dominik Bartkiewicz 2020-07-31 04:20:31 MDT
Hi

Yes, I think that amount of requests coming from users is one of the root causes of this bug. Maybe you can ask if they use any script/tool which generates this amount of RPCs.

Dominik
Comment 25 Dominik Bartkiewicz 2020-08-03 05:55:17 MDT
Hi

Can you grab some perf data from ~10 minutes (both perf.data.tar.bz2 and
perf.data)?
Maybe this will show us some bottleneck.

eg.:
perf record  -s --call-graph dwarf -p `pidof slurmctld`
perf archive perf.data

then send both perf.data.tar.bz2 and perf.data

Dominik
Comment 26 Brigitte May 2020-08-03 09:08:39 MDT
Hi,

  my colleagues will send you the data. I'm on holiday till 24 august.

Kind regards
Brigitte
Comment 27 karl-heinz.schmidmeier 2020-08-05 07:00:10 MDT
Created attachment 15320 [details]
perf record -s --call-graph dwarf -p 31126 sleep 600;perf archive perf.data

Hi,

I send you the file perf.data.tar.bz2 which you requested.

Best regards
Karl-Heinz
Comment 28 Brigitte May 2020-08-05 07:00:25 MDT
Sehr geehrte Damen und Herren,

   Ich bin ab dem   25.08.2020   wieder erreichbar.  E-Mails werden während meiner Abwesenheit nicht automatisch weitergeleitet.

Mit freundlichen Grüßen
Brigitte May
Comment 29 Dominik Bartkiewicz 2020-08-05 07:20:56 MDT
Hi

Thanks.
I know that this is confusing but perf.data.tar.bz2 and perf.data contain
different data, and I need them both.

Dominik
Comment 30 karl-heinz.schmidmeier 2020-08-05 08:23:04 MDT
Created attachment 15321 [details]
perf record -s --call-graph dwarf -p 31126 sleep 600;bzip2 perf.data

Hi

I have compressed the perf.data file (bzip2 perf.data). The original was too large. This is my second attempt.  

Best regards 
Karl-Heinz
Comment 31 Dominik Bartkiewicz 2020-08-05 08:32:02 MDT
Hi

I think you accidentally send me one more time perf.data.tar.bz

Dominik
Comment 32 karl-heinz.schmidmeier 2020-08-05 09:34:22 MDT
Hi

I send you 2 files.

The first file was created by the command "perf record -s --call-graph 
dwarf -p 31126 sleep 600". The output was perf.data and then I created

perf.data.tar.bz2 with the command "perf archive perf.data".

The original file perf.data took too long for sending.  That's why I 
compressed the perf.data file with the comand "bzip2 perf.data".

Was this correct or need you the uncompressed perf.data file?

Best regards



Am 05.08.2020 um 16:32 schrieb bugs@schedmd.com:
>
> *Comment # 31 <https://bugs.schedmd.com/show_bug.cgi?id=9365#c31> on 
> bug 9365 <https://bugs.schedmd.com/show_bug.cgi?id=9365> from Dominik 
> Bartkiewicz <mailto:bart@schedmd.com> *
> Hi
>
> I think you accidentally send me one more time perf.data.tar.bz
>
> Dominik
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>
Comment 33 Dominik Bartkiewicz 2020-08-05 09:48:40 MDT
Hi
 
attachment 15321 [details] and attachment 15320 [details] are the same.
I don't see perf.data.bz2.

Dominik
Comment 34 karl-heinz.schmidmeier 2020-08-06 00:21:48 MDT
Created attachment 15326 [details]
perf.data.bz2

Hi

Sorry, but something went wrong yesterday.

Best regards

Karl-Heinz

Am 05.08.2020 um 17:48 schrieb bugs@schedmd.com:
>
> *Comment # 33 <https://bugs.schedmd.com/show_bug.cgi?id=9365#c33> on 
> bug 9365 <https://bugs.schedmd.com/show_bug.cgi?id=9365> from Dominik 
> Bartkiewicz <mailto:bart@schedmd.com> *
> Hi
>
> attachment 15321 [details] <attachment.cgi?id=15321> [details] 
> <attachment.cgi?id=15321&action=edit>  andattachment 15320 <attachment.cgi?id=15320> [details] 
> <attachment.cgi?id=15320&action=edit>  are the same.
> I don't see perf.data.bz2.
>
> Dominik
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You are on the CC list for the bug.
>
Comment 41 Brigitte May 2020-08-28 05:49:08 MDT
Hi,

  unfortunately I'm not authorized to access bug #9592 , which you sent me in the mail from 17.08.2020. Could you change this?

Thanks!
Brigitte
Comment 42 Dominik Bartkiewicz 2020-08-28 09:54:01 MDT
Hi

Sorry for spamming, it was only automatic mail from bugzilla. 
Bug 9592 is internal (readable only for SchedMD staff). It was open to track potential performance issues in part_data_build_row_bitmaps().
This bug was false-positive and it is closed now.

Dominik
Comment 51 Dominik Bartkiewicz 2020-10-12 05:21:53 MDT
Hi

Those commits address one of the hot points showed in perf.
https://github.com/SchedMD/slurm/compare/cd1f0094dee...3f196e097641

These commits will be included in 20.11.

After those changes configuration "PreemptType=preempt/qos + SelectType=select/cons_tres"
shoud be significantly faster on system with bignumber of runnig jobs.

Can we close this ticket now?
As I wrote in comment 13, backfill on your system works correctly and processing all eligible jobs, even under huge load generated by users RPCs.

Dominik
Comment 52 Brigitte May 2020-10-14 09:00:14 MDT
Hi,

  thank you very much!!! You can close the ticket.

Kind regards,
Brigitte May
Comment 54 Ben Roberts 2020-12-16 09:54:50 MST
*** Ticket 10271 has been marked as a duplicate of this ticket. ***