Ticket 7442 - backfilling stats
Summary: backfilling stats
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 18.08.8
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-07-19 15:05 MDT by Kilian Cavalotti
Modified: 2020-01-09 08:27 MST (History)
4 users (show)

See Also:
Site: Stanford
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name: Sherlock
CLE Version:
Version Fixed: 20.02.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
backfill logs (4.28 MB, application/x-bzip)
2019-07-22 09:47 MDT, Kilian Cavalotti
Details
backill progress rate (10.42 KB, image/png)
2019-07-22 09:50 MDT, Kilian Cavalotti
Details
_will_run_test patch (3.04 KB, patch)
2019-07-23 12:04 MDT, Dominik Bartkiewicz
Details | Diff
slurmctld logs (70.60 KB, application/x-bzip)
2019-08-07 11:36 MDT, Kilian Cavalotti
Details
Extra logging patch (672 bytes, patch)
2019-08-19 08:20 MDT, Dominik Bartkiewicz
Details | Diff
slurmctld w/ extra logging (122.72 KB, application/x-bzip)
2019-08-19 10:22 MDT, Kilian Cavalotti
Details
bf improvement (1.78 KB, patch)
2019-08-28 15:42 MDT, Dominik Bartkiewicz
Details | Diff
backfill progress rate w/ patch (10.48 KB, image/png)
2019-09-27 09:57 MDT, Kilian Cavalotti
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2019-07-19 15:05:14 MDT
Hi!

We're still exploring possible ways to limit the number of jobs pending with a reason of "(None)" on Sherlock (cf. bug #7361), and looking at the backfillinf stats in sdiag, there are a few things that are not very clear to me:

We have this in SChedulerParameters:

SchedulerParameters=\
default_queue_depth=10000,\
partition_job_depth=100,\
max_sched_time=5,\
bf_continue,\
bf_busy_nodes,\
bf_interval=60,\
bf_max_time=900,\
bf_yield_interval=1000000,\
bf_yield_sleep=300000,\
bf_window=10200,\
bf_resolution=1800,\
bf_max_job_test=30000,\
bf_max_job_array_resv=10,\
bf_job_part_count_reserve=100,\
bf_min_age_reserve=3600,\
bf_min_prio_reserve=20000,\
kill_invalid_depend,\
nohold_on_prolog_fail,\
pack_serial_at_end,\
preempt_youngest_first,\
batch_sched_delay=5,\
jobs_per_user_per_hour=10000,\
max_rpc_cnt=64


and here's the output of sdiag:

Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
        Total backfilled jobs (since last slurm start): 28161
        Total backfilled jobs (since last stats cycle start): 12485
        Total backfilled heterogeneous job components: 0
        Total cycles: 79
        Last cycle when: Fri Jul 19 13:47:06 2019 (1563569226)
        Last cycle: 741337384
        Max cycle:  750720216
        Mean cycle: 684563388
        Last depth cycle: 2292
        Last depth cycle (try sched): 2107
        Depth Mean: 4418
        Depth Mean (try depth): 3487
        Last queue length: 12896
        Queue length mean: 10171


I'm particularly curious about the "depth" values. Given bf_max_job_test=30000, I would have expected the depth values to be more than ~3,000. We typically have 20,000+ jobs in queue, so there's plenty of jobs to test.

What am I missing here?

Thanks!
--
Kilian
Comment 1 Dominik Bartkiewicz 2019-07-22 05:14:49 MDT
Hi

bf_max_job_test is the maximum number of tested job.
BF cycle can be break earlier if bf_max_time (yield time is included) is hit or after yielding lock jobs, parts or config state changes:

https://github.com/SchedMD/slurm/blob/0a6cd4121e403994695481a0c685bea59c4ef2ad/src/plugins/sched/backfill/backfill.c#L1518
https://github.com/SchedMD/slurm/blob/0a6cd4121e403994695481a0c685bea59c4ef2ad/src/plugins/sched/backfill/backfill.c#L1059

If you enable debug flag backfill you should see this message:
"backfill: system state changed, ..."

Did I answer your question?
If you do have additional questions about this please let me know.

Dominik
Comment 2 Kilian Cavalotti 2019-07-22 09:25:05 MDT
(In reply to Dominik Bartkiewicz from comment #1)
> Hi
> 
> bf_max_job_test is the maximum number of tested job.
> BF cycle can be break earlier if bf_max_time (yield time is included) is hit
> or after yielding lock jobs, parts or config state changes:
> 
> https://github.com/SchedMD/slurm/blob/
> 0a6cd4121e403994695481a0c685bea59c4ef2ad/src/plugins/sched/backfill/backfill.
> c#L1518
> https://github.com/SchedMD/slurm/blob/
> 0a6cd4121e403994695481a0c685bea59c4ef2ad/src/plugins/sched/backfill/backfill.
> c#L1059
> 
> If you enable debug flag backfill you should see this message:
> "backfill: system state changed, ..."
> 
> Did I answer your question?
> If you do have additional questions about this please let me know.

Thanks for the additional information, but things are still not very clear to me.

If I understand correctly, with:
* 10,000 pending jobs in queue
* bf_max_job_test=30000
* bf_max_time=900
* sdiag's backfilling stats reporting a max cycle of 750720216µs
all of the pending jobs should get testes by the backfill scheduler each pass, right?

So why does it only test ~2,000?

        Last depth cycle: 2292
        Last depth cycle (try sched): 2107

As far as I understand it, our settings should allow for sufficient time for a full backfill cycle to test all of the ~10,000 pending jobs, not just 2,000.

There's no config nor partition change happening every cycle, so I don't think that's why the backfill scheduler doesn't test all of the jobs.

I'll try to enable +Backfill debug and see if it shows any hint.

Thanks!
-- 
Kilian
Comment 3 Kilian Cavalotti 2019-07-22 09:47:58 MDT
Created attachment 10978 [details]
backfill logs

So, I enabled debug3 and +Backfill, and here is the resulting slurmctld log file for that full backfill cycle. It ended up only testing 1,036 jobs in that cycle, only 10% of the total pending jobs. 

backfill: completed testing 1036(1) jobs, usec=758600
Comment 4 Kilian Cavalotti 2019-07-22 09:50:57 MDT
Created attachment 10980 [details]
backill progress rate

I also noted that the cycle starts by testing jobs at a rapid pace, before starting to slowdown and plateau after a few minutes.

I graphed the "backfill: yielding locks after testing XXX jobs" messages vs time, and it looks like the backfill scheduler mostly stalls at some point and doesn't progress anymore (see attached graph).

Is that expected?
Comment 5 Dominik Bartkiewicz 2019-07-23 10:45:02 MDT
Hi

It is normal that bf need to make more computation when it has more reservations
(more processed jobs), but I didn't expect such drop in performance (great plot).

Increasing bf_resolution can help in such a situation but
the current value is really big and I don't think that increasing it will help much.

You can test patch from bug 6914 it should slightly help.
Also you may be interested in Doug work in bug 6023.

Dominik
Comment 6 Kilian Cavalotti 2019-07-23 11:19:14 MDT
Hi Dominik, 

(In reply to Dominik Bartkiewicz from comment #5)
> It is normal that bf need to make more computation when it has more
> reservations
> (more processed jobs), but I didn't expect such drop in performance (great
> plot).

Yes, at this point, it looks like the backfill scheduler never reaches the end of the queue, no matter the timing of depth limits.

> Increasing bf_resolution can help in such a situation but
> the current value is really big and I don't think that increasing it will
> help much.
> 
> You can test patch from bug 6914 it should slightly help.
> Also you may be interested in Doug work in bug 6023.

Thanks for the suggestions. I don't have access to either of those bugs, though. :)

Cheers,
--
Kilian
Comment 7 Dominik Bartkiewicz 2019-07-23 12:04:48 MDT
Created attachment 10991 [details]
_will_run_test patch

This is patch from 6914, it changes slightly  _will_run_test() behavior. Synthetic tests don't show a huge performance increase but I am curious if this can help on real workload.
If you are interested in testing this patch let me know.

Bug 6023 contains Doug improvement to BF, it looks really promising and we expect to include it to 20.02
Comment 8 Dominik Bartkiewicz 2019-07-31 23:57:46 MDT
Hi

Any news?

Dominik
Comment 9 Kilian Cavalotti 2019-08-02 11:27:06 MDT
Hi Dominik, 

I gave a try at the patch from 6914 you attached, and that made slurmctld segfault right when starting.

The backtrace from the dumped core file is this:

#0  env_array_merge (dest_array=dest_array@entry=0x7f0619cbcde0, src_array=0x210000000000) at env.c:1741
#1  0x0000000000461bc0 in _build_env (job_ptr=0x8229a50, is_epilog=<optimized out>) at job_scheduler.c:3956
#2  0x000000000046ac05 in _run_prolog (arg=0x8229a50) at job_scheduler.c:4524
#3  0x00007f061e6fddd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f061e426ead in clone () from /lib64/libc.so.6


So we reverted to the stock 18.08.8 slurmctld right away, and is started just fine.

And I still don't have access to bug 6023.


But of course, we'd prefer not having to deal with segfaults in production, so any tested and validated fix you could come up with to address the backfill scheduler slowdown would be much appreciated.

Thanks!
-- 
Kilian
Comment 10 Dominik Bartkiewicz 2019-08-05 04:35:05 MDT
(In reply to Kilian Cavalotti from comment #9)
> Hi Dominik, 
> 
> I gave a try at the patch from 6914 you attached, and that made slurmctld
> segfault right when starting.
> 
> The backtrace from the dumped core file is this:
> 
> #0  env_array_merge (dest_array=dest_array@entry=0x7f0619cbcde0,
> src_array=0x210000000000) at env.c:1741
> #1  0x0000000000461bc0 in _build_env (job_ptr=0x8229a50,
> is_epilog=<optimized out>) at job_scheduler.c:3956
> #2  0x000000000046ac05 in _run_prolog (arg=0x8229a50) at job_scheduler.c:4524
> #3  0x00007f061e6fddd5 in start_thread () from /lib64/libpthread.so.0
> #4  0x00007f061e426ead in clone () from /lib64/libc.so.6
> 
> 
> So we reverted to the stock 18.08.8 slurmctld right away, and is started
> just fine.
> 
> And I still don't have access to bug 6023.
> 
> 
> But of course, we'd prefer not having to deal with segfaults in production,
> so any tested and validated fix you could come up with to address the
> backfill scheduler slowdown would be much appreciated.
> 
> Thanks!
> -- 
> Kilian

Hi

Sorry, This patch shouldn't procure any segfault at least not in this place.
Could you send me the output from gdb:
t a a bt
f 1
p *job_ptr


Dominik
Comment 11 Kilian Cavalotti 2019-08-05 09:28:58 MDT
(In reply to Dominik Bartkiewicz from comment #10)
> Sorry, This patch shouldn't procure any segfault at least not in this place.
> Could you send me the output from gdb:
> t a a bt
> f 1
> p *job_ptr

Sure, here it is:

(gdb) bt
#0  env_array_merge (dest_array=dest_array@entry=0x7f0619cbcde0, src_array=0x210000000000) at env.c:1741
#1  0x0000000000461bc0 in _build_env (job_ptr=0x8229a50, is_epilog=<optimized out>) at job_scheduler.c:3956
#2  0x000000000046ac05 in _run_prolog (arg=0x8229a50) at job_scheduler.c:4524
#3  0x00007f061e6fddd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f061e426ead in clone () from /lib64/libc.so.6
(gdb) t a a bt

Thread 7 (Thread 0x7f061abec700 (LWP 39068)):
#0  0x00007f061e7010b4 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#1  0x000000000046d7bf in lock_slurmctld (lock_levels=...) at locks.c:124
#2  0x00007f061a8e069b in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:287
#3  0x00007f061e6fddd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f061e426ead in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f061f0f6740 (LWP 39065)):
#0  0x00007f061e41c20d in poll () from /lib64/libc.so.6
#1  0x00007f061ebe83b6 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fff37f65100) at /usr/include/bits/poll2.h:46
#2  _conn_readable (persist_conn=persist_conn@entry=0x228f4a0) at slurm_persist_conn.c:138
#3  0x00007f061ebe991f in slurm_persist_recv_msg (persist_conn=0x228f4a0) at slurm_persist_conn.c:888
#4  0x00007f061a8e6aed in send_recv_slurmdbd_msg (rpc_version=rpc_version@entry=8448, req=req@entry=0x7fff37f65240, resp=resp@entry=0x7fff37f651e0) at slurmdbd_agent.c:959
#5  0x00007f061a8e6c17 in send_slurmdbd_recv_rc_msg (rpc_version=rpc_version@entry=8448, req=req@entry=0x7fff37f65240, resp_code=resp_code@entry=0x7fff37f6523c) at slurmdbd_agent.c:994
#6  0x00007f061a8e3ee5 in clusteracct_storage_p_register_ctld (db_conn=<optimized out>, port=<optimized out>) at accounting_storage_slurmdbd.c:2528
#7  0x00007f061ebd99a4 in clusteracct_storage_g_register_ctld (db_conn=0x1, port=<optimized out>) at slurm_accounting_storage.c:844
#8  0x000000000042bdfa in main (argc=<optimized out>, argv=<optimized out>) at controller.c:707

Thread 5 (Thread 0x7f061aced700 (LWP 39066)):
#0  0x00007f061e7012ce in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x000000000046d7fc in lock_slurmctld (lock_levels=...) at locks.c:126
#2  0x00000000004203b9 in _agent_retry (mail_too=false, min_wait=999) at agent.c:1526
#3  _agent_init (arg=<optimized out>) at agent.c:1396
#4  0x00007f061e6fddd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f061e426ead in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f061a6d7700 (LWP 39071)):
#0  0x00007f061e701d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f061a8e5996 in _agent (x=<optimized out>) at slurmdbd_agent.c:624
#2  0x00007f061e6fddd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f061e426ead in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f0619fc0700 (LWP 39081)):
#0  0x00007f061e701d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0619fc4f98 in _my_sleep (usec=60000000) at backfill.c:603
#2  0x00007f0619fcbe02 in backfill_agent (args=<optimized out>) at backfill.c:968
#3  0x00007f061e6fddd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f061e426ead in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f061979c700 (LWP 39132)):
#0  0x00007f061e701d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000043a4bc in _heartbeat_thread (no_data=<optimized out>) at heartbeat.c:143
#2  0x00007f061e6fddd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f061e426ead in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f0619cbd700 (LWP 39088)):
#0  env_array_merge (dest_array=dest_array@entry=0x7f0619cbcde0, src_array=0x210000000000) at env.c:1741
#1  0x0000000000461bc0 in _build_env (job_ptr=0x8229a50, is_epilog=<optimized out>) at job_scheduler.c:3956
#2  0x000000000046ac05 in _run_prolog (arg=0x8229a50) at job_scheduler.c:4524
#3  0x00007f061e6fddd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f061e426ead in clone () from /lib64/libc.so.6
(gdb)
(gdb) f 1
#1  0x0000000000461bc0 in _build_env (job_ptr=0x8229a50, is_epilog=<optimized out>) at job_scheduler.c:3956
3956    job_scheduler.c: No such file or directory.
(gdb) p *job_ptr
$1 = {magic = 4038539564, account = 0x8229fa0 "agoldpie", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x8229f70 "sh-101-52", alloc_resp_port = 0, alloc_sid = 148469, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 21892,
  assoc_ptr = 0x23d8d70, batch_features = 0x0, batch_flag = 1, batch_host = 0x7f3b600 "sh-101-54", billable_tres = 1, bit_flags = 504365056, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0,
  cpu_cnt = 1, cpus_per_tres = 0x0, cr_enabled = 0, db_index = 885998136, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x8229850, direct_set_prio = 0, end_time = 1564766489, end_time_exp = 1564766489, epilog_running = false, exit_code = 0,
  fed_details = 0x0, front_end_ptr = 0x0, gids = 0x0, gres_list = 0x0, gres_alloc = 0x8229fd0 "", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x8229ff0 "", gres_used = 0x0, group_id = 264504, job_id = 47382336, job_next = 0x0, job_array_next_j = 0x0,
  job_array_next_t = 0x0, job_resrcs = 0x822a070, job_state = 16385, kill_on_node_fail = 1, last_sched_eval = 1564766189, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x8229a20}, mail_type = 0, mail_user = 0x0, mem_per_tres = 0x0,
  mcs_label = 0x0, name = 0x8229f40 "cron-edf-scorer", network = 0x0, next_step_id = 0, ngids = 0, nodes = 0x8229ef0 "sh-101-54", node_addr = 0x8226410, node_bitmap = 0x7f3b520, node_bitmap_cg = 0x0, node_cnt = 1, node_cnt_wag = 0, nodes_completing = 0x0,
  origin_cluster = 0x0, other_port = 0, pack_details = 0x0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x8229f20 "normal", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x300eb00, power_flags = 0 '\000',
  pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 58967, priority_array = 0x0, prio_factors = 0x8229e60, profile = 0, qos_id = 3, qos_ptr = 0x2291310, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0,
  resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x822a040, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8448, start_time = 1564766189, state_desc = 0x0, state_reason = 0,
  state_reason_prev = 0, step_list = 0x81fa290, suspend_time = 0, system_comment = 0x0, time_last_active = 1564766294, time_limit = 5, time_min = 0, tot_sus_time = 0, total_cpus = 1, total_nodes = 1, tres_bind = 0x0, tres_freq = 0x0, tres_per_job = 0x0,
  tres_per_node = 0x0, tres_per_socket = 0x0, tres_per_task = 0x0, tres_req_cnt = 0x822a900, tres_req_str = 0x822a5d0 "1=1,2=6400,4=1,5=1", tres_fmt_req_str = 0x822a600 "cpu=1,mem=6400M,node=1,billing=1", tres_alloc_cnt = 0x822a640,
  tres_alloc_str = 0x822a540 "1=1,2=6400,3=18446744073709551614,4=1,5=1", tres_fmt_alloc_str = 0x822a590 "cpu=1,mem=6400M,node=1,billing=1", user_id = 362704, user_name = 0x0, wait_all_nodes = 0, warn_flags = 1, warn_signal = 10, warn_time = 60, wckey = 0x0,
  req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}


I'll add that both primary and secondary controllers running the patched slurmctld segfaulted in that same place, and that reverting the patch made them both start and run without encountering any problem.

Thanks!
-- 
Kilian
Comment 12 Dominik Bartkiewicz 2019-08-07 11:11:08 MDT
Hi

I still try to understand this segfault.
Could you send me slurmctld.log from demon restart to crash?

Dominik
Comment 13 Kilian Cavalotti 2019-08-07 11:36:33 MDT
Created attachment 11142 [details]
slurmctld logs

(In reply to Dominik Bartkiewicz from comment #12)
> Hi
> 
> I still try to understand this segfault.
> Could you send me slurmctld.log from demon restart to crash?

Sure, here it is.

Cheers,
-- 
Kilian
Comment 14 Dominik Bartkiewicz 2019-08-09 04:10:20 MDT
Hi


Did you recompile all spank after applying this patch?
I have realized that this is required.

Dominik
Comment 15 Kilian Cavalotti 2019-08-09 09:35:04 MDT
Hi Dominik, 

(In reply to Dominik Bartkiewicz from comment #14)
> Did you recompile all spank after applying this patch?
> I have realized that this is required.

No, I did not recompile all Spank plugins after applying this patch. Mostly because we've been operating under the premises that Spank plugins only need to be recompiled when upgrading to a new major version, so I didn't expect this patch to require it.

Given the outcome of my previous try, and your comment about this patch not showing a huge performance increase in synthetic tests, I'm sure you'll understand that we're not very keen on the idea of trying it again. :)

Cheers,
--
Kilian
Comment 16 Dominik Bartkiewicz 2019-08-09 10:24:31 MDT
Hi

Sorry for that, this is entirely my fault.
I should mention this, but I didn't predict this.
This patch change job_record struct, which generally can be done only between major version. 

Yes, it not showing a huge performance increase in synthetic tests.
However, I believe in a large system with a lot of various jobs, this patch can give significant improvement.

Dominik
Comment 17 Kilian Cavalotti 2019-08-09 19:06:53 MDT
Ok, thanks for the clarification.
And just to be clear, because it changes the job record struct, it means
that all the Slurm deamons have to be restarted, right? What about the jobs
in queue, will they be safe while running with the match and without it if
we revert back?

Cheers,
Comment 18 Dominik Bartkiewicz 2019-08-12 02:20:51 MDT
Hi

The patch adds a field to job_record, but it is never packed to state file or any RPCs.

Slurmctld with this patch make/load same looking state files as without it.  RPCs layer is also not changed (no need or restart or recompile other components).

This patch will be in the official release not earlier then slurm 20.02.

Dominik
Comment 19 Dominik Bartkiewicz 2019-08-19 08:20:02 MDT
Created attachment 11275 [details]
Extra logging patch

This patch contains extra debug. Maybe this logs will show what causes slow backfilling in the later part of bf cycle.
Comment 20 Kilian Cavalotti 2019-08-19 10:22:52 MDT
Created attachment 11280 [details]
slurmctld w/ extra logging

Hi Dominik, 

(In reply to Dominik Bartkiewicz from comment #19)
> Created attachment 11275 [details]
> Extra logging patch
> 
> This patch contains extra debug. Maybe this logs will show what causes slow
> backfilling in the later part of bf cycle.

Thanks for this. I applied the patch, and here's the logs that were generated.

Cheers,
--
Kilian
Comment 23 Kilian Cavalotti 2019-08-23 10:43:56 MDT
Hi Dominik, 

> Thanks for this. I applied the patch, and here's the logs that were
> generated.

Did the additional logs bring any additional clues as why the backfill scheduler isn't progressing as expected during each cycle?

Cheers,
-- 
Kilian
Comment 24 Dominik Bartkiewicz 2019-08-23 13:33:35 MDT
Hi

Yes, I think I have found a way to optimize bf code.
However, I need some time to test it. As you know, this code is complicated and needs extra care.
I don't want to introduce a new bug.

Dominik
Comment 25 Kilian Cavalotti 2019-08-23 14:32:41 MDT
(In reply to Dominik Bartkiewicz from comment #24)
> Yes, I think I have found a way to optimize bf code.

Excellent news!

> However, I need some time to test it. As you know, this code is complicated
> and needs extra care.
> I don't want to introduce a new bug.

Absolutely, and no worries at all! I will remove the logging patch now, because it's quite verbose :). Let me know if you need additional logs.

Cheers,
--
Kilian
Comment 33 Dominik Bartkiewicz 2019-08-29 09:44:08 MDT
Hi


Patch from attachment 11398 [details] should prevent from doing needless _try_sched().
I observe a significant increase in bf performance on my synthetic tests.

Dominik
Comment 34 Kilian Cavalotti 2019-08-29 10:03:09 MDT
Hi Dominik. 

Thanks! I'm gonna try your patch and let you know how it goes.

Cheers,
-- 
Kilian
Comment 35 Kilian Cavalotti 2019-08-29 18:55:11 MDT
Hi Domink, 

At first glance, we seem to be seeing a higher rate of backfilled jobs, and it looks like the backfill scheduler can reach the bottom of the queue now:

Backfilling stats
        Total backfilled jobs (since last slurm start): 43618
        Total backfilled jobs (since last stats cycle start): 5534
        Total backfilled heterogeneous job components: 0
        Total cycles: 17
        Last cycle when: Thu Aug 29 17:49:36 2019 (1567126176)
        Last cycle: 79295833
        Max cycle:  125821032
        Mean cycle: 68739640
        Last depth cycle: 16007
        Last depth cycle (try sched): 3244
        Depth Mean: 16178
        Depth Mean (try depth): 2872
        Last queue length: 15951
        Queue length mean: 15514

Cheers,
-- 
Kilian
Comment 37 Dominik Bartkiewicz 2019-09-05 08:56:22 MDT
Hi


We commit this patch to 20.02 as 9da7d95c61077.
Please let me know if you have any remarks or doubts.

Dominik
Comment 38 Kilian Cavalotti 2019-09-05 09:05:28 MDT
Hi Dominik, 

(In reply to Dominik Bartkiewicz from comment #37)
> We commit this patch to 20.02 as 9da7d95c61077.
> Please let me know if you have any remarks or doubts.

Thanks. Is there any specific reason to not include it in the 19.05 branch? 

It looks like a minor code modification that fixes an existing issue in 19.05 and  could be beneficial to everyone today, without modifying existing behavior, doesn't it?

Thanks,
-- 
Kilian
Comment 39 Dominik Bartkiewicz 2019-09-06 02:52:59 MDT
Hi

This patch modifies code which has been in backfill for last 10 years.
We are trying to not commit enhancement or optimization changes to
released versions.

Dominik
Comment 40 Dominik Bartkiewicz 2019-09-27 07:51:52 MDT
Hi

Do you have any feedback about this patch?

I feel that the lower performance of bf is mostly related to the performance of select plugin like in bug 7361.

Is there anything else I can answer here?

Dominik
Comment 41 Kilian Cavalotti 2019-09-27 09:57:39 MDT
Created attachment 11727 [details]
backfill progress rate w/ patch

Hi Dominik, 

> Do you have any feedback about this patch?

I just did another graph of the "backfill: yielding locks after testing XXX jobs" messages vs time, and it looks like the backfill scheduler doesn't stall anymore with that patch, so it's definitely better now.

> I feel that the lower performance of bf is mostly related to the performance
> of select plugin like in bug 7361.

Ok, I'll follow up in #7361, then. You're right they're very likely related.

> Is there anything else I can answer here?

Well, yes, about inclusion in 19.05. 

I understand and agree that minor versions shouldn't introduce new features or change existing behaviors. But that stalling of the bf scheduler looks like a bug to me. This patch is a small and unintrusive fix for that bug, and it has demonstrated its effectiveness and stability, as we've been running with in production for the last month or so.

So couldn't this patch be considered as a bug fix and be included in the 19.05.x branch?

Thanks!
--
Kilian
Comment 43 Dominik Bartkiewicz 2019-10-01 04:42:20 MDT
Hi

I am afraid that we can't do this.
After internal discussion,
we decided that we don't include this change in 19.05. 

Dominik
Comment 44 Kilian Cavalotti 2019-10-01 09:03:40 MDT
(In reply to Dominik Bartkiewicz from comment #43)
> Hi
> 
> I am afraid that we can't do this.
> After internal discussion,
> we decided that we don't include this change in 19.05. 

Argh, that's disappointing. :(

Cheers,
--
Kilian
Comment 45 Dominik Bartkiewicz 2019-10-31 08:01:14 MDT
Hi

Do you need anything else regarding this ticket or shall we proceed to close
this issue out?

Dominik
Comment 46 Kilian Cavalotti 2019-11-01 16:54:46 MDT
Good to close, thanks!

Cheers,
-- 
Kilian
Comment 48 Christopher Coffey 2020-01-09 08:27:08 MST
Good to see that there is a fix coming for the slow backfill. We've been working around this by limiting the number of jobs inspected per assoc, and the number of jobs to test in general. I may look at patching our slurm 19.05.5 with this patch.

I agree with Killian that this should be patched in 19.05 however.

Best,
Chris