Ticket 8597 - slurmctld memory leak
Summary: slurmctld memory leak
Status: RESOLVED CANNOTREPRODUCE
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 19.05.5
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Gavin D. Howard
QA Contact:
URL:
: 8887 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2020-02-28 05:10 MST by CSC sysadmins
Modified: 2020-08-12 04:20 MDT (History)
4 users (show)

See Also:
Site: CSC - IT Center for Science
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (7.23 KB, text/plain)
2020-02-28 05:10 MST, CSC sysadmins
Details
Account modification script (12.02 KB, application/x-shellscript)
2020-03-11 03:45 MDT, CSC sysadmins
Details
sdiag (19.58 KB, text/plain)
2020-03-11 03:45 MDT, CSC sysadmins
Details
slurmctld log (215.67 KB, text/x-log)
2020-03-11 03:47 MDT, CSC sysadmins
Details
saccmgr cmds which caused 1GB leak (10.17 KB, application/x-shellscript)
2020-03-16 10:01 MDT, CSC sysadmins
Details
Requested association dump (153.65 KB, application/gzip)
2020-03-20 03:06 MDT, CSC sysadmins
Details
assoc_mgr dump (599.96 KB, application/x-xz)
2020-03-20 11:42 MDT, CSC sysadmins
Details
perf report screenshot (187.20 KB, image/png)
2020-03-31 05:39 MDT, CSC sysadmins
Details
slurmctld log during perf report (384.80 KB, text/x-log)
2020-04-01 02:58 MDT, CSC sysadmins
Details
sacctmgr account modification script which was used during perf report (20.35 KB, text/plain)
2020-04-01 03:04 MDT, CSC sysadmins
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description CSC sysadmins 2020-02-28 05:10:25 MST
Created attachment 13215 [details]
slurm.conf

I'm creating separate issue about this memory leak:

https://bugs.schedmd.com/show_bug.cgi?id=7928#c49

Leak has been present for a long time (19.05.4) and I updated our slurm recently to the latest 19.05 branch which had couple memory leak fixes but leak is still present. We do not know what triggers this leak. We had a script that monitored slurmctld memory usage and once leak happened when user accounting script was modifying/addding/removing accounts but we could not reproduce the problem on the test system which is not running heavy production load.

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                        
4680 slurm     20   0   13,7g  12,6g   4116 S   4,7 20,1   7177:39 slurmctld    

(gdb) thread apply all bt

Thread 15 (Thread 0x7f66fd086700 (LWP 4681)):
#0  0x00007f66fc664de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000421414 in _agent_init (arg=<optimized out>) at agent.c:1394
#2  0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f66fc3898cd in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7f66f8a3b700 (LWP 4682)):
#0  0x00007f66fc664de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f66f8a40b76 in _set_db_inx_thread (no_data=<optimized out>)
    at accounting_storage_slurmdbd.c:451
#2  0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f66fc3898cd in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7f66f893a700 (LWP 4684)):
#0  0x00007f66fc664de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f66f8a45a96 in _agent (x=<optimized out>) at slurmdbd_agent.c:626
#2  0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f66fc3898cd in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f66f8226700 (LWP 4698)):
#0  0x00007f66fc662d62 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007f66fcb2ca28 in list_free_aux (x=x@entry=0x6bb8680, 
    pfreelist=pfreelist@entry=0x7f66fce6d2d8 <list_free_lists>) at list.c:987
#2  0x00007f66fcb2cce6 in list_free (l=0x6bb8680) at list.c:900
#3  list_destroy (l=0x6bb8680) at list.c:230
#4  0x00007f66fb229026 in cr_destroy_node_data (node_usage=node_usage@entry=0x7f634f816e80, 
    node_data=node_data@entry=0x0) at select_cons_tres.c:2684
#5  0x00007f66fb225e1f in will_run_test (job_ptr=job_ptr@entry=0x7f634f7084d0, 
    node_bitmap=node_bitmap@entry=0x7f634f1ba000, min_nodes=min_nodes@entry=1, 
    max_nodes=max_nodes@entry=1, req_nodes=req_nodes@entry=1, job_node_req=job_node_req@entry=1, 
    preemptee_candidates=preemptee_candidates@entry=0x0, 
    preemptee_job_list=preemptee_job_list@entry=0x7f66f8225a48, exc_core_bitmap=0x0)
    at job_test.c:6921
#6  0x00007f66fb2266ef in select_p_job_test (job_ptr=<optimized out>, node_bitmap=0x7f634f1ba000, 
    min_nodes=1, max_nodes=1, req_nodes=1, mode=<optimized out>, preemptee_candidates=0x0, 
    preemptee_job_list=0x7f66f8225a48, exc_core_bitmap=0x0) at select_cons_tres.c:1244
#7  0x00007f66fcb37f37 in select_g_job_test (job_ptr=job_ptr@entry=0x7f634f7084d0, 
    bitmap=bitmap@entry=0x7f634f1ba000, min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=1, 
    req_nodes=req_nodes@entry=1, mode=mode@entry=2, 
    preemptee_candidates=preemptee_candidates@entry=0x0, 
    preemptee_job_list=preemptee_job_list@entry=0x7f66f8225a48, 
    exc_core_bitmap=exc_core_bitmap@entry=0x0) at node_select.c:539
#8  0x00007f66f822cfbd in _try_sched (job_ptr=0x7f634f7084d0, 
    avail_bitmap=avail_bitmap@entry=0x7f66f8225c60, min_nodes=1, max_nodes=1, req_nodes=1, 
    exc_core_bitmap=0x0) at backfill.c:586
#9  0x00007f66f82301d8 in _attempt_backfill () at backfill.c:2226
#10 0x00007f66f823225b in backfill_agent (args=<optimized out>) at backfill.c:1015
#11 0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f66fc3898cd in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f66f8125700 (LWP 4703)):
#0  0x00007f66fc664de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000433120 in _agent_thread (arg=<optimized out>) at fed_mgr.c:2225
---Type <return> to continue, or q <return> to quit---
#2  0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f66fc3898cd in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f66f39dd700 (LWP 4704)):
#0  0x00007f66fc664de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000437129 in _fed_job_update_thread (arg=<optimized out>) at fed_mgr.c:2183
#2  0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f66fc3898cd in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f66f38dc700 (LWP 4705)):
#0  0x00007f66fc664de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f66f34d5a3a in _decay_thread (no_data=<optimized out>) at priority_multifactor.c:1405
#2  0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f66fc3898cd in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f66f37db700 (LWP 4706)):
#0  0x00007f66fc37ec2d in poll () from /lib64/libc.so.6
#1  0x000000000042ab6e in poll (__timeout=-1, __nfds=1, __fds=<optimized out>)
    at /usr/include/bits/poll2.h:46
#2  _slurmctld_rpc_mgr (no_data=<optimized out>) at controller.c:1183
#3  0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f66fc3898cd in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f66f30cb700 (LWP 4707)):
#0  0x00007f66fc6683c1 in sigwait () from /lib64/libpthread.so.0
#1  0x000000000042f6b6 in _slurmctld_signal_hand (no_data=<optimized out>) at controller.c:1059
#2  0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f66fc3898cd in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f66f2fca700 (LWP 4708)):
#0  0x00007f66fc664a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004ae4fb in slurmctld_state_save (no_data=<optimized out>) at state_save.c:176
#2  0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f66fc3898cd in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f66f2dc8700 (LWP 4710)):
#0  0x00007f66fc664a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000428c55 in _purge_files_thread (no_data=<optimized out>) at controller.c:3378
#2  0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f66fc3898cd in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f66f1ca3700 (LWP 5794)):
#0  0x00007f66fc664184 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#1  0x00000000004701e8 in lock_slurmctld (lock_levels=...) at locks.c:123
#2  0x000000000049858c in _slurm_rpc_dump_jobs_user (msg=0x7f66f1ca2e50) at proc_req.c:1903
#3  slurmctld_req (msg=msg@entry=0x7f66f1ca2e50, arg=arg@entry=0x7f66e65b4cd0) at proc_req.c:343
#4  0x000000000042a838 in _service_connection (arg=0x7f66e65b4cd0) at controller.c:1282
#5  0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f66fc3898cd in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f6676ded700 (LWP 5795)):
#0  0x00007f66fc664184 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#1  0x00000000004701e8 in lock_slurmctld (lock_levels=...) at locks.c:123
#2  0x000000000049858c in _slurm_rpc_dump_jobs_user (msg=0x7f6676dece50) at proc_req.c:1903
#3  slurmctld_req (msg=msg@entry=0x7f6676dece50, arg=arg@entry=0x7f66e65a3280) at proc_req.c:343
#4  0x000000000042a838 in _service_connection (arg=0x7f66e65a3280) at controller.c:1282
#5  0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f66fc3898cd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f66f169d700 (LWP 5796)):
#0  0x00007f66fc664184 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#1  0x00000000004701e8 in lock_slurmctld (lock_levels=...) at locks.c:123
#2  0x000000000049858c in _slurm_rpc_dump_jobs_user (msg=0x7f66f169ce50) at proc_req.c:1903
#3  slurmctld_req (msg=msg@entry=0x7f66f169ce50, arg=arg@entry=0x7f66e655f450) at proc_req.c:343
#4  0x000000000042a838 in _service_connection (arg=0x7f66e655f450) at controller.c:1282
#5  0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f66fc3898cd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f66fd087740 (LWP 4680)):
#0  0x00007f66fc35084d in nanosleep () from /lib64/libc.so.6
#1  0x00007f66fc381124 in usleep () from /lib64/libc.so.6
#2  0x000000000042e03f in _slurmctld_background (no_data=0x0) at controller.c:1963
#3  main (argc=<optimized out>, argv=<optimized out>) at controller.c:762
(gdb)
Comment 2 Gavin D. Howard 2020-02-28 10:28:19 MST
Thank you for the report. I will get on it.

There are a couple of things I want from you. First, send me the output of the following command:

> sdiag

Second, can you send me the user accounting script that triggered the leak?

Third, if possible, send the script that you use to monitor slurmctld's memory usage.

Fourth, please send me the logs from both slurmctld and slurmdbd.

Finally, I would appreciate it if you could send me your database.
Comment 5 CSC sysadmins 2020-03-11 01:06:08 MDT
Hi,

Sorry for the late response. I'll send requested logs and db dump later on this week. 

Monitoring script is really simple: 

while true; do echo -n "$(date +%T) :"; ps --noheaders -C slurmctld -o rss; sleep 10;done
Comment 6 CSC sysadmins 2020-03-11 03:45:10 MDT
Created attachment 13337 [details]
Account modification script
Comment 7 CSC sysadmins 2020-03-11 03:45:34 MDT
Created attachment 13338 [details]
sdiag
Comment 8 CSC sysadmins 2020-03-11 03:47:48 MDT
Created attachment 13339 [details]
slurmctld log

TIME: slurmctld RSS
10:47:52 :1480460
10:48:12 :1488652
10:48:42 :1580048
10:48:52 :1623364
10:49:02 :1652404
10:49:12 :1691236
10:49:33 :1731304
10:49:43 :1734548
Comment 12 Gavin D. Howard 2020-03-12 15:55:01 MDT
I am digging into everything you sent me. I have a few clues, but I may need a better reproducer.

You said you can't use Valgrind because of performance, which makes sense because it generally has a 10x slowdown. However, as far as I know, AddressSanitizer is much faster, only a 2x slowdown. Can you compile slurmctld with AddressSanitizer and run it?

You can do so by passing the `-fsanitize=address` flag to gcc (if it's new enough) or clang.

In fact, if that's still too slow, you can use `-fsanitize=leak` which does less, but should still do what we need.

If you can do that, please send me the output.
Comment 13 CSC sysadmins 2020-03-13 08:48:10 MDT
Hi,

I rebuild slurm rpms against libasan and seems that debug overhead is tolerable. Debug build was running ~5h today but it did not catch any leak. Will it write leaks to log file or should I run it from cmd line?

-Tommi
Comment 14 Gavin D. Howard 2020-03-13 09:28:43 MDT
ASAN outputs to stderr, so you will need to redirect that to a file somehow.
Comment 15 CSC sysadmins 2020-03-16 10:01:39 MDT
Created attachment 13381 [details]
saccmgr cmds which caused 1GB leak

Unfortunately slurmctld build with el7 gcc/4.8.5 (libslurm.so is linked against libasan.so) could not catch the leak. slurmctld -D 2 > /var/log/slurm/leak.log, memory usage grow during the day from 2G -> 5G but leak.log is empty.

Logs are really ugly during the account modifications, munge timeouts etc which are not nice for users

Slurmctld logs from timeframe when memory usage grow about 1GB:
TIME :    RSS
16:38:04 :4139020 
16:38:14 :3942376
16:38:24 :3970560
16:38:35 :3949440
16:38:45 :3953616
16:38:55 :3956136
16:39:05 :4105188
16:39:15 :4006048
16:39:25 :4056192
16:39:35 :4110108
16:39:45 :4229988
16:39:55 :4333420
16:40:05 :4423388
16:40:15 :4763728
16:40:25 :4809464
16:40:35 :4698692
16:40:45 :4741236
16:40:55 :4769036
16:41:05 :4781760
16:41:15 :4915576
16:41:25 :5003136

slurmctld log
https://kannu.csc.fi/s/CXsDJtsqfMmozfe
Comment 17 Gavin D. Howard 2020-03-16 14:40:48 MDT
Hmm...the fact that no leaks are detected makes me think that no leaks happened. In other words, I think that slurmctld may still be correctly using all of the memory it has.

However, this means that slurmctld's memory use is still growing, and we should figure out why.

I will see if I can reproduce the issue, and then reproduce it under Valgrind's Massif to see where the memory is being used.

I am also aware of the possible link between this issue and 7928, so I will investigate that.
Comment 18 Gavin D. Howard 2020-03-19 16:10:09 MDT
Please send me the output of the following command

> sacctmgr show assoc
Comment 19 CSC sysadmins 2020-03-20 03:06:27 MDT
Created attachment 13437 [details]
Requested association dump
Comment 20 Gavin D. Howard 2020-03-20 10:58:58 MDT
That is a lot of data, and it might point to our problem.

slurmctld holds association data as slurmdbd pushes it up.

Please send me the output of the following command:

> scontrol show assoc_mgr

Also, one thing you can do is go to your state save location and look for the following files:

> assoc_mgr_state
> assoc_mgr_state.old
> assoc_usage
> assoc_usage.old

If any of those files are growing in conjunction with the memory usage of slurmctld, the association data that slurmctld holds could be the issue.
Comment 21 CSC sysadmins 2020-03-20 11:42:37 MDT
Created attachment 13448 [details]
assoc_mgr dump
Comment 22 Gavin D. Howard 2020-03-20 13:18:53 MDT
That is pretty big, which makes me suspect that we are on the right track.

Is slurmctld still increasing its memory usage? Do you have to restart it every so often? Basically, is its memory usage a problem for your cluster? I can try to work with you to find ways to reduce its usage.
Comment 23 CSC sysadmins 2020-03-23 02:43:55 MDT
(In reply to Gavin D. Howard from comment #22)
> That is pretty big, which makes me suspect that we are on the right track.
> 
> Is slurmctld still increasing its memory usage? Do you have to restart it
> every so often? Basically, is its memory usage a problem for your cluster? I
> can try to work with you to find ways to reduce its usage.

Main issues are growing memory usage (sometimes OOM killer hits which is not nice situation to handle) and overall slowness. Making anything bigger than 20-30 changes with sacctmgr slurmctld is more or less crawling next ~30mins.

What kind of change would you propose for reducing the memory usage?

-Tommi
Comment 24 Gavin D. Howard 2020-03-23 10:11:50 MDT
I don't know yet; I am working through some ideas.

Can you tell me what the max memory usage is?
Comment 25 CSC sysadmins 2020-03-24 01:24:44 MDT
(In reply to Gavin D. Howard from comment #24)
> I don't know yet; I am working through some ideas.
> 
> Can you tell me what the max memory usage is?


I dont have exact number on my hand but it's around 30+GB when kernel oom-killer has to kill it.

-Tommi
Comment 26 Gavin D. Howard 2020-03-24 09:51:15 MDT
Tommi,

Okay, I have a few more questions:

1. Does slurmctld have a median memory usage, i.e., some point at which it stays, or does it grow without bound?
2. How much memory is on the slurmctld node?
3. How much memory is on the slurmdbd node? Is it the same node as the slurmctld?
4. What is the underlying storage for slurmdbd? What is the disk speed and type? Is it an NFS?
5. Would it be possible for you to reduce the complexity of your associations and/or remove the ones that will never be used again?
Comment 27 CSC sysadmins 2020-03-25 03:19:59 MDT
> 1. Does slurmctld have a median memory usage, i.e., some point at which it
> stays, or does it grow without bound?

Memory usage grows in a steps. Now it's been running 24h after last restart and RSS is under 1GB. But account modifications may trigger large memory usage increase which is not reclaim later.

> 2. How much memory is on the slurmctld node?

64GB

> 3. How much memory is on the slurmdbd node? Is it the same node as the
> slurmctld?

Same node, system does not look to be memory bound.

KiB Mem : 65808088 total, 34770432 free, 23624124 used,  7413532 buff/cache
KiB Swap:  1048572 total,  1046524 free,     2048 used. 41431104 avail Mem 


> 4. What is the underlying storage for slurmdbd? What is the disk speed and
> type? Is it an NFS?

SSD mirror.


> 5. Would it be possible for you to reduce the complexity of your
> associations and/or remove the ones that will never be used again?

Not an option at this point. I think the question is that why altering accounts makes clurmctld (not slurmdbd) crawl with 100% cpu usage for tens of minutes?
Comment 28 Gavin D. Howard 2020-03-25 10:59:42 MDT
Understood.

I think that at this point, I am going to import your database and run slurmctld under a profiler. I'll see what I can find.
Comment 29 Gavin D. Howard 2020-03-25 12:46:35 MDT
I have imported your database and have run lots of assoc updates with a modified version of your sacctmgr_cmds.sh script, but I cannot reproduce the problem.

Would it be possible for you to run perf on slurmctld? The way I would like it is this:

> perf record -g slurmctld

Run it for awhile, make assoc changes, until it slows down to a crawl, then exit out of slurmctld and send me the perf.data file.

With that data, I should be able to easily see where the problem is.
Comment 30 CSC sysadmins 2020-03-30 08:38:08 MDT
Hi,

Here is perf data from ~10min timeframe. Slurmctld memory usage increased over 1GB during that time (created 150 associations)

https://kannu.csc.fi/s/x2nDYDPQYSTM4m8
Comment 31 Gavin D. Howard 2020-03-30 14:57:40 MDT
I was under the impression that data in a perf.data file was machine agnostic, but I was wrong. I screwed up. I apologize.

In order for me to see what functions are taking the most time, the perf.data file has to be opened on the same machine.

I will need you to open the perf.data on the machine where slurmctld is and in the original directory where `perf record` was run.

BE AWARE: opening this file required 46 GB. There is a lot of data there (which is good).

When opening the file, use the following command:

> perf report -g "graph,0.5,caller"

Expand the data points that are the most important, probably the first third, and make sure to recursively expand them.

Then, please take screenshots and send them to me.

In the meantime, I will continue to attempt to replicate the problem here, where I could do all of this work myself.
Comment 32 CSC sysadmins 2020-03-31 05:39:40 MDT
Created attachment 13539 [details]
perf report screenshot

Hi,

I'm not sure if these are very useful, most of the time is spent to thread lock handling?
Comment 33 Gavin D. Howard 2020-03-31 10:08:36 MDT
Actually, it does tell me something.

We have a saying at SchedMD: "Slurm is threaded, but not concurrent." This means that Slurm *uses* multiple threads, but it doesn't handle actual concurrency well. You could be running into a case where that is even more true, where locking dominates.

There is one function in the perf report that might tell us something: `slurmdb_get_acct_hierarchical_rec_list()`. It seems most related to your problem, so I am going to follow that line.

This points to 

Do you have a script/program that does association management? If so, can you send it to me? I might be able to analyze it and figure out where the issue is.
Comment 34 CSC sysadmins 2020-04-01 02:58:50 MDT
Created attachment 13556 [details]
slurmctld log during perf report
Comment 35 CSC sysadmins 2020-04-01 03:04:32 MDT
Created attachment 13557 [details]
sacctmgr account modification script which was used during perf report

Situation is bad, we can make only ~10 modifications/minute. I'd guess that reproducing this leak/slowdown requires bunch of RPC pressure from compute/login nodes.
Comment 36 CSC sysadmins 2020-04-01 07:36:44 MDT
Colleague increased bf_yield_sleep to maximum value 10000000, for first sight it helped situation a bit.
Comment 37 Gavin D. Howard 2020-04-01 17:44:54 MDT
Your last comment and sdiag output are the clues: I think the backfill scheduler is dominating the execution time and memory usage of slurmctld.

There are a few options that you could change.

First is `bf_max_job_test`. This is the number of jobs that are considered for backfilling. You have this set at 2000, which is huge. This makes the scheduler take more time and use more memory. I would reduce this to at least 500, if not back to the default of 100.

Second, if you do reduce `bf_max_job_test`, I would also reduce `bf_max_job_user` to be far below what you have set.

Third, your `bf_window` is large, at 20,160. This also makes the scheduler take more time and use more memory. I would slash it to no more than half of its current value, if not lower, even putting it to its default of 1440. This is the number one value that should be reduced.

Fourth, to make sure that slurmctld always remains responsive, you can add the `max_rpc_count` option. This is the number of active threads in slurmctld. I would set it to something close to 32 and see if responsiveness improves.
Comment 38 CSC sysadmins 2020-04-02 06:22:11 MDT
> First is `bf_max_job_test`. This is the number of jobs that are considered
> for backfilling. You have this set at 2000, which is huge. This makes the
> scheduler take more time and use more memory. I would reduce this to at
> least 500, if not back to the default of 100.

We have increased that value because jobs may be in the queue but waiting reason is e.g. QOSGrpCpuLimit -> if max_job_test is small, cluster may drain from jobs. Is there an option to make backfiller to skip those jobs which are limited by QOS?

 
> Second, if you do reduce `bf_max_job_test`, I would also reduce
> `bf_max_job_user` to be far below what you have set.

What I see is that users have an array jobs and one job is short (e.g. ~30s ). Backfiller picks let say 50 short jobs/user which are perfectly suitable for backfill jobs but overall utilization is still poor because those short living jobs could be ran in a one shot if per user jobs are not limited. Now backfiller picks couple of tens of those per scheduling round and it's slow. 

> Third, your `bf_window` is large, at 20,160. This also makes the scheduler
> take more time and use more memory. I would slash it to no more than half of
> its current value, if not lower, even putting it to its default of 1440.
> This is the number one value that should be reduced.

Man page says: "A value at least as long as the highest allowed time limit is generally advisable to prevent job starvation." and our longrun queue is 14d. So which advice one is more correct?

> Fourth, to make sure that slurmctld always remains responsive, you can add
> the `max_rpc_count` option. This is the number of active threads in
> slurmctld. I would set it to something close to 32 and see if responsiveness
> improves.

We'll try this at some point. 

<rant begins>
We are quite frustrated to set limits all over the place just for that Slurm can focus on its internal thread battle. </rant end>
Comment 39 Gavin D. Howard 2020-04-02 11:20:21 MDT
(In reply to Tommi Tervo from comment #38)
> > First is `bf_max_job_test`. This is the number of jobs that are considered
> > for backfilling. You have this set at 2000, which is huge. This makes the
> > scheduler take more time and use more memory. I would reduce this to at
> > least 500, if not back to the default of 100.
> 
> We have increased that value because jobs may be in the queue but waiting
> reason is e.g. QOSGrpCpuLimit -> if max_job_test is small, cluster may drain
> from jobs. Is there an option to make backfiller to skip those jobs which
> are limited by QOS?

As far as I can tell from the code, the backfill scheduler already does this, but they still have to be in the queue.
  
> > Second, if you do reduce `bf_max_job_test`, I would also reduce
> > `bf_max_job_user` to be far below what you have set.
> 
> What I see is that users have an array jobs and one job is short (e.g. ~30s
> ). Backfiller picks let say 50 short jobs/user which are perfectly suitable
> for backfill jobs but overall utilization is still poor because those short
> living jobs could be ran in a one shot if per user jobs are not limited. Now
> backfiller picks couple of tens of those per scheduling round and it's slow. 

Okay. Makes sense.

> > Third, your `bf_window` is large, at 20,160. This also makes the scheduler
> > take more time and use more memory. I would slash it to no more than half of
> > its current value, if not lower, even putting it to its default of 1440.
> > This is the number one value that should be reduced.
> 
> Man page says: "A value at least as long as the highest allowed time limit
> is generally advisable to prevent job starvation." and our longrun queue is
> 14d. So which advice one is more correct?

Unfortunately, the man page should win here. I apologize; I did not know your biggest job length.

> > Fourth, to make sure that slurmctld always remains responsive, you can add
> > the `max_rpc_count` option. This is the number of active threads in
> > slurmctld. I would set it to something close to 32 and see if responsiveness
> > improves.
> 
> We'll try this at some point. 
> 
> <rant begins>
> We are quite frustrated to set limits all over the place just for that Slurm
> can focus on its internal thread battle. </rant end>

Yes, we are aware of it, and there is a long-term plan in the works to fix the lack of concurrency in Slurm. However, some configuration will always be needed because of how different clusters are.
Comment 41 Gavin D. Howard 2020-04-09 16:41:29 MDT
Just so you know, I will keep this bug open for now to track any optimizations and changes we decide to do. However, feel free to let me know if you need anything else.
Comment 43 Gavin D. Howard 2020-04-13 14:07:16 MDT
We have decided to attempt some optimizations, but I will need your help to get profile data that I can use.

The good news is that I have since learned how to make perf.data files transferable between machines.

If you still have the perf.data file, please run `perf archive` in the directory with that file. If you don't have it, or `perf archive` fails, please run slurmctld under perf record again while the problem occurs. Then please send me the resulting perf.data and perf.data.tar.bz2 files.
Comment 46 CSC sysadmins 2020-04-15 02:14:42 MDT
Hi,

Here is link to archived perf data

https://kannu.csc.fi/s/MJkpYCQdzrSFDkc
Comment 51 Gavin D. Howard 2020-04-28 09:23:12 MDT
*** Ticket 8887 has been marked as a duplicate of this ticket. ***
Comment 53 Gavin D. Howard 2020-05-05 15:50:22 MDT
I want to give you an update.

We have found one or more memory leaks that may be causing your problems. However, they are messy, and the fix may take time.
Comment 54 CSC sysadmins 2020-05-06 00:17:38 MDT
(In reply to Gavin D. Howard from comment #53)
> I want to give you an update.
> 
> We have found one or more memory leaks that may be causing your problems.
> However, they are messy, and the fix may take time.


Thanks for the update.

-Tommi
Comment 55 Gavin D. Howard 2020-05-18 14:11:15 MDT
The investigation into the memory leak didn't go anywhere.

However, I may have a solution that I forgot: in 19.05, Slurm caches lists. In 20.02, it does not. This list caching may be the cause of the growing memory usage, so one possible fix may be to upgrade to 20.02.
Comment 56 Gavin D. Howard 2020-06-02 09:34:45 MDT
By the way, if you want to test whether list caching is at fault, then *in a testing environment*, you can build Slurm by passing the `--enable-memory-leak-debug` option to `configure`. This will disable list caching.

The reason I suggest doing it in a test environment is because it slows Slurm down by a lot; with this option, Slurm carefully frees all memory that it uses.

If you do not reproduce the problem with this option, chances are the list caching of 19.05 is the problem. I have not managed to reproduce the problem with this option myself, which is a good sign.
Comment 57 Gavin D. Howard 2020-06-10 12:51:19 MDT
Tommi,

Can you give me an update on whether you have, or even can, test whether the list caching is the problem?
Comment 58 CSC sysadmins 2020-06-11 02:41:39 MDT
(In reply to Gavin D. Howard from comment #57)
> Tommi,
> 
> Can you give me an update on whether you have, or even can, test whether the
> list caching is the problem?

Hi,

Unfortunately I've no enough time to debug this now. We managed to mitigate this issue by doing automated sacctmgr changes in a small sets and disabled automatic node online from NHC (which will reduce amount of RPCs). We will upgrade slurm to 20.x earliest on autumn so this bug can be set on hold (or maybe close, I'll reopen if problem exists on 20.x)

-Tommi
Comment 59 Gavin D. Howard 2020-06-11 10:25:47 MDT
Okay. Thank you for letting me know.

Closing for now.