Created attachment 13215 [details] slurm.conf I'm creating separate issue about this memory leak: https://bugs.schedmd.com/show_bug.cgi?id=7928#c49 Leak has been present for a long time (19.05.4) and I updated our slurm recently to the latest 19.05 branch which had couple memory leak fixes but leak is still present. We do not know what triggers this leak. We had a script that monitored slurmctld memory usage and once leak happened when user accounting script was modifying/addding/removing accounts but we could not reproduce the problem on the test system which is not running heavy production load. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4680 slurm 20 0 13,7g 12,6g 4116 S 4,7 20,1 7177:39 slurmctld (gdb) thread apply all bt Thread 15 (Thread 0x7f66fd086700 (LWP 4681)): #0 0x00007f66fc664de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000421414 in _agent_init (arg=<optimized out>) at agent.c:1394 #2 0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f66fc3898cd in clone () from /lib64/libc.so.6 Thread 14 (Thread 0x7f66f8a3b700 (LWP 4682)): #0 0x00007f66fc664de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f66f8a40b76 in _set_db_inx_thread (no_data=<optimized out>) at accounting_storage_slurmdbd.c:451 #2 0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f66fc3898cd in clone () from /lib64/libc.so.6 Thread 13 (Thread 0x7f66f893a700 (LWP 4684)): #0 0x00007f66fc664de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f66f8a45a96 in _agent (x=<optimized out>) at slurmdbd_agent.c:626 #2 0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f66fc3898cd in clone () from /lib64/libc.so.6 Thread 12 (Thread 0x7f66f8226700 (LWP 4698)): #0 0x00007f66fc662d62 in pthread_mutex_lock () from /lib64/libpthread.so.0 #1 0x00007f66fcb2ca28 in list_free_aux (x=x@entry=0x6bb8680, pfreelist=pfreelist@entry=0x7f66fce6d2d8 <list_free_lists>) at list.c:987 #2 0x00007f66fcb2cce6 in list_free (l=0x6bb8680) at list.c:900 #3 list_destroy (l=0x6bb8680) at list.c:230 #4 0x00007f66fb229026 in cr_destroy_node_data (node_usage=node_usage@entry=0x7f634f816e80, node_data=node_data@entry=0x0) at select_cons_tres.c:2684 #5 0x00007f66fb225e1f in will_run_test (job_ptr=job_ptr@entry=0x7f634f7084d0, node_bitmap=node_bitmap@entry=0x7f634f1ba000, min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=1, req_nodes=req_nodes@entry=1, job_node_req=job_node_req@entry=1, preemptee_candidates=preemptee_candidates@entry=0x0, preemptee_job_list=preemptee_job_list@entry=0x7f66f8225a48, exc_core_bitmap=0x0) at job_test.c:6921 #6 0x00007f66fb2266ef in select_p_job_test (job_ptr=<optimized out>, node_bitmap=0x7f634f1ba000, min_nodes=1, max_nodes=1, req_nodes=1, mode=<optimized out>, preemptee_candidates=0x0, preemptee_job_list=0x7f66f8225a48, exc_core_bitmap=0x0) at select_cons_tres.c:1244 #7 0x00007f66fcb37f37 in select_g_job_test (job_ptr=job_ptr@entry=0x7f634f7084d0, bitmap=bitmap@entry=0x7f634f1ba000, min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=1, req_nodes=req_nodes@entry=1, mode=mode@entry=2, preemptee_candidates=preemptee_candidates@entry=0x0, preemptee_job_list=preemptee_job_list@entry=0x7f66f8225a48, exc_core_bitmap=exc_core_bitmap@entry=0x0) at node_select.c:539 #8 0x00007f66f822cfbd in _try_sched (job_ptr=0x7f634f7084d0, avail_bitmap=avail_bitmap@entry=0x7f66f8225c60, min_nodes=1, max_nodes=1, req_nodes=1, exc_core_bitmap=0x0) at backfill.c:586 #9 0x00007f66f82301d8 in _attempt_backfill () at backfill.c:2226 #10 0x00007f66f823225b in backfill_agent (args=<optimized out>) at backfill.c:1015 #11 0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007f66fc3898cd in clone () from /lib64/libc.so.6 Thread 11 (Thread 0x7f66f8125700 (LWP 4703)): #0 0x00007f66fc664de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000433120 in _agent_thread (arg=<optimized out>) at fed_mgr.c:2225 ---Type <return> to continue, or q <return> to quit--- #2 0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f66fc3898cd in clone () from /lib64/libc.so.6 Thread 10 (Thread 0x7f66f39dd700 (LWP 4704)): #0 0x00007f66fc664de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000437129 in _fed_job_update_thread (arg=<optimized out>) at fed_mgr.c:2183 #2 0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f66fc3898cd in clone () from /lib64/libc.so.6 Thread 9 (Thread 0x7f66f38dc700 (LWP 4705)): #0 0x00007f66fc664de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f66f34d5a3a in _decay_thread (no_data=<optimized out>) at priority_multifactor.c:1405 #2 0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f66fc3898cd in clone () from /lib64/libc.so.6 Thread 8 (Thread 0x7f66f37db700 (LWP 4706)): #0 0x00007f66fc37ec2d in poll () from /lib64/libc.so.6 #1 0x000000000042ab6e in poll (__timeout=-1, __nfds=1, __fds=<optimized out>) at /usr/include/bits/poll2.h:46 #2 _slurmctld_rpc_mgr (no_data=<optimized out>) at controller.c:1183 #3 0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f66fc3898cd in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7f66f30cb700 (LWP 4707)): #0 0x00007f66fc6683c1 in sigwait () from /lib64/libpthread.so.0 #1 0x000000000042f6b6 in _slurmctld_signal_hand (no_data=<optimized out>) at controller.c:1059 #2 0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f66fc3898cd in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7f66f2fca700 (LWP 4708)): #0 0x00007f66fc664a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000004ae4fb in slurmctld_state_save (no_data=<optimized out>) at state_save.c:176 #2 0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f66fc3898cd in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7f66f2dc8700 (LWP 4710)): #0 0x00007f66fc664a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000428c55 in _purge_files_thread (no_data=<optimized out>) at controller.c:3378 #2 0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007f66fc3898cd in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7f66f1ca3700 (LWP 5794)): #0 0x00007f66fc664184 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0 #1 0x00000000004701e8 in lock_slurmctld (lock_levels=...) at locks.c:123 #2 0x000000000049858c in _slurm_rpc_dump_jobs_user (msg=0x7f66f1ca2e50) at proc_req.c:1903 #3 slurmctld_req (msg=msg@entry=0x7f66f1ca2e50, arg=arg@entry=0x7f66e65b4cd0) at proc_req.c:343 #4 0x000000000042a838 in _service_connection (arg=0x7f66e65b4cd0) at controller.c:1282 #5 0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0 #6 0x00007f66fc3898cd in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f6676ded700 (LWP 5795)): #0 0x00007f66fc664184 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0 #1 0x00000000004701e8 in lock_slurmctld (lock_levels=...) at locks.c:123 #2 0x000000000049858c in _slurm_rpc_dump_jobs_user (msg=0x7f6676dece50) at proc_req.c:1903 #3 slurmctld_req (msg=msg@entry=0x7f6676dece50, arg=arg@entry=0x7f66e65a3280) at proc_req.c:343 #4 0x000000000042a838 in _service_connection (arg=0x7f66e65a3280) at controller.c:1282 #5 0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0 #6 0x00007f66fc3898cd in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f66f169d700 (LWP 5796)): #0 0x00007f66fc664184 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0 #1 0x00000000004701e8 in lock_slurmctld (lock_levels=...) at locks.c:123 #2 0x000000000049858c in _slurm_rpc_dump_jobs_user (msg=0x7f66f169ce50) at proc_req.c:1903 #3 slurmctld_req (msg=msg@entry=0x7f66f169ce50, arg=arg@entry=0x7f66e655f450) at proc_req.c:343 #4 0x000000000042a838 in _service_connection (arg=0x7f66e655f450) at controller.c:1282 #5 0x00007f66fc660ea5 in start_thread () from /lib64/libpthread.so.0 #6 0x00007f66fc3898cd in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f66fd087740 (LWP 4680)): #0 0x00007f66fc35084d in nanosleep () from /lib64/libc.so.6 #1 0x00007f66fc381124 in usleep () from /lib64/libc.so.6 #2 0x000000000042e03f in _slurmctld_background (no_data=0x0) at controller.c:1963 #3 main (argc=<optimized out>, argv=<optimized out>) at controller.c:762 (gdb)
Thank you for the report. I will get on it. There are a couple of things I want from you. First, send me the output of the following command: > sdiag Second, can you send me the user accounting script that triggered the leak? Third, if possible, send the script that you use to monitor slurmctld's memory usage. Fourth, please send me the logs from both slurmctld and slurmdbd. Finally, I would appreciate it if you could send me your database.
Hi, Sorry for the late response. I'll send requested logs and db dump later on this week. Monitoring script is really simple: while true; do echo -n "$(date +%T) :"; ps --noheaders -C slurmctld -o rss; sleep 10;done
Created attachment 13337 [details] Account modification script
Created attachment 13338 [details] sdiag
Created attachment 13339 [details] slurmctld log TIME: slurmctld RSS 10:47:52 :1480460 10:48:12 :1488652 10:48:42 :1580048 10:48:52 :1623364 10:49:02 :1652404 10:49:12 :1691236 10:49:33 :1731304 10:49:43 :1734548
I am digging into everything you sent me. I have a few clues, but I may need a better reproducer. You said you can't use Valgrind because of performance, which makes sense because it generally has a 10x slowdown. However, as far as I know, AddressSanitizer is much faster, only a 2x slowdown. Can you compile slurmctld with AddressSanitizer and run it? You can do so by passing the `-fsanitize=address` flag to gcc (if it's new enough) or clang. In fact, if that's still too slow, you can use `-fsanitize=leak` which does less, but should still do what we need. If you can do that, please send me the output.
Hi, I rebuild slurm rpms against libasan and seems that debug overhead is tolerable. Debug build was running ~5h today but it did not catch any leak. Will it write leaks to log file or should I run it from cmd line? -Tommi
ASAN outputs to stderr, so you will need to redirect that to a file somehow.
Created attachment 13381 [details] saccmgr cmds which caused 1GB leak Unfortunately slurmctld build with el7 gcc/4.8.5 (libslurm.so is linked against libasan.so) could not catch the leak. slurmctld -D 2 > /var/log/slurm/leak.log, memory usage grow during the day from 2G -> 5G but leak.log is empty. Logs are really ugly during the account modifications, munge timeouts etc which are not nice for users Slurmctld logs from timeframe when memory usage grow about 1GB: TIME : RSS 16:38:04 :4139020 16:38:14 :3942376 16:38:24 :3970560 16:38:35 :3949440 16:38:45 :3953616 16:38:55 :3956136 16:39:05 :4105188 16:39:15 :4006048 16:39:25 :4056192 16:39:35 :4110108 16:39:45 :4229988 16:39:55 :4333420 16:40:05 :4423388 16:40:15 :4763728 16:40:25 :4809464 16:40:35 :4698692 16:40:45 :4741236 16:40:55 :4769036 16:41:05 :4781760 16:41:15 :4915576 16:41:25 :5003136 slurmctld log https://kannu.csc.fi/s/CXsDJtsqfMmozfe
Hmm...the fact that no leaks are detected makes me think that no leaks happened. In other words, I think that slurmctld may still be correctly using all of the memory it has. However, this means that slurmctld's memory use is still growing, and we should figure out why. I will see if I can reproduce the issue, and then reproduce it under Valgrind's Massif to see where the memory is being used. I am also aware of the possible link between this issue and 7928, so I will investigate that.
Please send me the output of the following command > sacctmgr show assoc
Created attachment 13437 [details] Requested association dump
That is a lot of data, and it might point to our problem. slurmctld holds association data as slurmdbd pushes it up. Please send me the output of the following command: > scontrol show assoc_mgr Also, one thing you can do is go to your state save location and look for the following files: > assoc_mgr_state > assoc_mgr_state.old > assoc_usage > assoc_usage.old If any of those files are growing in conjunction with the memory usage of slurmctld, the association data that slurmctld holds could be the issue.
Created attachment 13448 [details] assoc_mgr dump
That is pretty big, which makes me suspect that we are on the right track. Is slurmctld still increasing its memory usage? Do you have to restart it every so often? Basically, is its memory usage a problem for your cluster? I can try to work with you to find ways to reduce its usage.
(In reply to Gavin D. Howard from comment #22) > That is pretty big, which makes me suspect that we are on the right track. > > Is slurmctld still increasing its memory usage? Do you have to restart it > every so often? Basically, is its memory usage a problem for your cluster? I > can try to work with you to find ways to reduce its usage. Main issues are growing memory usage (sometimes OOM killer hits which is not nice situation to handle) and overall slowness. Making anything bigger than 20-30 changes with sacctmgr slurmctld is more or less crawling next ~30mins. What kind of change would you propose for reducing the memory usage? -Tommi
I don't know yet; I am working through some ideas. Can you tell me what the max memory usage is?
(In reply to Gavin D. Howard from comment #24) > I don't know yet; I am working through some ideas. > > Can you tell me what the max memory usage is? I dont have exact number on my hand but it's around 30+GB when kernel oom-killer has to kill it. -Tommi
Tommi, Okay, I have a few more questions: 1. Does slurmctld have a median memory usage, i.e., some point at which it stays, or does it grow without bound? 2. How much memory is on the slurmctld node? 3. How much memory is on the slurmdbd node? Is it the same node as the slurmctld? 4. What is the underlying storage for slurmdbd? What is the disk speed and type? Is it an NFS? 5. Would it be possible for you to reduce the complexity of your associations and/or remove the ones that will never be used again?
> 1. Does slurmctld have a median memory usage, i.e., some point at which it > stays, or does it grow without bound? Memory usage grows in a steps. Now it's been running 24h after last restart and RSS is under 1GB. But account modifications may trigger large memory usage increase which is not reclaim later. > 2. How much memory is on the slurmctld node? 64GB > 3. How much memory is on the slurmdbd node? Is it the same node as the > slurmctld? Same node, system does not look to be memory bound. KiB Mem : 65808088 total, 34770432 free, 23624124 used, 7413532 buff/cache KiB Swap: 1048572 total, 1046524 free, 2048 used. 41431104 avail Mem > 4. What is the underlying storage for slurmdbd? What is the disk speed and > type? Is it an NFS? SSD mirror. > 5. Would it be possible for you to reduce the complexity of your > associations and/or remove the ones that will never be used again? Not an option at this point. I think the question is that why altering accounts makes clurmctld (not slurmdbd) crawl with 100% cpu usage for tens of minutes?
Understood. I think that at this point, I am going to import your database and run slurmctld under a profiler. I'll see what I can find.
I have imported your database and have run lots of assoc updates with a modified version of your sacctmgr_cmds.sh script, but I cannot reproduce the problem. Would it be possible for you to run perf on slurmctld? The way I would like it is this: > perf record -g slurmctld Run it for awhile, make assoc changes, until it slows down to a crawl, then exit out of slurmctld and send me the perf.data file. With that data, I should be able to easily see where the problem is.
Hi, Here is perf data from ~10min timeframe. Slurmctld memory usage increased over 1GB during that time (created 150 associations) https://kannu.csc.fi/s/x2nDYDPQYSTM4m8
I was under the impression that data in a perf.data file was machine agnostic, but I was wrong. I screwed up. I apologize. In order for me to see what functions are taking the most time, the perf.data file has to be opened on the same machine. I will need you to open the perf.data on the machine where slurmctld is and in the original directory where `perf record` was run. BE AWARE: opening this file required 46 GB. There is a lot of data there (which is good). When opening the file, use the following command: > perf report -g "graph,0.5,caller" Expand the data points that are the most important, probably the first third, and make sure to recursively expand them. Then, please take screenshots and send them to me. In the meantime, I will continue to attempt to replicate the problem here, where I could do all of this work myself.
Created attachment 13539 [details] perf report screenshot Hi, I'm not sure if these are very useful, most of the time is spent to thread lock handling?
Actually, it does tell me something. We have a saying at SchedMD: "Slurm is threaded, but not concurrent." This means that Slurm *uses* multiple threads, but it doesn't handle actual concurrency well. You could be running into a case where that is even more true, where locking dominates. There is one function in the perf report that might tell us something: `slurmdb_get_acct_hierarchical_rec_list()`. It seems most related to your problem, so I am going to follow that line. This points to Do you have a script/program that does association management? If so, can you send it to me? I might be able to analyze it and figure out where the issue is.
Created attachment 13556 [details] slurmctld log during perf report
Created attachment 13557 [details] sacctmgr account modification script which was used during perf report Situation is bad, we can make only ~10 modifications/minute. I'd guess that reproducing this leak/slowdown requires bunch of RPC pressure from compute/login nodes.
Colleague increased bf_yield_sleep to maximum value 10000000, for first sight it helped situation a bit.
Your last comment and sdiag output are the clues: I think the backfill scheduler is dominating the execution time and memory usage of slurmctld. There are a few options that you could change. First is `bf_max_job_test`. This is the number of jobs that are considered for backfilling. You have this set at 2000, which is huge. This makes the scheduler take more time and use more memory. I would reduce this to at least 500, if not back to the default of 100. Second, if you do reduce `bf_max_job_test`, I would also reduce `bf_max_job_user` to be far below what you have set. Third, your `bf_window` is large, at 20,160. This also makes the scheduler take more time and use more memory. I would slash it to no more than half of its current value, if not lower, even putting it to its default of 1440. This is the number one value that should be reduced. Fourth, to make sure that slurmctld always remains responsive, you can add the `max_rpc_count` option. This is the number of active threads in slurmctld. I would set it to something close to 32 and see if responsiveness improves.
> First is `bf_max_job_test`. This is the number of jobs that are considered > for backfilling. You have this set at 2000, which is huge. This makes the > scheduler take more time and use more memory. I would reduce this to at > least 500, if not back to the default of 100. We have increased that value because jobs may be in the queue but waiting reason is e.g. QOSGrpCpuLimit -> if max_job_test is small, cluster may drain from jobs. Is there an option to make backfiller to skip those jobs which are limited by QOS? > Second, if you do reduce `bf_max_job_test`, I would also reduce > `bf_max_job_user` to be far below what you have set. What I see is that users have an array jobs and one job is short (e.g. ~30s ). Backfiller picks let say 50 short jobs/user which are perfectly suitable for backfill jobs but overall utilization is still poor because those short living jobs could be ran in a one shot if per user jobs are not limited. Now backfiller picks couple of tens of those per scheduling round and it's slow. > Third, your `bf_window` is large, at 20,160. This also makes the scheduler > take more time and use more memory. I would slash it to no more than half of > its current value, if not lower, even putting it to its default of 1440. > This is the number one value that should be reduced. Man page says: "A value at least as long as the highest allowed time limit is generally advisable to prevent job starvation." and our longrun queue is 14d. So which advice one is more correct? > Fourth, to make sure that slurmctld always remains responsive, you can add > the `max_rpc_count` option. This is the number of active threads in > slurmctld. I would set it to something close to 32 and see if responsiveness > improves. We'll try this at some point. <rant begins> We are quite frustrated to set limits all over the place just for that Slurm can focus on its internal thread battle. </rant end>
(In reply to Tommi Tervo from comment #38) > > First is `bf_max_job_test`. This is the number of jobs that are considered > > for backfilling. You have this set at 2000, which is huge. This makes the > > scheduler take more time and use more memory. I would reduce this to at > > least 500, if not back to the default of 100. > > We have increased that value because jobs may be in the queue but waiting > reason is e.g. QOSGrpCpuLimit -> if max_job_test is small, cluster may drain > from jobs. Is there an option to make backfiller to skip those jobs which > are limited by QOS? As far as I can tell from the code, the backfill scheduler already does this, but they still have to be in the queue. > > Second, if you do reduce `bf_max_job_test`, I would also reduce > > `bf_max_job_user` to be far below what you have set. > > What I see is that users have an array jobs and one job is short (e.g. ~30s > ). Backfiller picks let say 50 short jobs/user which are perfectly suitable > for backfill jobs but overall utilization is still poor because those short > living jobs could be ran in a one shot if per user jobs are not limited. Now > backfiller picks couple of tens of those per scheduling round and it's slow. Okay. Makes sense. > > Third, your `bf_window` is large, at 20,160. This also makes the scheduler > > take more time and use more memory. I would slash it to no more than half of > > its current value, if not lower, even putting it to its default of 1440. > > This is the number one value that should be reduced. > > Man page says: "A value at least as long as the highest allowed time limit > is generally advisable to prevent job starvation." and our longrun queue is > 14d. So which advice one is more correct? Unfortunately, the man page should win here. I apologize; I did not know your biggest job length. > > Fourth, to make sure that slurmctld always remains responsive, you can add > > the `max_rpc_count` option. This is the number of active threads in > > slurmctld. I would set it to something close to 32 and see if responsiveness > > improves. > > We'll try this at some point. > > <rant begins> > We are quite frustrated to set limits all over the place just for that Slurm > can focus on its internal thread battle. </rant end> Yes, we are aware of it, and there is a long-term plan in the works to fix the lack of concurrency in Slurm. However, some configuration will always be needed because of how different clusters are.
Just so you know, I will keep this bug open for now to track any optimizations and changes we decide to do. However, feel free to let me know if you need anything else.
We have decided to attempt some optimizations, but I will need your help to get profile data that I can use. The good news is that I have since learned how to make perf.data files transferable between machines. If you still have the perf.data file, please run `perf archive` in the directory with that file. If you don't have it, or `perf archive` fails, please run slurmctld under perf record again while the problem occurs. Then please send me the resulting perf.data and perf.data.tar.bz2 files.
Hi, Here is link to archived perf data https://kannu.csc.fi/s/MJkpYCQdzrSFDkc
*** Ticket 8887 has been marked as a duplicate of this ticket. ***
I want to give you an update. We have found one or more memory leaks that may be causing your problems. However, they are messy, and the fix may take time.
(In reply to Gavin D. Howard from comment #53) > I want to give you an update. > > We have found one or more memory leaks that may be causing your problems. > However, they are messy, and the fix may take time. Thanks for the update. -Tommi
The investigation into the memory leak didn't go anywhere. However, I may have a solution that I forgot: in 19.05, Slurm caches lists. In 20.02, it does not. This list caching may be the cause of the growing memory usage, so one possible fix may be to upgrade to 20.02.
By the way, if you want to test whether list caching is at fault, then *in a testing environment*, you can build Slurm by passing the `--enable-memory-leak-debug` option to `configure`. This will disable list caching. The reason I suggest doing it in a test environment is because it slows Slurm down by a lot; with this option, Slurm carefully frees all memory that it uses. If you do not reproduce the problem with this option, chances are the list caching of 19.05 is the problem. I have not managed to reproduce the problem with this option myself, which is a good sign.
Tommi, Can you give me an update on whether you have, or even can, test whether the list caching is the problem?
(In reply to Gavin D. Howard from comment #57) > Tommi, > > Can you give me an update on whether you have, or even can, test whether the > list caching is the problem? Hi, Unfortunately I've no enough time to debug this now. We managed to mitigate this issue by doing automated sacctmgr changes in a small sets and disabled automatic node online from NHC (which will reduce amount of RPCs). We will upgrade slurm to 20.x earliest on autumn so this bug can be set on hold (or maybe close, I'll reopen if problem exists on 20.x) -Tommi
Okay. Thank you for letting me know. Closing for now.