| Summary: | slurmdbd memory leak | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Regine Gaudin <regine.gaudin> |
| Component: | slurmdbd | Assignee: | Albert Gil <albert.gil> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 20.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CEA | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
cat /etc/slurm_prometheus_acct.sh
#!/bin/bash
for state in cancelled completed deadline failed node_fail out_of_memory requeued suspended timeout
do
# sacct -S "$(date +%m/01)" -E "now" -D -s $state -X -o CpuTimeRAW -n
sacct -S 2021-04-04 -E "now" -D -s $state -X -o CpuTimeRAW -n
done
Hi Regine, Let me split the different issues that I see from your comments: A) It seems that your SQL config is not the bes one: slurmdbd: error: Database settings not recommended values: innodb_buffer_pool_size innodb_log_file_size innodb_lock_wait_timeout Please try to follow these instructions to have a better SQL setup for Slurm: https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build Or maybe you post only the logs of the debug/test vm? B) From your comments it doesn't look like a memory leak to me. I mean, I see same slurmdbd PID with RAM moving up *and down*, but never above 4G. That's quite the expected behavior, right? How much RAM has the host? Note that Linux could be doing a OOM on slurmdbd, but maybe that's because other processes on that machine are using more RAM than expected. Is that possible. C) I'm not sure why you need to run that sacct command every 3 hours I'm wondering if sreport is what your really want to use? Can you explain me a bit what info do you need for the billing and why you think that sreport is not the right tool? At this point I would like to focus first on C), is that ok for you too? Actually, maybe A) is an easy improvement that most probably won't solve the main issue, but will help us for sure. Regards, Albert PS: I've also noticed that in /etc/slurm_prometheus_acct.sh you loop over several states, but jfyi the -s,--states options allows a comma separated list of states too. B) On the real cluster
dmesg -T |grep slurmdbd
[Sat May 22 16:01:39 2021] [ 4108] 350 4108 3477282 2972303 5967 0 0 slurmdbd
[Sat May 22 16:01:39 2021] Out of memory: Kill process 4108 (slurmdbd) score 785 or sacrifice child
[Sat May 22 16:01:39 2021] Killed process 4108 (slurmdbd) total-vm:13909128kB, anon-rss:11889212kB, file-rss:0kB, shmem-rss:0kB
[Mon May 24 01:01:41 2021] [18157] 350 18157 2973939 2814329 5561 0 0 slurmdbd
[Mon May 24 01:01:41 2021] Out of memory: Kill process 18157 (slurmdbd) score 743 or sacrifice child
[Mon May 24 01:01:41 2021] Killed process 18157 (slurmdbd) total-vm:11895756kB, anon-rss:11257316kB, file-rss:0kB, shmem-rss:0kB
[Mon May 24 01:01:41 2021] Out of memory: Kill process 18162 (slurmdbd) score 743 or sacrifice child
[Tue May 25 10:01:50 2021] [10305] 350 10305 2868848 2700087 5342 0 0 slurmdbd
[Tue May 25 10:01:50 2021] Out of memory: Kill process 10305 (slurmdbd) score 713 or sacrifice child
[Tue May 25 10:01:50 2021] Killed process 10305 (slurmdbd) total-vm:11475392kB, anon-rss:10800348kB, file-rss:0kB, shmem-rss:0kB
[Wed May 26 08:49:38 2021] [11381] 350 11381 2534113 2418584 4800 0 0 slurmdbd
[Wed May 26 08:49:38 2021] Out of memory: Kill process 11381 (slurmdbd) score 638 or sacrifice child
[Wed May 26 08:49:38 2021] Killed process 11381 (slurmdbd) total-vm:10136452kB, anon-rss:9674336kB, file-rss:0kB, shmem-rss:0kB
[Wed May 26 17:01:41 2021] [28973] 350 28973 2494657 2251135 4476 0 0 slurmdbd
[Wed May 26 17:01:41 2021] Out of memory: Kill process 28973 (slurmdbd) score 594 or sacrifice child
[Wed May 26 17:01:41 2021] Killed process 28973 (slurmdbd) total-vm:9978628kB, anon-rss:9004540kB, file-rss:0kB, shmem-rss:0kB
[Thu May 27 01:01:50 2021] [ 6365] 350 6365 2438250 2231957 4486 0 0 slurmdbd
[Thu May 27 01:01:50 2021] Out of memory: Kill process 6365 (slurmdbd) score 589 or sacrifice child
[Thu May 27 01:01:50 2021] Killed process 6365 (slurmdbd) total-vm:9753000kB, anon-rss:8927828kB, file-rss:0kB, shmem-rss:0kB
[Thu May 27 01:01:50 2021] Out of memory: Kill process 6368 (slurmdbd) score 589 or sacrifice child
[Thu May 27 13:01:53 2021] [11397] 350 11397 2354286 2223149 4422 0 0 slurmdbd
[Thu May 27 13:01:53 2021] Out of memory: Kill process 11397 (slurmdbd) score 587 or sacrifice child
[Thu May 27 13:01:53 2021] Killed process 11397 (slurmdbd) total-vm:9417144kB, anon-rss:8892596kB, file-rss:0kB, shmem-rss:0kB
[Thu May 27 22:01:55 2021] [22925] 350 22925 2308821 2191822 4349 0 0 slurmdbd
[Thu May 27 22:01:55 2021] Out of memory: Kill process 22925 (slurmdbd) score 578 or sacrifice child
[Thu May 27 22:01:55 2021] Killed process 22925 (slurmdbd) total-vm:9235284kB, anon-rss:8767288kB, file-rss:0kB, shmem-rss:0kB
[Fri May 28 11:02:14 2021] [11130] 350 11130 2325798 2163652 4308 0 0 slurmdbd
[Fri May 28 11:02:14 2021] Out of memory: Kill process 11130 (slurmdbd) score 571 or sacrifice child
[Fri May 28 11:02:14 2021] Killed process 11130 (slurmdbd) total-vm:9303192kB, anon-rss:8654608kB, file-rss:0kB, shmem-rss:0kB
[Fri May 28 18:01:53 2021] [29395] 350 29395 2298359 2104128 4222 0 0 slurmdbd
[Fri May 28 18:01:53 2021] Out of memory: Kill process 29395 (slurmdbd) score 555 or sacrifice child
[Fri May 28 18:01:53 2021] Killed process 29395 (slurmdbd) total-vm:9193436kB, anon-rss:8416512kB, file-rss:0kB, shmem-rss:0kB
[Fri May 28 18:01:53 2021] [29457] 350 29395 2298359 2104218 4222 0 0 slurmdbd
[Fri May 28 18:01:53 2021] Out of memory: Kill process 29457 (slurmdbd) score 555 or sacrifice child
[Fri May 28 18:01:53 2021] Killed process 29457 (slurmdbd) total-vm:9193436kB, anon-rss:8416872kB, file-rss:0kB, shmem-rss:0kB
[Sat May 29 05:02:03 2021] [19014] 350 19014 2183676 2008297 3990 0 0 slurmdbd
[Sat May 29 05:02:03 2021] Out of memory: Kill process 19014 (slurmdbd) score 530 or sacrifice child
[Sat May 29 05:02:03 2021] Killed process 19014 (slurmdbd) total-vm:8734704kB, anon-rss:8033188kB, file-rss:0kB, shmem-rss:0kB
[Sat May 29 17:01:56 2021] [12250] 350 12250 2100822 1943934 3858 0 0 slurmdbd
[Sat May 29 17:01:56 2021] Out of memory: Kill process 12250 (slurmdbd) score 513 or sacrifice child
[Sat May 29 17:01:56 2021] Killed process 12250 (slurmdbd) total-vm:8403288kB, anon-rss:7775736kB, file-rss:0kB, shmem-rss:0kB
[Sat May 29 17:01:56 2021] [12307] 350 12250 2100822 1943981 3858 0 0 slurmdbd
[Sat May 29 17:01:56 2021] Out of memory: Kill process 12307 (slurmdbd) score 513 or sacrifice child
[Sat May 29 17:01:56 2021] Killed process 12307 (slurmdbd) total-vm:8403288kB, anon-rss:7775900kB, file-rss:24kB, shmem-rss:0kB
[Sun May 30 05:01:59 2021] [27248] 350 27248 2009756 1899852 3769 0 0 slurmdbd
[Sun May 30 05:01:59 2021] Out of memory: Kill process 27248 (slurmdbd) score 501 or sacrifice child
[Sun May 30 05:01:59 2021] Killed process 27248 (slurmdbd) total-vm:8039024kB, anon-rss:7599408kB, file-rss:0kB, shmem-rss:0kB
[Sun May 30 15:02:04 2021] [15281] 350 15281 2022981 1838734 3671 0 0 slurmdbd
[Sun May 30 15:02:04 2021] Out of memory: Kill process 15281 (slurmdbd) score 485 or sacrifice child
[Sun May 30 15:02:04 2021] Killed process 15281 (slurmdbd) total-vm:8091924kB, anon-rss:7354936kB, file-rss:0kB, shmem-rss:0kB
[Mon May 31 01:01:56 2021] [29982] 350 29982 1893568 1769621 3543 0 0 slurmdbd
[Mon May 31 01:01:56 2021] Out of memory: Kill process 29982 (slurmdbd) score 467 or sacrifice child
[Mon May 31 01:01:56 2021] Killed process 29982 (slurmdbd) total-vm:7574272kB, anon-rss:7078484kB, file-rss:0kB, shmem-rss:0kB
[Mon May 31 08:02:01 2021] [30176] 350 30176 2010320 1727238 3520 0 0 slurmdbd
[Mon May 31 08:02:01 2021] Out of memory: Kill process 30176 (slurmdbd) score 456 or sacrifice child
[Mon May 31 08:02:01 2021] Killed process 30176 (slurmdbd) total-vm:8041280kB, anon-rss:6908952kB, file-rss:0kB, shmem-rss:0kB
[Mon May 31 17:02:02 2021] [31656] 350 31656 1664753 1511721 3035 0 0 slurmdbd
[Mon May 31 17:02:02 2021] Out of memory: Kill process 31656 (slurmdbd) score 399 or sacrifice child
[Mon May 31 17:02:02 2021] Killed process 31656 (slurmdbd) total-vm:6659012kB, anon-rss:6046884kB, file-rss:0kB, shmem-rss:0kB
[Thu Jun 3 09:49:30 2021] [ 875] 350 875 1604336 1157221 2550 0 0 slurmdbd
[root@db1 ~] # free -gh
total used free shared buff/cache available
Mem: 14G 8.7G 161M 838M 5.6G 4.6G
Swap: 0B 0B 0B
C)I have asked to the sysadmin, this is not for billing but for survey/monitoring purpose. The results are read by prometheus and it allows to detect for instance burst of node_failure. So we need a minimum of frequency for beeing reactive in case of problem.
The exact script and frequenci is hourly is
cat /etc/cron.hourly/slurm_prometheus_acct.sh
#!/bin/bash
TMPFILE=$(mktemp)
trap "rm ${TMPFILE}" EXIT
for state in cancelled completed deadline failed node_fail out_of_memory requeued suspended timeout
do
value=$(sacct -S "$(date +%m/01)" -E "now" -D -s $state -X -o CpuTimeRAW -n | paste -s -d+ | bc -l 2>/dev/null)
[ -z $value ] && value=0
echo slurm_job_cpu_seconds_total{state=\"${state}\"} $value >> ${TMPFILE}
done
test -d /etc/node_exporter || exit 1
cp -f ${TMPFILE} /etc/node_exporter/slurm_prometheus_acct.prom
chmod 0644 /etc/node_exporter/slurm_prometheus_acct.prom
Hi Regine, > B) On the real cluster > dmesg -T |grep slurmdbd > [Sat May 22 16:01:39 2021] [ 4108] 350 4108 3477282 2972303 5967 0 0 slurmdbd > [Sat May 22 16:01:39 2021] Out of memory: Kill process 4108 (slurmdbd) score 785 or sacrifice child > [Sat May 22 16:01:39 2021] Killed process 4108 (slurmdbd) total-vm:13909128kB, anon-rss:11889212kB, file-rss:0kB, shmem-rss:0kB Yes, the kernel is killing slurmdbd, and this is really bad. But the total-vm used by slurmdbd when being killed are quite normal values: some times it gets killed with only 6G being used, and the highest value is 13G. Those values are quite expected depending on the size of the queries being done, so it doesn't look like a leak to me. > [root@db1 ~] # free -gh > total used free shared buff/cache > available > Mem: 14G 8.7G 161M 838M 5.6G > 4.6G > Swap: 0B 0B 0B Ok, I see that your machine has 14G, therefore it's normal that the kernel OOMs slurmdbd. I think that you would need to increase, probably double, the memory of the machine where slurmdbd runs. Only 14G seems not enough for your daily work in your system. > C)I have asked to the sysadmin, this is not for billing but for > survey/monitoring purpose. The results are read by prometheus and it allows > to detect for instance burst of node_failure. So we need a minimum of > frequency for beeing reactive in case of problem. > > The exact script and frequenci is hourly is > cat /etc/cron.hourly/slurm_prometheus_acct.sh > #!/bin/bash > > TMPFILE=$(mktemp) > trap "rm ${TMPFILE}" EXIT > > for state in cancelled completed deadline failed node_fail out_of_memory > requeued suspended timeout > do > value=$(sacct -S "$(date +%m/01)" -E "now" -D -s $state -X -o CpuTimeRAW > -n | paste -s -d+ | bc -l 2>/dev/null) > [ -z $value ] && value=0 > echo slurm_job_cpu_seconds_total{state=\"${state}\"} $value >> ${TMPFILE} > done > test -d /etc/node_exporter || exit 1 > cp -f ${TMPFILE} /etc/node_exporter/slurm_prometheus_acct.prom > chmod 0644 /etc/node_exporter/slurm_prometheus_acct.prom I see. It seems that what you want is the Job Completion Plugin with Prometheus. Just in case you are not aware of it, we support a very similar monitoring with Elasticsearch: https://slurm.schedmd.com/elasticsearch.html Regards, Albert Hi We'll double memory of the slurmdbd machine. It'll take a while. Hi Regine, > We'll double memory of the slurmdbd machine. Ok. > It'll take a while. If this is ok for you I'm closing this ticket as "infogiven" for now, but please feel free to reopen it if increasing the memory is not solving the issue or you need further support on this. Regards, Albert |
Hi For compute hours billing reasons we need to run every 3 hours daily under root this accounting: #!/bin/bash for state in cancelled completed deadline failed node_fail out_of_memory requeued suspended timeout do sacct -S "$(date +%m/01)" -E "now" -D -s $state -X -o CpuTimeRAW -n done slurmdbd is regularly killed by oom killer in out of memory with highest frequency near end of the month. I hope to have reproduced the memory leak (but not sure) on a vm importing the cluster database (I can not give you the database. Please note that the ldap users are not known on the vm) I have tried to compile slurm with memory debuging and run valgrind but it leads to failure of the "sacct -s completed" part: # for i in `seq 1 4`;do date; ps aux | awk '{$5=int(100 * $5/1024/1024)/100"GB";}{ print;}' |grep slurmdbd|grep -v grep; /etc/slurm_prometheus_acct.sh > /dev/null ; echo "sacct done"; date; ps aux | awk '{$5=int(100 * $5/1024/1024)/100"GB";}{ print;}' |grep slurmdbd|grep -v grep; done Tue Jun 1 15:00:50 CEST 2021 slurm 14972 6.3 0.0 0.34GB 10816 ? Ssl 14:49 0:44 /usr/sbin/slurmdbd -D sacct done Tue Jun 1 15:01:40 CEST 2021 slurm 14972 7.1 1.6 2.72GB 2529268 ? Ssl 14:49 0:53 /usr/sbin/slurmdbd -D Tue Jun 1 15:01:40 CEST 2021 slurm 14972 7.1 1.6 2.72GB 2529268 ? Ssl 14:49 0:53 /usr/sbin/slurmdbd -D sacct done Tue Jun 1 15:02:37 CEST 2021 slurm 14972 8.3 0.5 1.22GB 920172 ? Ssl 14:49 1:06 /usr/sbin/slurmdbd -D Tue Jun 1 15:02:38 CEST 2021 slurm 14972 8.3 0.5 1.22GB 920172 ? Ssl 14:49 1:06 /usr/sbin/slurmdbd -D sacct done Tue Jun 1 15:03:32 CEST 2021 slurm 14972 8.8 2.2 3.59GB 3434812 ? Ssl 14:49 1:16 /usr/sbin/slurmdbd -D Tue Jun 1 15:03:32 CEST 2021 slurm 14972 8.8 2.2 3.59GB 3434812 ? Ssl 14:49 1:16 /usr/sbin/slurmdbd -D sacct done Tue Jun 1 15:04:30 CEST 2021 slurm 14972 9.7 1.6 2.72GB 2529264 ? Ssl 14:49 1:29 /usr/sbin/slurmdbd -D [root@vm0 x86_64] # for i in `seq 1 4`;do date; ps aux | awk '{$5=int(100 * $5/1024/1024)/100"GB";}{ print;}' |grep slurmdbd|grep -v grep; /etc/slurm_prometheus_acct.sh > /dev/null ; echo "sacct done"; date; ps aux | awk '{$5=int(100 * $5/1024/1024)/100"GB";}{ print;}' |grep slurmdbd|grep -v grep; done Tue Jun 1 15:05:17 CEST 2021 slurm 14972 9.3 1.6 2.72GB 2529264 ? Ssl 14:49 1:29 /usr/sbin/slurmdbd -D sacct done Tue Jun 1 15:06:12 CEST 2021 slurm 14972 9.7 2.0 3.28GB 3110744 ? Ssl 14:49 1:39 /usr/sbin/slurmdbd -D Tue Jun 1 15:06:12 CEST 2021 slurm 14972 9.7 2.0 3.28GB 3110744 ? Ssl 14:49 1:39 /usr/sbin/slurmdbd -D sacct done Tue Jun 1 15:07:07 CEST 2021 slurm 14972 10.2 0.3 0.9GB 596104 ? Ssl 14:49 1:50 /usr/sbin/slurmdbd -D Tue Jun 1 15:07:07 CEST 2021 slurm 14972 10.2 0.3 0.9GB 596104 ? Ssl 14:49 1:50 /usr/sbin/slurmdbd -D sacct done Tue Jun 1 15:08:05 CEST 2021 slurm 14972 10.8 2.0 3.28GB 3110744 ? Ssl 14:49 2:02 /usr/sbin/slurmdbd -D Tue Jun 1 15:08:05 CEST 2021 slurm 14972 10.8 2.0 3.28GB 3110744 ? Ssl 14:49 2:02 /usr/sbin/slurmdbd -D sacct done Tue Jun 1 15:08:59 CEST 2021 slurm 14972 11.2 0.3 0.9GB 596104 ? Ssl 14:49 2:12 /usr/sbin/slurmdbd -D valgrind --tool=memcheck --error-limit=no --leak-check=full --show-reachable=yes --max-stackframe=16777216 --num-callers=20 --child-silent-after-fork=yes --track-origins=yes slurmdbd -Dv > valgrind-slurmdbd-20 2>&1 sacct -S 2021-04-04 -E "now" -D -s completed -X -o CpuTimeRAW -n sacct: error: Getting response to message type: DBD_GET_JOBS_COND sacct: error: DBD_GET_JOBS_COND failure: Unspecified error unkillable valgrind needs kill -9 valgrind --tool=memcheck --error-limit=no --leak-check=full --show-reachable=yes --max-stackframe=16777216 --num-callers=20 --child-silent-after-fork=yes --track-origins=yes slurmdbd -Dv > valgrind-slurmdbd-20 2>&1 ^C^C^C^C^C^C !ps ps aux | awk '{$5=int(100 * $5/1024/1024)/100"GB";}{ print;}' |grep slurmdbd|grep -v grep slurm 22035 98.9 3.4 5.24GB 5278208 pts/0 Sl+ 15:48 18:57 valgrind --tool=memcheck --error-limit=no --leak-check=full --show-reachable=yes --max-stackframe=16777216 --num-callers=20 --child-silent-after-fork=yes --track-origins=yes slurmdbd -Dv gaudinr 23011 0.1 0.0 0.15GB 5436 pts/3 S+ 16:07 0:00 vim memo_bug_slurmdbd kill -9 22035 cat valgrind-slurmdbd-20 ==22035== Memcheck, a memory error detector ==22035== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==22035== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info ==22035== Command: slurmdbd -Dv ==22035== slurmdbd: pidfile not locked, assuming no running daemon slurmdbd: accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 5.5.64-MariaDB slurmdbd: error: Database settings not recommended values: innodb_buffer_pool_size innodb_log_file_size innodb_lock_wait_timeout slurmdbd: accounting_storage/as_mysql: init: Accounting storage MYSQL plugin loaded slurmdbd: slurmdbd version 20.11.5 started slurmdbd: Terminate signal (SIGINT or SIGTERM) received [root@vm0 ~] # mysql Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 184 Server version: 5.5.64-MariaDB MariaDB Server Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> use slurm_acct_db; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [slurm_acct_db]> select count(*) from CSC_job_table; ERROR 1146 (42S02): Table 'slurm_acct_db.CSC_job_table' doesn't exist MariaDB [slurm_acct_db]> select count(*) from my_cluster_job_table; +----------+ | count(*) | +----------+ | 3002175 | +----------+ 1 row in set (0.59 sec) MariaDB [slurm_acct_db]> select count(*) from my_cluster_step_table; +----------+ | count(*) | +----------+ | 9243219 | +----------+ 1 row in set (1.27 sec) MariaDB [slurm_acct_db]>