Ticket 11735

Summary:	slurmdbd memory leak
Product:	Slurm	Reporter:	Regine Gaudin <regine.gaudin>
Component:	slurmdbd	Assignee:	Albert Gil <albert.gil>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	20.11.5
Hardware:	Linux
OS:	Linux
Site:	CEA	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Regine Gaudin 2021-06-01 08:27:30 MDT

Hi

For compute hours billing reasons we need to run every 3 hours daily under root this accounting:
#!/bin/bash
for state in cancelled completed deadline failed node_fail out_of_memory requeued suspended timeout
do
   sacct -S "$(date +%m/01)" -E "now" -D -s $state -X -o CpuTimeRAW -n 
done


slurmdbd is regularly killed by oom killer  in out of memory with highest frequency near end of the month.

I hope to have reproduced the memory leak (but not sure) on a vm importing the cluster database (I can not give you the database. Please note that the ldap users are not known on the vm)

I have tried to compile slurm with memory debuging and run valgrind but it leads to failure of the "sacct -s completed" part:

#  for i in `seq 1 4`;do  date; ps aux | awk '{$5=int(100 * $5/1024/1024)/100"GB";}{ print;}' |grep slurmdbd|grep -v grep; /etc/slurm_prometheus_acct.sh > /dev/null ; echo "sacct done"; date; ps aux | awk '{$5=int(100 * $5/1024/1024)/100"GB";}{ print;}' |grep slurmdbd|grep -v grep;  done
Tue Jun  1 15:00:50 CEST 2021
slurm 14972 6.3 0.0 0.34GB 10816 ? Ssl 14:49 0:44 /usr/sbin/slurmdbd -D
sacct done
Tue Jun  1 15:01:40 CEST 2021
slurm 14972 7.1 1.6 2.72GB 2529268 ? Ssl 14:49 0:53 /usr/sbin/slurmdbd -D
Tue Jun  1 15:01:40 CEST 2021
slurm 14972 7.1 1.6 2.72GB 2529268 ? Ssl 14:49 0:53 /usr/sbin/slurmdbd -D
sacct done
Tue Jun  1 15:02:37 CEST 2021
slurm 14972 8.3 0.5 1.22GB 920172 ? Ssl 14:49 1:06 /usr/sbin/slurmdbd -D
Tue Jun  1 15:02:38 CEST 2021
slurm 14972 8.3 0.5 1.22GB 920172 ? Ssl 14:49 1:06 /usr/sbin/slurmdbd -D
sacct done
Tue Jun  1 15:03:32 CEST 2021
slurm 14972 8.8 2.2 3.59GB 3434812 ? Ssl 14:49 1:16 /usr/sbin/slurmdbd -D
Tue Jun  1 15:03:32 CEST 2021
slurm 14972 8.8 2.2 3.59GB 3434812 ? Ssl 14:49 1:16 /usr/sbin/slurmdbd -D
sacct done
Tue Jun  1 15:04:30 CEST 2021
slurm 14972 9.7 1.6 2.72GB 2529264 ? Ssl 14:49 1:29 /usr/sbin/slurmdbd -D

[root@vm0 x86_64] #  for i in `seq 1 4`;do  date; ps aux | awk '{$5=int(100 * $5/1024/1024)/100"GB";}{ print;}' |grep slurmdbd|grep -v grep; /etc/slurm_prometheus_acct.sh > /dev/null ; echo "sacct done"; date; ps aux | awk '{$5=int(100 * $5/1024/1024)/100"GB";}{ print;}' |grep slurmdbd|grep -v grep;  done
Tue Jun  1 15:05:17 CEST 2021
slurm 14972 9.3 1.6 2.72GB 2529264 ? Ssl 14:49 1:29 /usr/sbin/slurmdbd -D
sacct done
Tue Jun  1 15:06:12 CEST 2021
slurm 14972 9.7 2.0 3.28GB 3110744 ? Ssl 14:49 1:39 /usr/sbin/slurmdbd -D
Tue Jun  1 15:06:12 CEST 2021
slurm 14972 9.7 2.0 3.28GB 3110744 ? Ssl 14:49 1:39 /usr/sbin/slurmdbd -D
sacct done
Tue Jun  1 15:07:07 CEST 2021
slurm 14972 10.2 0.3 0.9GB 596104 ? Ssl 14:49 1:50 /usr/sbin/slurmdbd -D
Tue Jun  1 15:07:07 CEST 2021
slurm 14972 10.2 0.3 0.9GB 596104 ? Ssl 14:49 1:50 /usr/sbin/slurmdbd -D
sacct done
Tue Jun  1 15:08:05 CEST 2021
slurm 14972 10.8 2.0 3.28GB 3110744 ? Ssl 14:49 2:02 /usr/sbin/slurmdbd -D
Tue Jun  1 15:08:05 CEST 2021
slurm 14972 10.8 2.0 3.28GB 3110744 ? Ssl 14:49 2:02 /usr/sbin/slurmdbd -D
sacct done
Tue Jun  1 15:08:59 CEST 2021
slurm 14972 11.2 0.3 0.9GB 596104 ? Ssl 14:49 2:12 /usr/sbin/slurmdbd -D

valgrind --tool=memcheck  --error-limit=no --leak-check=full --show-reachable=yes --max-stackframe=16777216 --num-callers=20 --child-silent-after-fork=yes --track-origins=yes slurmdbd -Dv > valgrind-slurmdbd-20 2>&1

sacct -S 2021-04-04 -E "now" -D -s completed  -X -o CpuTimeRAW -n
sacct: error: Getting response to message type: DBD_GET_JOBS_COND
sacct: error: DBD_GET_JOBS_COND failure: Unspecified error

unkillable valgrind needs kill -9
valgrind --tool=memcheck  --error-limit=no --leak-check=full --show-reachable=yes --max-stackframe=16777216 --num-callers=20 --child-silent-after-fork=yes --track-origins=yes slurmdbd -Dv > valgrind-slurmdbd-20 2>&1
^C^C^C^C^C^C



!ps
ps aux | awk '{$5=int(100 * $5/1024/1024)/100"GB";}{ print;}' |grep slurmdbd|grep -v grep
slurm 22035 98.9 3.4 5.24GB 5278208 pts/0 Sl+ 15:48 18:57 valgrind --tool=memcheck --error-limit=no --leak-check=full --show-reachable=yes --max-stackframe=16777216 --num-callers=20 --child-silent-after-fork=yes --track-origins=yes slurmdbd -Dv
gaudinr 23011 0.1 0.0 0.15GB 5436 pts/3 S+ 16:07 0:00 vim memo_bug_slurmdbd

kill -9 22035

 cat valgrind-slurmdbd-20
==22035== Memcheck, a memory error detector
==22035== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==22035== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==22035== Command: slurmdbd -Dv
==22035== 
slurmdbd: pidfile not locked, assuming no running daemon
slurmdbd: accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 5.5.64-MariaDB
slurmdbd: error: Database settings not recommended values: innodb_buffer_pool_size innodb_log_file_size innodb_lock_wait_timeout
slurmdbd: accounting_storage/as_mysql: init: Accounting storage MYSQL plugin loaded
slurmdbd: slurmdbd version 20.11.5 started
slurmdbd: Terminate signal (SIGINT or SIGTERM) received




[root@vm0 ~] # mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 184
Server version: 5.5.64-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> use slurm_acct_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [slurm_acct_db]> select count(*) from CSC_job_table;
ERROR 1146 (42S02): Table 'slurm_acct_db.CSC_job_table' doesn't exist
MariaDB [slurm_acct_db]> select count(*) from my_cluster_job_table;
+----------+
| count(*) |
+----------+
|  3002175 |
+----------+
1 row in set (0.59 sec)

MariaDB [slurm_acct_db]> select count(*) from  my_cluster_step_table;
+----------+
| count(*) |
+----------+
|  9243219 |
+----------+
1 row in set (1.27 sec)

MariaDB [slurm_acct_db]>

Comment 1 Regine Gaudin 2021-06-01 08:29:04 MDT

cat  /etc/slurm_prometheus_acct.sh 
#!/bin/bash
for state in cancelled completed deadline failed node_fail out_of_memory requeued suspended timeout
do
#    sacct -S "$(date +%m/01)" -E "now" -D -s $state -X -o CpuTimeRAW -n 
    sacct -S 2021-04-04 -E "now" -D -s $state -X -o CpuTimeRAW -n 
done

Comment 2 Albert Gil 2021-06-02 07:45:01 MDT

Hi Regine,

Let me split the different issues that I see from your comments:

A) It seems that your SQL config is not the bes one:

slurmdbd: error: Database settings not recommended values: innodb_buffer_pool_size innodb_log_file_size innodb_lock_wait_timeout

Please try to follow these instructions to have a better SQL setup for Slurm:

https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build

Or maybe you post only the logs of the debug/test vm?

B) From your comments it doesn't look like a memory leak to me.

I mean, I see same slurmdbd PID with RAM moving up *and down*, but never above 4G.
That's quite the expected behavior, right?
How much RAM has the host?

Note that Linux could be doing a OOM on slurmdbd, but maybe that's because other processes on that machine are using more RAM than expected.
Is that possible.

C) I'm not sure why you need to run that sacct command every 3 hours

I'm wondering if sreport is what your really want to use?
Can you explain me a bit what info do you need for the billing and why you think that sreport is not the right tool?


At this point I would like to focus first on C), is that ok for you too?
Actually, maybe A) is an easy improvement that most probably won't solve the main issue, but will help us for sure.


Regards,
Albert

PS: I've also noticed that in /etc/slurm_prometheus_acct.sh you loop over several states, but jfyi the -s,--states options allows a comma separated list of states too.

Comment 3 Regine Gaudin 2021-06-03 09:06:42 MDT

B) On the real cluster
dmesg -T |grep slurmdbd
[Sat May 22 16:01:39 2021] [ 4108]   350  4108  3477282  2972303    5967        0             0 slurmdbd
[Sat May 22 16:01:39 2021] Out of memory: Kill process 4108 (slurmdbd) score 785 or sacrifice child
[Sat May 22 16:01:39 2021] Killed process 4108 (slurmdbd) total-vm:13909128kB, anon-rss:11889212kB, file-rss:0kB, shmem-rss:0kB
[Mon May 24 01:01:41 2021] [18157]   350 18157  2973939  2814329    5561        0             0 slurmdbd
[Mon May 24 01:01:41 2021] Out of memory: Kill process 18157 (slurmdbd) score 743 or sacrifice child
[Mon May 24 01:01:41 2021] Killed process 18157 (slurmdbd) total-vm:11895756kB, anon-rss:11257316kB, file-rss:0kB, shmem-rss:0kB
[Mon May 24 01:01:41 2021] Out of memory: Kill process 18162 (slurmdbd) score 743 or sacrifice child
[Tue May 25 10:01:50 2021] [10305]   350 10305  2868848  2700087    5342        0             0 slurmdbd
[Tue May 25 10:01:50 2021] Out of memory: Kill process 10305 (slurmdbd) score 713 or sacrifice child
[Tue May 25 10:01:50 2021] Killed process 10305 (slurmdbd) total-vm:11475392kB, anon-rss:10800348kB, file-rss:0kB, shmem-rss:0kB
[Wed May 26 08:49:38 2021] [11381]   350 11381  2534113  2418584    4800        0             0 slurmdbd
[Wed May 26 08:49:38 2021] Out of memory: Kill process 11381 (slurmdbd) score 638 or sacrifice child
[Wed May 26 08:49:38 2021] Killed process 11381 (slurmdbd) total-vm:10136452kB, anon-rss:9674336kB, file-rss:0kB, shmem-rss:0kB
[Wed May 26 17:01:41 2021] [28973]   350 28973  2494657  2251135    4476        0             0 slurmdbd
[Wed May 26 17:01:41 2021] Out of memory: Kill process 28973 (slurmdbd) score 594 or sacrifice child
[Wed May 26 17:01:41 2021] Killed process 28973 (slurmdbd) total-vm:9978628kB, anon-rss:9004540kB, file-rss:0kB, shmem-rss:0kB
[Thu May 27 01:01:50 2021] [ 6365]   350  6365  2438250  2231957    4486        0             0 slurmdbd
[Thu May 27 01:01:50 2021] Out of memory: Kill process 6365 (slurmdbd) score 589 or sacrifice child
[Thu May 27 01:01:50 2021] Killed process 6365 (slurmdbd) total-vm:9753000kB, anon-rss:8927828kB, file-rss:0kB, shmem-rss:0kB
[Thu May 27 01:01:50 2021] Out of memory: Kill process 6368 (slurmdbd) score 589 or sacrifice child
[Thu May 27 13:01:53 2021] [11397]   350 11397  2354286  2223149    4422        0             0 slurmdbd
[Thu May 27 13:01:53 2021] Out of memory: Kill process 11397 (slurmdbd) score 587 or sacrifice child
[Thu May 27 13:01:53 2021] Killed process 11397 (slurmdbd) total-vm:9417144kB, anon-rss:8892596kB, file-rss:0kB, shmem-rss:0kB
[Thu May 27 22:01:55 2021] [22925]   350 22925  2308821  2191822    4349        0             0 slurmdbd
[Thu May 27 22:01:55 2021] Out of memory: Kill process 22925 (slurmdbd) score 578 or sacrifice child
[Thu May 27 22:01:55 2021] Killed process 22925 (slurmdbd) total-vm:9235284kB, anon-rss:8767288kB, file-rss:0kB, shmem-rss:0kB
[Fri May 28 11:02:14 2021] [11130]   350 11130  2325798  2163652    4308        0             0 slurmdbd
[Fri May 28 11:02:14 2021] Out of memory: Kill process 11130 (slurmdbd) score 571 or sacrifice child
[Fri May 28 11:02:14 2021] Killed process 11130 (slurmdbd) total-vm:9303192kB, anon-rss:8654608kB, file-rss:0kB, shmem-rss:0kB
[Fri May 28 18:01:53 2021] [29395]   350 29395  2298359  2104128    4222        0             0 slurmdbd
[Fri May 28 18:01:53 2021] Out of memory: Kill process 29395 (slurmdbd) score 555 or sacrifice child
[Fri May 28 18:01:53 2021] Killed process 29395 (slurmdbd) total-vm:9193436kB, anon-rss:8416512kB, file-rss:0kB, shmem-rss:0kB
[Fri May 28 18:01:53 2021] [29457]   350 29395  2298359  2104218    4222        0             0 slurmdbd
[Fri May 28 18:01:53 2021] Out of memory: Kill process 29457 (slurmdbd) score 555 or sacrifice child
[Fri May 28 18:01:53 2021] Killed process 29457 (slurmdbd) total-vm:9193436kB, anon-rss:8416872kB, file-rss:0kB, shmem-rss:0kB
[Sat May 29 05:02:03 2021] [19014]   350 19014  2183676  2008297    3990        0             0 slurmdbd
[Sat May 29 05:02:03 2021] Out of memory: Kill process 19014 (slurmdbd) score 530 or sacrifice child
[Sat May 29 05:02:03 2021] Killed process 19014 (slurmdbd) total-vm:8734704kB, anon-rss:8033188kB, file-rss:0kB, shmem-rss:0kB
[Sat May 29 17:01:56 2021] [12250]   350 12250  2100822  1943934    3858        0             0 slurmdbd
[Sat May 29 17:01:56 2021] Out of memory: Kill process 12250 (slurmdbd) score 513 or sacrifice child
[Sat May 29 17:01:56 2021] Killed process 12250 (slurmdbd) total-vm:8403288kB, anon-rss:7775736kB, file-rss:0kB, shmem-rss:0kB
[Sat May 29 17:01:56 2021] [12307]   350 12250  2100822  1943981    3858        0             0 slurmdbd
[Sat May 29 17:01:56 2021] Out of memory: Kill process 12307 (slurmdbd) score 513 or sacrifice child
[Sat May 29 17:01:56 2021] Killed process 12307 (slurmdbd) total-vm:8403288kB, anon-rss:7775900kB, file-rss:24kB, shmem-rss:0kB
[Sun May 30 05:01:59 2021] [27248]   350 27248  2009756  1899852    3769        0             0 slurmdbd
[Sun May 30 05:01:59 2021] Out of memory: Kill process 27248 (slurmdbd) score 501 or sacrifice child
[Sun May 30 05:01:59 2021] Killed process 27248 (slurmdbd) total-vm:8039024kB, anon-rss:7599408kB, file-rss:0kB, shmem-rss:0kB
[Sun May 30 15:02:04 2021] [15281]   350 15281  2022981  1838734    3671        0             0 slurmdbd
[Sun May 30 15:02:04 2021] Out of memory: Kill process 15281 (slurmdbd) score 485 or sacrifice child
[Sun May 30 15:02:04 2021] Killed process 15281 (slurmdbd) total-vm:8091924kB, anon-rss:7354936kB, file-rss:0kB, shmem-rss:0kB
[Mon May 31 01:01:56 2021] [29982]   350 29982  1893568  1769621    3543        0             0 slurmdbd
[Mon May 31 01:01:56 2021] Out of memory: Kill process 29982 (slurmdbd) score 467 or sacrifice child
[Mon May 31 01:01:56 2021] Killed process 29982 (slurmdbd) total-vm:7574272kB, anon-rss:7078484kB, file-rss:0kB, shmem-rss:0kB
[Mon May 31 08:02:01 2021] [30176]   350 30176  2010320  1727238    3520        0             0 slurmdbd
[Mon May 31 08:02:01 2021] Out of memory: Kill process 30176 (slurmdbd) score 456 or sacrifice child
[Mon May 31 08:02:01 2021] Killed process 30176 (slurmdbd) total-vm:8041280kB, anon-rss:6908952kB, file-rss:0kB, shmem-rss:0kB
[Mon May 31 17:02:02 2021] [31656]   350 31656  1664753  1511721    3035        0             0 slurmdbd
[Mon May 31 17:02:02 2021] Out of memory: Kill process 31656 (slurmdbd) score 399 or sacrifice child
[Mon May 31 17:02:02 2021] Killed process 31656 (slurmdbd) total-vm:6659012kB, anon-rss:6046884kB, file-rss:0kB, shmem-rss:0kB
[Thu Jun  3 09:49:30 2021] [  875]   350   875  1604336  1157221    2550        0             0 slurmdbd
[root@db1 ~] # free -gh
              total        used        free      shared  buff/cache   available
Mem:            14G        8.7G        161M        838M        5.6G        4.6G
Swap:            0B          0B          0B

C)I have asked to the sysadmin, this is not for billing but for survey/monitoring purpose. The results are read by prometheus and it allows to detect for instance burst of node_failure. So we need a minimum of frequency for beeing reactive in case of problem.

The exact script and frequenci is hourly is  
cat /etc/cron.hourly/slurm_prometheus_acct.sh 
#!/bin/bash

TMPFILE=$(mktemp)
trap "rm ${TMPFILE}" EXIT

for state in cancelled completed deadline failed node_fail out_of_memory requeued suspended timeout
do
    value=$(sacct -S "$(date +%m/01)" -E "now" -D -s $state -X -o CpuTimeRAW -n | paste -s -d+ | bc -l 2>/dev/null)
    [ -z $value ] && value=0
    echo slurm_job_cpu_seconds_total{state=\"${state}\"} $value >> ${TMPFILE}
done
test -d /etc/node_exporter || exit 1
cp -f ${TMPFILE} /etc/node_exporter/slurm_prometheus_acct.prom
chmod 0644 /etc/node_exporter/slurm_prometheus_acct.prom

Comment 4 Albert Gil 2021-06-03 09:29:42 MDT

Hi Regine,

> B) On the real cluster
> dmesg -T |grep slurmdbd
> [Sat May 22 16:01:39 2021] [ 4108]   350  4108  3477282  2972303    5967    0             0 slurmdbd
> [Sat May 22 16:01:39 2021] Out of memory: Kill process 4108 (slurmdbd) score 785 or sacrifice child
> [Sat May 22 16:01:39 2021] Killed process 4108 (slurmdbd) total-vm:13909128kB, anon-rss:11889212kB, file-rss:0kB, shmem-rss:0kB

Yes, the kernel is killing slurmdbd, and this is really bad.
But the total-vm used by slurmdbd when being killed are quite normal values: some times it gets killed with only 6G being used, and the highest value is 13G.
Those values are quite expected depending on the size of the queries being done, so it doesn't look like a leak to me.

> [root@db1 ~] # free -gh
>               total        used        free      shared  buff/cache  
> available
> Mem:            14G        8.7G        161M        838M        5.6G       
> 4.6G
> Swap:            0B          0B          0B

Ok, I see that your machine has 14G, therefore it's normal that the kernel OOMs slurmdbd.
I think that you would need to increase, probably double, the memory of the machine where slurmdbd runs.
Only 14G seems not enough for your daily work in your system.

> C)I have asked to the sysadmin, this is not for billing but for
> survey/monitoring purpose. The results are read by prometheus and it allows
> to detect for instance burst of node_failure. So we need a minimum of
> frequency for beeing reactive in case of problem.
> 
> The exact script and frequenci is hourly is  
> cat /etc/cron.hourly/slurm_prometheus_acct.sh 
> #!/bin/bash
> 
> TMPFILE=$(mktemp)
> trap "rm ${TMPFILE}" EXIT
> 
> for state in cancelled completed deadline failed node_fail out_of_memory
> requeued suspended timeout
> do
>     value=$(sacct -S "$(date +%m/01)" -E "now" -D -s $state -X -o CpuTimeRAW
> -n | paste -s -d+ | bc -l 2>/dev/null)
>     [ -z $value ] && value=0
>     echo slurm_job_cpu_seconds_total{state=\"${state}\"} $value >> ${TMPFILE}
> done
> test -d /etc/node_exporter || exit 1
> cp -f ${TMPFILE} /etc/node_exporter/slurm_prometheus_acct.prom
> chmod 0644 /etc/node_exporter/slurm_prometheus_acct.prom


I see.
It seems that what you want is the Job Completion Plugin with Prometheus.
Just in case you are not aware of it, we support a very similar monitoring with Elasticsearch:
https://slurm.schedmd.com/elasticsearch.html


Regards,
Albert

Comment 5 Regine Gaudin 2021-06-08 08:00:25 MDT

Hi
We'll double memory of the slurmdbd machine. It'll take a while.

Comment 6 Albert Gil 2021-06-08 09:16:56 MDT

Hi Regine,

> We'll double memory of the slurmdbd machine.

Ok.

> It'll take a while.

If this is ok for you I'm closing this ticket as "infogiven" for now, but please feel free to reopen it if increasing the memory is not solving the issue or you need further support on this.

Regards,
Albert