5632 – Severe memory leak on slurmdbd

Ticket 5632 - Severe memory leak on slurmdbd

Summary: Severe memory leak on slurmdbd

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmdbd (show other tickets)
Version:	17.11.7
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Alejandro Sanchez
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-08-25 04:04 MDT by CSC sysadmins
Modified:	2020-07-22 17:52 MDT (History)
CC List:	1 user (show)

See Also:	9453
Site:	CSC - IT Center for Science
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Valgrind logs (59.70 KB, application/xml) 2018-08-25 04:04 MDT, CSC sysadmins	Details
slurmdbd debug logs (197.37 KB, application/x-bzip) 2018-08-27 03:28 MDT, CSC sysadmins	Details
slurmdbd log (10.47 KB, text/x-log) 2018-08-30 09:22 MDT, CSC sysadmins	Details
slurmdbd.conf (897 bytes, text/plain) 2018-08-30 10:45 MDT, CSC sysadmins	Details
slurm conf (7.12 KB, text/plain) 2018-08-30 10:46 MDT, CSC sysadmins	Details
Valgring log (97.48 KB, text/x-log) 2018-08-30 11:28 MDT, CSC sysadmins	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description CSC sysadmins 2018-08-25 04:04:36 MDT

Created attachment 7701 [details]
Valgrind logs

Slurmdbd starts suddenly to leak memory and finally out of memory killer kills it.
Looks like it can be triggered simple sacct query: sacct -S2018-08-23 -a


Killed process 16387, UID 88, (slurmdbd) total-vm:75617404kB, anon-rss:44130612kB, file-rss:764kB
Killed process 888, UID 88, (slurmdbd) total-vm:71086500kB, anon-rss:46839900kB, file-rss:760kB
Killed process 51621, UID 88, (slurmdbd) total-vm:72528776kB, anon-rss:47543748kB, file-rss:760kB
Killed process 33544, UID 88, (slurmdbd) total-vm:70692256kB, anon-rss:50673504kB, file-rss:764kB
Killed process 13741, UID 88, (slurmdbd) total-vm:71020964kB, anon-rss:48939836kB, file-rss:768kB
Killed process 16196, UID 27, (mysqld) total-vm:31780160kB, anon-rss:12058892kB, file-rss:1228kB
Killed process 34962, UID 88, (slurmdbd) total-vm:70790028kB, anon-rss:59370956kB, file-rss:412kB

Comment 1 Tim Wickberg 2018-08-25 13:17:49 MDT

(In reply to Tommi Tervo from comment #0)
> Created attachment 7701 [details]
> Valgrind logs
> 
> Slurmdbd starts suddenly to leak memory and finally out of memory killer
> kills it.
> Looks like it can be triggered simple sacct query: sacct -S2018-08-23 -a

Can you better describe what's happening at the time it gets killed?

The valgrind results don't look that unusual for stopping the slurmdbd while processing an RPC. The largest potential leak reported there is 5MB, and those allocations should be free'd once the thread finishes servicing the request.

Logs from slurmdbd around the time OOM stepped in would be nice. But the segfault description indicates it's not using a huge amount of memory at that point - how much is available on the slurmdbd host, and are there other processes running on that which could have lead to the high memory pressure and OOM?

Comment 2 CSC sysadmins 2018-08-27 03:21:37 MDT

It's really difficult to get valgrind logs because when leak starts, slurmdbd does not stop to ctrl-C or kill -15 signal. 
I tried to stop it when RSS was around 3GB and waited almost 30min until I'd to kill it with -9. Then valgrind+slurmdbd RSS was around 15GB.

Normally slurmdbd RSS is something like 200-500MB and when it starts to leak, it grows like a 100MB/s. Valgrind will crash when RSS is over 32GB.

Is there some other way to debug memory allocations?

Server has 64GB of memory and Mariadb makes around 10GB of it. Usually slurmdbd RSS is over 50GB when oom-killer hit it.

Comment 3 CSC sysadmins 2018-08-27 03:28:19 MDT

Created attachment 7702 [details]
slurmdbd debug logs

Comment 4 Alejandro Sanchez 2018-08-27 04:10:36 MDT

Tommi - we've seen this a few times in the past and the following approach helped:

Could you please recompile the code with --enable-memory-leak-debug option? we do create lists for each transaction which sort of caches the memory for the list to be reused later. This should be fairly small, but might be related. With this option enabled, the cache is disabled. Each memory request will be satisfied with a separate xmalloc request. When the memory is no longer required, it is immediately freed. This means valgrind can identify where exactly any leak associated with the use of the list functions originates. You would only need to do this for the slurmdbd node.

Comment 5 CSC sysadmins 2018-08-27 06:41:38 MDT

I've recompiled slurmdbd --with-memory-leak-debug and it's now running under vg. But  I don't know what triggers this error so I need to wait some hours/days.

Comment 6 CSC sysadmins 2018-08-29 01:59:11 MDT

Haven't managed to reproduce the leak yet and yesterday evening slurmdbd restarted itself with strange error so I needed to restart debug session.

[2018-08-28T18:00:07.976] Warning: Note very large processing time from hourly_rollup for csc: usec=7512395 began=18:00:00.464
[2018-08-28T18:27:52.257] error: We should have gotten a result: 'No error' ''
[2018-08-28T18:28:31.318] Accounting storage MYSQL plugin loaded
[2018-08-28T18:28:31.318] pidfile not locked, assuming no running daemon
[2018-08-28T18:28:37.061] slurmdbd version 17.11.7 started

Leaking slurmdbd is compiled against el6 mysql-5.x headers/libs, could it make buggy binary when using MariaDB 10.2? My debug build it compiled against MariaDB-10.2 and without libslurmfull.so.

Comment 7 Alejandro Sanchez 2018-08-29 03:45:09 MDT

(In reply to Tommi Tervo from comment #6)
> Haven't managed to reproduce the leak yet and yesterday evening slurmdbd
> restarted itself with strange error so I needed to restart debug session.

Glad to hear dbd memory consumption isn't growing that much now.
 
> [2018-08-28T18:00:07.976] Warning: Note very large processing time from
> hourly_rollup for csc: usec=7512395 began=18:00:00.464
> [2018-08-28T18:27:52.257] error: We should have gotten a result: 'No error'
> ''
> [2018-08-28T18:28:31.318] Accounting storage MYSQL plugin loaded
> [2018-08-28T18:28:31.318] pidfile not locked, assuming no running daemon
> [2018-08-28T18:28:37.061] slurmdbd version 17.11.7 started

Can you attach slurmdbd.log?
 
> Leaking slurmdbd is compiled against el6 mysql-5.x headers/libs, could it
> make buggy binary when using MariaDB 10.2? 

Do you mean that slurmdbd was compiled against mysql-5.x but now it is using MariaDB 10.2 underneath?

> My debug build it compiled
> against MariaDB-10.2 and without libslurmfull.so.

slurmdbd should be (at least dynamically) linked against libslurmfull.so.

Comment 8 CSC sysadmins 2018-08-30 09:22:42 MDT

Created attachment 7731 [details]
slurmdbd log

Comment 9 CSC sysadmins 2018-08-30 09:23:15 MDT

Bad news. I had a valgrind running with slurmdbd compiled --enable-memory-leak-debug and it started to leak two hours ago. I had a script running on the background which sends kill signal when RSS is > 1G. As you see from the attached logs slurmdbd got "Terminate signal" but it refused to stop. 


I double checked that leak option was enabled and config.log confirms it.

------------
It was created by slurm configure 17.11, which was
generated by GNU Autoconf 2.69.  Invocation command line was

  $ ./configure --prefix=/opt/slurm_debug --sysconfdir=/etc/slurm/ --localstatedir=/var --sharedstatedir=/var/lib --enable-memory-leak-debug --without-shared-libslurm --with-mysql_config=/usr/bin


configure:23856: checking whether memory leak debugging is enabled
configure:23876: result: yes

-----------

Comment 10 Alejandro Sanchez 2018-08-30 09:55:44 MDT

I'm investigating further on my side but I'd like to ask you to change the threshold at which you signal slurmdbd to 4/5GiB, since there might be queries (i.e. sacct/sreport) resulting large buffers up to 3GiB.

Comment 11 Alejandro Sanchez 2018-08-30 10:20:16 MDT

Can you also attach slurm.conf and slurmdbd.conf? Thanks.

Comment 12 CSC sysadmins 2018-08-30 10:45:51 MDT

Created attachment 7732 [details]
slurmdbd.conf

Comment 13 CSC sysadmins 2018-08-30 10:46:08 MDT

Created attachment 7733 [details]
slurm conf

Comment 14 CSC sysadmins 2018-08-30 11:28:17 MDT

Created attachment 7735 [details]
Valgring log

This analysis probably does not have the leak source but slurmdbd and slurmctld crashed when I stopped slurmdbd valgrind session (around T19:23-T19:24)

Nothing interesting on the slurmctld.log

[2018-08-30T19:23:38.261] backfill: Started JobID=24031186 in serial on c154
[2018-08-30T19:23:38.424] backfill: Started JobID=24149122_540(24152306) in serial on c851
[2018-08-30T19:30:47.890] pidfile not locked, assuming no running daemon
[2018-08-30T19:30:47.891] slurmctld version 17.11.7 started on cluster csc
[2018-08-30T19:30:47.891] Munge cryptographic signature plugin loaded

Neither on slurmdbd.log

[2018-08-30T19:21:48.448] slurmdbd version 17.11.7 started
[2018-08-30T19:23:38.834] error: Problem getting jobs for cluster 
[2018-08-30T19:23:38.834] error: Problem getting jobs for cluster 
[2018-08-30T19:23:38.835] error: Problem getting jobs for cluster 
[2018-08-30T19:23:38.835] error: Problem getting jobs for cluster 
[2018-08-30T19:24:33.501] Terminate signal (SIGINT or SIGTERM) received

Comment 15 Alejandro Sanchez 2018-08-31 06:20:58 MDT

(In reply to Tommi Tervo from comment #6)
These two errors are very uncommon:

> [2018-08-28T18:00:07.976] Warning: Note very large processing time from
> hourly_rollup for csc: usec=7512395 began=18:00:00.464
> [2018-08-28T18:27:52.257] error: We should have gotten a result: 'No error'
> ''

and

> [2018-08-30T19:21:48.448] slurmdbd version 17.11.7 started
> [2018-08-30T19:23:38.834] error: Problem getting jobs for cluster 

do you have the logs where these messages were written to?

I'm wondering if you are running a healthy slurmdbd binary.

* Why are you configuring with --without-shared-libslurm ? It shouldn't make a difference but that +

(In reply to Tommi Tervo from comment #6)

> Leaking slurmdbd is compiled against el6 mysql-5.x headers/libs, could it
> make buggy binary when using MariaDB 10.2? My debug build it compiled
> against MariaDB-10.2 and without libslurmfull.so.

makes me think that perhaps your slurmdbd binary isn't correct or isn't linked against your current underlying database server and version?

Looking at the code function mysql_db_query_ret(), which is where one of the error() messages is generated:

https://github.com/SchedMD/slurm/blob/slurm-17.11/src/database/mysql_common.c#L892

There's this comment:

                /*
                 * Starting in MariaDB 10.2 many of the api commands started
                 * setting errno erroneously.
                 */
 
so again I'm wondering if your slurmdbd is properly built?

Comment 16 CSC sysadmins 2018-08-31 06:39:27 MDT

OK, I'll upgrade slurm on next Monday. Debug-version of slurmdbd build is compiled against mariadb but rpm version is not.

MYSQL_LIBS='-L/usr/lib64/ -lmariadb -lz -ldl -lm -lpthread -lssl -lcrypto'


Tommi

Comment 17 CSC sysadmins 2018-09-05 01:02:00 MDT

Slurmdbd memory usage settled down with Slurm update + compiled against mariaDB headers/libs. I'll close this.

Comment 18 CSC sysadmins 2018-09-10 04:54:20 MDT

Hi,

I need to reopen this one, slurmdbd leak happens with 17.11.9-2, compiled against mariadb-10.7.

Mem:  66067192k total, 65867952k used,   199240k free,     2900k buffers
Swap: 33553916k total, 26641172k used,  6912744k free,   172060k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                  
32196 mysql     20   0 27.2g 9.6g 4192 S 25.5 15.2 256:00.32 mysqld                                                                                                                                                   
  229 root      20   0     0    0    0 S 17.8  0.0  57:36.22 kswapd0                                                                                                                                                  
45058 slurm     20   0 8365m 331m 3396 S 13.4  0.5   3:43.43 slurmctld                                                                                                                                                
  230 root      20   0     0    0    0 S  7.1  0.0  59:43.46 kswapd1                                                                                                                                                  
26684 slurm     20   0 60.9g  48g 1476 S  5.0 77.4  18:33.92 slurmdbd

Comment 19 Alejandro Sanchez 2018-09-10 05:11:43 MDT

How many jobs and how many steps do you have in your database?

mysql> use slurm_acct_db;
mysql> select count(*) from CSC_job_table;
mysql> select count(*) from CSC_step_table;

I'm wondering if you're storing such a large quantity of job/steps so that it is impacting some operations dealing with these tables. Looking at your slurmdbd.conf you have these Purge* values:

PurgeEventAfter=1month
PurgeJobAfter=24month
PurgeResvAfter=1month
PurgeStepAfter=24month

* Do you need job/step information from the previous 2 years?

I want to remark this from slurmdbd.conf man page:

The purge takes place at the start of the each purge interval. For example, if the purge time is 2 months, the purge would happen at the beginning of each month.

Depending on your tables size, we could consider reducing how long you store jobs/steps and also increase the frequency at which purge operation is attempted to an hourly fashion.

Comment 23 CSC sysadmins 2018-09-10 05:35:01 MDT

Here are counter values. 24month purge interval is conservative choice, it's just handy if one needs some old job statistics. 


MariaDB [slurm_acct_db]> select count(*) from csc_job_table;
+----------+
| count(*) |
+----------+
| 14129196 |
+----------+
1 row in set (35.11 sec)

MariaDB [slurm_acct_db]> select count(*) from csc_step_table;
+----------+
| count(*) |
+----------+
| 28795824 |
+----------+
1 row in set (4 min 49.56 sec)

Comment 24 Alejandro Sanchez 2018-09-10 05:44:11 MDT

I guess you only have 1 cluster right?

$ sacctmgr show clusters

Comment 25 CSC sysadmins 2018-09-10 05:47:19 MDT

Hi,

We have two clusters, shell is a really small 4 node "cluster" for interactive jobs.

   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
       csc       10.10.0.5         6817  8192         1                                                                                           normal           
     shell       10.10.0.3         6817  8192         1                                       8                     896                           normal

Comment 26 CSC sysadmins 2018-09-10 05:49:15 MDT

Shell counts:

select count(*) from shell_step_table;
+----------+
| count(*) |
+----------+
|   107817 |
+----------+
1 row in set (1.41 sec)

MariaDB [slurm_acct_db]> select count(*) from shell_job_table;
+----------+
| count(*) |
+----------+
|   108600 |
+----------+
1 row in set (0.29 sec)

Comment 32 Alejandro Sanchez 2018-09-10 10:57:52 MDT

Tommi, we'd like to ask a few more info:

Output of

$ sacctmgr show config

and

$ sacctmgr show runawayjobs

We're also still curious to know why you are configuring Slurm with the option?

--without-shared-libslurm

Thanks.

Comment 33 CSC sysadmins 2018-09-11 00:13:50 MDT

> We're also still curious to know why you are configuring Slurm with the option? --without-shared-libslurm

Hi,

Only my first memory debug version was build without-libslurm, our production version uses libslurm (I just build with rpmbuild -ta slurm.tgz)


[root@service01 tmp]# sacctmgr show config
Configuration data as of 2018-09-11T09:00:47
AccountingStorageBackupHost  = (null)
AccountingStorageHost  = slurmdbip
AccountingStorageLoc   = N/A
AccountingStoragePass  = (null)
AccountingStoragePort  = 6819
AccountingStorageType  = accounting_storage/slurmdbd
AccountingStorageUser  = N/A
AuthType               = auth/munge
MessageTimeout         = 99 sec
PluginDir              = /usr/lib64/slurm
PrivateData            = none
SlurmUserId            = slurm(88)
SLURM_CONF             = /etc/slurm/slurm.conf
SLURM_VERSION          = 17.11.9-2
TCPTimeout             = 2 sec
TrackWCKey             = 0

SlurmDBD configuration:
ArchiveDir             = /tmp
ArchiveEvents          = No
ArchiveJobs            = Yes
ArchiveResvs           = No
ArchiveScript          = (null)
ArchiveSteps           = No
ArchiveSuspend         = No
ArchiveTXN             = No
ArchiveUsage           = No
AuthInfo               = (null)
AuthType               = auth/munge
BOOT_TIME              = 2018-09-10T17:11:57
CommitDelay            = No
DbdAddr                = 10.10.0.5
DbdBackupHost          = (null)
DbdHost                = localhost
DbdPort                = 6819
DebugFlags             = (null)
DebugLevel             = verbose
DebugLevelSyslog       = quiet
DefaultQOS             = (null)
LogFile                = /var/log/slurm/slurmdbd.log
MaxQueryTimeRange      = UNLIMITED
MessageTimeout         = 10 secs
PidFile                = /var/run/slurmdbd.pid
PluginDir              = /usr/lib64/slurm
PrivateData            = none
PurgeEventAfter        = 1 months
PurgeJobAfter          = 24 months*
PurgeResvAfter         = 1 months
PurgeStepAfter         = 24 months
PurgeSuspendAfter      = NONE
PurgeTXNAfter          = NONE
PurgeUsageAfter        = NONE
SLURMDBD_CONF          = /etc/slurm/slurmdbd.conf
SLURMDBD_VERSION       = 17.11.9-2
SlurmUser              = slurm(88)
StorageBackupHost      = (null)
StorageHost            = localhost
StorageLoc             = slurm_acct_db
StoragePort            = 1234
StorageType            = accounting_storage/mysql
StorageUser            = slurm
TCPTimeout             = 2 secs
TrackWCKey             = No
TrackSlurmctldDown     = No
[root@service01 tmp]# sacctmgr show runawayjobs
NOTE: Runaway jobs are jobs that don't exist in the controller but are still considered pending, running or suspended in the database
          ID       Name  Partition    Cluster      State           TimeStart             TimeEnd 
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- 
24152305           reps     serial        csc    RUNNING 2018-08-30T19:23:35             Unknown 

Would you like to fix these runaway jobs?
(This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed.
Once corrected, this will trigger the rollup to reroll usage from before the oldest runaway job.)

 (You have 30 seconds to decide)
(N/y): y
[root@service01 tmp]#

Comment 34 Alejandro Sanchez 2018-09-11 04:44:25 MDT

Are you aware of any user/script doing overly-aggressive sacct queries, meaning queries with an start time close the oldest job you've in your database or sacct queries that could report very large results?

Could you try setting something moderate for MaxQueryTimeRange? In any case, queries which attempt to return over 3GB of data will still fail to complete with ESLURM_RESULT_TOO_LARGE.

Is slurmdbd memory usage settled at 60+GiB or it stabilizes back to normal values?

Comment 36 CSC sysadmins 2018-09-11 05:25:06 MDT

It's hard to say if there is someone running large sacct query, we have hundreds of users.

I enforced "sane" limit so that it should not be possible to over consume server memory: MaxQueryTimeRange      = 90-00:00:00

If slurmdbd starts to leak memory it'll hit out of memory killer fast.

Comment 38 Alejandro Sanchez 2018-09-18 05:32:40 MDT

Hi Tommi,

any updates on slurmdbd memory usage after setting MaxQueryTimeRange?

In any case, my impression here is that the slurmdbd memory usage peak up to ~80+GiB is not due to a memory leak.

While MaxQueryTimeRange or the hard-coded 3GiB size limit might help mitigate sending large amounts of data back to the client, it seems there's an intrinsic problem we need to leverage how to best address. Basically, the problem seems to be that overly-aggressive queries to the job/step tables (for instance using sacct) might return large data depending on people runnning millions of jobs in a short period of time. This MySQL/MariaDB data is loaded in slurmdbd memory before we check MaxQueryTimeRange/3GiB size, and that's why I believe you're seeing slurmdbd daemon memory usage growing up to such size.

We need to internally think how best to address this.

Comment 40 CSC sysadmins 2018-09-19 08:13:33 MDT

Hi,

Memory usage has been modest since that querylimit was enforced. ATM slurmdbd memory usage is higher than normal (usually 200-300MB) but it's not growing fast.
I checked that there are no sacct processes on cluster.



  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                
20501 slurm     20   0 8383m 1.1g 4912 S  0.3  1.7 648:34.12 slurmctld                                                                                                                                              
49572 slurm     20   0 6966m 1.3g 3104 S  0.0  2.1   4:21.16 slurmdbd

Comment 42 Alejandro Sanchez 2018-09-21 04:24:46 MDT

Hi Tommi. Does slurmdbd memory usage settled stable to normal numbers?

Comment 43 Alejandro Sanchez 2018-10-04 02:56:19 MDT

(In reply to Alejandro Sanchez from comment #42)
> Hi Tommi. Does slurmdbd memory usage settled stable to normal numbers?

Ping :)

Comment 44 CSC sysadmins 2018-10-05 05:52:32 MDT

Hi Alejandro,

Memory utilisation has been modest since setting the limit. It was yesterday a bit over 2GB RSS. I think this bug can be closed if you have plan how to fetch data from the DB without consuming a lot of memory for upcoming Slurm versions.

-Tommi

Comment 45 Alejandro Sanchez 2018-10-05 06:10:00 MDT

(In reply to Tommi Tervo from comment #44)
> Hi Alejandro,
> 
> Memory utilisation has been modest since setting the limit. It was yesterday
> a bit over 2GB RSS. I think this bug can be closed if you have plan how to
> fetch data from the DB without consuming a lot of memory for upcoming Slurm
> versions.
> 
> -Tommi

All right, I'm gonna go ahead and close this bug as per memory usage remaining stable as reported.

I've opened a separate sev-5 enhancenment request bug 5817 to track the exploration of new ideas to avoid overly-aggressive queries to trigger slurmdbd memory usage peaks.

Thanks for your feedback.