Created attachment 7701 [details] Valgrind logs Slurmdbd starts suddenly to leak memory and finally out of memory killer kills it. Looks like it can be triggered simple sacct query: sacct -S2018-08-23 -a Killed process 16387, UID 88, (slurmdbd) total-vm:75617404kB, anon-rss:44130612kB, file-rss:764kB Killed process 888, UID 88, (slurmdbd) total-vm:71086500kB, anon-rss:46839900kB, file-rss:760kB Killed process 51621, UID 88, (slurmdbd) total-vm:72528776kB, anon-rss:47543748kB, file-rss:760kB Killed process 33544, UID 88, (slurmdbd) total-vm:70692256kB, anon-rss:50673504kB, file-rss:764kB Killed process 13741, UID 88, (slurmdbd) total-vm:71020964kB, anon-rss:48939836kB, file-rss:768kB Killed process 16196, UID 27, (mysqld) total-vm:31780160kB, anon-rss:12058892kB, file-rss:1228kB Killed process 34962, UID 88, (slurmdbd) total-vm:70790028kB, anon-rss:59370956kB, file-rss:412kB
(In reply to Tommi Tervo from comment #0) > Created attachment 7701 [details] > Valgrind logs > > Slurmdbd starts suddenly to leak memory and finally out of memory killer > kills it. > Looks like it can be triggered simple sacct query: sacct -S2018-08-23 -a Can you better describe what's happening at the time it gets killed? The valgrind results don't look that unusual for stopping the slurmdbd while processing an RPC. The largest potential leak reported there is 5MB, and those allocations should be free'd once the thread finishes servicing the request. Logs from slurmdbd around the time OOM stepped in would be nice. But the segfault description indicates it's not using a huge amount of memory at that point - how much is available on the slurmdbd host, and are there other processes running on that which could have lead to the high memory pressure and OOM?
It's really difficult to get valgrind logs because when leak starts, slurmdbd does not stop to ctrl-C or kill -15 signal. I tried to stop it when RSS was around 3GB and waited almost 30min until I'd to kill it with -9. Then valgrind+slurmdbd RSS was around 15GB. Normally slurmdbd RSS is something like 200-500MB and when it starts to leak, it grows like a 100MB/s. Valgrind will crash when RSS is over 32GB. Is there some other way to debug memory allocations? Server has 64GB of memory and Mariadb makes around 10GB of it. Usually slurmdbd RSS is over 50GB when oom-killer hit it.
Created attachment 7702 [details] slurmdbd debug logs
Tommi - we've seen this a few times in the past and the following approach helped: Could you please recompile the code with --enable-memory-leak-debug option? we do create lists for each transaction which sort of caches the memory for the list to be reused later. This should be fairly small, but might be related. With this option enabled, the cache is disabled. Each memory request will be satisfied with a separate xmalloc request. When the memory is no longer required, it is immediately freed. This means valgrind can identify where exactly any leak associated with the use of the list functions originates. You would only need to do this for the slurmdbd node.
I've recompiled slurmdbd --with-memory-leak-debug and it's now running under vg. But I don't know what triggers this error so I need to wait some hours/days.
Haven't managed to reproduce the leak yet and yesterday evening slurmdbd restarted itself with strange error so I needed to restart debug session. [2018-08-28T18:00:07.976] Warning: Note very large processing time from hourly_rollup for csc: usec=7512395 began=18:00:00.464 [2018-08-28T18:27:52.257] error: We should have gotten a result: 'No error' '' [2018-08-28T18:28:31.318] Accounting storage MYSQL plugin loaded [2018-08-28T18:28:31.318] pidfile not locked, assuming no running daemon [2018-08-28T18:28:37.061] slurmdbd version 17.11.7 started Leaking slurmdbd is compiled against el6 mysql-5.x headers/libs, could it make buggy binary when using MariaDB 10.2? My debug build it compiled against MariaDB-10.2 and without libslurmfull.so.
(In reply to Tommi Tervo from comment #6) > Haven't managed to reproduce the leak yet and yesterday evening slurmdbd > restarted itself with strange error so I needed to restart debug session. Glad to hear dbd memory consumption isn't growing that much now. > [2018-08-28T18:00:07.976] Warning: Note very large processing time from > hourly_rollup for csc: usec=7512395 began=18:00:00.464 > [2018-08-28T18:27:52.257] error: We should have gotten a result: 'No error' > '' > [2018-08-28T18:28:31.318] Accounting storage MYSQL plugin loaded > [2018-08-28T18:28:31.318] pidfile not locked, assuming no running daemon > [2018-08-28T18:28:37.061] slurmdbd version 17.11.7 started Can you attach slurmdbd.log? > Leaking slurmdbd is compiled against el6 mysql-5.x headers/libs, could it > make buggy binary when using MariaDB 10.2? Do you mean that slurmdbd was compiled against mysql-5.x but now it is using MariaDB 10.2 underneath? > My debug build it compiled > against MariaDB-10.2 and without libslurmfull.so. slurmdbd should be (at least dynamically) linked against libslurmfull.so.
Created attachment 7731 [details] slurmdbd log
Bad news. I had a valgrind running with slurmdbd compiled --enable-memory-leak-debug and it started to leak two hours ago. I had a script running on the background which sends kill signal when RSS is > 1G. As you see from the attached logs slurmdbd got "Terminate signal" but it refused to stop. I double checked that leak option was enabled and config.log confirms it. ------------ It was created by slurm configure 17.11, which was generated by GNU Autoconf 2.69. Invocation command line was $ ./configure --prefix=/opt/slurm_debug --sysconfdir=/etc/slurm/ --localstatedir=/var --sharedstatedir=/var/lib --enable-memory-leak-debug --without-shared-libslurm --with-mysql_config=/usr/bin configure:23856: checking whether memory leak debugging is enabled configure:23876: result: yes -----------
I'm investigating further on my side but I'd like to ask you to change the threshold at which you signal slurmdbd to 4/5GiB, since there might be queries (i.e. sacct/sreport) resulting large buffers up to 3GiB.
Can you also attach slurm.conf and slurmdbd.conf? Thanks.
Created attachment 7732 [details] slurmdbd.conf
Created attachment 7733 [details] slurm conf
Created attachment 7735 [details] Valgring log This analysis probably does not have the leak source but slurmdbd and slurmctld crashed when I stopped slurmdbd valgrind session (around T19:23-T19:24) Nothing interesting on the slurmctld.log [2018-08-30T19:23:38.261] backfill: Started JobID=24031186 in serial on c154 [2018-08-30T19:23:38.424] backfill: Started JobID=24149122_540(24152306) in serial on c851 [2018-08-30T19:30:47.890] pidfile not locked, assuming no running daemon [2018-08-30T19:30:47.891] slurmctld version 17.11.7 started on cluster csc [2018-08-30T19:30:47.891] Munge cryptographic signature plugin loaded Neither on slurmdbd.log [2018-08-30T19:21:48.448] slurmdbd version 17.11.7 started [2018-08-30T19:23:38.834] error: Problem getting jobs for cluster [2018-08-30T19:23:38.834] error: Problem getting jobs for cluster [2018-08-30T19:23:38.835] error: Problem getting jobs for cluster [2018-08-30T19:23:38.835] error: Problem getting jobs for cluster [2018-08-30T19:24:33.501] Terminate signal (SIGINT or SIGTERM) received
(In reply to Tommi Tervo from comment #6) These two errors are very uncommon: > [2018-08-28T18:00:07.976] Warning: Note very large processing time from > hourly_rollup for csc: usec=7512395 began=18:00:00.464 > [2018-08-28T18:27:52.257] error: We should have gotten a result: 'No error' > '' and > [2018-08-30T19:21:48.448] slurmdbd version 17.11.7 started > [2018-08-30T19:23:38.834] error: Problem getting jobs for cluster do you have the logs where these messages were written to? I'm wondering if you are running a healthy slurmdbd binary. * Why are you configuring with --without-shared-libslurm ? It shouldn't make a difference but that + (In reply to Tommi Tervo from comment #6) > Leaking slurmdbd is compiled against el6 mysql-5.x headers/libs, could it > make buggy binary when using MariaDB 10.2? My debug build it compiled > against MariaDB-10.2 and without libslurmfull.so. makes me think that perhaps your slurmdbd binary isn't correct or isn't linked against your current underlying database server and version? Looking at the code function mysql_db_query_ret(), which is where one of the error() messages is generated: https://github.com/SchedMD/slurm/blob/slurm-17.11/src/database/mysql_common.c#L892 There's this comment: /* * Starting in MariaDB 10.2 many of the api commands started * setting errno erroneously. */ so again I'm wondering if your slurmdbd is properly built?
OK, I'll upgrade slurm on next Monday. Debug-version of slurmdbd build is compiled against mariadb but rpm version is not. MYSQL_LIBS='-L/usr/lib64/ -lmariadb -lz -ldl -lm -lpthread -lssl -lcrypto' Tommi
Slurmdbd memory usage settled down with Slurm update + compiled against mariaDB headers/libs. I'll close this.
Hi, I need to reopen this one, slurmdbd leak happens with 17.11.9-2, compiled against mariadb-10.7. Mem: 66067192k total, 65867952k used, 199240k free, 2900k buffers Swap: 33553916k total, 26641172k used, 6912744k free, 172060k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 32196 mysql 20 0 27.2g 9.6g 4192 S 25.5 15.2 256:00.32 mysqld 229 root 20 0 0 0 0 S 17.8 0.0 57:36.22 kswapd0 45058 slurm 20 0 8365m 331m 3396 S 13.4 0.5 3:43.43 slurmctld 230 root 20 0 0 0 0 S 7.1 0.0 59:43.46 kswapd1 26684 slurm 20 0 60.9g 48g 1476 S 5.0 77.4 18:33.92 slurmdbd
How many jobs and how many steps do you have in your database? mysql> use slurm_acct_db; mysql> select count(*) from CSC_job_table; mysql> select count(*) from CSC_step_table; I'm wondering if you're storing such a large quantity of job/steps so that it is impacting some operations dealing with these tables. Looking at your slurmdbd.conf you have these Purge* values: PurgeEventAfter=1month PurgeJobAfter=24month PurgeResvAfter=1month PurgeStepAfter=24month * Do you need job/step information from the previous 2 years? I want to remark this from slurmdbd.conf man page: The purge takes place at the start of the each purge interval. For example, if the purge time is 2 months, the purge would happen at the beginning of each month. Depending on your tables size, we could consider reducing how long you store jobs/steps and also increase the frequency at which purge operation is attempted to an hourly fashion.
Here are counter values. 24month purge interval is conservative choice, it's just handy if one needs some old job statistics. MariaDB [slurm_acct_db]> select count(*) from csc_job_table; +----------+ | count(*) | +----------+ | 14129196 | +----------+ 1 row in set (35.11 sec) MariaDB [slurm_acct_db]> select count(*) from csc_step_table; +----------+ | count(*) | +----------+ | 28795824 | +----------+ 1 row in set (4 min 49.56 sec)
I guess you only have 1 cluster right? $ sacctmgr show clusters
Hi, We have two clusters, shell is a really small 4 node "cluster" for interactive jobs. Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- csc 10.10.0.5 6817 8192 1 normal shell 10.10.0.3 6817 8192 1 8 896 normal
Shell counts: select count(*) from shell_step_table; +----------+ | count(*) | +----------+ | 107817 | +----------+ 1 row in set (1.41 sec) MariaDB [slurm_acct_db]> select count(*) from shell_job_table; +----------+ | count(*) | +----------+ | 108600 | +----------+ 1 row in set (0.29 sec)
Tommi, we'd like to ask a few more info: Output of $ sacctmgr show config and $ sacctmgr show runawayjobs We're also still curious to know why you are configuring Slurm with the option? --without-shared-libslurm Thanks.
> We're also still curious to know why you are configuring Slurm with the option? --without-shared-libslurm Hi, Only my first memory debug version was build without-libslurm, our production version uses libslurm (I just build with rpmbuild -ta slurm.tgz) [root@service01 tmp]# sacctmgr show config Configuration data as of 2018-09-11T09:00:47 AccountingStorageBackupHost = (null) AccountingStorageHost = slurmdbip AccountingStorageLoc = N/A AccountingStoragePass = (null) AccountingStoragePort = 6819 AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AuthType = auth/munge MessageTimeout = 99 sec PluginDir = /usr/lib64/slurm PrivateData = none SlurmUserId = slurm(88) SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 17.11.9-2 TCPTimeout = 2 sec TrackWCKey = 0 SlurmDBD configuration: ArchiveDir = /tmp ArchiveEvents = No ArchiveJobs = Yes ArchiveResvs = No ArchiveScript = (null) ArchiveSteps = No ArchiveSuspend = No ArchiveTXN = No ArchiveUsage = No AuthInfo = (null) AuthType = auth/munge BOOT_TIME = 2018-09-10T17:11:57 CommitDelay = No DbdAddr = 10.10.0.5 DbdBackupHost = (null) DbdHost = localhost DbdPort = 6819 DebugFlags = (null) DebugLevel = verbose DebugLevelSyslog = quiet DefaultQOS = (null) LogFile = /var/log/slurm/slurmdbd.log MaxQueryTimeRange = UNLIMITED MessageTimeout = 10 secs PidFile = /var/run/slurmdbd.pid PluginDir = /usr/lib64/slurm PrivateData = none PurgeEventAfter = 1 months PurgeJobAfter = 24 months* PurgeResvAfter = 1 months PurgeStepAfter = 24 months PurgeSuspendAfter = NONE PurgeTXNAfter = NONE PurgeUsageAfter = NONE SLURMDBD_CONF = /etc/slurm/slurmdbd.conf SLURMDBD_VERSION = 17.11.9-2 SlurmUser = slurm(88) StorageBackupHost = (null) StorageHost = localhost StorageLoc = slurm_acct_db StoragePort = 1234 StorageType = accounting_storage/mysql StorageUser = slurm TCPTimeout = 2 secs TrackWCKey = No TrackSlurmctldDown = No [root@service01 tmp]# sacctmgr show runawayjobs NOTE: Runaway jobs are jobs that don't exist in the controller but are still considered pending, running or suspended in the database ID Name Partition Cluster State TimeStart TimeEnd ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- 24152305 reps serial csc RUNNING 2018-08-30T19:23:35 Unknown Would you like to fix these runaway jobs? (This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed. Once corrected, this will trigger the rollup to reroll usage from before the oldest runaway job.) (You have 30 seconds to decide) (N/y): y [root@service01 tmp]#
Are you aware of any user/script doing overly-aggressive sacct queries, meaning queries with an start time close the oldest job you've in your database or sacct queries that could report very large results? Could you try setting something moderate for MaxQueryTimeRange? In any case, queries which attempt to return over 3GB of data will still fail to complete with ESLURM_RESULT_TOO_LARGE. Is slurmdbd memory usage settled at 60+GiB or it stabilizes back to normal values?
It's hard to say if there is someone running large sacct query, we have hundreds of users. I enforced "sane" limit so that it should not be possible to over consume server memory: MaxQueryTimeRange = 90-00:00:00 If slurmdbd starts to leak memory it'll hit out of memory killer fast.
Hi Tommi, any updates on slurmdbd memory usage after setting MaxQueryTimeRange? In any case, my impression here is that the slurmdbd memory usage peak up to ~80+GiB is not due to a memory leak. While MaxQueryTimeRange or the hard-coded 3GiB size limit might help mitigate sending large amounts of data back to the client, it seems there's an intrinsic problem we need to leverage how to best address. Basically, the problem seems to be that overly-aggressive queries to the job/step tables (for instance using sacct) might return large data depending on people runnning millions of jobs in a short period of time. This MySQL/MariaDB data is loaded in slurmdbd memory before we check MaxQueryTimeRange/3GiB size, and that's why I believe you're seeing slurmdbd daemon memory usage growing up to such size. We need to internally think how best to address this.
Hi, Memory usage has been modest since that querylimit was enforced. ATM slurmdbd memory usage is higher than normal (usually 200-300MB) but it's not growing fast. I checked that there are no sacct processes on cluster. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20501 slurm 20 0 8383m 1.1g 4912 S 0.3 1.7 648:34.12 slurmctld 49572 slurm 20 0 6966m 1.3g 3104 S 0.0 2.1 4:21.16 slurmdbd
Hi Tommi. Does slurmdbd memory usage settled stable to normal numbers?
(In reply to Alejandro Sanchez from comment #42) > Hi Tommi. Does slurmdbd memory usage settled stable to normal numbers? Ping :)
Hi Alejandro, Memory utilisation has been modest since setting the limit. It was yesterday a bit over 2GB RSS. I think this bug can be closed if you have plan how to fetch data from the DB without consuming a lot of memory for upcoming Slurm versions. -Tommi
(In reply to Tommi Tervo from comment #44) > Hi Alejandro, > > Memory utilisation has been modest since setting the limit. It was yesterday > a bit over 2GB RSS. I think this bug can be closed if you have plan how to > fetch data from the DB without consuming a lot of memory for upcoming Slurm > versions. > > -Tommi All right, I'm gonna go ahead and close this bug as per memory usage remaining stable as reported. I've opened a separate sev-5 enhancenment request bug 5817 to track the exploration of new ideas to avoid overly-aggressive queries to trigger slurmdbd memory usage peaks. Thanks for your feedback.