Ticket 6179

Summary:	slurmdbd memory leak
Product:	Slurm	Reporter:	Nevin Cheung <cheung.nevin>
Component:	slurmdbd	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	goncalves.erika, lipscomb.slaton
Version:	17.11.11
Hardware:	Linux
OS:	Linux
Site:	Genentech (Roche)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmdbd.conf as of 20181206 slurmdbd.log as of 20181206 slurmdbd.1.log.bz2 slurmdbd.log as of 20181207, after a day of DebugFlags=DB_JOB,DB_STEP quick fix

Description Nevin Cheung 2018-12-05 12:01:16 MST

We are a SchedMD customer, using Bright Cluster Manager 7.3. We just upgraded our cluster from Slurm 17.02.09 to 17.11.11. We are noticing a very large resident memory footprint for slurmdbd since our upgrade -- up to 130GB before we had to restart it. We followed the recommendations in Bug 5632 to set the following in slurmdbd.conf:

MaxQueryTimeRange      = 90-00:00:00

While this helped, it only slowed down the memory growth -- currently using 45GB in 48 hours. Also tracking the progress of Bug (enhancement) 5817. 

Please help in debugging, thank you.

Comment 1 slaton 2018-12-05 18:01:29 MST

Some additional details.

We recently began running 'sacctmgr show runaway' after any case where slurmctld was up but slurmdbd down, to cleanup runaways. At the moment no runaways are found.

However we continue to see messages like this in /var/log/slurmdbd for most hourly rollups.

error: We have more time than is possible (30859200+1598400+0)(32457600) > 30859200 for cluster slurm_cluster(8572) from 2018-12-05T11:00:00 - 2018-12-05T12:00:00 tres 1
error: We have more allocated time than is possible (31074260 > 30859200) for cluster slurm_cluster(8572) from 2018-12-05T11:00:00 - 2018-12-05T12:00:00 tres 5
error: We have more time than is possible (30859200+1598400+0)(32457600) > 30859200 for cluster slurm_cluster(8572) from 2018-12-05T11:00:00 - 2018-12-05T12:00:00 tres 5
Warning: Note very large processing time from hourly_rollup for slurm_cluster: usec=62911321 began=13:00:00.734

As Nevin mentioned we run Bright Cluster Manager and so are always a bit suspicious that any slurmdbd resource issue could be due to aggressive sacct queries by Bright. 
However since configuring MaxQueryTimeRange a few days ago, I don't see any messages in slurmdbd logs that queries are being rejected, except for my own tests.

Additionally, the slurmdbd memory utilization issue appeared more or less immediately following our upgrade to 17.11 a couple of weeks ago, whereas we have been running Bright for about 15 months now.

Comment 2 Dominik Bartkiewicz 2018-12-06 02:04:33 MST

Hi

Could you send me slurmdbd.conf and slurmdbd.log?

Does memory consumption stay on 48GB or grow?

Could you add DebugFlags DB_JOB,DB_STEP to slurmdbd.conf

Could you send me the output from this "SHOW FULL PROCESSLIST;" query from your MySQL server?

Could you send me output from "sacctmgr show reservation" and  " scontrol show reservation"?

Dominik

Comment 3 Nevin Cheung 2018-12-06 10:59:38 MST

slurmdbd.conf and slurmdbd.log.bz2 will be attached separately.

Memory consumption now 60GB.

We will add DebugFlgas DB_JOB,DB_STEP to slurmdbd.conf at the next opportunity.

MariaDB [slurm_acct_db]> SHOW FULL PROCESSLIST;
+--------+-------+------------------------------------+---------------+---------+------+-------+-----------------------+----------+
| Id     | User  | Host                               | db            | Command | Time | State | Info                  | Progress |
+--------+-------+------------------------------------+---------------+---------+------+-------+-----------------------+----------+
| 146565 | slurm | nb001.yyyyyyy.com:55493 | slurm_acct_db | Sleep   | 3117 |       | NULL                  |    0.000 |
| 146567 | slurm | nb001.yyyyyyy.com:55501 | slurm_acct_db | Sleep   |    1 |       | NULL                  |    0.000 |
| 178149 | slurm | localhost                          | slurm_acct_db | Query   |    0 | NULL  | SHOW FULL PROCESSLIST |    0.000 |
+--------+-------+------------------------------------+---------------+---------+------+-------+-----------------------+----------+
3 rows in set (0.00 sec)

MariaDB [slurm_acct_db]> Bye
[ghpc1 root@nb001 ~]# sacctmgr show reservation
   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall 
---------- --------------- ------------------------------ ------------------- ------------------- ---------- 
[ghpc1 root@nb001 ~]# scontrol show reservation
No reservations in the system
[ghpc1 root@nb001 ~]#

Comment 4 Nevin Cheung 2018-12-06 11:00:34 MST

Created attachment 8549 [details]
slurmdbd.conf as of 20181206

Comment 5 Nevin Cheung 2018-12-06 11:02:05 MST

Created attachment 8550 [details]
slurmdbd.log as of 20181206

Comment 6 Nevin Cheung 2018-12-06 17:25:00 MST

Created attachment 8556 [details]
slurmdbd.1.log.bz2

slurmdbd.log as of 20181206, after a few hours of DebugFlags=DB_JOB,DB_STEP

Comment 7 Nevin Cheung 2018-12-07 17:05:19 MST

Created attachment 8568 [details]
slurmdbd.log as of 20181207, after a day of DebugFlags=DB_JOB,DB_STEP

Comment 8 Dominik Bartkiewicz 2018-12-11 08:08:42 MST

Hi
I noticed that about every minute something asks slurmdbd for jobs from "t1.time_end >= 115200".
MaxQueryTimeRange is not checked for  SlurmUser and root.
I suspect that this queries cause huge memory footprint of slurmdbd.


[2018-12-06T13:21:00.235] 8(as_mysql_jobacct_process.c:505) query
select t1.account, t1.admin_comment, t1.array_max_tasks, t1.array_task_str, t1.cpus_req, t1.derived_ec, t1.derived_es, t1.exit_code, t1.id_array_job, t1.id_array_task, t1.id_assoc, t1.id_block, t1.id_group, t1.id_job, t1.pack_job_id, t1.pack_job_offset, t1.id_qos, t1.id_resv, t3.resv_name, t1.id_user, t1.id_wckey, t1.job_db_inx, t1.job_name, t1.kill_requid, t1.mem_req, t1.node_inx, t1.nodelist, t1.nodes_alloc, t1.partition, t1.priority, t1.state, t1.time_eligible, t1.time_end, t1.time_start, t1.time_submit, t1.time_suspended, t1.timelimit, t1.track_steps, t1.wckey, t1.gres_alloc, t1.gres_req, t1.gres_used, t1.tres_alloc, t1.tres_req, t1.work_dir, t1.mcs_label, t2.acct, t2.lft, t2.user from "slurm_cluster_job_table" as t1 left join "slurm_cluster_assoc_table" as t2 on t1.id_assoc=t2.id_assoc left join "slurm_cluster_resv_table" as t3 on t1.id_resv=t3.id_resv && ((t1.time_start && (t3.time_start < t1.time_start && (t3.time_end >= t1.time_start || t3.time_end = 0))) || ((t3.time_start < t1.time_submit && (t3.time_end >= t1.time_submit || t3.time_end = 0)) || (t3.time_start > t1.time_submit))) where ((t1.state='3' && (t1.time_end && (t1.time_end >= 115200))) || (t1.state='5' && (t1.time_end && (t1.time_end >= 115200))) || (t1.state='6' && (t1.time_end && (t1.time_end >= 115200))) || (t1.state='7' && (t1.time_end && (t1.time_end >= 115200)))) && (state != 524288) group by id_job, time_submit desc
--



Dominik

Comment 9 Dominik Bartkiewicz 2018-12-13 14:28:50 MST

Hi

Have you tracked what sends this queries?

Dominik

Comment 10 slaton 2018-12-13 17:19:46 MST

Dominik, we've tracked these queries to Bright Cluster Manager. 

After consultation with Bright we confirmed they they are only required for some Bright-based monitoring tools and not linked to key Bright cluster management or HA Slurm functionality. We have disabled them in our TST environment and are still evaluating the changes before implementation in PRD.

Will update the ticket once this is completed.

Comment 11 Dominik Bartkiewicz 2018-12-14 08:07:06 MST

Hi

Maybe we can drop severity if we know what causes this problem?

Dominik

Comment 12 slaton 2018-12-14 08:12:10 MST

OK to lower severity.

Comment 13 slaton 2018-12-17 18:49:48 MST

Dominik -

After disabling a number of Bright Cluster Manager queries that our colleagues identified as potentially confirmed were non-critical, we are unfortunately still seeing the same slurmdbd memory behavior. Also, the query you identified is still being observed in logs.

We will engage with Bright support to learn more about this query and whether it can be disabled. 

Regarding the values in the query you pointed out (t1.time_end >= 115200), are the units here in unix time? Meaning the query is for jobs that ended on or after Friday, January 2, 1970 8am... aka all jobs ever.

Comment 14 Dominik Bartkiewicz 2018-12-18 04:18:25 MST

Hi

Yes, this is a timestamp, and it corresponds to January 2, 1970, 8 am.

If you increase debug level to debug2, you will be able to check from which host new connection is made.

If you need, I can provide the patch to track/stop this queries.

Dominik

Comment 15 Dominik Bartkiewicz 2018-12-18 09:18:23 MST

Created attachment 8690 [details]
quick fix

Patch with an extra log, you can uncomment the second part of it to block such queries

Comment 16 slaton 2018-12-18 20:46:18 MST

Dominik - 

Alas, it appears it was insufficient to make the Bright metrics changes without fully restarting cmd services. 

After doing so, we are no longer seeing the problematic query you identified. Additionally, slurmdbd resident memory appears stable so far, after a few hours.

We will continue to monitor and will confirm after 24 hours whether this is still the case.

For other Slurm + Bright users, these are the metrics we have disabled in BrightCM to (apparently) fix this issue. The vendor has confirmed that this change "will only affect how job metrics are gathered and should have no impact on HA or other critical features."

avgexpfactor
avgjobduration
completedjobs
estimateddelay
failedjobs
queuedjobs
runningjobs

Comment 17 Dominik Bartkiewicz 2018-12-19 07:21:37 MST

Hi

Thanks for this info.

Dominik

Comment 18 Dominik Bartkiewicz 2018-12-20 09:25:36 MST

Hi

Is slurmdbd memory usage stable still?
If yes could we close this ticket as infogiven?

Dominik

Comment 19 slaton 2018-12-20 09:58:56 MST

After more than 24 hours, memory usage by slurmdbd is stable and very low, around 4-5 MB.

Thanks Dominik, you can close the ticket.

Comment 20 Dominik Bartkiewicz 2018-12-20 09:59:59 MST

Glad to hear.
Closing as resolved/infogiven.

Dominik