Ticket 6179

Summary: slurmdbd memory leak
Product: Slurm Reporter: Nevin Cheung <cheung.nevin>
Component: slurmdbdAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: goncalves.erika, lipscomb.slaton
Version: 17.11.11   
Hardware: Linux   
OS: Linux   
Site: Genentech (Roche) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmdbd.conf as of 20181206
slurmdbd.log as of 20181206
slurmdbd.1.log.bz2
slurmdbd.log as of 20181207, after a day of DebugFlags=DB_JOB,DB_STEP
quick fix

Description Nevin Cheung 2018-12-05 12:01:16 MST
We are a SchedMD customer, using Bright Cluster Manager 7.3. We just upgraded our cluster from Slurm 17.02.09 to 17.11.11. We are noticing a very large resident memory footprint for slurmdbd since our upgrade -- up to 130GB before we had to restart it. We followed the recommendations in Bug 5632 to set the following in slurmdbd.conf:

MaxQueryTimeRange      = 90-00:00:00

While this helped, it only slowed down the memory growth -- currently using 45GB in 48 hours. Also tracking the progress of Bug (enhancement) 5817. 

Please help in debugging, thank you.
Comment 1 slaton 2018-12-05 18:01:29 MST
Some additional details.

We recently began running 'sacctmgr show runaway' after any case where slurmctld was up but slurmdbd down, to cleanup runaways. At the moment no runaways are found.

However we continue to see messages like this in /var/log/slurmdbd for most hourly rollups.

error: We have more time than is possible (30859200+1598400+0)(32457600) > 30859200 for cluster slurm_cluster(8572) from 2018-12-05T11:00:00 - 2018-12-05T12:00:00 tres 1
error: We have more allocated time than is possible (31074260 > 30859200) for cluster slurm_cluster(8572) from 2018-12-05T11:00:00 - 2018-12-05T12:00:00 tres 5
error: We have more time than is possible (30859200+1598400+0)(32457600) > 30859200 for cluster slurm_cluster(8572) from 2018-12-05T11:00:00 - 2018-12-05T12:00:00 tres 5
Warning: Note very large processing time from hourly_rollup for slurm_cluster: usec=62911321 began=13:00:00.734

As Nevin mentioned we run Bright Cluster Manager and so are always a bit suspicious that any slurmdbd resource issue could be due to aggressive sacct queries by Bright. 
However since configuring MaxQueryTimeRange a few days ago, I don't see any messages in slurmdbd logs that queries are being rejected, except for my own tests.

Additionally, the slurmdbd memory utilization issue appeared more or less immediately following our upgrade to 17.11 a couple of weeks ago, whereas we have been running Bright for about 15 months now.
Comment 2 Dominik Bartkiewicz 2018-12-06 02:04:33 MST
Hi

Could you send me slurmdbd.conf and slurmdbd.log?

Does memory consumption stay on 48GB or grow?

Could you add DebugFlags DB_JOB,DB_STEP to slurmdbd.conf

Could you send me the output from this "SHOW FULL PROCESSLIST;" query from your MySQL server?

Could you send me output from "sacctmgr show reservation" and  " scontrol show reservation"?

Dominik
Comment 3 Nevin Cheung 2018-12-06 10:59:38 MST
slurmdbd.conf and slurmdbd.log.bz2 will be attached separately.

Memory consumption now 60GB.

We will add DebugFlgas DB_JOB,DB_STEP to slurmdbd.conf at the next opportunity.

MariaDB [slurm_acct_db]> SHOW FULL PROCESSLIST;
+--------+-------+------------------------------------+---------------+---------+------+-------+-----------------------+----------+
| Id     | User  | Host                               | db            | Command | Time | State | Info                  | Progress |
+--------+-------+------------------------------------+---------------+---------+------+-------+-----------------------+----------+
| 146565 | slurm | nb001.yyyyyyy.com:55493 | slurm_acct_db | Sleep   | 3117 |       | NULL                  |    0.000 |
| 146567 | slurm | nb001.yyyyyyy.com:55501 | slurm_acct_db | Sleep   |    1 |       | NULL                  |    0.000 |
| 178149 | slurm | localhost                          | slurm_acct_db | Query   |    0 | NULL  | SHOW FULL PROCESSLIST |    0.000 |
+--------+-------+------------------------------------+---------------+---------+------+-------+-----------------------+----------+
3 rows in set (0.00 sec)

MariaDB [slurm_acct_db]> Bye
[ghpc1 root@nb001 ~]# sacctmgr show reservation
   Cluster            Name                           TRES           TimeStart             TimeEnd UnusedWall 
---------- --------------- ------------------------------ ------------------- ------------------- ---------- 
[ghpc1 root@nb001 ~]# scontrol show reservation
No reservations in the system
[ghpc1 root@nb001 ~]#
Comment 4 Nevin Cheung 2018-12-06 11:00:34 MST
Created attachment 8549 [details]
slurmdbd.conf as of 20181206
Comment 5 Nevin Cheung 2018-12-06 11:02:05 MST
Created attachment 8550 [details]
slurmdbd.log as of 20181206
Comment 6 Nevin Cheung 2018-12-06 17:25:00 MST
Created attachment 8556 [details]
slurmdbd.1.log.bz2

slurmdbd.log as of 20181206, after a few hours of DebugFlags=DB_JOB,DB_STEP
Comment 7 Nevin Cheung 2018-12-07 17:05:19 MST
Created attachment 8568 [details]
slurmdbd.log as of 20181207, after a day of DebugFlags=DB_JOB,DB_STEP
Comment 8 Dominik Bartkiewicz 2018-12-11 08:08:42 MST
Hi
I noticed that about every minute something asks slurmdbd for jobs from "t1.time_end >= 115200".
MaxQueryTimeRange is not checked for  SlurmUser and root.
I suspect that this queries cause huge memory footprint of slurmdbd.


[2018-12-06T13:21:00.235] 8(as_mysql_jobacct_process.c:505) query
select t1.account, t1.admin_comment, t1.array_max_tasks, t1.array_task_str, t1.cpus_req, t1.derived_ec, t1.derived_es, t1.exit_code, t1.id_array_job, t1.id_array_task, t1.id_assoc, t1.id_block, t1.id_group, t1.id_job, t1.pack_job_id, t1.pack_job_offset, t1.id_qos, t1.id_resv, t3.resv_name, t1.id_user, t1.id_wckey, t1.job_db_inx, t1.job_name, t1.kill_requid, t1.mem_req, t1.node_inx, t1.nodelist, t1.nodes_alloc, t1.partition, t1.priority, t1.state, t1.time_eligible, t1.time_end, t1.time_start, t1.time_submit, t1.time_suspended, t1.timelimit, t1.track_steps, t1.wckey, t1.gres_alloc, t1.gres_req, t1.gres_used, t1.tres_alloc, t1.tres_req, t1.work_dir, t1.mcs_label, t2.acct, t2.lft, t2.user from "slurm_cluster_job_table" as t1 left join "slurm_cluster_assoc_table" as t2 on t1.id_assoc=t2.id_assoc left join "slurm_cluster_resv_table" as t3 on t1.id_resv=t3.id_resv && ((t1.time_start && (t3.time_start < t1.time_start && (t3.time_end >= t1.time_start || t3.time_end = 0))) || ((t3.time_start < t1.time_submit && (t3.time_end >= t1.time_submit || t3.time_end = 0)) || (t3.time_start > t1.time_submit))) where ((t1.state='3' && (t1.time_end && (t1.time_end >= 115200))) || (t1.state='5' && (t1.time_end && (t1.time_end >= 115200))) || (t1.state='6' && (t1.time_end && (t1.time_end >= 115200))) || (t1.state='7' && (t1.time_end && (t1.time_end >= 115200)))) && (state != 524288) group by id_job, time_submit desc
--



Dominik
Comment 9 Dominik Bartkiewicz 2018-12-13 14:28:50 MST
Hi

Have you tracked what sends this queries?

Dominik
Comment 10 slaton 2018-12-13 17:19:46 MST
Dominik, we've tracked these queries to Bright Cluster Manager. 

After consultation with Bright we confirmed they they are only required for some Bright-based monitoring tools and not linked to key Bright cluster management or HA Slurm functionality. We have disabled them in our TST environment and are still evaluating the changes before implementation in PRD.

Will update the ticket once this is completed.
Comment 11 Dominik Bartkiewicz 2018-12-14 08:07:06 MST
Hi

Maybe we can drop severity if we know what causes this problem?

Dominik
Comment 12 slaton 2018-12-14 08:12:10 MST
OK to lower severity.
Comment 13 slaton 2018-12-17 18:49:48 MST
Dominik -

After disabling a number of Bright Cluster Manager queries that our colleagues identified as potentially confirmed were non-critical, we are unfortunately still seeing the same slurmdbd memory behavior. Also, the query you identified is still being observed in logs.

We will engage with Bright support to learn more about this query and whether it can be disabled. 

Regarding the values in the query you pointed out (t1.time_end >= 115200), are the units here in unix time? Meaning the query is for jobs that ended on or after Friday, January 2, 1970 8am... aka all jobs ever.
Comment 14 Dominik Bartkiewicz 2018-12-18 04:18:25 MST
Hi

Yes, this is a timestamp, and it corresponds to January 2, 1970, 8 am.

If you increase debug level to debug2, you will be able to check from which host new connection is made.

If you need, I can provide the patch to track/stop this queries.

Dominik
Comment 15 Dominik Bartkiewicz 2018-12-18 09:18:23 MST
Created attachment 8690 [details]
quick fix

Patch with an extra log, you can uncomment the second part of it to block such queries
Comment 16 slaton 2018-12-18 20:46:18 MST
Dominik - 

Alas, it appears it was insufficient to make the Bright metrics changes without fully restarting cmd services. 

After doing so, we are no longer seeing the problematic query you identified. Additionally, slurmdbd resident memory appears stable so far, after a few hours.

We will continue to monitor and will confirm after 24 hours whether this is still the case.

For other Slurm + Bright users, these are the metrics we have disabled in BrightCM to (apparently) fix this issue. The vendor has confirmed that this change "will only affect how job metrics are gathered and should have no impact on HA or other critical features."

avgexpfactor
avgjobduration
completedjobs
estimateddelay
failedjobs
queuedjobs
runningjobs
Comment 17 Dominik Bartkiewicz 2018-12-19 07:21:37 MST
Hi

Thanks for this info.

Dominik
Comment 18 Dominik Bartkiewicz 2018-12-20 09:25:36 MST
Hi

Is slurmdbd memory usage stable still?
If yes could we close this ticket as infogiven?

Dominik
Comment 19 slaton 2018-12-20 09:58:56 MST
After more than 24 hours, memory usage by slurmdbd is stable and very low, around 4-5 MB.

Thanks Dominik, you can close the ticket.
Comment 20 Dominik Bartkiewicz 2018-12-20 09:59:59 MST
Glad to hear.
Closing as resolved/infogiven.

Dominik