Ticket 8338 - memory leak in slurmctld
Summary: memory leak in slurmctld
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 19.05.5
Hardware: Linux Linux
: 2 - High Impact
Assignee: Dominik Bartkiewicz
QA Contact: Douglas Wightman
URL:
Depends on:
Blocks:
 
Reported: 2020-01-15 02:15 MST by hpc-ops
Modified: 2020-04-20 05:28 MDT (History)
2 users (show)

See Also:
Site: Ghent
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 19.05.6
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Valgrind output (386.16 KB, text/plain)
2020-01-15 02:15 MST, hpc-ops
Details
slurm config (3.16 KB, text/plain)
2020-01-15 02:16 MST, hpc-ops
Details
slurm log (11.64 KB, text/plain)
2020-01-15 02:16 MST, hpc-ops
Details
patch proposal (1.12 KB, patch)
2020-01-15 04:44 MST, Dominik Bartkiewicz
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description hpc-ops 2020-01-15 02:15:30 MST
Created attachment 12741 [details]
Valgrind output

In a small cluster (10 nodes) the memory leak can be easily 1Gb per day.
Comment 1 hpc-ops 2020-01-15 02:16:02 MST
Created attachment 12742 [details]
slurm config
Comment 2 hpc-ops 2020-01-15 02:16:23 MST
Created attachment 12743 [details]
slurm log
Comment 4 Dominik Bartkiewicz 2020-01-15 02:56:06 MST
Hi

I think that I find the cause of this leak.
I will provide a fix/patch soon.
Could you send me output from:
"scontrol show res"

Dominik
Comment 5 hpc-ops 2020-01-15 03:05:08 MST
Dear Dominik,

scontrol show res
ReservationName=bbres StartTime=2019-12-13T16:01:14 EndTime=2020-12-12T16:01:14 Duration=365-00:00:00
   Nodes=node3307.joltik.os NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=joltik Flags=
     NodeName=node3307.joltik.os CoreIDs=8-15
   TRES=cpu=8
   Users=vsc43020 Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=tier2q1maintenance StartTime=2020-01-31T08:00:00 EndTime=2021-01-30T08:00:00 Duration=365-00:00:00
   Nodes=node3300.joltik.os,node3301.joltik.os,node3302.joltik.os,node3303.joltik.os,node3304.joltik.os,node3305.joltik.os,node3306.joltik.os,node3307.joltik.os,node3308.joltik.os,node3309.joltik.os NodeCnt=10 CoreCnt=320 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES,ALL_NODES
   TRES=cpu=320
   Users=(null) Accounts=gvo00002 Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
Comment 6 Dominik Bartkiewicz 2020-01-15 04:44:47 MST
Created attachment 12744 [details]
patch proposal

this patch should fix this memory leak
sorry that this took so long but I have done additional tests to reduce the probability of introducing new issues
Comment 9 Dominik Bartkiewicz 2020-01-16 14:33:58 MST
Hi

Did you have a chance to test this patch?

Dominik
Comment 10 hpc-ops 2020-01-17 02:40:52 MST
Dear Dominik,

Thanks for the patch, it solved the the memory leak.

Balazs
Comment 12 Dominik Bartkiewicz 2020-01-21 01:52:58 MST
Hi

Those commits fix leak in _core_bitmap_to_array() and 3 smaller leak in reservation related code.
All of this fix will be included in 19.05.6 and above.

https://github.com/SchedMD/slurm/commit/ffb20605
https://github.com/SchedMD/slurm/commit/0713c41a
https://github.com/SchedMD/slurm/commit/164bafcc

I'll go ahead and close this out.

Dominik