Ticket 8338

Summary: memory leak in slurmctld
Product: Slurm Reporter: hpc-ops
Component: slurmctldAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact: Douglas Wightman <wightman>
Severity: 2 - High Impact    
Priority: --- CC: bart, csc-slurm-tickets
Version: 19.05.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8597
https://bugs.schedmd.com/show_bug.cgi?id=8899
Site: Ghent Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 19.05.6
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Valgrind output
slurm config
slurm log
patch proposal

Description hpc-ops 2020-01-15 02:15:30 MST
Created attachment 12741 [details]
Valgrind output

In a small cluster (10 nodes) the memory leak can be easily 1Gb per day.
Comment 1 hpc-ops 2020-01-15 02:16:02 MST
Created attachment 12742 [details]
slurm config
Comment 2 hpc-ops 2020-01-15 02:16:23 MST
Created attachment 12743 [details]
slurm log
Comment 4 Dominik Bartkiewicz 2020-01-15 02:56:06 MST
Hi

I think that I find the cause of this leak.
I will provide a fix/patch soon.
Could you send me output from:
"scontrol show res"

Dominik
Comment 5 hpc-ops 2020-01-15 03:05:08 MST
Dear Dominik,

scontrol show res
ReservationName=bbres StartTime=2019-12-13T16:01:14 EndTime=2020-12-12T16:01:14 Duration=365-00:00:00
   Nodes=node3307.joltik.os NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=joltik Flags=
     NodeName=node3307.joltik.os CoreIDs=8-15
   TRES=cpu=8
   Users=vsc43020 Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a

ReservationName=tier2q1maintenance StartTime=2020-01-31T08:00:00 EndTime=2021-01-30T08:00:00 Duration=365-00:00:00
   Nodes=node3300.joltik.os,node3301.joltik.os,node3302.joltik.os,node3303.joltik.os,node3304.joltik.os,node3305.joltik.os,node3306.joltik.os,node3307.joltik.os,node3308.joltik.os,node3309.joltik.os NodeCnt=10 CoreCnt=320 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES,ALL_NODES
   TRES=cpu=320
   Users=(null) Accounts=gvo00002 Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
Comment 6 Dominik Bartkiewicz 2020-01-15 04:44:47 MST
Created attachment 12744 [details]
patch proposal

this patch should fix this memory leak
sorry that this took so long but I have done additional tests to reduce the probability of introducing new issues
Comment 9 Dominik Bartkiewicz 2020-01-16 14:33:58 MST
Hi

Did you have a chance to test this patch?

Dominik
Comment 10 hpc-ops 2020-01-17 02:40:52 MST
Dear Dominik,

Thanks for the patch, it solved the the memory leak.

Balazs
Comment 12 Dominik Bartkiewicz 2020-01-21 01:52:58 MST
Hi

Those commits fix leak in _core_bitmap_to_array() and 3 smaller leak in reservation related code.
All of this fix will be included in 19.05.6 and above.

https://github.com/SchedMD/slurm/commit/ffb20605
https://github.com/SchedMD/slurm/commit/0713c41a
https://github.com/SchedMD/slurm/commit/164bafcc

I'll go ahead and close this out.

Dominik