| Summary: | memory leak in slurmctld | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | hpc-ops |
| Component: | slurmctld | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | Douglas Wightman <wightman> |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | bart, csc-slurm-tickets |
| Version: | 19.05.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=8597 https://bugs.schedmd.com/show_bug.cgi?id=8899 |
||
| Site: | Ghent | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 19.05.6 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
Valgrind output
slurm config slurm log patch proposal |
||
Created attachment 12742 [details]
slurm config
Created attachment 12743 [details]
slurm log
Hi I think that I find the cause of this leak. I will provide a fix/patch soon. Could you send me output from: "scontrol show res" Dominik Dear Dominik,
scontrol show res
ReservationName=bbres StartTime=2019-12-13T16:01:14 EndTime=2020-12-12T16:01:14 Duration=365-00:00:00
Nodes=node3307.joltik.os NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=joltik Flags=
NodeName=node3307.joltik.os CoreIDs=8-15
TRES=cpu=8
Users=vsc43020 Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
ReservationName=tier2q1maintenance StartTime=2020-01-31T08:00:00 EndTime=2021-01-30T08:00:00 Duration=365-00:00:00
Nodes=node3300.joltik.os,node3301.joltik.os,node3302.joltik.os,node3303.joltik.os,node3304.joltik.os,node3305.joltik.os,node3306.joltik.os,node3307.joltik.os,node3308.joltik.os,node3309.joltik.os NodeCnt=10 CoreCnt=320 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES,ALL_NODES
TRES=cpu=320
Users=(null) Accounts=gvo00002 Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
Created attachment 12744 [details]
patch proposal
this patch should fix this memory leak
sorry that this took so long but I have done additional tests to reduce the probability of introducing new issues
Hi Did you have a chance to test this patch? Dominik Dear Dominik, Thanks for the patch, it solved the the memory leak. Balazs Hi Those commits fix leak in _core_bitmap_to_array() and 3 smaller leak in reservation related code. All of this fix will be included in 19.05.6 and above. https://github.com/SchedMD/slurm/commit/ffb20605 https://github.com/SchedMD/slurm/commit/0713c41a https://github.com/SchedMD/slurm/commit/164bafcc I'll go ahead and close this out. Dominik |
Created attachment 12741 [details] Valgrind output In a small cluster (10 nodes) the memory leak can be easily 1Gb per day.