5045 – suspected slurmd memory leak

Ticket 5045 - suspected slurmd memory leak

Summary: suspected slurmd memory leak

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	17.11.4
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-04-09 22:44 MDT by Ben Matthews
Modified:	2018-04-16 12:01 MDT (History)
CC List:	0 users

See Also:
Site:	UCAR
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf from offending compute node (12.92 KB, text/plain) 2018-04-09 23:02 MDT, Ben Matthews	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Ben Matthews 2018-04-09 22:44:56 MDT

So, in attempting to verify that my new spank plugin (see Bug 5007) is not leaking memory I've noticed that even without the plugin loaded, slurmd seems to leak memory.

So, baseline after running a while, but only running a few jobs:

-bash-4.2# ps -u root v | grep -E 'PID|slurmd' | grep -v grep
  PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
 5429 ?        Ss     0:01      4   197 390770 6252  0.0 /usr/local/sbin/slurmd -f /etc/slurm/slurm.conf -D

After running 100 jobs to completion and waiting a couple minutes:

-bash-4.2# ps -u root v | grep -E 'PID|slurmd' | grep -v grep
  PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
 5429 ?        Ss     0:02      4   197 1988306 13500  0.0 /usr/local/sbin/slurmd -f /etc/slurm/slurm.conf -D

and another 100 jobs:

-bash-4.2# ps -u root v | grep -E 'PID|slurmd' | grep -v grep
  PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
 5429 ?        Ss     0:03      4   197 1988306 15276  0.0 /usr/local/sbin/slurmd -f /etc/slurm/slurm.conf -D


This is reproducible on a couple different test nodes running slightly different stacks (CentOS7, SLES12, etc). I assume it's also happening in production, but we have quite a bit of memory on our nodes, so we hadn't noticed yet. 

My test job is pretty trivial:

#-----
#!/bin/bash

#SBATCH --account=sssg0001
#SBATCH -p dav
#SBATCH -t 1:00:00
#SBATCH -N1
#SBATCH -n1
#SBATCH --reservation=root_18
#SBATCH -w caldera02

hostname
#-----

(and the reservation/-w flags don't seem to be important, but they are useful for getting me the test node in the production cluster). 


This is 17.11.4 + the recent database CVE patch. 

Is there some state that slurmd should be keeping a while after jobs finish? If not, what diagnostics can I provide?

Comment 1 Ben Matthews 2018-04-09 23:02:52 MDT

Created attachment 6572 [details]
slurm.conf from offending compute node

Comment 2 Tim Wickberg 2018-04-09 23:27:16 MDT

> Is there some state that slurmd should be keeping a while after jobs finish?

Yes, slurmd does hang on to some stuff post-job completion.

There are a handful of different cleanup tasks that will eventually flush this out. If you do notice *actual* leaks - not just a small gain in RSS - please let me know. If the slurmd cleans this up properly at shutdown, _it is not a leak_, and I'd request you stop throwing that term around loosely.

Things that will influence this:

- User / group id caching. Enabling send_gids will mitigate this to a certain extent.

- MUNGE credential anti-replay caching. Lowering cred_expire (default 120 seconds) can clean this up faster, although the exact timing on that is not guaranteed - a lot of these cleanup tasks are run lazily.

- sbcast transfer state in progress.

> If not, what diagnostics can I provide?

If you'd like to run under valgrind - which would demonstrate if there are actual leaks (which I will admit may well exist) - you should use --enable-memory-leak-debug at configure time. Otherwise slurmd will intentionally skip a lot of cleanup tasks which will throw a ton of false positive warnings, as freeing memory before process termination is a waste of time in production.

Comment 3 Ben Matthews 2018-04-10 00:04:02 MDT

 
> - User / group id caching. Enabling send_gids will mitigate this to a
> certain extent.
> 
All jobs were run from a single user/group. I assume this wouldn't cause the group caching to expand? 

Presumably, this also partially mitigates that problem:
-bash-4.2# grep CacheGroups /etc/slurm/slurm.conf
CacheGroups=0


> - MUNGE credential anti-replay caching. Lowering cred_expire (default 120
> seconds) can clean this up faster, although the exact timing on that is not
> guaranteed - a lot of these cleanup tasks are run lazily.
> 

After right around an hour, the memory usage is exactly the same and I don't think we've changed that parameter. Probably not this. 

-bash-4.2# date && ps -u root v | grep -E 'PID|slurmd' | grep -v grep
Mon Apr  9 23:39:54 MDT 2018
  PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
 5429 ?        Ss     0:03      4   197 1988306 15276  0.0 /usr/local/sbin/slurmd -f /etc/slurm/slurm.conf -D

and after a couple hundred more jobs:

-bash-4.2# date && ps -u root v | grep -E 'PID|slurmd' | grep -v grep
Mon Apr  9 23:42:47 MDT 2018
  PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
 5429 ?        Ss     0:04      4   197 2054870 17092  0.0 /usr/local/sbin/slurmd -f /etc/slurm/slurm.conf -D



> - sbcast transfer state in progress.
> 
Not using sbcast as far as I know

> > If not, what diagnostics can I provide?
> 
> If you'd like to run under valgrind - which would demonstrate if there are
> actual leaks (which I will admit may well exist) - you should use
> --enable-memory-leak-debug at configure time. Otherwise slurmd will
> intentionally skip a lot of cleanup tasks which will throw a ton of false
> positive warnings, as freeing memory before process termination is a waste
> of time in production.

I'll give that a try.

Interestingly, I don't see this on the RHEL6.4 machines - just the new RHEL7 test system that I'm playing on, so maybe it's something more subtile than a real leak, or maybe we have something in this particular build. Hopefully valgrind will say something useful. Probably should have tried that first, but I was hoping this was something obvious that you were aware of. Thanks for the info

Comment 4 Ben Matthews 2018-04-10 00:05:50 MDT

and I absolutely acknowledge that this is quibbling over tiny amounts of memory. Better to catch it early :-)

Comment 5 Tim Wickberg 2018-04-16 12:01:36 MDT

Marking resolved/infogiven for now. Please reopen if you do find a leak.

As I'd mentioned out of band before, setting MALLOC_MMAP_THRESHOLD_=131072 may help avoid mmap fragmentation with recent glibc versions, which I think you've been misinterpreting as a leak in slurmd.