So, in attempting to verify that my new spank plugin (see Bug 5007) is not leaking memory I've noticed that even without the plugin loaded, slurmd seems to leak memory. So, baseline after running a while, but only running a few jobs: -bash-4.2# ps -u root v | grep -E 'PID|slurmd' | grep -v grep PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 5429 ? Ss 0:01 4 197 390770 6252 0.0 /usr/local/sbin/slurmd -f /etc/slurm/slurm.conf -D After running 100 jobs to completion and waiting a couple minutes: -bash-4.2# ps -u root v | grep -E 'PID|slurmd' | grep -v grep PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 5429 ? Ss 0:02 4 197 1988306 13500 0.0 /usr/local/sbin/slurmd -f /etc/slurm/slurm.conf -D and another 100 jobs: -bash-4.2# ps -u root v | grep -E 'PID|slurmd' | grep -v grep PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 5429 ? Ss 0:03 4 197 1988306 15276 0.0 /usr/local/sbin/slurmd -f /etc/slurm/slurm.conf -D This is reproducible on a couple different test nodes running slightly different stacks (CentOS7, SLES12, etc). I assume it's also happening in production, but we have quite a bit of memory on our nodes, so we hadn't noticed yet. My test job is pretty trivial: #----- #!/bin/bash #SBATCH --account=sssg0001 #SBATCH -p dav #SBATCH -t 1:00:00 #SBATCH -N1 #SBATCH -n1 #SBATCH --reservation=root_18 #SBATCH -w caldera02 hostname #----- (and the reservation/-w flags don't seem to be important, but they are useful for getting me the test node in the production cluster). This is 17.11.4 + the recent database CVE patch. Is there some state that slurmd should be keeping a while after jobs finish? If not, what diagnostics can I provide?
Created attachment 6572 [details] slurm.conf from offending compute node
> Is there some state that slurmd should be keeping a while after jobs finish? Yes, slurmd does hang on to some stuff post-job completion. There are a handful of different cleanup tasks that will eventually flush this out. If you do notice *actual* leaks - not just a small gain in RSS - please let me know. If the slurmd cleans this up properly at shutdown, _it is not a leak_, and I'd request you stop throwing that term around loosely. Things that will influence this: - User / group id caching. Enabling send_gids will mitigate this to a certain extent. - MUNGE credential anti-replay caching. Lowering cred_expire (default 120 seconds) can clean this up faster, although the exact timing on that is not guaranteed - a lot of these cleanup tasks are run lazily. - sbcast transfer state in progress. > If not, what diagnostics can I provide? If you'd like to run under valgrind - which would demonstrate if there are actual leaks (which I will admit may well exist) - you should use --enable-memory-leak-debug at configure time. Otherwise slurmd will intentionally skip a lot of cleanup tasks which will throw a ton of false positive warnings, as freeing memory before process termination is a waste of time in production.
> - User / group id caching. Enabling send_gids will mitigate this to a > certain extent. > All jobs were run from a single user/group. I assume this wouldn't cause the group caching to expand? Presumably, this also partially mitigates that problem: -bash-4.2# grep CacheGroups /etc/slurm/slurm.conf CacheGroups=0 > - MUNGE credential anti-replay caching. Lowering cred_expire (default 120 > seconds) can clean this up faster, although the exact timing on that is not > guaranteed - a lot of these cleanup tasks are run lazily. > After right around an hour, the memory usage is exactly the same and I don't think we've changed that parameter. Probably not this. -bash-4.2# date && ps -u root v | grep -E 'PID|slurmd' | grep -v grep Mon Apr 9 23:39:54 MDT 2018 PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 5429 ? Ss 0:03 4 197 1988306 15276 0.0 /usr/local/sbin/slurmd -f /etc/slurm/slurm.conf -D and after a couple hundred more jobs: -bash-4.2# date && ps -u root v | grep -E 'PID|slurmd' | grep -v grep Mon Apr 9 23:42:47 MDT 2018 PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 5429 ? Ss 0:04 4 197 2054870 17092 0.0 /usr/local/sbin/slurmd -f /etc/slurm/slurm.conf -D > - sbcast transfer state in progress. > Not using sbcast as far as I know > > If not, what diagnostics can I provide? > > If you'd like to run under valgrind - which would demonstrate if there are > actual leaks (which I will admit may well exist) - you should use > --enable-memory-leak-debug at configure time. Otherwise slurmd will > intentionally skip a lot of cleanup tasks which will throw a ton of false > positive warnings, as freeing memory before process termination is a waste > of time in production. I'll give that a try. Interestingly, I don't see this on the RHEL6.4 machines - just the new RHEL7 test system that I'm playing on, so maybe it's something more subtile than a real leak, or maybe we have something in this particular build. Hopefully valgrind will say something useful. Probably should have tried that first, but I was hoping this was something obvious that you were aware of. Thanks for the info
and I absolutely acknowledge that this is quibbling over tiny amounts of memory. Better to catch it early :-)
Marking resolved/infogiven for now. Please reopen if you do find a leak. As I'd mentioned out of band before, setting MALLOC_MMAP_THRESHOLD_=131072 may help avoid mmap fragmentation with recent glibc versions, which I think you've been misinterpreting as a leak in slurmd.