| Summary: | suspected slurmd memory leak | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Ben Matthews <matthews> |
| Component: | slurmd | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 17.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | UCAR | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf from offending compute node | ||
|
Description
Ben Matthews
2018-04-09 22:44:56 MDT
Created attachment 6572 [details]
slurm.conf from offending compute node
> Is there some state that slurmd should be keeping a while after jobs finish? Yes, slurmd does hang on to some stuff post-job completion. There are a handful of different cleanup tasks that will eventually flush this out. If you do notice *actual* leaks - not just a small gain in RSS - please let me know. If the slurmd cleans this up properly at shutdown, _it is not a leak_, and I'd request you stop throwing that term around loosely. Things that will influence this: - User / group id caching. Enabling send_gids will mitigate this to a certain extent. - MUNGE credential anti-replay caching. Lowering cred_expire (default 120 seconds) can clean this up faster, although the exact timing on that is not guaranteed - a lot of these cleanup tasks are run lazily. - sbcast transfer state in progress. > If not, what diagnostics can I provide? If you'd like to run under valgrind - which would demonstrate if there are actual leaks (which I will admit may well exist) - you should use --enable-memory-leak-debug at configure time. Otherwise slurmd will intentionally skip a lot of cleanup tasks which will throw a ton of false positive warnings, as freeing memory before process termination is a waste of time in production. > - User / group id caching. Enabling send_gids will mitigate this to a > certain extent. > All jobs were run from a single user/group. I assume this wouldn't cause the group caching to expand? Presumably, this also partially mitigates that problem: -bash-4.2# grep CacheGroups /etc/slurm/slurm.conf CacheGroups=0 > - MUNGE credential anti-replay caching. Lowering cred_expire (default 120 > seconds) can clean this up faster, although the exact timing on that is not > guaranteed - a lot of these cleanup tasks are run lazily. > After right around an hour, the memory usage is exactly the same and I don't think we've changed that parameter. Probably not this. -bash-4.2# date && ps -u root v | grep -E 'PID|slurmd' | grep -v grep Mon Apr 9 23:39:54 MDT 2018 PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 5429 ? Ss 0:03 4 197 1988306 15276 0.0 /usr/local/sbin/slurmd -f /etc/slurm/slurm.conf -D and after a couple hundred more jobs: -bash-4.2# date && ps -u root v | grep -E 'PID|slurmd' | grep -v grep Mon Apr 9 23:42:47 MDT 2018 PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND 5429 ? Ss 0:04 4 197 2054870 17092 0.0 /usr/local/sbin/slurmd -f /etc/slurm/slurm.conf -D > - sbcast transfer state in progress. > Not using sbcast as far as I know > > If not, what diagnostics can I provide? > > If you'd like to run under valgrind - which would demonstrate if there are > actual leaks (which I will admit may well exist) - you should use > --enable-memory-leak-debug at configure time. Otherwise slurmd will > intentionally skip a lot of cleanup tasks which will throw a ton of false > positive warnings, as freeing memory before process termination is a waste > of time in production. I'll give that a try. Interestingly, I don't see this on the RHEL6.4 machines - just the new RHEL7 test system that I'm playing on, so maybe it's something more subtile than a real leak, or maybe we have something in this particular build. Hopefully valgrind will say something useful. Probably should have tried that first, but I was hoping this was something obvious that you were aware of. Thanks for the info and I absolutely acknowledge that this is quibbling over tiny amounts of memory. Better to catch it early :-) Marking resolved/infogiven for now. Please reopen if you do find a leak. As I'd mentioned out of band before, setting MALLOC_MMAP_THRESHOLD_=131072 may help avoid mmap fragmentation with recent glibc versions, which I think you've been misinterpreting as a leak in slurmd. |