Ticket 4635

Summary: Scheduler doesn't schedule jobs because of "wrong" memory measurement.
Product: Slurm Reporter: Jan van Haarst <jan>
Component: SchedulingAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: --- CC: alexander, satya.devel
Version: 17.11.6   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Patch for bug 4635

Description Jan van Haarst 2018-01-17 08:02:26 MST

    
Comment 1 Jan van Haarst 2018-01-17 08:11:17 MST
Today I had the problem that jobs didn't get scheduled on empty nodes with an empty queue apparently because slurm thinks the node can't handle the memory requested.

Example :

haars001@nfs01:~$ scontrol show nodes  | grep -e Addr  -e CPU -e Mem | paste - - - | column -t | grep CPUAlloc=0 | sort -k11 | head -1
CPUAlloc=0   CPUErr=0  CPUTot=16  CPULoad=0.01   NodeAddr=node032  NodeHostName=node032  Version=16.05  OS=Linux  RealMemory=64337    AllocMem=0       FreeMem=26528   Sockets=2  Boards=1
haars001@nfs01:~$ ssh node032 free -m
              total        used        free      shared  buff/cache   available
Mem:          64337        1705       26522         128       36109       59353
Swap:         65535           0       65535

So SLURM thinks that there is ~25G available, where the OS says that there is about 57G available.

As far as I could find, SLURM gets that info like this :
https://github.com/SchedMD/slurm/blob/e4eaf07f3f5b5a1e00bf6c397e2bb8f456e79062/src/slurmd/slurmd/get_mach_stat.c#L287

*free_mem = (((uint64_t )info.freeram)*info.mem_unit)/(1024*1024);

So it uses Linux sysinfo to request the amount of free RAM.

free on the other calculates the available memory like this :
https://gitlab.com/procps-ng/procps/blob/master/free.c#L368

Which leads to :
https://gitlab.com/procps-ng/procps/blob/master/proc/sysinfo.c#L794 

 /* zero? might need fallback for 2.6.27 <= kernel <? 3.14 */
  if (!kb_main_available) {
#ifdef __linux__
    if (linux_version_code < LINUX_VERSION(2, 6, 27))
      kb_main_available = kb_main_free;
    else {
      FILE_TO_BUF(VM_MIN_FREE_FILE, vm_min_free_fd);
      kb_min_free = (unsigned long) strtoull(buf,&tail,10);

      watermark_low = kb_min_free * 5 / 4; /* should be equal to sum of all 'low' fields in /proc/zoneinfo */

      mem_available = (signed long)kb_main_free - watermark_low
      + kb_inactive_file + kb_active_file - MIN((kb_inactive_file + kb_active_file) / 2, watermark_low)
      + kb_slab_reclaimable - MIN(kb_slab_reclaimable / 2, watermark_low);

      if (mem_available < 0) mem_available = 0;
      kb_main_available = (unsigned long)mem_available;
    }
#else
      kb_main_available = kb_main_free;
#endif /* linux */
  }


So a quick solution might be that slurm includes the buffer RAM into the memory calculation (if that makes sense), as that is easily available :
https://github.com/torvalds/linux/blob/master/include/uapi/linux/sysinfo.h

More info on free versus available can be found here : https://www.linuxatemyram.com/

With kind regards,
Jan van Haarst
Comment 2 Satya Mishra 2018-05-21 12:42:50 MDT
Here's output from free on a node that's drained because of "Low RealMemory"
             total       used       free     shared    buffers     cached
Mem:     264436104  227618396   36817708     130940     530672  180670364
-/+ buffers/cache:   46417360  218018744
Swap:      4194300     328932    3865368

There's 36GB of absolutely free memory available and 218GB really free memory as cached data can be cleared, but slurm thinks the node is low on memory. Please fix this or let me bypass.
Comment 3 Alexander Åhman 2018-10-08 02:57:57 MDT
Any updates on this problem? 
I have verified this behaviour on Slurm version 17.11.6 but there is no change in source code in Git master so I guess the behaviour is unchanged in the latest version.

This is starting to become a big problem for us, many nodes are left unused since Slurm thinks they don't have any free memory available when they in fact have +100G free. Will patch this myself now but how is it supposed to work?

Would be good if someone from SchedMD, at least, can confirm or close this bug.
Comment 4 Jacob Jenson 2018-10-08 09:04:46 MDT
Jan,

We need to have this issue covered by a support contract before we can assign a support engineer. If you or someone from your site would like to discuss support options please email me directly at jacob@schedmd.com

Jacob
Comment 5 Alexander Åhman 2018-10-08 13:56:06 MDT
(In reply to Jacob Jenson from comment #4)
> Jan,
> 
> We need to have this issue covered by a support contract before we can
> assign a support engineer. If you or someone from your site would like to
> discuss support options please email me directly at jacob@schedmd.com

Understand, will contact you to discuss support contract. But this is one bug that should already be fixed, years ago. When Slurm can't measure available memory for a node you will not be able to use that node any more --> very bad. Why have a workload manager that can't schedule jobs correctly?

I'm working on a patch for this bug for use in our own cluster. Don't know how open you are to contributions but if you want I will share my code.
Comment 6 Satya Mishra 2018-10-08 14:06:07 MDT
I'm not sure this is a real bug. In my case it was misconfigured RealMemory in slurm.conf

"slurmd -C" gives the right RealMemory to use for slurm.conf
Comment 7 Alexander Åhman 2018-10-10 03:27:58 MDT
(In reply to Satya Mishra from comment #6)
> I'm not sure this is a real bug. In my case it was misconfigured RealMemory
> in slurm.conf
> 
> "slurmd -C" gives the right RealMemory to use for slurm.conf

Yes, it's a bug alright.
You are correct that "slurmd -C" correctly detects the amount of installed memory but that's not the problem here. The problem is that "slurmd" is not correctly measuring the amount of free memory at runtime. So, if the Linux kernel decides to use 50G as disk cache Slurm will think that those 50G of RAM are permanently gone when they in fact are not.
Comment 8 Alexander Åhman 2018-10-10 03:51:08 MDT
Created attachment 7986 [details]
Patch for bug 4635
Comment 9 Alexander Åhman 2018-10-10 03:51:44 MDT
I have patched slurmd and the modified version is deployed in our cluster. So far everything is working as it should.

This patch will only work if the line "MemAvailable" is present in /proc/meminfo. It should be the case in all modern Linux kernels but if you happen to run an older version it will fall back and use the standard Slurm way and only return free memory.

This patch is verified to work on Slurm 17.11.6 but since the function "get_free_mem" is the same in Git master this patch will work for the new version of Slurm too.

Here is a link to when "MemAvailable" was added to the kernel (in 2014), with some interesting information about how to calculate free memory:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34e431b0ae398fc54ea69ff85ec700722c9da773

Enjoy!
Comment 10 Tim Wickberg 2018-10-10 09:51:40 MDT
(In reply to Alexander Åhman from comment #9)
> I have patched slurmd and the modified version is deployed in our cluster.
> So far everything is working as it should.
> 
> This patch will only work if the line "MemAvailable" is present in
> /proc/meminfo. It should be the case in all modern Linux kernels but if you
> happen to run an older version it will fall back and use the standard Slurm
> way and only return free memory.
> 
> This patch is verified to work on Slurm 17.11.6 but since the function
> "get_free_mem" is the same in Git master this patch will work for the new
> version of Slurm too.

Thank you for the patch submission, but this patch is incorrect, and will not be accepted.

The FreeMem value will not influence whether the node is permitted to register or not, and you're modifying it to no longer print out the correct informational value.

- Tim
Comment 11 Jan van Haarst 2018-10-18 04:12:00 MDT
I'm a bit surprised by the comment "The FreeMem value will not influence whether the node is permitted to register or not".
If that is the case, then what _is_ the way that the scheduler decides to allow a job to be run on a node, depending on the amount of RAM it requests ?

What I see, is that the FreeMem value is way off on all nodes on our cluster.
Most of the time it is too low, but sometimes it is too high when compared to MemAvailable.

What we see in the field, is that jobs do not get run, as the scheduler seems to think there isn't enough RAM available to the job.