Today I had the problem that jobs didn't get scheduled on empty nodes with an empty queue apparently because slurm thinks the node can't handle the memory requested. Example : haars001@nfs01:~$ scontrol show nodes | grep -e Addr -e CPU -e Mem | paste - - - | column -t | grep CPUAlloc=0 | sort -k11 | head -1 CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.01 NodeAddr=node032 NodeHostName=node032 Version=16.05 OS=Linux RealMemory=64337 AllocMem=0 FreeMem=26528 Sockets=2 Boards=1 haars001@nfs01:~$ ssh node032 free -m total used free shared buff/cache available Mem: 64337 1705 26522 128 36109 59353 Swap: 65535 0 65535 So SLURM thinks that there is ~25G available, where the OS says that there is about 57G available. As far as I could find, SLURM gets that info like this : https://github.com/SchedMD/slurm/blob/e4eaf07f3f5b5a1e00bf6c397e2bb8f456e79062/src/slurmd/slurmd/get_mach_stat.c#L287 *free_mem = (((uint64_t )info.freeram)*info.mem_unit)/(1024*1024); So it uses Linux sysinfo to request the amount of free RAM. free on the other calculates the available memory like this : https://gitlab.com/procps-ng/procps/blob/master/free.c#L368 Which leads to : https://gitlab.com/procps-ng/procps/blob/master/proc/sysinfo.c#L794 /* zero? might need fallback for 2.6.27 <= kernel <? 3.14 */ if (!kb_main_available) { #ifdef __linux__ if (linux_version_code < LINUX_VERSION(2, 6, 27)) kb_main_available = kb_main_free; else { FILE_TO_BUF(VM_MIN_FREE_FILE, vm_min_free_fd); kb_min_free = (unsigned long) strtoull(buf,&tail,10); watermark_low = kb_min_free * 5 / 4; /* should be equal to sum of all 'low' fields in /proc/zoneinfo */ mem_available = (signed long)kb_main_free - watermark_low + kb_inactive_file + kb_active_file - MIN((kb_inactive_file + kb_active_file) / 2, watermark_low) + kb_slab_reclaimable - MIN(kb_slab_reclaimable / 2, watermark_low); if (mem_available < 0) mem_available = 0; kb_main_available = (unsigned long)mem_available; } #else kb_main_available = kb_main_free; #endif /* linux */ } So a quick solution might be that slurm includes the buffer RAM into the memory calculation (if that makes sense), as that is easily available : https://github.com/torvalds/linux/blob/master/include/uapi/linux/sysinfo.h More info on free versus available can be found here : https://www.linuxatemyram.com/ With kind regards, Jan van Haarst
Here's output from free on a node that's drained because of "Low RealMemory" total used free shared buffers cached Mem: 264436104 227618396 36817708 130940 530672 180670364 -/+ buffers/cache: 46417360 218018744 Swap: 4194300 328932 3865368 There's 36GB of absolutely free memory available and 218GB really free memory as cached data can be cleared, but slurm thinks the node is low on memory. Please fix this or let me bypass.
Any updates on this problem? I have verified this behaviour on Slurm version 17.11.6 but there is no change in source code in Git master so I guess the behaviour is unchanged in the latest version. This is starting to become a big problem for us, many nodes are left unused since Slurm thinks they don't have any free memory available when they in fact have +100G free. Will patch this myself now but how is it supposed to work? Would be good if someone from SchedMD, at least, can confirm or close this bug.
Jan, We need to have this issue covered by a support contract before we can assign a support engineer. If you or someone from your site would like to discuss support options please email me directly at jacob@schedmd.com Jacob
(In reply to Jacob Jenson from comment #4) > Jan, > > We need to have this issue covered by a support contract before we can > assign a support engineer. If you or someone from your site would like to > discuss support options please email me directly at jacob@schedmd.com Understand, will contact you to discuss support contract. But this is one bug that should already be fixed, years ago. When Slurm can't measure available memory for a node you will not be able to use that node any more --> very bad. Why have a workload manager that can't schedule jobs correctly? I'm working on a patch for this bug for use in our own cluster. Don't know how open you are to contributions but if you want I will share my code.
I'm not sure this is a real bug. In my case it was misconfigured RealMemory in slurm.conf "slurmd -C" gives the right RealMemory to use for slurm.conf
(In reply to Satya Mishra from comment #6) > I'm not sure this is a real bug. In my case it was misconfigured RealMemory > in slurm.conf > > "slurmd -C" gives the right RealMemory to use for slurm.conf Yes, it's a bug alright. You are correct that "slurmd -C" correctly detects the amount of installed memory but that's not the problem here. The problem is that "slurmd" is not correctly measuring the amount of free memory at runtime. So, if the Linux kernel decides to use 50G as disk cache Slurm will think that those 50G of RAM are permanently gone when they in fact are not.
Created attachment 7986 [details] Patch for bug 4635
I have patched slurmd and the modified version is deployed in our cluster. So far everything is working as it should. This patch will only work if the line "MemAvailable" is present in /proc/meminfo. It should be the case in all modern Linux kernels but if you happen to run an older version it will fall back and use the standard Slurm way and only return free memory. This patch is verified to work on Slurm 17.11.6 but since the function "get_free_mem" is the same in Git master this patch will work for the new version of Slurm too. Here is a link to when "MemAvailable" was added to the kernel (in 2014), with some interesting information about how to calculate free memory: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34e431b0ae398fc54ea69ff85ec700722c9da773 Enjoy!
(In reply to Alexander Åhman from comment #9) > I have patched slurmd and the modified version is deployed in our cluster. > So far everything is working as it should. > > This patch will only work if the line "MemAvailable" is present in > /proc/meminfo. It should be the case in all modern Linux kernels but if you > happen to run an older version it will fall back and use the standard Slurm > way and only return free memory. > > This patch is verified to work on Slurm 17.11.6 but since the function > "get_free_mem" is the same in Git master this patch will work for the new > version of Slurm too. Thank you for the patch submission, but this patch is incorrect, and will not be accepted. The FreeMem value will not influence whether the node is permitted to register or not, and you're modifying it to no longer print out the correct informational value. - Tim
I'm a bit surprised by the comment "The FreeMem value will not influence whether the node is permitted to register or not". If that is the case, then what _is_ the way that the scheduler decides to allow a job to be run on a node, depending on the amount of RAM it requests ? What I see, is that the FreeMem value is way off on all nodes on our cluster. Most of the time it is too low, but sometimes it is too high when compared to MemAvailable. What we see in the field, is that jobs do not get run, as the scheduler seems to think there isn't enough RAM available to the job.