We are running into the same issue as in bug 10255. It states that it is resolved with a patch in 20.02.7 Of course, we just installed 20.02.6 yesterday and had to take measures to remedy the issue in real time. So, we were forced to back down the slurmd part to 20.02.4. The slurmctld, slurmdbd and default are all pointing to 20.02.6. A few questions: 1. How critical is this bug? Is it just a cosmetic issue or is there cause for concern in job behavior. 2. When will 20.02.7 be released so that we can install and test that version 3. If it will be a while, can you provide a patch for this issue so that we can install and test it? Thanks, Tony.
Created attachment 16921 [details] bug10336_20026_custom.patch (In reply to Anthony DelSorbo from comment #0) > We are running into the same issue as in bug 10255. It states that it is > resolved with a patch in 20.02.7 Of course, we just installed 20.02.6 > yesterday and had to take measures to remedy the issue in real time. So, we > were forced to back down the slurmd part to 20.02.4. The slurmctld, > slurmdbd and default are all pointing to 20.02.6. > > A few questions: > > 1. How critical is this bug? Is it just a cosmetic issue or is there cause > for concern in job behavior. It just marks the extern step as OUT_OF_MEMORY. This is reflected in the accounting and there's no 'real' impact on the execution of the job itself. > 2. When will 20.02.7 be released so that we can install and test that version We have no official release date yet. > 3. If it will be a while, can you provide a patch for this issue so that we > can install and test it? Sure, I attach the raw patch for you to test it. The commit applied to latest version works for your version too, so it is the same. https://github.com/SchedMD/slurm/commit/272c636d507e1dc59d987da478d42f6713d88ae1 Off-topic: Which kernel version are your nodes on? I am experiencing some oddities in recent kernels and cgroups v1.
(In reply to Felip Moll from comment #1) Thanks for your replies Felip. > > Off-topic: Which kernel version are your nodes on? I am experiencing some > oddities in recent kernels and cgroups v1. Kernel: 3.10.0-1127.19.1.el7.x86_64 Would you elaborate on the oddities you're experiencing? Would I see the same here? Or is it because you may be in a development environment? Best, Tony.
> Kernel: 3.10.0-1127.19.1.el7.x86_64 I am not seeing this issues in this kernel version. > Would you elaborate on the oddities you're experiencing? Would I see the > same here? Or is it because you may be in a development environment? You should be safe. Briefly, we register as listeners to events on cgroup hierarchy to count the number of OOMs every step receives during its execution. There's one event listener for every step, and in recent kernels (5.x) I see how an event generated on one step is broadcasted to other steps, which means if one gets OOMed, the others will detect it and mark themselves as OOMed too. This doesn't happen on 3.x versions. It is an investigation on course, but annoying at the moment... cgroups v1. Let me know if the patch fixes the issue for you.
*** Ticket 10349 has been marked as a duplicate of this ticket. ***
Hi Anthony and Ahmed, If there are no more questions I am closing this issue. Please, mark it as OPEN again and just ask if your concerns are still present. Thanks
(In reply to Felip Moll from comment #5) Felip - Thanks for the info. Agreed on course of action. Best, Tony.