Ticket 10336

Summary: Job extern step always ends in OUT_OF_MEMORY in 20.02.6 - patch available?
Product: Slurm Reporter: Anthony DelSorbo <anthony.delsorbo>
Component: slurmdAssignee: Felip Moll <felip.moll>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: ahmed.mazaty, felip.moll
Version: 20.02.6   
Hardware: Linux   
OS: Linux   
Site: NOAA Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: NESCC NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: bug10336_20026_custom.patch

Description Anthony DelSorbo 2020-12-02 08:21:42 MST
We are running into the same issue as in bug 10255.  It states that it is resolved with a patch in 20.02.7  Of course, we just installed 20.02.6 yesterday and had to take measures to remedy the issue in real time.  So, we were forced to back down the slurmd part to 20.02.4.  The slurmctld, slurmdbd and default are all pointing to 20.02.6.

A few questions:

1. How critical is this bug?  Is it just a cosmetic issue or is there cause for concern in job behavior.
2. When will 20.02.7 be released so that we can install and test that version
3. If it will be a while, can you provide a patch for this issue so that we can install and test it?


Thanks,

Tony.
Comment 1 Felip Moll 2020-12-02 08:59:30 MST
Created attachment 16921 [details]
bug10336_20026_custom.patch

(In reply to Anthony DelSorbo from comment #0)
> We are running into the same issue as in bug 10255.  It states that it is
> resolved with a patch in 20.02.7  Of course, we just installed 20.02.6
> yesterday and had to take measures to remedy the issue in real time.  So, we
> were forced to back down the slurmd part to 20.02.4.  The slurmctld,
> slurmdbd and default are all pointing to 20.02.6.
> 
> A few questions:
> 
> 1. How critical is this bug?  Is it just a cosmetic issue or is there cause
> for concern in job behavior.

It just marks the extern step as OUT_OF_MEMORY. This is reflected in the accounting and there's no 'real' impact on the execution of the job itself.

> 2. When will 20.02.7 be released so that we can install and test that version

We have no official release date yet.

> 3. If it will be a while, can you provide a patch for this issue so that we
> can install and test it?

Sure, I attach the raw patch for you to test it.
The commit applied to latest version works for your version too, so it is the same.

https://github.com/SchedMD/slurm/commit/272c636d507e1dc59d987da478d42f6713d88ae1



Off-topic: Which kernel version are your nodes on? I am experiencing some oddities in recent kernels and cgroups v1.
Comment 2 Anthony DelSorbo 2020-12-02 09:04:54 MST
(In reply to Felip Moll from comment #1)
Thanks for your replies Felip.
> 
> Off-topic: Which kernel version are your nodes on? I am experiencing some
> oddities in recent kernels and cgroups v1.

Kernel: 3.10.0-1127.19.1.el7.x86_64

Would you elaborate on the oddities you're experiencing?  Would I see the same here?  Or is it because you may be in a development environment?

Best,

Tony.
Comment 3 Felip Moll 2020-12-02 09:51:46 MST
> Kernel: 3.10.0-1127.19.1.el7.x86_64

I am not seeing this issues in this kernel version.

> Would you elaborate on the oddities you're experiencing?  Would I see the
> same here?  Or is it because you may be in a development environment?

You should be safe. Briefly, we register as listeners to events on cgroup hierarchy to count the number of OOMs every step receives during its execution. There's one event listener for every step, and in recent kernels (5.x) I see how an event generated on one step is broadcasted to other steps, which means if one gets OOMed, the others will detect it and mark themselves as OOMed too. This doesn't happen on 3.x versions.

It is an investigation on course, but annoying at the moment... cgroups v1.


Let me know if the patch fixes the issue for you.
Comment 4 Felip Moll 2020-12-04 04:59:12 MST
*** Ticket 10349 has been marked as a duplicate of this ticket. ***
Comment 5 Felip Moll 2020-12-07 06:16:24 MST
Hi Anthony and Ahmed,

If there are no more questions I am closing this issue.

Please, mark it as OPEN again and just ask if your concerns are still present.

Thanks
Comment 6 Anthony DelSorbo 2020-12-07 08:43:16 MST
(In reply to Felip Moll from comment #5)
Felip - Thanks for the info.  Agreed on course of action.

Best,

Tony.