Ticket 10336 - Job extern step always ends in OUT_OF_MEMORY in 20.02.6 - patch available?
Summary: Job extern step always ends in OUT_OF_MEMORY in 20.02.6 - patch available?
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 20.02.6
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
: 10349 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2020-12-02 08:21 MST by Anthony DelSorbo
Modified: 2020-12-07 08:43 MST (History)
2 users (show)

See Also:
Site: NOAA
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: NESCC
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
bug10336_20026_custom.patch (1.51 KB, patch)
2020-12-02 08:59 MST, Felip Moll
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Anthony DelSorbo 2020-12-02 08:21:42 MST
We are running into the same issue as in bug 10255.  It states that it is resolved with a patch in 20.02.7  Of course, we just installed 20.02.6 yesterday and had to take measures to remedy the issue in real time.  So, we were forced to back down the slurmd part to 20.02.4.  The slurmctld, slurmdbd and default are all pointing to 20.02.6.

A few questions:

1. How critical is this bug?  Is it just a cosmetic issue or is there cause for concern in job behavior.
2. When will 20.02.7 be released so that we can install and test that version
3. If it will be a while, can you provide a patch for this issue so that we can install and test it?


Thanks,

Tony.
Comment 1 Felip Moll 2020-12-02 08:59:30 MST
Created attachment 16921 [details]
bug10336_20026_custom.patch

(In reply to Anthony DelSorbo from comment #0)
> We are running into the same issue as in bug 10255.  It states that it is
> resolved with a patch in 20.02.7  Of course, we just installed 20.02.6
> yesterday and had to take measures to remedy the issue in real time.  So, we
> were forced to back down the slurmd part to 20.02.4.  The slurmctld,
> slurmdbd and default are all pointing to 20.02.6.
> 
> A few questions:
> 
> 1. How critical is this bug?  Is it just a cosmetic issue or is there cause
> for concern in job behavior.

It just marks the extern step as OUT_OF_MEMORY. This is reflected in the accounting and there's no 'real' impact on the execution of the job itself.

> 2. When will 20.02.7 be released so that we can install and test that version

We have no official release date yet.

> 3. If it will be a while, can you provide a patch for this issue so that we
> can install and test it?

Sure, I attach the raw patch for you to test it.
The commit applied to latest version works for your version too, so it is the same.

https://github.com/SchedMD/slurm/commit/272c636d507e1dc59d987da478d42f6713d88ae1



Off-topic: Which kernel version are your nodes on? I am experiencing some oddities in recent kernels and cgroups v1.
Comment 2 Anthony DelSorbo 2020-12-02 09:04:54 MST
(In reply to Felip Moll from comment #1)
Thanks for your replies Felip.
> 
> Off-topic: Which kernel version are your nodes on? I am experiencing some
> oddities in recent kernels and cgroups v1.

Kernel: 3.10.0-1127.19.1.el7.x86_64

Would you elaborate on the oddities you're experiencing?  Would I see the same here?  Or is it because you may be in a development environment?

Best,

Tony.
Comment 3 Felip Moll 2020-12-02 09:51:46 MST
> Kernel: 3.10.0-1127.19.1.el7.x86_64

I am not seeing this issues in this kernel version.

> Would you elaborate on the oddities you're experiencing?  Would I see the
> same here?  Or is it because you may be in a development environment?

You should be safe. Briefly, we register as listeners to events on cgroup hierarchy to count the number of OOMs every step receives during its execution. There's one event listener for every step, and in recent kernels (5.x) I see how an event generated on one step is broadcasted to other steps, which means if one gets OOMed, the others will detect it and mark themselves as OOMed too. This doesn't happen on 3.x versions.

It is an investigation on course, but annoying at the moment... cgroups v1.


Let me know if the patch fixes the issue for you.
Comment 4 Felip Moll 2020-12-04 04:59:12 MST
*** Ticket 10349 has been marked as a duplicate of this ticket. ***
Comment 5 Felip Moll 2020-12-07 06:16:24 MST
Hi Anthony and Ahmed,

If there are no more questions I am closing this issue.

Please, mark it as OPEN again and just ask if your concerns are still present.

Thanks
Comment 6 Anthony DelSorbo 2020-12-07 08:43:16 MST
(In reply to Felip Moll from comment #5)
Felip - Thanks for the info.  Agreed on course of action.

Best,

Tony.