Ticket 5241

Summary: Modify Slurm to report OOM event when using hugepages
Product: Slurm Reporter: Moe Jette <jette>
Component: slurmstepdAssignee: Unassigned Developer <dev-unassigned>
Status: OPEN --- QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: brian.gilmer, bsantos, david.gloe, fullop, mej, schedmd, tim
Version: 17.11.7   
Hardware: Cray XC   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=4520
Site: LANL Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Moe Jette 2018-05-31 17:41:31 MDT
Cray's UP07 release is reported to have a new hugepages library that will work with cgroups including recording of OOM events.

Tentatively, Slurm will be modified to report the event using a similar message to that currently provided for other memory OOM events, something like this:
srun: error: nid12345: tasks 67-78: Out of Hugepages Memory

No date has yet been set for delivery of this feature in Slurm.