Ticket 5241 - Modify Slurm to report OOM event when using hugepages
Summary: Modify Slurm to report OOM event when using hugepages
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 17.11.7
Hardware: Cray XC Linux
: 5 - Enhancement
Assignee: Unassigned Developer
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-05-31 17:41 MDT by Moe Jette
Modified: 2019-12-20 11:28 MST (History)
7 users (show)

See Also:
Site: LANL
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Moe Jette 2018-05-31 17:41:31 MDT
Cray's UP07 release is reported to have a new hugepages library that will work with cgroups including recording of OOM events.

Tentatively, Slurm will be modified to report the event using a similar message to that currently provided for other memory OOM events, something like this:
srun: error: nid12345: tasks 67-78: Out of Hugepages Memory

No date has yet been set for delivery of this feature in Slurm.