Summary: | Improve feedback on Out Of Memory conditions | ||
---|---|---|---|
Product: | Slurm | Reporter: | Alejandro Sanchez <alex> |
Component: | slurmd | Assignee: | Alejandro Sanchez <alex> |
Status: | OPEN --- | QA Contact: | |
Severity: | 5 - Enhancement | ||
Priority: | --- | CC: | felip.moll, kaizaad |
Version: | 18.08.x | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=6765 https://bugs.schedmd.com/show_bug.cgi?id=9737 https://bugs.schedmd.com/show_bug.cgi?id=10122 |
||
Site: | SchedMD | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | Google sites: | --- |
HPCnow Sites: | --- | HPE Sites: | --- |
IBM Sites: | --- | NOAA SIte: | --- |
NoveTech Sites: | --- | Nvidia HWinf-CS Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Tzag Elita Sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | 3 - High | Emory-Cloud Sites: | --- |
Description
Alejandro Sanchez
2018-01-04 04:45:54 MST
Part of the enhancement has been solved here: https://github.com/SchedMD/slurm/commit/943c4a130f39dbb1fb Perhaps modify the API so that we get rid of the SIG_OOM and instead we add a new member(s) to reflect oom-kill event and/or memory hitting the limit, perhaps displaying the second as SystemComment. Try to detect kernels with different oom counts available in the event file: https://patchwork.kernel.org/patch/9737381/ and use this instead of the manual eventfd() monitoring. *** Ticket 6765 has been marked as a duplicate of this ticket. *** |