Ticket 4288

Summary:	SLUB errors handling in SLURM
Product:	Slurm	Reporter:	Damien <damien.leong>
Component:	Other	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	alex
Version:	16.05.4
Hardware:	Linux
OS:	Linux
Site:	Monash University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Damien 2017-10-22 04:49:03 MDT

Hi Slurm Support

Comment 1 Damien 2017-10-22 04:58:40 MDT

Hi Slurm Support

This is an info-query, how do slurm handle SLUB errors in compute nodes ? Will they be off-line and drain ?

Does any slurm user complain of SLUB errors when using slurm ? any how do they deal with them ?  


Kindly advise. Thanks.

Cheers

Damien

Comment 2 Alejandro Sanchez 2017-10-23 04:32:15 MDT

Hi Damien.

> This is an info-query, how do slurm handle SLUB errors in compute nodes ?
> Will they be off-line and drain ?

Slurm does not currently track compute nodes' syslog messages nor interacts with the OS in order to look for SLUB errors. Nodes are not set to DRAIN and no action is taken.
 
> Does any slurm user complain of SLUB errors when using slurm ? any how do
> they deal with them ?  

I only see these two bugs related to SLUB/SLAB errors:
https://bugs.schedmd.com/show_bug.cgi?id=3648
https://bugs.schedmd.com/show_bug.cgi?id=3874

There are cgroup.conf options to constrain limit memory.kmem.limit_in_bytes, which I believe accounting includes[1] stack pages, slab pages and sockets memory pressure.

Last commit related to these options (included since Slurm 17.02.5) is:

https://github.com/SchedMD/slurm/commit/ba32ac482194e5b

Slurm only takes care of slurmd daemon responsiveness on the nodes and somehow delegates the monitoring/health work to the specific HealthCheckProgram, which you could use to set a specific test to detect SLUB errors and take the appropriate actions if needed. For KNL nodes Slurm provides with the UmeCheckInterval option to detect Uncorrectable Memory Errors (UME) and the node is set to DOWN if any are detected, but there's nothing similar for SLUB at present.

[1] https://lwn.net/Articles/529927/

Comment 3 Alejandro Sanchez 2017-11-07 09:22:50 MST

Hi Damien, is there anything else we can assist you with this bug? Thanks.

Comment 4 Damien 2017-11-07 17:39:39 MST

Thanks, Please close this ticket.

Cheers

Damien



(In reply to Alejandro Sanchez from comment #3)
> Hi Damien, is there anything else we can assist you with this bug? Thanks.