Ticket 4288

Summary: SLUB errors handling in SLURM
Product: Slurm Reporter: Damien <damien.leong>
Component: OtherAssignee: Alejandro Sanchez <alex>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: alex
Version: 16.05.4   
Hardware: Linux   
OS: Linux   
Site: Monash University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Damien 2017-10-22 04:49:03 MDT
Hi Slurm Support
Comment 1 Damien 2017-10-22 04:58:40 MDT
Hi Slurm Support

This is an info-query, how do slurm handle SLUB errors in compute nodes ? Will they be off-line and drain ?

Does any slurm user complain of SLUB errors when using slurm ? any how do they deal with them ?  


Kindly advise. Thanks.

Cheers

Damien
Comment 2 Alejandro Sanchez 2017-10-23 04:32:15 MDT
Hi Damien.

> This is an info-query, how do slurm handle SLUB errors in compute nodes ?
> Will they be off-line and drain ?

Slurm does not currently track compute nodes' syslog messages nor interacts with the OS in order to look for SLUB errors. Nodes are not set to DRAIN and no action is taken.
 
> Does any slurm user complain of SLUB errors when using slurm ? any how do
> they deal with them ?  

I only see these two bugs related to SLUB/SLAB errors:
https://bugs.schedmd.com/show_bug.cgi?id=3648
https://bugs.schedmd.com/show_bug.cgi?id=3874

There are cgroup.conf options to constrain limit memory.kmem.limit_in_bytes, which I believe accounting includes[1] stack pages, slab pages and sockets memory pressure.

Last commit related to these options (included since Slurm 17.02.5) is:

https://github.com/SchedMD/slurm/commit/ba32ac482194e5b

Slurm only takes care of slurmd daemon responsiveness on the nodes and somehow delegates the monitoring/health work to the specific HealthCheckProgram, which you could use to set a specific test to detect SLUB errors and take the appropriate actions if needed. For KNL nodes Slurm provides with the UmeCheckInterval option to detect Uncorrectable Memory Errors (UME) and the node is set to DOWN if any are detected, but there's nothing similar for SLUB at present.

[1] https://lwn.net/Articles/529927/
Comment 3 Alejandro Sanchez 2017-11-07 09:22:50 MST
Hi Damien, is there anything else we can assist you with this bug? Thanks.
Comment 4 Damien 2017-11-07 17:39:39 MST
Thanks, Please close this ticket.

Cheers

Damien



(In reply to Alejandro Sanchez from comment #3)
> Hi Damien, is there anything else we can assist you with this bug? Thanks.