| Summary: | SLUB errors handling in SLURM | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Damien <damien.leong> |
| Component: | Other | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | alex |
| Version: | 16.05.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Monash University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Damien
2017-10-22 04:49:03 MDT
Hi Slurm Support This is an info-query, how do slurm handle SLUB errors in compute nodes ? Will they be off-line and drain ? Does any slurm user complain of SLUB errors when using slurm ? any how do they deal with them ? Kindly advise. Thanks. Cheers Damien Hi Damien. > This is an info-query, how do slurm handle SLUB errors in compute nodes ? > Will they be off-line and drain ? Slurm does not currently track compute nodes' syslog messages nor interacts with the OS in order to look for SLUB errors. Nodes are not set to DRAIN and no action is taken. > Does any slurm user complain of SLUB errors when using slurm ? any how do > they deal with them ? I only see these two bugs related to SLUB/SLAB errors: https://bugs.schedmd.com/show_bug.cgi?id=3648 https://bugs.schedmd.com/show_bug.cgi?id=3874 There are cgroup.conf options to constrain limit memory.kmem.limit_in_bytes, which I believe accounting includes[1] stack pages, slab pages and sockets memory pressure. Last commit related to these options (included since Slurm 17.02.5) is: https://github.com/SchedMD/slurm/commit/ba32ac482194e5b Slurm only takes care of slurmd daemon responsiveness on the nodes and somehow delegates the monitoring/health work to the specific HealthCheckProgram, which you could use to set a specific test to detect SLUB errors and take the appropriate actions if needed. For KNL nodes Slurm provides with the UmeCheckInterval option to detect Uncorrectable Memory Errors (UME) and the node is set to DOWN if any are detected, but there's nothing similar for SLUB at present. [1] https://lwn.net/Articles/529927/ Hi Damien, is there anything else we can assist you with this bug? Thanks. Thanks, Please close this ticket. Cheers Damien (In reply to Alejandro Sanchez from comment #3) > Hi Damien, is there anything else we can assist you with this bug? Thanks. |