5062 – slurmstepd stuck, showing nodes down

Ticket 5062 - slurmstepd stuck, showing nodes down

Summary: slurmstepd stuck, showing nodes down

Status:	RESOLVED DUPLICATE of ticket 4733

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmstepd (show other tickets)
Version:	17.11.4
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Alejandro Sanchez
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-04-12 10:08 MDT by Naveed Near-Ansari
Modified:	2018-04-13 10:08 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Caltech
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
cgroup.conf (82 bytes, application/octet-stream) 2018-04-13 08:22 MDT, Naveed Near-Ansari	Details
slurm.conf (2.85 KB, application/octet-stream) 2018-04-13 08:22 MDT, Naveed Near-Ansari	Details
slurmstep_hung (16.01 KB, application/octet-stream) 2018-04-13 08:47 MDT, Naveed Near-Ansari	Details
cgroup.conf (82 bytes, application/octet-stream) 2018-04-13 08:47 MDT, Naveed Near-Ansari	Details
slurm.conf (2.85 KB, application/octet-stream) 2018-04-13 08:47 MDT, Naveed Near-Ansari	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Naveed Near-Ansari 2018-04-12 10:08:13 MDT

Some nodes are showing down:

any          up 14-00:00:0     12  down* hpc-22-[19,22],hpc-23-[07,09,12,14,22-23],hpc-25-10,hpc-92-05,hpc-93-[06,15]

upon investifation each of these still has a slurmstepd process running on it:

[naveed@hpc-22-19 ~]$ ps aux | grep slurm
root      21822  0.0  0.0 17120736 14680 ?      Sl   Apr08   0:01 /central/slurm/install/d/sbin/slurmd
root      99409  0.0  0.0 371588  4308 ?        Sl   Apr10   0:04 slurmstepd: [1443.extern]
root      99422  0.0  0.0 755900  6632 ?        Sl   Apr10   0:00 slurmstepd: [1443.0]

dmesg is showing many SLUB errors:

[196430.357796] SLUB: Unable to allocate memory on node -1 (gfp=0x8020)
[196430.357797]   cache: blkdev_ioc(18:step_0), object size: 104, buffer size: 104, default order: 1, min order: 0
[196430.357798]   node 0: slabs: 11, objs: 858, free: 0
[196430.357798]   node 1: slabs: 10, objs: 741, free: 0
[196430.358048] __get_request: dev 8:0: request aux data allocation failed, iosched may be disturbed

mem:
[naveed@hpc-22-19 ~]$ free -m
              total        used        free      shared  buff/cache   available
Mem:         192080       15381      175257          89        1441      175197
Swap:         65535          18       65517


Have you seen this and have any suggestions for preventing it?

Comment 1 Alejandro Sanchez 2018-04-13 02:52:59 MDT

Hi Naveed. Process Sl state means Interruptible sleep in a multi-threaded process. We've 3 bugs open reporting a similar problem (4733, 4810, 4690) that we might collapse eventually into a single bug to address the same issue for everyone.

In order to confirm you're experiencing the same error:

- Can you gdb attach to a couple of these Sl stepds and execute 'bt' and attach here the output? Let's see if the backtace looks like the deadlocks we've already identified in the other logs.
- Exact GLIBC version, including all vendor sub-numbering.
- Output from /proc/cpuinfo (just the first processor is fine).
- Do you have multithreading turned off in your nodes?
- Can you also attach slurm.conf and cgroup.conf?

Thanks.

Comment 2 Naveed Near-Ansari 2018-04-13 08:22:32 MDT

Created attachment 6627 [details]
cgroup.conf

On slurmstepd: [1444.0]:

(gdb) bt
#0  0x00002aaaabfe732a in wait4 () from /usr/lib64/libc.so.6
#1  0x0000000000410086 in _spawn_job_container (job=0x649a50) at mgr.c:1107
#2  job_manager (job=job@entry=0x649a50) at mgr.c:1216
#3  0x000000000040c9f7 in main (argc=1, argv=0x7fffffffed88) at slurmstepd.c:172


On slurmstepd: [1444.extern]:

(gdb) bt
#0  0x00002aaaabd15ef7 in pthread_join () from /usr/lib64/libpthread.so.0
#1  0x000000000041078f in _wait_for_io (job=0x647760) at mgr.c:2219
#2  job_manager (job=job@entry=0x647760) at mgr.c:1397
#3  0x000000000040c9f7 in main (argc=1, argv=0x7fffffffed88) at slurmstepd.c:172

Glibsc:
glibc-2.17-157.el7_3.1.i686
glibc-2.17-157.el7_3.1.x86_64


Hyperthreading is off:

[naveed@hpc-25-10 ~]$ cat /proc/cpuinfo
processor            : 0
vendor_id           : GenuineIntel
cpu family           : 6
model                   : 85
model name       : Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
stepping              : 4
microcode           : 0x200002c
cpu MHz                              : 2100.000
cache size            : 22528 KB
physical id           : 0
siblings : 16
core id                  : 0
cpu cores             : 16
apicid                    : 0
initial apicid        : 0
fpu                         : yes
fpu_exception  : yes
cpuid level          : 22
wp                          : yes
flags                       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local
bogomips            : 4200.00
clflush size          : 64
cache_alignment             : 64
address sizes      : 46 bits physical, 48 bits virtual
power management:





From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Friday, April 13, 2018 at 1:53 AM
To: Naveed Near-Ansari <naveed@caltech.edu>
Subject: [Bug 5062] slurmstepd stuck, showing nodes down

Alejandro Sanchez<mailto:alex@schedmd.com> changed bug 5062<https://bugs.schedmd.com/show_bug.cgi?id=5062>
What

Removed

Added

CC



alex@schedmd.com

Assignee

support@schedmd.com

alex@schedmd.com

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=5062#c1> on bug 5062<https://bugs.schedmd.com/show_bug.cgi?id=5062> from Alejandro Sanchez<mailto:alex@schedmd.com>

Hi Naveed. Process Sl state means Interruptible sleep in a multi-threaded

process. We've 3 bugs open reporting a similar problem (4733, 4810, 4690) that

we might collapse eventually into a single bug to address the same issue for

everyone.



In order to confirm you're experiencing the same error:



- Can you gdb attach to a couple of these Sl stepds and execute 'bt' and attach

here the output? Let's see if the backtace looks like the deadlocks we've

already identified in the other logs.

- Exact GLIBC version, including all vendor sub-numbering.

- Output from /proc/cpuinfo (just the first processor is fine).

- Do you have multithreading turned off in your nodes?

- Can you also attach slurm.conf and cgroup.conf?



Thanks.

________________________________
You are receiving this mail because:
·         You reported the bug.

Comment 3 Naveed Near-Ansari 2018-04-13 08:22:33 MDT

Created attachment 6628 [details]
slurm.conf

Comment 4 Alejandro Sanchez 2018-04-13 08:26:01 MDT

Sorry, will you reattach again and report back 'thread apply all bt' and 'thread apply all bt full'? Thanks

Comment 5 Naveed Near-Ansari 2018-04-13 08:47:11 MDT

Created attachment 6629 [details]
slurmstep_hung

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Friday, April 13, 2018 at 7:26 AM
To: Naveed Near-Ansari <naveed@caltech.edu>
Subject: [Bug 5062] slurmstepd stuck, showing nodes down

Comment # 4<https://bugs.schedmd.com/show_bug.cgi?id=5062#c4> on bug 5062<https://bugs.schedmd.com/show_bug.cgi?id=5062> from Alejandro Sanchez<mailto:alex@schedmd.com>

Sorry, will you reattach again and report back 'thread apply all bt' and

'thread apply all bt full'? Thanks
Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=5062#c3> on bug 5062<https://bugs.schedmd.com/show_bug.cgi?id=5062> from Naveed Near-Ansari<mailto:naveed@caltech.edu>

Created attachment 6628 [details]<attachment.cgi?id=6628> [details]<attachment.cgi?id=6628&action=edit>

slurm.conf

________________________________
You are receiving this mail because:
·         You reported the bug.

Comment 6 Naveed Near-Ansari 2018-04-13 08:47:12 MDT

Created attachment 6630 [details]
cgroup.conf

Comment 7 Naveed Near-Ansari 2018-04-13 08:47:12 MDT

Created attachment 6631 [details]
slurm.conf

Comment 8 Naveed Near-Ansari 2018-04-13 09:21:02 MDT

Let me know if you need anything else on this.  If not i'll return the nodes to service.

Thanks,

Naveed

Comment 9 Alejandro Sanchez 2018-04-13 09:24:58 MDT

Yes sorry I'd need you gdb attach again to those steps and execute these:

(gdb) thread apply all bt 
and 
(gdb) thread apply all bt full

at first sight it looks like the same issue in the other 3 bugs, but these gdb commands will confirm it. Thanks.

Comment 10 Naveed Near-Ansari 2018-04-13 09:38:39 MDT

I included those in the attachment slurmstep_hung in the last message. I was getting weird template errors when emailing in the body of the message, so thought that way would be safer.

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Friday, April 13, 2018 at 8:25 AM
To: Naveed Near-Ansari <naveed@caltech.edu>
Subject: [Bug 5062] slurmstepd stuck, showing nodes down

Comment # 9<https://bugs.schedmd.com/show_bug.cgi?id=5062#c9> on bug 5062<https://bugs.schedmd.com/show_bug.cgi?id=5062> from Alejandro Sanchez<mailto:alex@schedmd.com>

Yes sorry I'd need you gdb attach again to those steps and execute these:

(gdb) thread apply all bt

and

(gdb) thread apply all bt full

at first sight it looks like the same issue in the other 3 bugs, but these gdb

commands will confirm it. Thanks.

________________________________
You are receiving this mail because:
·         You reported the bug.

Comment 11 Alejandro Sanchez 2018-04-13 10:08:06 MDT

The backtrace is the same as the one reported in all these 3 bugs. There is a weird interaction between slurmd/stepd forking and calls to glibc's malloc().  The other sites also reported that version of glibc and we are still not sure if the problem comes from glibc itself managing the arena's[1] or if it's a Slurm problem. I'm marking this as a duplicate of bug 4733, so we don't have 4 bugs with the same problem. Thanks for the reported information.

[1] Arena
A structure that is shared among one or more threads which contains references to one or more heaps, as well as linked lists of chunks within those heaps which are "free". Threads assigned to each arena will allocate memory from that arena's free lists.

*** This ticket has been marked as a duplicate of ticket 4733 ***