Ticket 5121

Summary:	slurmstepd hangs for no apparent reason
Product:	Slurm	Reporter:	Ryan Cox <ryan_cox>
Component:	slurmstepd	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	17.11.5
Hardware:	Linux
OS:	Linux
Site:	BYU - Brigham Young University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Ryan Cox 2018-05-01 17:09:30 MDT

We've been seeing a lot of steps that seem to be deadlocking and we're not sure why or for how long it has been going on.  We have been fighting some occasional network issues due to a certain vendor's firmware, so we chalked it up to that for a while.

We just started noticing that it doesn't seem related.  scontrol listpids hangs and the nodes are unreachable by slurmctld.  Here is a gdb session from a seemingly deadlocked slurmstepd (attached with -p):
(gdb) thread apply all bt

Thread 5 (Thread 0x7f2e34638700 (LWP 17776)):
#0  0x00007f2e36688334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f2e366835d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007f2e366834a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f2e3735dda0 in log_msg (level=level@entry=LOG_LEVEL_DEBUG2, fmt=fmt@entry=0x42d828 "step_terminate_monitor will run for %d secs", args=args@entry=0x7f2e34637d48) at log.c:960
#4  0x00007f2e3735f366 in debug2 (fmt=fmt@entry=0x42d828 "step_terminate_monitor will run for %d secs") at log.c:1192
#5  0x000000000041e7cc in _monitor (arg=0x13275c0) at step_terminate_monitor.c:117
#6  0x00007f2e36681aa1 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f2e363cebcd in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f2e2f6f3700 (LWP 17771)):
#0  0x00007f2e363de32e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007f2e36362da0 in _L_lock_4044 () from /lib64/libc.so.6
#2  0x00007f2e3635db40 in arena_get2 () from /lib64/libc.so.6
#3  0x00007f2e36360a9e in malloc () from /lib64/libc.so.6
#4  0x00007f2e37411727 in slurm_xmalloc (size=size@entry=24, clear=clear@entry=false, file=file@entry=0x7f2e37440f92 "pack.c", line=line@entry=155, func=func@entry=0x7f2e3742b428 <__func__.8630> "init_buf") at xmalloc.c:86
#5  0x00007f2e37366590 in init_buf (size=16384) at pack.c:155
#6  0x000000000041c8ed in _handle_accept (arg=0x0) at req.c:384
#7  0x00007f2e36681aa1 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f2e363cebcd in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f2e2f7f4700 (LWP 25477)):
#0  0x00007f2e363de32e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007f2e36362da0 in _L_lock_4044 () from /lib64/libc.so.6
#2  0x00007f2e3635db40 in arena_get2 () from /lib64/libc.so.6
#3  0x00007f2e3636080d in calloc () from /lib64/libc.so.6
#4  0x00007f2e3741171d in slurm_xmalloc (size=30, clear=clear@entry=true, file=file@entry=0x7f2e374465ff "xstring.c", line=line@entry=104, func=func@entry=0x7f2e374466c8 <__func__.5549> "makespace") at xmalloc.c:84
#5  0x00007f2e37411b10 in makespace (str=str@entry=0x7f2e2f7f3b88, needed=<optimized out>) at xstring.c:104
#6  0x00007f2e37411cdc in _xstrcat (str1=str1@entry=0x7f2e2f7f3b88, str2=<optimized out>, str2@entry=0x7f2e34029269 "%s: oom-kill event count: %lu") at xstring.c:135
#7  0x00007f2e3735da44 in vxstrfmt (fmt=fmt@entry=0x7f2e34029269 "%s: oom-kill event count: %lu", ap=ap@entry=0x7f2e2f7f3d98) at log.c:828
#8  0x00007f2e3735e01e in log_msg (level=level@entry=LOG_LEVEL_INFO, fmt=fmt@entry=0x7f2e34029269 "%s: oom-kill event count: %lu", args=args@entry=0x7f2e2f7f3d98) at log.c:1039
#9  0x00007f2e3735f192 in info (fmt=fmt@entry=0x7f2e34029269 "%s: oom-kill event count: %lu") at log.c:1165
#10 0x00007f2e34026140 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:489
#11 0x00007f2e36681aa1 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f2e363cebcd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f2e34537700 (LWP 25476)):
#0  0x00007f2e36688334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f2e366835d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007f2e366834a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f2e3735dda0 in log_msg (level=level@entry=LOG_LEVEL_DEBUG3, fmt=fmt@entry=0x42be1b "Called _msg_socket_accept", args=args@entry=0x7f2e34536c48) at log.c:960
#4  0x00007f2e3735f402 in debug3 (fmt=fmt@entry=0x42be1b "Called _msg_socket_accept") at log.c:1201
#5  0x0000000000416f8b in _msg_socket_accept (obj=0x133eb00, objs=<optimized out>) at req.c:321
#6  0x00007f2e37417a18 in _poll_handle_event (objList=0x13384a0, obj=0x133eb00, revents=1) at eio.c:495
#7  _poll_dispatch (objList=0x13384a0, map=0x7f2e30000900, nfds=1, pfds=0x7f2e300008d0) at eio.c:447
#8  eio_handle_mainloop (eio=0x133ccd0) at eio.c:335
#9  0x000000000041cd1c in _msg_thr_internal (job_arg=0x13275c0) at req.c:245
#10 0x00007f2e36681aa1 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f2e363cebcd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f2e37884700 (LWP 25473)):
#0  0x00007f2e36688334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f2e366835d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007f2e366834a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f2e3735dda0 in log_msg (level=level@entry=LOG_LEVEL_DEBUG3, fmt=fmt@entry=0x42f978 "%s: parameter '%s' set to '%s' for '%s'", args=args@entry=0x7ffe0dce7588) at log.c:960
#4  0x00007f2e3735f402 in debug3 (fmt=fmt@entry=0x42f978 "%s: parameter '%s' set to '%s' for '%s'") at log.c:1201
#5  0x0000000000427fee in xcgroup_set_param (cg=cg@entry=0x7f2e2fffc400 <step_freezer_cg>, param=param@entry=0x7f2e2fdfb5a9 "freezer.state", content=<optimized out>, content@entry=0x7f2e2fdfb5b7 "THAWED") at xcgroup.c:579
#6  0x00007f2e2fdfb034 in _slurm_cgroup_resume (id=id@entry=25473) at proctrack_cgroup.c:358
#7  0x00007f2e2fdfb368 in proctrack_p_signal (id=25473, signal=9) at proctrack_cgroup.c:523
#8  0x0000000000420f73 in proctrack_g_signal (cont_id=25473, signal=signal@entry=9) at proctrack.c:391
#9  0x0000000000410057 in _spawn_job_container (job=0x13275c0) at mgr.c:1131
#10 job_manager (job=job@entry=0x13275c0) at mgr.c:1216
#11 0x000000000040c856 in main (argc=1, argv=0x7ffe0dce8a38) at slurmstepd.c:172

These are the last lines in the log:
[2018-04-23T10:58:27.618] Launching batch job 24531145 for UID 22916
[2018-04-23T10:58:27.626] [24531144.batch] task/cgroup: /slurm/uid_22916/job_24531144: alloc=16384MB mem.limit=16384MB memsw.limit=17203MB
[2018-04-23T10:58:27.626] [24531144.batch] task/cgroup: /slurm/uid_22916/job_24531144/step_batch: alloc=16384MB mem.limit=16384MB memsw.limit=17203MB
[2018-04-23T10:58:27.643] [24531146.batch] task/cgroup: /slurm/uid_22916/job_24531146: alloc=16384MB mem.limit=16384MB memsw.limit=17203MB
[2018-04-23T10:58:27.643] [24531146.batch] task/cgroup: /slurm/uid_22916/job_24531146/step_batch: alloc=16384MB mem.limit=16384MB memsw.limit=17203MB
[2018-04-23T10:58:27.662] [24531145.batch] task/cgroup: /slurm/uid_22916/job_24531145: alloc=16384MB mem.limit=16384MB memsw.limit=17203MB
[2018-04-23T10:58:27.662] [24531145.batch] task/cgroup: /slurm/uid_22916/job_24531145/step_batch: alloc=16384MB mem.limit=16384MB memsw.limit=17203MB
[2018-04-23T11:26:28.957] [24531144.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2018-04-23T11:26:30.349] [24531144.batch] done with job
[2018-04-23T11:26:30.453] [24531144.extern] _oom_event_monitor: oom-kill event count: 1
[2018-04-23T11:26:30.607] [24531144.extern] done with job
[2018-04-23T11:26:58.224] [24531145.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2018-04-23T11:26:58.229] [24531145.batch] done with job
[2018-04-23T11:26:58.331] [24531145.extern] _oom_event_monitor: oom-kill event count: 1
[2018-04-23T11:26:58.481] [24531145.extern] done with job
[2018-04-23T11:27:05.720] [24531146.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2018-04-23T11:27:05.725] [24531146.batch] done with job
[2018-04-23T11:27:05.828] [24531146.extern] _oom_event_monitor: oom-kill event count: 1
[2018-04-23T11:27:05.981] [24531146.extern] done with job
[2018-04-23T11:28:04.594] [24531143.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2018-04-23T11:28:04.598] [24531143.batch] done with job
[2018-04-23T11:28:04.710] [24531143.extern] _oom_event_monitor: oom-kill event count: 1
[2018-04-23T11:28:05.039] [24531143.extern] done with job
[2018-04-23T12:16:45.352] _run_prolog: run job script took usec=3732
[2018-04-23T12:16:45.352] _run_prolog: prolog with lock for job 24532061 ran for 0 seconds
[2018-04-23T12:16:45.377] [24532061.extern] task/cgroup: /slurm/uid_21131/job_24532061: alloc=65536MB mem.limit=64254MB memsw.limit=67530MB
[2018-04-23T12:16:45.377] [24532061.extern] task/cgroup: /slurm/uid_21131/job_24532061/step_extern: alloc=65536MB mem.limit=64254MB memsw.limit=67530MB
[2018-04-23T12:16:45.403] Launching batch job 24532061 for UID 21131
[2018-04-23T12:16:45.434] [24532061.batch] task/cgroup: /slurm/uid_21131/job_24532061: alloc=65536MB mem.limit=64254MB memsw.limit=67530MB
[2018-04-23T12:16:45.434] [24532061.batch] task/cgroup: /slurm/uid_21131/job_24532061/step_batch: alloc=65536MB mem.limit=64254MB memsw.limit=67530MB
[2018-04-23T15:14:26.920] [24532061.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2018-04-23T15:14:28.161] [24532061.batch] done with job
[2018-04-27T11:11:31.000] error: Timeout waiting for completion of 28 threads

# ps -ef |grep slurm
root      4291 31363  0 17:06 pts/0    00:00:00 grep slurm
root     22521     1  0 Apr09 ?        00:00:03 slurmd
root     25473     1  0 Apr23 ?        00:00:00 slurmstepd: [24532061.extern]




Another node has:
(gdb) thread apply all bt

Thread 4 (Thread 0x7fa490cc0700 (LWP 9199)):
#0  0x00007fa49378832e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007fa4937d156a in _L_lock_227 () from /lib64/libc.so.6
#2  0x00007fa4937d150d in arena_thread_freeres () from /lib64/libc.so.6
#3  0x00007fa4937d15b2 in __libc_thread_freeres () from /lib64/libc.so.6
#4  0x00007fa493a2bbc8 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fa493778bcd in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fa490dc1700 (LWP 17595)):
#0  0x00007fa49378832e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007fa49370ce90 in _L_lock_4559 () from /lib64/libc.so.6
#2  0x00007fa493708081 in arena_get2 () from /lib64/libc.so.6
#3  0x00007fa49370a80d in calloc () from /lib64/libc.so.6
#4  0x00007fa4947bb71d in slurm_xmalloc (size=30, clear=clear@entry=true, file=file@entry=0x7fa4947f05ff "xstring.c", line=line@entry=104,
    func=func@entry=0x7fa4947f06c8 <__func__.5549> "makespace") at xmalloc.c:84
#5  0x00007fa4947bbb10 in makespace (str=str@entry=0x7fa490dc0b88, needed=<optimized out>) at xstring.c:104
#6  0x00007fa4947bbcdc in _xstrcat (str1=str1@entry=0x7fa490dc0b88, str2=<optimized out>, str2@entry=0x7fa4915d4269 "%s: oom-kill event count: %lu")
    at xstring.c:135
#7  0x00007fa494707a44 in vxstrfmt (fmt=fmt@entry=0x7fa4915d4269 "%s: oom-kill event count: %lu", ap=ap@entry=0x7fa490dc0d98) at log.c:828
#8  0x00007fa49470801e in log_msg (level=level@entry=LOG_LEVEL_INFO, fmt=fmt@entry=0x7fa4915d4269 "%s: oom-kill event count: %lu", args=args@entry=0x7fa490dc0d98)
    at log.c:1039
#9  0x00007fa494709192 in info (fmt=fmt@entry=0x7fa4915d4269 "%s: oom-kill event count: %lu") at log.c:1165
#10 0x00007fa4915d1140 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:489
#11 0x00007fa493a2baa1 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fa493778bcd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fa4918e1700 (LWP 17594)):
#0  0x00007fa493a32334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fa493a2d5d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007fa493a2d4a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fa494707da0 in log_msg (level=level@entry=LOG_LEVEL_DEBUG3, fmt=fmt@entry=0x42be1b "Called _msg_socket_accept", args=args@entry=0x7fa4918e0c48)
    at log.c:960
#4  0x00007fa494709402 in debug3 (fmt=fmt@entry=0x42be1b "Called _msg_socket_accept") at log.c:1201
#5  0x0000000000416f8b in _msg_socket_accept (obj=0xeed380, objs=<optimized out>) at req.c:321
#6  0x00007fa4947c1a18 in _poll_handle_event (objList=0xee6e70, obj=0xeed380, revents=1) at eio.c:495
#7  _poll_dispatch (objList=0xee6e70, map=0x7fa48c000900, nfds=1, pfds=0x7fa48c0008d0) at eio.c:447
#8  eio_handle_mainloop (eio=0xeeb530) at eio.c:335
#9  0x000000000041cd1c in _msg_thr_internal (job_arg=0xed5f90) at req.c:245
#10 0x00007fa493a2baa1 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fa493778bcd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fa494c2e700 (LWP 17591)):
#0  0x00007fa493a32334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fa493a2d5d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00007fa493a2d4a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fa494707da0 in log_msg (level=level@entry=LOG_LEVEL_DEBUG2, fmt=fmt@entry=0x42eb98 "%s: rmdir(%s): %m", args=args@entry=0x7fff79d24668) at log.c:960
#4  0x00007fa494709366 in debug2 (fmt=fmt@entry=0x42eb98 "%s: rmdir(%s): %m") at log.c:1192
#5  0x0000000000427085 in xcgroup_delete (cg=cg@entry=0x7fa491beed40 <user_memory_cg>) at xcgroup.c:493
---Type <return> to continue, or q <return> to quit---
#6  0x00007fa4919e69f6 in jobacct_gather_cgroup_memory_fini (slurm_cgroup_conf=slurm_cgroup_conf@entry=0x7fa491beab20 <slurm_cgroup_conf>)
    at jobacct_gather_cgroup_memory.c:155
#7  0x00007fa4919e5d3f in fini () at jobacct_gather_cgroup.c:261
#8  0x00007fa494716f6d in plugin_unload (plug=0xeeaf90) at plugin.c:305
#9  0x00007fa494717383 in plugin_context_destroy (c=0xeea290) at plugin.c:477
#10 0x00007fa4947344fe in jobacct_gather_fini () at slurm_jobacct_gather.c:335
#11 0x00007fa49472caab in acct_gather_profile_fini () at slurm_acct_gather_profile.c:252
#12 0x0000000000410031 in _spawn_job_container (job=0xed5f90) at mgr.c:1124
#13 job_manager (job=job@entry=0xed5f90) at mgr.c:1216
#14 0x000000000040c856 in main (argc=1, argv=0x7fff79d25b08) at slurmstepd.c:172


# ps -ef |grep slurm
root     10911 10171  0 17:06 pts/0    00:00:00 grep slurm
root     17591     1  0 Apr21 ?        00:00:00 slurmstepd: [24521908.extern]
root     17598     1  0 Apr21 ?        00:00:00 slurmstepd: [24521908.batch]
root     18220     1  0 Mar31 ?        00:00:07 slurmd

[2018-04-21T12:46:47.310] [24521855.batch] task/cgroup: /slurm/uid_20449/job_24521855/step_batch: alloc=24576MB mem.limit=24576MB memsw.limit=25804MB
[2018-04-21T12:46:47.733] launch task 24521855.0 request from 20449.20482@192.168.216.21 (port 33411)
[2018-04-21T12:46:47.759] [24521855.0] task/cgroup: /slurm/uid_20449/job_24521855: alloc=24576MB mem.limit=24576MB memsw.limit=25804MB
[2018-04-21T12:46:47.759] [24521855.0] task/cgroup: /slurm/uid_20449/job_24521855/step_0: alloc=24576MB mem.limit=24576MB memsw.limit=25804MB
[2018-04-21T12:47:05.356] [24521855.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2018-04-21T12:47:06.713] [24521855.0] done with job
[2018-04-21T12:47:06.713] [24521855.batch] done with job
[2018-04-21T12:47:06.792] [24521855.extern] _oom_event_monitor: oom-kill event count: 1
[2018-04-21T12:47:06.876] [24521855.extern] done with job
[2018-04-21T14:09:29.067] _run_prolog: run job script took usec=3745
[2018-04-21T14:09:29.067] _run_prolog: prolog with lock for job 24521908 ran for 0 seconds
[2018-04-21T14:09:29.095] [24521908.extern] task/cgroup: /slurm/uid_20631/job_24521908: alloc=32768MB mem.limit=31934MB memsw.limit=33572MB
[2018-04-21T14:09:29.095] [24521908.extern] task/cgroup: /slurm/uid_20631/job_24521908/step_extern: alloc=32768MB mem.limit=31934MB memsw.limit=33572MB
[2018-04-21T14:09:29.125] Launching batch job 24521908 for UID 20631
[2018-04-21T14:09:29.154] [24521908.batch] task/cgroup: /slurm/uid_20631/job_24521908: alloc=32768MB mem.limit=31934MB memsw.limit=33572MB
[2018-04-21T14:09:29.154] [24521908.batch] task/cgroup: /slurm/uid_20631/job_24521908/step_batch: alloc=32768MB mem.limit=31934MB memsw.limit=33572MB
[2018-04-21T14:09:31.563] launch task 24521908.0 request from 20631.20703@192.168.216.21 (port 65156)
[2018-04-21T14:09:31.590] [24521908.0] task/cgroup: /slurm/uid_20631/job_24521908: alloc=32768MB mem.limit=31934MB memsw.limit=33572MB
[2018-04-21T14:09:31.590] [24521908.0] task/cgroup: /slurm/uid_20631/job_24521908/step_0: alloc=32768MB mem.limit=31934MB memsw.limit=33572MB
[2018-04-21T17:10:01.157] [24521908.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2018-04-21T17:10:02.493] [24521908.0] done with job
[2018-04-22T18:18:21.776] error: forward_thread to m7-20-15: Connection refused
[2018-04-27T11:11:31.008] error: Timeout waiting for completion of 43 threads

Any ideas?

Comment 1 Tim Wickberg 2018-05-01 17:33:27 MDT

Hey guys -

This is a duplicate of what we're now tracking in bug 5103. There is a patch on there (marked obsolete though as the commits corresponding to it have landed on the 17.11 branch) that should prevent this.

We expect to have 17.11.6 out soon to address this, it does seem like some recent RHEL glibc changes exposed this unsafe thread behavior and make it a lot more likely to deadlock.

- Tim

*** This ticket has been marked as a duplicate of ticket 5103 ***

Comment 2 Ryan Cox 2018-05-02 08:43:41 MDT

We're actually still on RHEL6 for a few more months.  We might apply that patch today.

Thanks.