Ticket 4733 - Jobs occasionally stuck in CG state: slurmstepd for .0 on one node does not exit when MPI job does
Summary: Jobs occasionally stuck in CG state: slurmstepd for .0 on one node does not e...
Status: RESOLVED DUPLICATE of ticket 5103
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 17.11.2
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
: 5062 5091 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2018-02-04 22:04 MST by Christopher Samuel
Modified: 2018-04-26 11:01 MDT (History)
10 users (show)

See Also:
Site: Swinburne
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmstepd stack trace (15.25 KB, text/x-log)
2018-02-04 22:04 MST, Christopher Samuel
Details
ompi-1.10.7-lockup.tar.gz (30.97 KB, application/gzip)
2018-02-20 17:48 MST, Christopher Samuel
Details
slurmstepd.gdb-full (140.41 KB, text/plain)
2018-02-24 04:21 MST, Christopher Samuel
Details
slurmstepd.0.gdb (4.62 KB, text/plain)
2018-02-25 21:13 MST, Christopher Samuel
Details
slurmstepd.0.pstack (2.56 KB, text/plain)
2018-02-25 21:13 MST, Christopher Samuel
Details
slurmstepd-gdb-bt-with-debuginfo.txt (196.23 KB, text/plain)
2018-02-27 20:44 MST, Christopher Samuel
Details
stack-slurmd.log (269.82 KB, text/x-log)
2018-02-27 20:59 MST, Christopher Samuel
Details
stack-slurmstepd.log (246.67 KB, text/x-log)
2018-02-27 20:59 MST, Christopher Samuel
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Christopher Samuel 2018-02-04 22:04:09 MST
Created attachment 6064 [details]
slurmstepd stack trace

H there,

We are seeing simple test jobs getting stuck in 'CG' state when the MPI task exits but the .0 step does not clean up on one (random) node when the users script does.

The script in question does:

--------------------8< snip snip 8<--------------------
#!/bin/bash
#SBATCH --nodes=96
#SBATCH --ntasks-per-node=36
#SBATCH --mem=180gb
#SBATCH --time=00:10:00

#module load iomkl/2016.2
module load intel/2016.2.181-gcc-6.4.0 openmpi/3.0.0

mpirun xhpl
--------------------8< snip snip 8<--------------------

Looking at the affected nodes I see no processes by the user there, pstree -a shows:

  |-slurmd
  |   `-17*[{slurmd}]
  |-slurmstepd         
  |   |-sleep 1000000
  |   `-3*[{slurmstepd}]
  |-slurmstepd              
  |   `-20*[{slurmstepd}]

ps shows:

root     153087  0.0  0.0 1353580 3232 ?        Sl   11:30   0:00 /apps/slurm/latest/sbin/slurmd
root     168684  0.0  0.0 303036  4524 ?        Sl   15:22   0:00 slurmstepd: [12212.extern]
root     168693  0.0  0.0 1492292 4992 ?        Sl   15:22   0:00 slurmstepd: [12212.0]

pstack trace attached due to length.

The last syslog messages from the node (where slurmd log is going) are:

Feb  5 15:22:15 john104 slurmstepd[168693]: task/cgroup: /slurm/uid_1040/job_12212: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB
Feb  5 15:22:15 john104 slurmstepd[168693]: task/cgroup: /slurm/uid_1040/job_12212/step_0: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB

squeue shows:

# squeue -j 12212
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             12212   skylake hpl.slur  rhumble CG       3:42      1 john104

and:

# squeue -j 12212 -s
         STEPID     NAME PARTITION     USER      TIME NODELIST
        12212.0    orted   skylake  rhumble     34:35 john[7-44,46-52,54-55,58-73,75-98,100-107]
   12212.Extern   extern   skylake  rhumble     34:36 john[6-44,46-52,54-55,58-73,75-98,100-107]

So you can see it's just one node that's left in this odd state.


Looking at wchan with ps:

[root@john104 ~]# ps -Ao pid,cmd,wchan | grep slurmstepd
168684 slurmstepd: [12212.extern]  do_wait
168693 slurmstepd: [12212.0]       futex_wait_queue_me


I pulled in commit d2c838070 mentioned in bug 4718 before realising that this is subtly different in that it's not an issue with the extern job step, so this is happening both with and without it.

If I kill -9 the .0 slurmstepd then slurmd will clean the job up correctly, but just a normal kill doesn't touch it.

All the best,
Chris
Comment 1 Alejandro Sanchez 2018-02-05 03:22:05 MST
Hi Chris,

Besides the mentioned commit:

https://github.com/SchedMD/slurm/commit/d2c8380709

There's another patch pending to be reviewed and checked in, tracked in bug:

https://bugs.schedmd.com/show_bug.cgi?id=4718

which I believe would solve this too. That patch will be mostly included in 17.11.3 which should be released early this week.

I'll discuss it internally though just in case the issue reported here is different to what is solved there. Thanks.
Comment 3 Christopher Samuel 2018-02-05 04:35:07 MST
On 05/02/18 21:22, bugs@schedmd.com wrote:

> I'll discuss it internally though just in case the issue reported
> here is different to what is solved there. Thanks.

Thanks Alejandro, look forward to hearing back on this!
Comment 5 Alejandro Sanchez 2018-02-05 08:58:18 MST
Christopher, we've internally discussed this and we believe this should be fixed in 17.11.3. 

Would you mind testing in that version (when released) and see if the steps still get stuck or not and confirm here whether it is fixed? Thanks!.
Comment 6 Christopher Samuel 2018-02-05 15:54:57 MST
On 06/02/18 02:58, bugs@schedmd.com wrote:

> Christopher, we've internally discussed this and we believe this
> should be fixed in 17.11.3.

Great, thanks for that.

> Would you mind testing in that version (when released) and see if the
> steps still get stuck or not and confirm here whether it is fixed?
> Thanks!.

Sure! Happy to do that. I'm not working today but will be tomorrow 
(Wednesday).

All the best,
Chris
Comment 7 Alejandro Sanchez 2018-02-07 03:44:05 MST
Christopher, I'm marking this as duplicate of bug 4718. We are fairly confident following related commit will fix issue here. Please reopen otherwise after testing.

https://github.com/SchedMD/slurm/commit/108502e9504

*** This ticket has been marked as a duplicate of ticket 4718 ***
Comment 8 Christopher Samuel 2018-02-07 04:27:15 MST
On 07/02/18 21:44, bugs@schedmd.com wrote:

> Christopher, I'm marking this as duplicate of bug 4718 We are fairly 
> confident following related commit will fix issue here. Please reopen
> otherwise after testing.

Sorry - please reopen.  We are still seeing the problem with 17.11.3-2.

     14200   skylake hpl.slur  rhumble CG       0:35      1 john11

[one of many]

[root@john11 ~]# ps auxww | fgrep slurm
root     351901  0.0  0.0 355000  1844 ?        Sl   13:51   0:01
/apps/slurm/latest/sbin/slurmd
root     413204  0.0  0.0 306140  1536 ?        Sl   21:14   0:00
slurmstepd: [14200.extern]
root     413214  0.0  0.0 430408  1592 ?        Sl   21:14   0:00
slurmstepd: [14200.0]

[root@john11 ~]# scontrol show slurmd
scontrol: error: slurm_slurmd_info: Socket timed out on send/recv operation
slurm_load_slurmd_status: Socket timed out on send/recv operation

pstree -a shows:

   ├─slurmd
   │   └─4*[{slurmd}]
   ├─slurmstepd
   │   ├─sleep 1000000
   │   └─4*[{slurmstepd}]
   ├─slurmstepd
   │   └─7*[{slurmstepd}]

slurmd stack according to pstack:

[root@john11 ~]# pstack 351901
Thread 5 (Thread 0x2b50d9cb3700 (LWP 413551)):
#0  0x00002b50dabe170d in read () from /usr/lib64/libpthread.so.0
#1  0x00002b50da52007e in stepd_connect (directory=<optimized out>,
nodename=<optimized out>, jobid=jobid@entry=14200,
stepid=stepid@entry=0,
protocol_version=protocol_version@entry=0x2b50d9cb2bde) at stepd_api.c:281
#2  0x0000000000415e0e in _signal_jobstep (jobid=jobid@entry=14200,
stepid=0, signal=signal@entry=18, flags=flags@entry=0) at req.c:3217
#3  0x0000000000415f13 in _kill_all_active_steps (jobid=14200,
sig=sig@entry=18, flags=flags@entry=0, batch=batch@entry=true) at req.c:4467
#4  0x0000000000417d0a in _rpc_terminate_job
(msg=msg@entry=0x2b50e0018520) at req.c:5414
#5  0x00000000004197ac in slurmd_req (msg=msg@entry=0x2b50e0018520) at
req.c:355
#6  0x000000000040c671 in _service_connection (arg=<optimized out>) at
slurmd.c:544
#7  0x00002b50dabdae25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00002b50daee734d in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x2b50de069700 (LWP 414724)):
#0  0x00002b50dabe170d in read () from /usr/lib64/libpthread.so.0
#1  0x00002b50da52007e in stepd_connect (directory=<optimized out>,
nodename=<optimized out>, jobid=14200, stepid=0,
protocol_version=protocol_version@entry=0x2b50e40019f8) at stepd_api.c:281
#2  0x000000000040d9ac in _fill_registration_msg (msg=0x2b50e400d930) at
slurmd.c:688
#3  send_registration_msg (status=status@entry=0,
startup=startup@entry=true) at slurmd.c:565
#4  0x000000000041999c in slurmd_req (msg=msg@entry=0x2b50e400d890) at
req.c:386
#5  0x000000000040c671 in _service_connection (arg=<optimized out>) at
slurmd.c:544
#6  0x00002b50dabdae25 in start_thread () from /usr/lib64/libpthread.so.0
#7  0x00002b50daee734d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x2b50df565700 (LWP 416434)):
#0  0x00002b50dabe170d in read () from /usr/lib64/libpthread.so.0
#1  0x00002b50da52007e in stepd_connect (directory=<optimized out>,
nodename=<optimized out>, jobid=14200, stepid=0,
protocol_version=protocol_version@entry=0x2b50e80112f8) at stepd_api.c:281
#2  0x000000000040d9ac in _fill_registration_msg (msg=0x2b50e80008d0) at
slurmd.c:688
#3  send_registration_msg (status=status@entry=0,
startup=startup@entry=true) at slurmd.c:565
#4  0x000000000041999c in slurmd_req (msg=msg@entry=0x2b50e8000ae0) at
req.c:386
#5  0x000000000040c671 in _service_connection (arg=<optimized out>) at
slurmd.c:544
#6  0x00002b50dabdae25 in start_thread () from /usr/lib64/libpthread.so.0
#7  0x00002b50daee734d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x2b50df666700 (LWP 417058)):
#0  0x00002b50dabe170d in read () from /usr/lib64/libpthread.so.0
#1  0x00002b50da52007e in stepd_connect (directory=<optimized out>,
nodename=<optimized out>, jobid=14200, stepid=0,
protocol_version=protocol_version@entry=0x2b50ec010db8) at stepd_api.c:281
#2  0x000000000041a91f in _get_step_list () at req.c:3475
#3  _rpc_daemon_status (msg=0x2b50ec0008d0) at req.c:3526
#4  slurmd_req (msg=msg@entry=0x2b50ec0008d0) at req.c:426
#5  0x000000000040c671 in _service_connection (arg=<optimized out>) at
slurmd.c:544
#6  0x00002b50dabdae25 in start_thread () from /usr/lib64/libpthread.so.0
#7  0x00002b50daee734d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x2b50d9bafa00 (LWP 351901)):
#0  0x00002b50dabe198d in accept () from /usr/lib64/libpthread.so.0
#1  0x00002b50da4e3197 in slurm_accept_msg_conn (fd=<optimized out>,
addr=<optimized out>) at slurm_protocol_socket_implementation.c:454
#2  0x000000000040f16f in _msg_engine () at slurmd.c:437
#3  main (argc=1, argv=0x7fff09133838) at slurmd.c:373


slurmstepd for .0:

[root@john11 ~]# pstack 413214
Thread 8 (Thread 0x2b836469d700 (LWP 413219)):
#0  0x00002b83600fda3d in poll () from /usr/lib64/libc.so.6
#1  0x00002b835ef10737 in _poll_internal (shutdown_time=<optimized out>, 
nfds=2, pfds=0x2b83680008d0) at eio.c:364
#2  eio_handle_mainloop (eio=0x1fa91f0) at eio.c:328
#3  0x000000000041ca78 in _msg_thr_internal (job_arg=0x1f61850) at req.c:245
#4  0x00002b835fdfbe25 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00002b836010834d in clone () from /usr/lib64/libc.so.6
Thread 7 (Thread 0x2b83657e2700 (LWP 413220)):
#0  0x00002b83600fda3d in poll () from /usr/lib64/libc.so.6
#1  0x00002b835ef10737 in _poll_internal (shutdown_time=<optimized out>, 
nfds=3, pfds=0x2b836c000ac0) at eio.c:364
#2  eio_handle_mainloop (eio=0x2b836c0008d0) at eio.c:328
#3  0x00002b83652cb31c in _agent (unused=<optimized out>) at agent.c:328
#4  0x00002b835fdfbe25 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00002b836010834d in clone () from /usr/lib64/libc.so.6
Thread 6 (Thread 0x2b83659e4700 (LWP 413222)):
#0  0x00002b8360115eec in __lll_lock_wait_private () from 
/usr/lib64/libc.so.6
#1  0x00002b836017760d in _L_lock_27 () from /usr/lib64/libc.so.6
#2  0x00002b83601775bd in arena_thread_freeres () from /usr/lib64/libc.so.6
#3  0x00002b8360177662 in __libc_thread_freeres () from /usr/lib64/libc.so.6
#4  0x00002b835fdfbe38 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00002b836010834d in clone () from /usr/lib64/libc.so.6
Thread 5 (Thread 0x2b835ecce700 (LWP 413553)):
#0  0x00002b8360115eec in __lll_lock_wait_private () from 
/usr/lib64/libc.so.6
#1  0x00002b83600927d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2  0x00002b836008aca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00002b83600900fe in malloc () from /usr/lib64/libc.so.6
#4  0x00002b835ef0a878 in slurm_xmalloc (size=size@entry=24, 
clear=clear@entry=false, file=file@entry=0x2b835ef39992 "pack.c", 
line=line@entry=152, func=func@entry=0x2b835ef240a5 <__func__.7819> 
"init_buf") at xmalloc.c:86
#5  0x00002b835ee60bd0 in init_buf (size=16384) at pack.c:152
#6  0x000000000041c6ab in _handle_accept (arg=0x0) at req.c:384
#7  0x00002b835fdfbe25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00002b836010834d in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x2b836459c700 (LWP 414726)):
#0  0x00002b8360115eec in __lll_lock_wait_private () from 
/usr/lib64/libc.so.6
#1  0x00002b83600927d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2  0x00002b836008aca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00002b83600900fe in malloc () from /usr/lib64/libc.so.6
#4  0x00002b835ef0a878 in slurm_xmalloc (size=size@entry=24, 
clear=clear@entry=false, file=file@entry=0x2b835ef39992 "pack.c", 
line=line@entry=152, func=func@entry=0x2b835ef240a5 <__func__.7819> 
"init_buf") at xmalloc.c:86
#5  0x00002b835ee60bd0 in init_buf (size=16384) at pack.c:152
#6  0x000000000041c6ab in _handle_accept (arg=0x0) at req.c:384
#7  0x00002b835fdfbe25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00002b836010834d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x2b83658e3700 (LWP 416436)):
#0  0x00002b8360115eec in __lll_lock_wait_private () from 
/usr/lib64/libc.so.6
#1  0x00002b83600927d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2  0x00002b836008aca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00002b83600900fe in malloc () from /usr/lib64/libc.so.6
#4  0x00002b835ef0a878 in slurm_xmalloc (size=size@entry=24, 
clear=clear@entry=false, file=file@entry=0x2b835ef39992 "pack.c", 
line=line@entry=152, func=func@entry=0x2b835ef240a5 <__func__.7819> 
"init_buf") at xmalloc.c:86
#5  0x00002b835ee60bd0 in init_buf (size=16384) at pack.c:152
#6  0x000000000041c6ab in _handle_accept (arg=0x0) at req.c:384
#7  0x00002b835fdfbe25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00002b836010834d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x2b8363b7d700 (LWP 417060)):
#0  0x00002b8360115eec in __lll_lock_wait_private () from 
/usr/lib64/libc.so.6
#1  0x00002b83600927d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2  0x00002b836008aca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00002b83600900fe in malloc () from /usr/lib64/libc.so.6
#4  0x00002b835ef0a878 in slurm_xmalloc (size=size@entry=24, 
clear=clear@entry=false, file=file@entry=0x2b835ef39992 "pack.c", 
line=line@entry=152, func=func@entry=0x2b835ef240a5 <__func__.7819> 
"init_buf") at xmalloc.c:86
#5  0x00002b835ee60bd0 in init_buf (size=16384) at pack.c:152
#6  0x000000000041c6ab in _handle_accept (arg=0x0) at req.c:384
#7  0x00002b835fdfbe25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00002b836010834d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x2b835ebcbfc0 (LWP 413214)):
#0  0x00002b835fdfcf57 in pthread_join () from /usr/lib64/libpthread.so.0
#1  0x00000000004106ef in _wait_for_io (job=0x1f61850) at mgr.c:2219
#2  job_manager (job=job@entry=0x1f61850) at mgr.c:1397
#3  0x000000000040c957 in main (argc=1, argv=0x7ffe3331f1f8) at 
slurmstepd.c:172

slurmstepd for the extern step:

Thread 5 (Thread 0x2b36e3df5700 (LWP 413207)):
#0  0x00002b36e4f26945 in pthread_cond_wait@@GLIBC_2.3.2 () from 
/usr/lib64/libpthread.so.0
#1  0x00002b36e3fab028 in _watch_tasks (arg=<optimized out>) at 
slurm_jobacct_gather.c:242
#2  0x00002b36e4f22e25 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00002b36e522f34d in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x2b36e96c3700 (LWP 413208)):
#0  0x00002b36e51f61ad in nanosleep () from /usr/lib64/libc.so.6
#1  0x00002b36e5226ec4 in usleep () from /usr/lib64/libc.so.6
#2  0x00002b36e3fa5039 in _timer_thread (args=<optimized out>) at 
slurm_acct_gather_profile.c:189
#3  0x00002b36e4f22e25 in start_thread () from /usr/lib64/libpthread.so.0
#4  0x00002b36e522f34d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x2b36e97c4700 (LWP 413209)):
#0  0x00002b36e5224a3d in poll () from /usr/lib64/libc.so.6
#1  0x00002b36e4037737 in _poll_internal (shutdown_time=<optimized out>, 
nfds=2, pfds=0x2b36ec0008d0) at eio.c:364
#2  eio_handle_mainloop (eio=0x2297060) at eio.c:328
#3  0x000000000041ca78 in _msg_thr_internal (job_arg=0x2256ca0) at req.c:245
#4  0x00002b36e4f22e25 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00002b36e522f34d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x2b36ea6f2700 (LWP 413210)):
#0  0x00002b36e5224a3d in poll () from /usr/lib64/libc.so.6
#1  0x00002b36e99cd9a2 in _oom_event_monitor (x=<optimized out>) at 
task_cgroup_memory.c:466
#2  0x00002b36e4f22e25 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00002b36e522f34d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x2b36e3cf2fc0 (LWP 413204)):
#0  0x00002b36e51f5e6a in wait4 () from /usr/lib64/libc.so.6
#1  0x000000000040ffe6 in _spawn_job_container (job=0x2256ca0) at mgr.c:1107
#2  job_manager (job=job@entry=0x2256ca0) at mgr.c:1216
#3  0x000000000040c957 in main (argc=1, argv=0x7ffdfe65e528) at 
slurmstepd.c:172


Thanks,
Chris
Comment 9 Alejandro Sanchez 2018-02-07 09:49:04 MST
Would you mind attaching slurmd.log and slurmctld.log so that I can see the messages since a bit before the job start till the current state?
Comment 10 Alejandro Sanchez 2018-02-07 09:50:51 MST
Also was job 14200 a batch job? salloc/srun one? or adopted incoming ssh? Can you grab the information related to its submission? paramters and potentially the batch script.
Comment 11 Christopher Samuel 2018-02-07 15:40:10 MST
On 08/02/18 03:50, bugs@schedmd.com wrote:

> Also was job 14200 a batch job? salloc/srun one? or adopted incoming
> ssh? Can you grab the information related to its submission?
> paramters and potentially the batch script.

As per initial report these are just batch jobs.

Script for the one in question is:

--------------------8< snip snip 8<--------------------
#!/bin/bash
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=36
#SBATCH --mem=180gb
#SBATCH --time=02:0:00

#module load iomkl/2016.2
module load intel/2016.2.181-gcc-6.4.0 openmpi/3.0.0

mpirun xhpl

--------------------8< snip snip 8<--------------------

OpenMPI appears to have been configured with:

   Configure command line: 
'--prefix=/apps/skylake/software/compiler/intel/2016.2.181-gcc-6.4.0/openmpi/3.0.0' 
'--enable-shared' '--enable-mpi-thread-multiple' '--with-verbs' 
'--enable-mpirun-prefix-by-default' '--with-hwloc=' '--disable-dlopen' 
'--with-slurm=/apps/slurm/latest' '--with-pmi=/apps/slurm/latest' 
'--with-ucx=/apps/skylake/software/core/ucx/1.2.1'


So mpirun launches orted via srun.

Job information with scontrol show job -ddd:

JobId=14200 JobName=hpl.slurm
    UserId=rhumble(1040) GroupId=hpcadmin(10190) MCS_label=N/A
    Priority=18671 Nice=0 Account=hpcadmin QOS=normal
    JobState=COMPLETING Reason=None Dependency=(null)
    Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
    DerivedExitCode=0:0
    RunTime=00:00:35 TimeLimit=02:00:00 TimeMin=N/A
    SubmitTime=2018-02-07T20:18:38 EligibleTime=2018-02-07T20:18:38
    StartTime=2018-02-07T21:14:40 EndTime=2018-02-07T21:15:15 Deadline=N/A
    PreemptTime=None SuspendTime=None SecsPreSuspend=0
    LastSchedEval=2018-02-07T21:14:40
    Partition=skylake AllocNode:Sid=farnarkle2:322324
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=john11
    BatchHost=john6
    NumNodes=1 NumCPUs=36 NumTasks=288 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=288,mem=1440G,node=8,billing=288
    Socks/Node=* NtasksPerN:B:S:C=36:0:*:* CoreSpec=*
      Nodes=john[6-13] CPU_IDs=0-35 Mem=184320 GRES_IDX=
    MinCPUsNode=36 MinMemoryNode=180G MinTmpDiskNode=0
    Features=(null) DelayBoot=00:00:00
    Gres=(null) Reservation=(null)
    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
    Command=/home/rhumble/bench/hpl/8node/hpl.slurm
    WorkDir=/home/rhumble/bench/hpl/8node
    StdErr=/home/rhumble/bench/hpl/8node/slurm-14200.out
    StdIn=/dev/null
    StdOut=/home/rhumble/bench/hpl/8node/slurm-14200.out
    Power=


Slurmctld logs for the job:

[2018-02-07T20:18:38.470] _slurm_rpc_submit_batch_job: JobId=14200 
InitPrio=18615 usec=366
[2018-02-07T21:14:40.747] sched: Allocate JobID=14200 
NodeList=john[6-13] #CPUs=288 Partition=skylake
[2018-02-07T21:15:15.956] _job_complete: JobID=14200 State=0x1 NodeCnt=8 
WEXITSTATUS 0
[2018-02-07T21:15:15.956] _job_complete: JobID=14200 State=0x8003 
NodeCnt=8 done
[2018-02-07T21:16:40.883] Resending TERMINATE_JOB request JobId=14200 
Nodelist=john11
[2018-02-07T21:17:40.973] Resending TERMINATE_JOB request JobId=14200 
Nodelist=john11
[2018-02-07T21:18:40.067] Resending TERMINATE_JOB request JobId=14200 
Nodelist=john11
[2018-02-07T21:19:40.672] Resending TERMINATE_JOB request JobId=14200 
Nodelist=john11

Slurmd logs (sent to syslog) are:



Feb  7 21:14:40 john11 slurmd[351901]: _run_prolog: run job script took 
usec=10344
Feb  7 21:14:40 john11 slurmd[351901]: _run_prolog: prolog with lock for 
job 14200 ran for 0 seconds
Feb  7 21:14:40 john11 slurmstepd[413204]: task/cgroup: 
/slurm/uid_1040/job_14200: alloc=184320MB mem.limit=184320MB 
memsw.limit=202752MB
Feb  7 21:14:40 john11 slurmstepd[413204]: task/cgroup: 
/slurm/uid_1040/job_14200/step_extern: alloc=184320MB mem.limit=184320MB 
memsw.limit=202752MB
Feb  7 21:14:41 john11 slurmd[351901]: launch task 14200.0 request from 
1040.10190@192.168.44.106 (port 28894)
Feb  7 21:14:41 john11 slurmstepd[413214]: task/cgroup: 
/slurm/uid_1040/job_14200: alloc=184320MB mem.limit=184320MB 
memsw.limit=202752MB
Feb  7 21:14:41 john11 slurmstepd[413214]: task/cgroup: 
/slurm/uid_1040/job_14200/step_0: alloc=184320MB mem.limit=184320MB 
memsw.limit=202752MB

Just prior to that (no intervening messages) I see:

Feb  7 21:13:40 john11 slurmd[351901]: _run_prolog: run job script took 
usec=10307
Feb  7 21:13:40 john11 slurmd[351901]: _run_prolog: prolog with lock for 
job 14187 ran for 0 seconds
Feb  7 21:13:40 john11 slurmstepd[412824]: task/cgroup: 
/slurm/uid_1040/job_14187: alloc=184320MB mem.limit=184320MB 
memsw.limit=202752MB
Feb  7 21:13:40 john11 slurmstepd[412824]: task/cgroup: 
/slurm/uid_1040/job_14187/step_extern: alloc=184320MB mem.limit=184320MB 
memsw.limit=202752MB
Feb  7 21:13:40 john11 slurmd[351901]: launch task 14187.0 request from 
1040.10190@192.168.44.106 (port 4318)
Feb  7 21:13:40 john11 slurmstepd[412834]: task/cgroup: 
/slurm/uid_1040/job_14187: alloc=184320MB mem.limit=184320MB 
memsw.limit=202752MB
Feb  7 21:13:40 john11 slurmstepd[412834]: task/cgroup: 
/slurm/uid_1040/job_14187/step_0: alloc=184320MB mem.limit=184320MB 
memsw.limit=202752MB
Feb  7 21:14:14 john11 slurmstepd[412834]: error: Failed to send 
MESSAGE_TASK_EXIT: Connection refused
Feb  7 21:14:14 john11 slurmstepd[412834]: done with job
Feb  7 21:14:15 john11 slurmstepd[412824]: _oom_event_monitor: oom-kill 
event count: 1
Feb  7 21:14:15 john11 slurmstepd[412824]: done with job

So that's one that cleaned up properly from the look of it.

cheers,
Chris
Comment 12 Christopher Samuel 2018-02-07 16:09:52 MST
Here's the WCHAN info for the Slurm processes, in case that's of use.

[root@john11 ~]# ps -Ao pid,cmd:30,wchan | grep slurm
351901 /apps/slurm/latest/sbin/slurmd inet_csk_accept
413204 slurmstepd: [14200.extern]     do_wait
413214 slurmstepd: [14200.0]          futex_wait_queue_me
Comment 13 Christopher Samuel 2018-02-08 00:05:42 MST
On 08/02/18 09:40, Christopher Samuel wrote:

> OpenMPI appears to have been configured with:
> 
>    Configure command line: 
> '--prefix=/apps/skylake/software/compiler/intel/2016.2.181-gcc-6.4.0/openmpi/3.0.0' 
> '--enable-shared' '--enable-mpi-thread-multiple' '--with-verbs' 
> '--enable-mpirun-prefix-by-default' '--with-hwloc=' '--disable-dlopen' 
> '--with-slurm=/apps/slurm/latest' '--with-pmi=/apps/slurm/latest' 
> '--with-ucx=/apps/skylake/software/core/ucx/1.2.1'

On a hunch I requested a duplicate build with the only change being to
drop the final --with-ucx=[...] build option.

Removing that from OpenMPI seems to make the issue go away
when we run the same binary with the different stack.

Now *why* enabling UCX in OpenMPI would cause slurmstepd to
occasionally get hung up is another question and one I don't
have any idea about.

So we can work around it with this, but it does still appear
to be an issue with Slurm if it can't clear up after these.

Hope this helps your investigation!

All the best,
Chris
Comment 14 Alejandro Sanchez 2018-02-08 04:28:41 MST
(In reply to Christopher Samuel from comment #13)
> On 08/02/18 09:40, Christopher Samuel wrote:
> 
> > OpenMPI appears to have been configured with:
> > 
> >    Configure command line: 
> > '--prefix=/apps/skylake/software/compiler/intel/2016.2.181-gcc-6.4.0/openmpi/3.0.0' 
> > '--enable-shared' '--enable-mpi-thread-multiple' '--with-verbs' 
> > '--enable-mpirun-prefix-by-default' '--with-hwloc=' '--disable-dlopen' 
> > '--with-slurm=/apps/slurm/latest' '--with-pmi=/apps/slurm/latest' 
> > '--with-ucx=/apps/skylake/software/core/ucx/1.2.1'
> 
> On a hunch I requested a duplicate build with the only change being to
> drop the final --with-ucx=[...] build option.
> 
> Removing that from OpenMPI seems to make the issue go away
> when we run the same binary with the different stack.
> 
> Now *why* enabling UCX in OpenMPI would cause slurmstepd to
> occasionally get hung up is another question and one I don't
> have any idea about.
> 
> So we can work around it with this, but it does still appear
> to be an issue with Slurm if it can't clear up after these.
> 
> Hope this helps your investigation!
> 
> All the best,
> Chris

Interesting. Are you using pmi1, pmi2 or pmix?

There's a commit adding notes to the Slurm mpi guide webpage:

https://slurm.schedmd.com/mpi_guide.html

on how to build Slurm with PMIx, but it seems the changes have not yet been pushed to the website. I'll ask a team colleague to refresh the web so you can read them from the web instead that from the commit patches:

https://github.com/SchedMD/slurm/commit/34209c471a29aeb5cf44e35
and continuation commit:
https://github.com/SchedMD/slurm/commit/baa4f239eb6e732321bddb8

There's also a very useful how-to documented by the PMIx developers themselves that can be found here:

https://pmix.org/support/how-to/slurm-support/

This document includes a "Using the UCX Extension".

I'm wondering if you built all the software pieces so that they are compatible with each other and using the right config options. Did you configure Slurm with --with-ucx=path-to-UCX-installation?

There's also a bunch of SLURM_ variables enabling Direct-connect, Early-wireup and also able to specify UCX network devices that come into play here.

Would you mind reviewing all these guidelines' recommendations and procedures and compare them against the way you built Slurm, PMI[1|2|x] and OpenMPI?

Thanks.
Comment 16 Alejandro Sanchez 2018-02-08 10:15:54 MST
Also are you using the same compiler to build the different soft components? And also which version of libc are you using? Thanks!
Comment 18 Alejandro Sanchez 2018-02-09 08:41:11 MST
Christopher, we're interested in your glibc version since we've been reported with similar stack traces in Bug 4690, and it appears to be caused by a bug in glibc malloc arena locks, where there seems to be a deadlock between malloc and fork. This bug is described in the following link:

https://bugzilla.redhat.com/show_bug.cgi?id=906468

Fixed In Version:	glibc-2.17-162.el7
Comment 19 Christopher Samuel 2018-02-09 15:23:07 MST
Hi  Alejandro,

On Saturday, 10 February 2018 2:41:11 AM AEDT bugs@schedmd.com wrote:

> Christopher, we're interested in your glibc version since we've been
> reported with similar stack traces in Bug 4690, and it appears to be caused
> by a bug in glibc malloc arena locks, where there seems to be a deadlock
> between malloc and fork. This bug is described in the following link:

Sorry for the delay, I don't work Fridays so I didn't see your replies.

I'm back in the office on Monday so I'll look further but we are running RHEL7 
and the glibc version on compute nodes is:

glibc-2.17-196.el7_4.2.x86_64

So that's meant to have those bug fixes. :-/

All the best,
Chris
Comment 20 Christopher Samuel 2018-02-10 03:47:35 MST
Hi  Alejandro,

On 08/02/18 22:28, bugs@schedmd.com wrote:

> Interesting. Are you using pmi1, pmi2 or pmix?

PMI2,  I had tried out PMIx last year, without success.

> This document includes a "Using the UCX Extension".

Thanks, will take a look.

> I'm wondering if you built all the software pieces so that they are compatible
> with each other and using the right config options. Did you configure Slurm
> with --with-ucx=path-to-UCX-installation?

Hmm, I used to have that when testing, but I think I removed that before
starting to do production builds.  So no, we don't have UCX support
enabled in Slurm! :-/

> There's also a bunch of SLURM_ variables enabling Direct-connect, Early-wireup
> and also able to specify UCX network devices that come into play here.
> 
> Would you mind reviewing all these guidelines' recommendations and procedures
> and compare them against the way you built Slurm, PMI[1|2|x] and OpenMPI?

Not a problem - thanks for this.

All the best,
Chris
Comment 21 Christopher Samuel 2018-02-10 03:49:18 MST
On 09/02/18 04:15, bugs@schedmd.com wrote:

> Also are you using the same compiler to build the different soft components?

We use the RHEL7 GCC for Slurm, but of course OpenMPI has to be built
with multiple compilers because the Fortran F90 .mod file is compiler
implementation dependent.  The version that's being tested is
built with the Intel compilers because that's our expectation
of what the users building code will mostly be using.

All the best,
Chris
Comment 22 Christopher Samuel 2018-02-11 15:31:38 MST
On 08/02/18 18:05, Christopher Samuel wrote:

> Now *why* enabling UCX in OpenMPI would cause slurmstepd to
> occasionally get hung up is another question and one I don't
> have any idea about.

I've now seen a non-UCX OpenMPI build cause a job to get stuck in CG,
but this time it's the extern step (the sleep is gone) and this ptrace
output makes it look like we might have multiple calls to log_msg() be
causing this deadlock.

[root@john21 ~]# pstack 366079
Thread 4 (Thread 0x2b9103abc700 (LWP 366084)):
#0  0x00002b90ff22142d in __lll_lock_wait () from /usr/lib64/libpthread.so.0
#1  0x00002b90ff21cdcb in _L_lock_812 () from /usr/lib64/libpthread.so.0
#2  0x00002b90ff21cc98 in pthread_mutex_lock () from 
/usr/lib64/libpthread.so.0
#3  0x00002b90fe277771 in log_msg (level=level@entry=LOG_LEVEL_DEBUG3, 
fmt=fmt@entry=0x42c853 "Called _msg_socket_accept", 
args=args@entry=0x2b9103abbb78) at log.c:959
#4  0x00002b90fe278d14 in debug3 (fmt=fmt@entry=0x42c853 "Called 
_msg_socket_accept") at log.c:1185
#5  0x0000000000416c59 in _msg_socket_accept (obj=0x13195f0, 
objs=<optimized out>) at req.c:321
#6  0x00002b90fe32f928 in _poll_handle_event (objList=0x1324560, 
obj=0x13195f0, revents=1) at eio.c:495
#7  _poll_dispatch (objList=0x1324560, map=0x2b9104000900, nfds=1, 
pfds=0x2b91040008d0) at eio.c:447
#8  eio_handle_mainloop (eio=0x135b1a0) at eio.c:335
#9  0x000000000041ca78 in _msg_thr_internal (job_arg=0x131bfd0) at req.c:245
#10 0x00002b90ff21ae25 in start_thread () from /usr/lib64/libpthread.so.0
#11 0x00002b90ff52734d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x2b9103fd6700 (LWP 366085)):
#0  0x00002b90ff534eec in __lll_lock_wait_private () from 
/usr/lib64/libc.so.6
#1  0x00002b90ff4b18c8 in _L_lock_2251 () from /usr/lib64/libc.so.6
#2  0x00002b90ff4aa0ef in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00002b90ff4afbca in calloc () from /usr/lib64/libc.so.6
#4  0x00002b90fe329871 in slurm_xmalloc (size=30, 
clear=clear@entry=true, file=file@entry=0x2b90fe35e0fe "xstring.c", 
line=line@entry=104, func=func@entry=0x2b90fe35e1bd <__func__.4940> 
"makespace") at xmalloc.c:84
#5  0x00002b90fe329c74 in makespace (str=str@entry=0x2b9103fd5bd8, 
needed=<optimized out>) at xstring.c:104
#6  0x00002b90fe329e35 in _xstrcat (str1=str1@entry=0x2b9103fd5bd8, 
str2=<optimized out>, str2@entry=0x2b9103cc9019 "%s: oom-kill event 
count: %lu") at xstring.c:135
#7  0x00002b90fe2773c7 in vxstrfmt (fmt=fmt@entry=0x2b9103cc9019 "%s: 
oom-kill event count: %lu", ap=ap@entry=0x2b9103fd5cb8) at log.c:827
#8  0x00002b90fe2779f3 in log_msg (level=level@entry=LOG_LEVEL_INFO, 
fmt=fmt@entry=0x2b9103cc9019 "%s: oom-kill event count: %lu", 
args=args@entry=0x2b9103fd5cb8) at log.c:1038
#9  0x00002b90fe278aa4 in info (fmt=fmt@entry=0x2b9103cc9019 "%s: 
oom-kill event count: %lu") at log.c:1149
#10 0x00002b9103cc5aa1 in _oom_event_monitor (x=<optimized out>) at 
task_cgroup_memory.c:489
#11 0x00002b90ff21ae25 in start_thread () from /usr/lib64/libpthread.so.0
#12 0x00002b90ff52734d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x2b9108b14700 (LWP 366390)):
#0  0x00002b90ff534eec in __lll_lock_wait_private () from 
/usr/lib64/libc.so.6
#1  0x00002b90ff59660d in _L_lock_27 () from /usr/lib64/libc.so.6
#2  0x00002b90ff5965bd in arena_thread_freeres () from /usr/lib64/libc.so.6
#3  0x00002b90ff596662 in __libc_thread_freeres () from /usr/lib64/libc.so.6
#4  0x00002b90ff21ae38 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00002b90ff52734d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x2b90fdfeafc0 (LWP 366079)):
#0  0x00002b90ff22142d in __lll_lock_wait () from /usr/lib64/libpthread.so.0
#1  0x00002b90ff21cdcb in _L_lock_812 () from /usr/lib64/libpthread.so.0
#2  0x00002b90ff21cc98 in pthread_mutex_lock () from 
/usr/lib64/libpthread.so.0
#3  0x00002b90fe277771 in log_msg (level=level@entry=LOG_LEVEL_DEBUG2, 
fmt=fmt@entry=0x42faad "%s: rmdir(%s): %m", 
args=args@entry=0x7ffdc3da1348) at log.c:959
#4  0x00002b90fe278c78 in debug2 (fmt=fmt@entry=0x42faad "%s: rmdir(%s): 
%m") at log.c:1176
#5  0x0000000000427c77 in xcgroup_delete (cg=cg@entry=0x2b91038b6640 
<user_memory_cg>) at xcgroup.c:493
#6  0x00002b91036ad5e8 in jobacct_gather_cgroup_memory_fini 
(slurm_cgroup_conf=slurm_cgroup_conf@entry=0x2b91038b2420 
<slurm_cgroup_conf>) at jobacct_gather_cgroup_memory.c:155
#7  0x00002b91036aca58 in fini () at jobacct_gather_cgroup.c:261
#8  0x00002b90fe2862bf in plugin_unload (plug=0x13218d0) at plugin.c:305
#9  0x00002b90fe2866ee in plugin_context_destroy (c=0x1319450) at 
plugin.c:477
#10 0x00002b90fe2a3674 in jobacct_gather_fini () at 
slurm_jobacct_gather.c:335
#11 0x00002b90fe29bcd7 in acct_gather_profile_fini () at 
slurm_acct_gather_profile.c:252
#12 0x0000000000410062 in _spawn_job_container (job=0x131bfd0) at mgr.c:1124
#13 job_manager (job=job@entry=0x131bfd0) at mgr.c:1216
#14 0x000000000040c957 in main (argc=1, argv=0x7ffdc3da27b8) at 
slurmstepd.c:172

Now I'm not sure this is the same scenario, I'll go and look at the
previous stack traces I've put in to see if it's plausible there too.

cheers,
Chris
Comment 23 Christopher Samuel 2018-02-11 15:39:07 MST
On 12/02/18 09:31, Christopher Samuel wrote:

> I've now seen a non-UCX OpenMPI build cause a job to get stuck in CG,
> but this time it's the extern step (the sleep is gone) and this ptrace
> output makes it look like we might have multiple calls to log_msg() be
> causing this deadlock.

This does look like a different issue, the stack trace is quite
different to what we were seeing before.

Looking back at that I can see why you'd think a glibc bug in
fork() (which is implmented via clone()) would be implicated,
that seems like the only slurmstepd call there is main() and
then wait_for_io().

All the best,
Chris
Comment 24 Alejandro Sanchez 2018-02-12 04:04:40 MST
From this second backtrace in comment 22, it looks like:

Thread 4 waits for log_msg() lock:
...
  log_msg -->
  ...
    __lll_lock_wait

Thread 3 owns the log_msg() lock, but waits for malloc() arena one:
...
  log_msg --> acquires the lock
  ...
    slurm_xmalloc -->
    ...
      __lll_lock_wait_private


Thread 2 waits lock on glibc arena lock after clone():
start_thread -->
...
  arena_thread_freeres
    ...
      __lll_lock_wait_private

Thread 1 waits on log_msg lock after jobacct_gather/cgroup memory clean:

jobacct_gather_cgroup_memory_fini -->
  xcgroup_delete -->
    debug2 --> (Since it can't rmdir() the cgroup, attempts to log_msg())
      log_msg -->
        pthread_mutex_lock 

From me it looks like Thread 2 waits for 3 and viceversa (thus the deadlock) and 1 and 4 wait for 3 who owns the log_msg lock.

I'll keep investigating whether there are newer bugs relate to these locks in glibc.

- Can you reproduce at will? If so could you indicate the steps?
- Slurm webpage is refreshed with pmix docs, but doesn't look related:

https://slurm.schedmd.com/mpi_guide.html#pmix

- In this second backtrace, that extern stepd was the only one stuck or there were more? If so, could you attach the backtrace from those too?

- Maybe the slurm.conf/cgroup.conf and log messages from slurmd.log around the life-cycle of step with pid 366079 would be helpful too.

Thanks.
Comment 25 Christopher Samuel 2018-02-12 04:27:02 MST
Hi Alejandro,

On Monday, 12 February 2018 10:04:40 PM AEDT bugs@schedmd.com wrote:

> From this second backtrace in comment 22, it looks like:
> I'll keep investigating whether there are newer bugs relate to these locks
> in glibc.

Thanks, I've not found anything obvious but I didn't have a lot of time to
pursue that.

> - Can you reproduce at will? If so could you indicate the steps?

It doesn't happen all the time, my colleague is submitting a lot of HPL jobs
that take about 30 seconds each (they're GPU ones) using OpenMPI that was
built with UCX support.  We get a handfull of them that hit this deadlock (so
it seems to be a race condition of some form).

> - Slurm webpage is refreshed with pmix docs, but doesn't look related:
> https://slurm.schedmd.com/mpi_guide.html#pmix

Thanks.

> - In this second backtrace, that extern stepd was the only one stuck or
> there were more? If so, could you attach the backtrace from those too?

No, that was all there was for that job, just the extern slurmstepd.

> - Maybe the slurm.conf/cgroup.conf and log messages from slurmd.log around
> the life-cycle of step with pid 366079 would be helpful too.

Here's cgroup.conf:

--------------------------8< snip snip 8<--------------------------
###
#
# Slurm cgroup support configuration file
#
# See man slurm.conf and man cgroup.conf for further
# information on cgroup configuration parameters
#--
###
CgroupAutomount=yes
ConstrainCores=yes
TaskAffinity=no

ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
AllowedRAMSpace=100
AllowedSwapSpace=10

# Disable kmem cgroup due to OPA specific issue
ConstrainKmemSpace=no

# CS; Wed Feb  7 09:44:13 AEDT 2018
# Constrain devices via cgroups (to limits access to GPUs etc)
ConstrainDevices=yes
--------------------------8< snip snip 8<--------------------------

Here are all the log messages from that node for that job and a few earlier ones:

Feb 12 01:19:50 john21 slurmd[350152]: _run_prolog: run job script took usec=36671
Feb 12 01:19:50 john21 slurmd[350152]: _run_prolog: prolog with lock for job 19439 ran for 0 seconds
Feb 12 01:19:50 john21 slurmstepd[365368]: task/cgroup: /slurm/uid_1040/job_19439: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB
Feb 12 01:19:50 john21 slurmstepd[365368]: task/cgroup: /slurm/uid_1040/job_19439/step_extern: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB
Feb 12 01:19:51 john21 slurmd[350152]: launch task 19439.0 request from 1040.10190@192.168.44.114 (port 45267)
Feb 12 01:19:51 john21 slurmstepd[365379]: task/cgroup: /slurm/uid_1040/job_19439: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB
Feb 12 01:19:51 john21 slurmstepd[365379]: task/cgroup: /slurm/uid_1040/job_19439/step_0: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB
Feb 12 01:20:20 john21 slurmstepd[365379]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
Feb 12 01:20:20 john21 slurmstepd[365379]: done with job
Feb 12 01:20:21 john21 slurmstepd[365368]: _oom_event_monitor: oom-kill event count: 1
Feb 12 01:20:21 john21 slurmstepd[365368]: done with job
Feb 12 01:20:50 john21 slurmd[350152]: _run_prolog: run job script took usec=35217
Feb 12 01:20:50 john21 slurmd[350152]: _run_prolog: prolog with lock for job 19452 ran for 0 seconds
Feb 12 01:20:50 john21 slurmstepd[365724]: task/cgroup: /slurm/uid_1040/job_19452: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB
Feb 12 01:20:50 john21 slurmstepd[365724]: task/cgroup: /slurm/uid_1040/job_19452/step_extern: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB
Feb 12 01:20:50 john21 slurmd[350152]: launch task 19452.0 request from 1040.10190@192.168.44.114 (port 6356)
Feb 12 01:20:50 john21 slurmstepd[365734]: task/cgroup: /slurm/uid_1040/job_19452: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB
Feb 12 01:20:50 john21 slurmstepd[365734]: task/cgroup: /slurm/uid_1040/job_19452/step_0: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB
Feb 12 01:21:19 john21 slurmstepd[365734]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
Feb 12 01:21:19 john21 slurmstepd[365734]: done with job
Feb 12 01:21:20 john21 slurmstepd[365724]: _oom_event_monitor: oom-kill event count: 1
Feb 12 01:21:20 john21 slurmstepd[365724]: done with job
Feb 12 01:21:50 john21 slurmd[350152]: _run_prolog: run job script took usec=38899
Feb 12 01:21:50 john21 slurmd[350152]: _run_prolog: prolog with lock for job 19465 ran for 0 seconds
Feb 12 01:21:50 john21 slurmstepd[366079]: task/cgroup: /slurm/uid_1040/job_19465: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB
Feb 12 01:21:50 john21 slurmstepd[366079]: task/cgroup: /slurm/uid_1040/job_19465/step_extern: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB
Feb 12 01:21:50 john21 slurmd[350152]: launch task 19465.0 request from 1040.10190@192.168.44.114 (port 32980)
Feb 12 01:21:50 john21 slurmstepd[366089]: task/cgroup: /slurm/uid_1040/job_19465: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB
Feb 12 01:21:50 john21 slurmstepd[366089]: task/cgroup: /slurm/uid_1040/job_19465/step_0: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB
Feb 12 01:22:20 john21 slurmstepd[366089]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
Feb 12 01:22:20 john21 slurmstepd[366089]: done with job

Hope that helps!

All the best,
Chris
Comment 26 Christopher Samuel 2018-02-14 16:51:52 MST
On 12/02/18 22:26, Chris Samuel wrote:

> It doesn't happen all the time, my colleague is submitting a lot of HPL jobs
> that take about 30 seconds each (they're GPU ones) using OpenMPI that was
> built with UCX support.  We get a handfull of them that hit this deadlock (so
> it seems to be a race condition of some form).

We had a couple hit the same problem when configured without UCX (so
just using Slurm's usual PMI2 support, nothing fancy).

Unfortunately I did an 'scontrol reconfigure' to make a change to
the PrologFlags active to enable X11 forwarding and that caused
both the hung jobs to get cleaned up.

I had grabbed a copy of the job script first:

-----------------------8< snip snip 8<-----------------------
#!/bin/bash
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=36
#SBATCH --mem=180gb
#SBATCH --time=02:0:00

#module load iomkl/2016.2
#module load intel/2016.2.181-gcc-6.4.0 openmpi/3.0.0
module load intel/2016.2.181-gcc-6.4.0 openmpi-no-ucx/3.0.0

mpirun xhpl
-----------------------8< snip snip 8<-----------------------

I had taken a stack trace yesterday but lost it when my laptop
got into a weird state overnight. :-(

Pretty sure it was the same we had seen before though.

So it might be the race is still present, just easier to hit
with UCX enabled..

All the best,
Chris
Comment 27 Alejandro Sanchez 2018-02-20 04:12:03 MST
You mean that after setting PrologFlags=X11 the hung jobs to stop being stuck and got finally cleaned up, and you're not seeing stuck jobs anymore with this flag set? I'm asking this since X11 flag implicitly enables Contain and thus Alloc consequently, and the stepd behavior is different from regular steps.
Comment 28 Christopher Samuel 2018-02-20 04:14:38 MST
On 20/02/18 22:12, bugs@schedmd.com wrote:

> You mean that after setting PrologFlags=X11 the hung jobs to stop being stuck
> and got finally cleaned up, and you're not seeing stuck jobs anymore with this
> flag set? I'm asking this since X11 flag implicitly enables Contain and thus
> Alloc consequently, and the stepd behavior is different from regular steps.

No, we're still seeing jobs that get stuck in CG state.

Saw one that was with OpenMPI 1.10.1 so definitely not UCX related as 
there was no support in that version.

My guess is that it's an odd race condition.

I'm not at work at the moment, but will upload what I caught from that 
hang tomorrow when I'm in the office.

All the best,
Chris
Comment 29 Christopher Samuel 2018-02-20 17:48:10 MST
Created attachment 6197 [details]
ompi-1.10.7-lockup.tar.gz

On 20/02/18 22:14, Christopher Samuel wrote:

> I'm not at work at the moment, but will upload what I caught from that 
> hang tomorrow when I'm in the office.

Here's the logs I promised last night.

I got the OMPI version number slightly wrong by memory last night,
it was 1.10.7 not 1.10.1.

The tarball includes the output from ompi_info too in case that helps.

All the best,
Chris
Comment 30 Doug Jacobsen 2018-02-23 07:37:10 MST
FYI, I just replicated a similar issue on a Cray system.

Relevant thread trace from the hung slurmstepd:

Thread 4 (Thread 0x2aaab1edb700 (LWP 171580)):
#0  0x00002aaaabdf201b in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00002aaaabd76b72 in arena_get2.isra () from /lib64/libc.so.6
#2  0x00002aaaabd77412 in malloc () from /lib64/libc.so.6
#3  0x00002aaaaae1cf48 in slurm_xmalloc (size=size@entry=24, clear=clear@entry=false, file=file@entry=0x2aaaaae4c512 "pack.c",
    line=line@entry=152, func=func@entry=0x2aaaaae36a05 <__func__.8403> "init_buf") at xmalloc.c:86
#4  0x00002aaaaad727f0 in init_buf (size=16384) at pack.c:152
#5  0x000000000041d296 in _handle_accept (arg=0x0) at req.c:384
#6  0x00002aaaab090724 in start_thread () from /lib64/libpthread.so.0
#7  0x00002aaaabde5c1d in clone () from /lib64/libc.so.6



looks like it is hanging on a malloc while attempting the process an RPC (sadly probably the RPC to terminate the step...)
Comment 31 Doug Jacobsen 2018-02-23 07:39:39 MST
sorry, full backtrace:


Thread 4 (Thread 0x2aaab1edb700 (LWP 171580)):
#0  0x00002aaaabdf201b in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00002aaaabd76b72 in arena_get2.isra () from /lib64/libc.so.6
#2  0x00002aaaabd77412 in malloc () from /lib64/libc.so.6
#3  0x00002aaaaae1cf48 in slurm_xmalloc (size=size@entry=24, clear=clear@entry=false, file=file@entry=0x2aaaaae4c512 "pack.c",
    line=line@entry=152, func=func@entry=0x2aaaaae36a05 <__func__.8403> "init_buf") at xmalloc.c:86
#4  0x00002aaaaad727f0 in init_buf (size=16384) at pack.c:152
#5  0x000000000041d296 in _handle_accept (arg=0x0) at req.c:384
#6  0x00002aaaab090724 in start_thread () from /lib64/libpthread.so.0
#7  0x00002aaaabde5c1d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x2aaab3d4d700 (LWP 170560)):
#0  0x00002aaaabdf201b in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00002aaaabe4a494 in arena_thread_freeres () from /lib64/libc.so.6
#2  0x00002aaaabe4a552 in __libc_thread_freeres () from /lib64/libc.so.6
#3  0x00002aaaab09073c in start_thread () from /lib64/libpthread.so.0
#4  0x00002aaaabde5c1d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x2aaab1fdc700 (LWP 170557)):
#0  0x00002aaaabddd37d in poll () from /lib64/libc.so.6
#1  0x00002aaaaae230b7 in poll (__timeout=-1, __nfds=<optimized out>, __fds=0x2aaab40008d0) at /usr/include/bits/poll2.h:46
#2  _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2aaab40008d0) at eio.c:364
#3  eio_handle_mainloop (eio=0x6cf6f0) at eio.c:328
#4  0x000000000041d678 in _msg_thr_internal (job_arg=0x6bd550) at req.c:245
#5  0x00002aaaab090724 in start_thread () from /lib64/libpthread.so.0
#6  0x00002aaaabde5c1d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2aaaaab34ac0 (LWP 170554)):
#0  0x00002aaaab0919ad in pthread_join () from /lib64/libpthread.so.0
#1  0x0000000000410fce in _wait_for_io (job=0x6bd550) at mgr.c:2219
#2  job_manager (job=job@entry=0x6bd550) at mgr.c:1397
#3  0x000000000040d17a in main (argc=1, argv=0x7fffffffedd8) at slurmstepd.c:172
(gdb)


looks like threads 3 and 4 may be both involved (similar to earlier discussion)
Comment 32 Alejandro Sanchez 2018-02-23 07:53:18 MST
I'm wondering if this fix has something to do with these deadlocks:

https://bugzilla.redhat.com/show_bug.cgi?id=1332917

Fixed In Version:	glibc-2.22-15.fc23, glibc-2.23.1-7.fc24

Doug, what's your system glibc version?

Christopher reported glibc-2.17-196.el7_4.2.x86_64 so most probably doesn't include the fix above.
Comment 33 Doug Jacobsen 2018-02-23 07:58:56 MST
We are presently using Cray CLE6.0up04, which is based upon SuSE 12sp2

We've been using up04 for some time (last June) but I've never seen this issue.  I believe there were some glibc updates associated with the recent security patches for meltdown, spectre, user namespaces issues, and the hugedirty COW.

Possibly the security update generated a regression.

We installed those patches on 2/6.

Our current image has:
glibc-profile-2.22-62.3.4.x86_64
glibc-locale-32bit-2.22-62.3.4.x86_64
glibc-2.22-62.3.4.x86_64
glibc-profile-32bit-2.22-62.3.4.x86_64
glibc-info-2.22-62.3.4.noarch
glibc-locale-2.22-62.3.4.x86_64
glibc-32bit-2.22-62.3.4.x86_64
glibc-i18ndata-2.22-62.3.4.noarch
glibc-html-2.22-62.3.4.noarch


The previous image has (where I don't think I saw this issue):
glibc-2.22-61.3.x86_64
glibc-32bit-2.22-61.3.x86_64
glibc-profile-32bit-2.22-61.3.x86_64
glibc-html-2.22-61.3.noarch
glibc-locale-32bit-2.22-61.3.x86_64
glibc-info-2.22-61.3.noarch
glibc-profile-2.22-61.3.x86_64
glibc-locale-2.22-61.3.x86_64
glibc-i18ndata-2.22-61.3.noarch
Comment 34 Christopher Samuel 2018-02-23 15:08:50 MST
On Saturday, 24 February 2018 1:58:56 AM AEDT bugs@schedmd.com wrote:

> We've been using up04 for some time (last June) but I've never seen this
> issue. I believe there were some glibc updates associated with the recent
> security patches for meltdown, spectre, user namespaces issues, and the
> hugedirty COW.

Interesting, I've been wondering whether this might be related, it feels to me 
like we only started hearing about that after those fixes appeared.

My colleague who was seeing this is away this weekend but I'll ask if they 
remember that too (I was away in the USA when the news broke and the fixes were 
released).

Of course with microcode updates (later reverted), kernel changes and glibc 
changes it's going to be interesting trying to untangle this (if it is the 
cause)..

Doug: which Slurm version are you running?   When was that last changed?

cheers!
Chris
Comment 35 Doug Jacobsen 2018-02-23 17:24:10 MST
We are using slurm 17.11.3 as of 2/6 (same day as the other patches), but were using 17.11.2 from 1/9 (and 17.02.9 before that)
Comment 36 Christopher Samuel 2018-02-23 17:33:10 MST
Hi Doug,

On 24/02/18 11:24, bugs@schedmd.com wrote:

> We are using slurm 17.11.3 as of 2/6 (same day as the other patches),
> but were using 17.11.2 from 1/9 (and 17.02.9 before that)

Did you see this deadlock on 17.02.9 ?

cheers,
Chris
Comment 37 Doug Jacobsen 2018-02-23 17:38:32 MST
it's possible it has happened with 17.02, but I haven't seen it until just today.  Given that it is a glibc bug, I personally suspect the recent updates to glibc.
Comment 38 Christopher Samuel 2018-02-24 04:21:42 MST
Created attachment 6234 [details]
slurmstepd.gdb-full

On 24/02/18 11:38, bugs@schedmd.com wrote:

> it's possible it has happened with 17.02, but I haven't seen it until
> just today.  Given that it is a glibc bug, I personally suspect the
> recent updates to glibc.

I think they suspect a glibc bug, but reading the latest one that
Alejandro found it's not clear to me if that's a different bug to
the RHEL7 one that it's a clone of, but instead targeting Fedora.

The bug it's cloned from was fixed last year and our glibc should
have that patch in it. Errata here:

https://access.redhat.com/errata/RHSA-2017:1916

Another interpretation of the bug is that it's the same symptoms
but a different cause, and they only fixed it for Fedora.

I've checked and we have the latest glibc updates in production,
the last errata was this one:

https://access.redhat.com/errata/RHBA-2017:3296

So my previous thought that there were glibc updates for Meltdown &
Specre was wrong.

In case it's useful I attach a gdb "thread apply all bt full" from
one of our currently stuck jobs in case it gives more context.

All the best,
Chris
Comment 39 Christopher Samuel 2018-02-25 21:13:34 MST
Created attachment 6236 [details]
slurmstepd.0.gdb

On 24/02/18 22:21, Christopher Samuel wrote:

> In case it's useful I attach a gdb "thread apply all bt full" from
> one of our currently stuck jobs in case it gives more context.

I managed to trigger the same issue for a job launched with "srun"
using my colleagues HPL binary and input deck.  However, unlike when
launched with mpirun this is stuck in the 'R' state.

[csamuel@farnarkle1 ~]$ squeue | fgrep 850
          39928_850   skylake hpl.srun  csamuel  R      21:50      8 
john[54-61]

That's 1 job element out of 1000 that got stuck.

Interestingly Slurm has already cleared up the parent job array at
this point so asking about the job shows no info:

[csamuel@farnarkle1 ~]$ squeue -j 39928
              JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
[csamuel@farnarkle1 ~]$

(that does work for another running job array)

I've attached the stack trace (grabbed via pstack and gbd) in case
this gives more insight than the mpirun case.

All the best,
Chris
Comment 40 Christopher Samuel 2018-02-25 21:13:34 MST
Created attachment 6237 [details]
slurmstepd.0.pstack
Comment 41 Christopher Samuel 2018-02-25 21:22:06 MST
On 26/02/18 15:13, Christopher Samuel wrote:

> I managed to trigger the same issue for a job launched with "srun"
> using my colleagues HPL binary and input deck.  However, unlike when
> launched with mpirun this is stuck in the 'R' state.

It occurred to me that this might be the case with the mpirun jobs too,
until they hit their walltime limit and Slurm tries to clean them up.

Trying to scancel that srun job does indeed get it wedged in the 'CG' state.

[csamuel@farnarkle1 ~]$ scancel 39928_850
[csamuel@farnarkle1 ~]$ squeue | fgrep 850
          39928_850   skylake hpl.srun  csamuel CG      28:13      2 
john[54-55]

Eventually it cleans up everything except the step on the node that
has wedged.

[csamuel@farnarkle1 ~]$ squeue | fgrep 850
          39928_850   skylake hpl.srun  csamuel CG      28:13      2 
john[54-55]
[csamuel@farnarkle1 ~]$ squeue | fgrep 850
          39928_850   skylake hpl.srun  csamuel CG      28:13      1 john55

Which matches what we eventually see with the mpirun jobs.

All the best,
Chris
Comment 42 Christopher Samuel 2018-02-25 21:24:49 MST
On 26/02/18 15:21, Christopher Samuel wrote:

> It occurred to me that this might be the case with the mpirun jobs too,
> until they hit their walltime limit and Slurm tries to clean them up.

Just to make a liar out of me I caught an mpirun job which had gone
straight from 'R' to 'CG' and was wedged.

All the best,
Chris
Comment 43 Christopher Samuel 2018-02-27 20:44:18 MST
Created attachment 6251 [details]
slurmstepd-gdb-bt-with-debuginfo.txt

Now we've managed to get the required debuginfo's installed I've
attached a repeat stack trace from one of the slurmstepd's via gdb after
grabbing a core file with "gcore".

All the best,
Chris
Comment 44 Christopher Samuel 2018-02-27 20:59:23 MST
Created attachment 6252 [details]
stack-slurmd.log

Apologies for spamming you with backtraces, but I thought it would be
useful to have two corresponding ones for both the slurmd and slurmstepd
so you can see what connections were being made from slurmd to the step
daemon in case that might help understand the trigger.

Both of these are got using gdb to analyse core files got with "gcore"
and done with all the necessary debug symbols installed.

All the best,
Chris
Comment 45 Christopher Samuel 2018-02-27 20:59:24 MST
Created attachment 6253 [details]
stack-slurmstepd.log
Comment 46 Christopher Samuel 2018-02-28 04:01:38 MST
On 24/02/18 22:21, Christopher Samuel wrote:

> I think they suspect a glibc bug

I've been corresponding with a glibc maintainer and they've looked at
the backtraces and the glibc version and their comment is:

# The backtraces show that the threads are blocked on free_list_lock.
#
# During fork, we reinitialize this lock (because we reconstruct all
# data protected by it), so the parent state at the point of fork should
# not matter as far as this lock is concerned.
#
# As far as I can see, free_list_lock is correctly handled inside
# malloc. The lock/unlock calls are properly paired, and the critical
# sections are really short.
#
# We've also not seen anything like that, so it's probably some
# application-specific issue.

So it seems unlikely that this is glibc related. Certainly I didn't see
this at ${JOB-1} where I was helping with a RHEL7 based cluster a lot.

All the best,
Chris
Comment 47 Alejandro Sanchez 2018-02-28 08:41:58 MST
Hi Christopher, thanks for the new information. We're still trying to figure out the root cause of these deadlocks. We'll keep you informed.
Comment 48 Pablo Llopis 2018-03-01 05:16:10 MST
Hi there,

I have also experienced jobs stuck in CG state and similar symptoms to those described by Christopher: slurmstepd blocks and is unable to cleanup finished jobs.

I am not sure if this is the same bug, as I have noticed certain differences: 

In my case, the jobs initially do not exit because they get stuck for a while in uninterruptible D state. At this point, I can see the user process stuck in write() due to a storage problem, which is causing the uninterruptible sleep, and slurmstepd is wait4()ing blocking in kernel's do_wait.  I think this is expected since the user job is stuck and slurm will be unable to SIGKILL the job.

However, after the process is liberated from this state (after a timespan of hours), and is destroyed, I find that slurmstepd seems deadlocked (wchan shows do_signal_stop). 
My stacktrace of slurmstepd also looks a bit different from Christopher's as I see no memory allocation/freeing codepaths:

Thread 5 (Thread 0x2b08cc466700 (LWP 14992)):
#0  0x00002b08cd1a1945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x000000000054955f in _watch_tasks (arg=0x0) at slurm_jobacct_gather.c:244
#2  0x00002b08cd19de25 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00002b08cd4aa34d in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x2b08d0f23700 (LWP 14993)):
#0  0x00002b08cd1a442d in __lll_lock_wait () from /usr/lib64/libpthread.so.0
#1  0x00002b08cd19fdcb in _L_lock_812 () from /usr/lib64/libpthread.so.0
#2  0x00002b08cd19fc98 in pthread_mutex_lock () from /usr/lib64/libpthread.so.0
#3  0x000000000047e6ab in log_msg (level=LOG_LEVEL_DEBUG2, fmt=0x5edcfb "profile signalling type %s", args=0x2b08d0f22d60) at log.c:967
#4  0x000000000047f3f9 in debug2 (fmt=0x5edcfb "profile signalling type %s") at log.c:1182
#5  0x000000000054d697 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:174
#6  0x00002b08cd19de25 in start_thread () from /usr/lib64/libpthread.so.0
#7  0x00002b08cd4aa34d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x2b08d1024700 (LWP 14994)):
#0  0x00002b08cd1a442d in __lll_lock_wait () from /usr/lib64/libpthread.so.0
#1  0x00002b08cd19fdcb in _L_lock_812 () from /usr/lib64/libpthread.so.0
#2  0x00002b08cd19fc98 in pthread_mutex_lock () from /usr/lib64/libpthread.so.0
#3  0x000000000047e6ab in log_msg (level=LOG_LEVEL_DEBUG3, fmt=0x5c8343 "Called _msg_socket_accept", args=0x2b08d1023bc0) at log.c:967
#4  0x000000000047f4c5 in debug3 (fmt=0x5c8343 "Called _msg_socket_accept") at log.c:1191
#5  0x000000000043d6ae in _msg_socket_accept (obj=0xb04260, objs=0xaff850) at req.c:336
#6  0x0000000000453f73 in _poll_handle_event (revents=1, obj=0xb04260, objList=0xaff850) at eio.c:493
#7  0x0000000000453d39 in _poll_dispatch (pfds=0x2b08d40008d0, nfds=1, map=0x2b08d4000900, objList=0xaff850) at eio.c:445
#8  0x0000000000453895 in eio_handle_mainloop (eio=0xb04300) at eio.c:333
#9  0x000000000043d2de in _msg_thr_internal (job_arg=0xaea000) at req.c:242
#10 0x00002b08cd19de25 in start_thread () from /usr/lib64/libpthread.so.0
#11 0x00002b08cd4aa34d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x2b08d237b700 (LWP 30612)):
#0  0x00002b08cd1a46ad in write () from /usr/lib64/libpthread.so.0
#1  0x0000000000482856 in cbuf_put_fd (srcbuf=0xb083b0, pdstfd=0x2b08d237a9c4, len=79) at cbuf.c:1243
#2  0x0000000000482c93 in cbuf_reader (src=0xb08350, len=79, putf=0x482829 <cbuf_put_fd>, dst=0x2b08d237a9c4) at cbuf.c:1411
#3  0x0000000000481d37 in cbuf_read_to_fd (src=0xb08350, dstfd=2, len=79) at cbuf.c:865
#4  0x000000000047e5ee in _log_printf (log=0xaced10, cb=0xb08350, stream=0x2b08cd76f1c0 <_IO_2_1_stderr_>, fmt=0x5d22d6 "%s: %s%s\n") at log.c:947
#5  0x000000000047eb37 in log_msg (level=LOG_LEVEL_ERROR, fmt=0x5c8de0 "*** %s ON %s CANCELLED AT %s ***", args=0x2b08d237abe0) at log.c:1058
#6  0x000000000047f0bc in error (fmt=0x5c8de0 "*** %s ON %s CANCELLED AT %s ***") at log.c:1140
#7  0x000000000043fe4c in _handle_signal_container (fd=10, job=0xaea000, uid=0) at req.c:838
#8  0x000000000043e564 in _handle_request (fd=10, job=0xaea000, uid=0, gid=0) at req.c:511
#9  0x000000000043e0f8 in _handle_accept (arg=0x0) at req.c:455
#10 0x00002b08cd19de25 in start_thread () from /usr/lib64/libpthread.so.0
#11 0x00002b08cd4aa34d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x2b08cc32f5c0 (LWP 14990)):
#0  0x00002b08cd470e6a in wait4 () from /usr/lib64/libc.so.6
#1  0x0000000000430d8e in _wait_for_any_task (job=0xaea000, waitflag=true) at mgr.c:1989
#2  0x00000000004312fb in _wait_for_all_tasks (job=0xaea000) at mgr.c:2097
#3  0x000000000042f731 in job_manager (job=0xaea000) at mgr.c:1306
#4  0x000000000042a6a4 in main (argc=1, argv=0x7ffe8ea856d8) at slurmstepd.c:183

slurmd is also unresponsive (scontrol show slurmd times out). Restarting slurmd does not help as it sees a stale job socket in the spool directory and tries to connect() to it, resulting in more blocking.

We are running slurm 17.02.6, with the MVAPICH2 patch (http://mvapich.cse.ohio-state.edu/download/mvapich/osu-shmempmi-slurm-17.02.2.patch), on an up-to-date CentOS 7.4 (and also glibc-2.17-196).  
In this case, these were sbatch jobs which invoke MVAPICH2-2.2's mpiexec, which was compiled with these flags: '--enable-hybrid' '--enable-dependency-tracking' '--enable-cache' '--enable-cxx' '--enable-fortran=all' '--enable-threads=multiple' 'CFLAGS=-m64 -O2 -pipe -Wall -Wshadow' 'CXXFLAGS=-m64 -O2 -pipe -Wall -Weffc++ -Wshadow' 'FCFLAGS=-m64 -O2 -pipe -Wall' 'FFLAGS=-m64 -O2 -pipe -Wall'

Does this look as the same bug to you, or do you want me to report a different bug? Or do you think this looks more like a PEBKAC? :)

Cheers,
Pablo
Comment 49 Pablo Llopis 2018-03-01 06:53:39 MST
Sorry, I amend my previous comment because in this case slurm was compiled *without* the aforementioned patch.
Comment 53 Patrick Glomski 2018-03-26 09:55:56 MDT
We have the same issue: All user processes are dead and one node from the job is stuck in 'completing' state. There is a slurm extern step and .0 step (with no child processes) left on the node. slurmstepd has to be manually killed to return the node to service.

    1 16986 16985 16985 ?           -1 Sl       0   0:00 /usr/sbin/slurmd
    1 18508 18507 18507 ?           -1 Sl       0   0:04 slurmstepd: [5181.extern]
18508 18513 18513 18507 ?           -1 S        0   0:00  \_ sleep 1000000
    1 18516 18515 18515 ?           -1 Sl       0   0:01 slurmstepd: [5181.0]

Please advise if you would like me to collect any debugging information.

Thanks,
Patrick
Comment 54 Brian F Gilmer 2018-04-09 15:23:55 MDT
Hello,

At Cray we have seen this problem on a CS system which is an IB cluster and a VM cluster.  We tried a patch commenting out the debug code and here is a summary of our backtrace.  We saw the problem on both 17.02 and 17.11.  

Thread 193
#0 poll() from /usr/lib64/libc.so.6
#1 _poll_internal eio.c:364
#2 eio_handle_mainloop at eioc.328
#3 _io_thr io.c:1479
#4 start_thread() /usr/lib64/libpthread.so
#5 clone() /usr/lib64/libc.so.6
 
Thread 192
<same as thread 2>
...
Thread 3
<same as thread 2>
 
Thread 2
#0 ___lll_lock_wait_private from /usr/lib64/libc.so.6
#1 _L_lock_2251() from /usr/lib64/libc.so.6
#2 arena_get2.isra.3() from /usr/lib64/libc.so.6
#3 malloc() from /usr/lib64/libc.so.6
#4 slurm_xmalloc() pack.c
#5 init_buf() pack.c
#6 in _handle_accept req.c
#7 start_thread() /usr/lib64/libpthread.so.0
#8 clone() from /usr/lib64/libc.so
 
Thread 1
#0 pthread_join() /usr/lib64/libpthread.so.0
#1 acct_gather_profile_fini() at slurm_acct_gather_profile.c
#2 job_manager at mgr.c
#3 main slurmstepd.c:172


On the VM the problem happens about once a day.  The workflow is a lot of serial jobs on the same node.  The memory foot print is small so this looks like it may be associated with memory fragmentation.

On the CS system the memory is large.  The problem can be encountered when running large, short MPI jobs.

I checked the Slurm code in the call stack and it has not change in the last 3 major release.

According to #46 this is not glibc/malloc.

Where are we at with this issue?
Comment 55 Alejandro Sanchez 2018-04-10 04:36:48 MDT
All three bugs  4733, 4810, 4690 seem to report the same issue. We're still making conjectures on the root cause of this.
Comment 56 Tim Wickberg 2018-04-11 21:52:46 MDT
Hi folks -

Apologies for the bug update, we have three nearly identical versions of this bug (bug 4690, 4733, 4810), and I'm trying to narrow down the possible sources.

As best we can tell at the moment, this does seem to only affect a limited number of glibc version - most of which appear to be 2.17 (plus extensive vendor patching).

Even with that version, we're seeing inconsistent results trying to reproduce, and I'm starting to lean towards this issue being caused by an interaction between CPU, GLIBC, and some subtle timing differences in how the slurmstepd manages the extern step vs. normal steps.

What would help us at the moment is if you could confirm:

- Exact GLIBC version, including all vendor sub-numbering.
- Output from /proc/cpuinfo (just the first processor is fine).

E.g., at the moment we're able to reproduce on a box with:

- glibc-2.17-196.el7.x86_64
- Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

thanks,
- Tim
Comment 57 Christopher Samuel 2018-04-11 22:03:57 MDT
On 12/04/18 13:52, bugs@schedmd.com wrote:

> What would help us at the moment is if you could confirm:
> 
> - Exact GLIBC version, including all vendor sub-numbering.
> - Output from /proc/cpuinfo (just the first processor is fine).

For us:

glibc-2.17-196.el7_4.2.x86_64

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
stepping	: 4
microcode	: 0x2000043
cpu MHz		: 2300.000
cache size	: 25344 KB
physical id	: 0
siblings	: 18
core id		: 0
cpu cores	: 18
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl 
xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor 
ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 
x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm 
abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_pt spec_ctrl 
ibpb_support tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust 
bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq 
rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt 
xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm 
ida arat pln pts
bogomips	: 4600.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:
Comment 58 Christopher Samuel 2018-04-11 22:08:20 MDT
On 10/04/18 07:23, bugs@schedmd.com wrote:

> On the CS system the memory is large.  The problem can be encountered when
> running large, short MPI jobs.

We were also seeing this with lots of short, wide MPI jobs (I *think* 
they were GPU HPL runs on a selection of nodes but I'm seeking 
confirmation).
Comment 59 Patrick Glomski 2018-04-12 07:13:23 MDT
(In reply to Tim Wickberg from comment #56)
> Hi folks -
> 
> Apologies for the bug update, we have three nearly identical versions of
> this bug (bug 4690, 4733, 4810), and I'm trying to narrow down the possible
> sources.
> 
> As best we can tell at the moment, this does seem to only affect a limited
> number of glibc version - most of which appear to be 2.17 (plus extensive
> vendor patching).
> 
> Even with that version, we're seeing inconsistent results trying to
> reproduce, and I'm starting to lean towards this issue being caused by an
> interaction between CPU, GLIBC, and some subtle timing differences in how
> the slurmstepd manages the extern step vs. normal steps.
> 
> What would help us at the moment is if you could confirm:
> 
> - Exact GLIBC version, including all vendor sub-numbering.
> - Output from /proc/cpuinfo (just the first processor is fine).
> 
> E.g., at the moment we're able to reproduce on a box with:
> 
> - glibc-2.17-196.el7.x86_64
> - Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
> 
> thanks,
> - Tim

glibc-2.17-196.el7_4.2.x86_64
Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
Comment 60 Alejandro Sanchez 2018-04-13 10:08:06 MDT
*** Ticket 5062 has been marked as a duplicate of this ticket. ***
Comment 61 Tim Wickberg 2018-04-20 14:59:16 MDT
*** Ticket 5091 has been marked as a duplicate of this ticket. ***
Comment 62 Tim Wickberg 2018-04-26 11:01:07 MDT
Hey folks -

Thank you all for your patience while we've been working through this. We believe we now understand the root cause of these deadlocks, and are working on a series of fixes to address them.

It does appear to be an issue within Slurm itself related to unsafe use of pthread_kill, and not a glibc bug as we'd suspected. (Although recent changes to glibc do seem to make this race condition significantly easier to trigger - Slurm has inadvertently relied on this unsafe behavior working for quite some time.)

I'm closing this as a duplicate of that new bug 5103 which will track those improvements.

cheers,
- Tim

*** This ticket has been marked as a duplicate of ticket 5103 ***