Created attachment 6064 [details] slurmstepd stack trace H there, We are seeing simple test jobs getting stuck in 'CG' state when the MPI task exits but the .0 step does not clean up on one (random) node when the users script does. The script in question does: --------------------8< snip snip 8<-------------------- #!/bin/bash #SBATCH --nodes=96 #SBATCH --ntasks-per-node=36 #SBATCH --mem=180gb #SBATCH --time=00:10:00 #module load iomkl/2016.2 module load intel/2016.2.181-gcc-6.4.0 openmpi/3.0.0 mpirun xhpl --------------------8< snip snip 8<-------------------- Looking at the affected nodes I see no processes by the user there, pstree -a shows: |-slurmd | `-17*[{slurmd}] |-slurmstepd | |-sleep 1000000 | `-3*[{slurmstepd}] |-slurmstepd | `-20*[{slurmstepd}] ps shows: root 153087 0.0 0.0 1353580 3232 ? Sl 11:30 0:00 /apps/slurm/latest/sbin/slurmd root 168684 0.0 0.0 303036 4524 ? Sl 15:22 0:00 slurmstepd: [12212.extern] root 168693 0.0 0.0 1492292 4992 ? Sl 15:22 0:00 slurmstepd: [12212.0] pstack trace attached due to length. The last syslog messages from the node (where slurmd log is going) are: Feb 5 15:22:15 john104 slurmstepd[168693]: task/cgroup: /slurm/uid_1040/job_12212: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 5 15:22:15 john104 slurmstepd[168693]: task/cgroup: /slurm/uid_1040/job_12212/step_0: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB squeue shows: # squeue -j 12212 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 12212 skylake hpl.slur rhumble CG 3:42 1 john104 and: # squeue -j 12212 -s STEPID NAME PARTITION USER TIME NODELIST 12212.0 orted skylake rhumble 34:35 john[7-44,46-52,54-55,58-73,75-98,100-107] 12212.Extern extern skylake rhumble 34:36 john[6-44,46-52,54-55,58-73,75-98,100-107] So you can see it's just one node that's left in this odd state. Looking at wchan with ps: [root@john104 ~]# ps -Ao pid,cmd,wchan | grep slurmstepd 168684 slurmstepd: [12212.extern] do_wait 168693 slurmstepd: [12212.0] futex_wait_queue_me I pulled in commit d2c838070 mentioned in bug 4718 before realising that this is subtly different in that it's not an issue with the extern job step, so this is happening both with and without it. If I kill -9 the .0 slurmstepd then slurmd will clean the job up correctly, but just a normal kill doesn't touch it. All the best, Chris
Hi Chris, Besides the mentioned commit: https://github.com/SchedMD/slurm/commit/d2c8380709 There's another patch pending to be reviewed and checked in, tracked in bug: https://bugs.schedmd.com/show_bug.cgi?id=4718 which I believe would solve this too. That patch will be mostly included in 17.11.3 which should be released early this week. I'll discuss it internally though just in case the issue reported here is different to what is solved there. Thanks.
On 05/02/18 21:22, bugs@schedmd.com wrote: > I'll discuss it internally though just in case the issue reported > here is different to what is solved there. Thanks. Thanks Alejandro, look forward to hearing back on this!
Christopher, we've internally discussed this and we believe this should be fixed in 17.11.3. Would you mind testing in that version (when released) and see if the steps still get stuck or not and confirm here whether it is fixed? Thanks!.
On 06/02/18 02:58, bugs@schedmd.com wrote: > Christopher, we've internally discussed this and we believe this > should be fixed in 17.11.3. Great, thanks for that. > Would you mind testing in that version (when released) and see if the > steps still get stuck or not and confirm here whether it is fixed? > Thanks!. Sure! Happy to do that. I'm not working today but will be tomorrow (Wednesday). All the best, Chris
Christopher, I'm marking this as duplicate of bug 4718. We are fairly confident following related commit will fix issue here. Please reopen otherwise after testing. https://github.com/SchedMD/slurm/commit/108502e9504 *** This ticket has been marked as a duplicate of ticket 4718 ***
On 07/02/18 21:44, bugs@schedmd.com wrote: > Christopher, I'm marking this as duplicate of bug 4718 We are fairly > confident following related commit will fix issue here. Please reopen > otherwise after testing. Sorry - please reopen. We are still seeing the problem with 17.11.3-2. 14200 skylake hpl.slur rhumble CG 0:35 1 john11 [one of many] [root@john11 ~]# ps auxww | fgrep slurm root 351901 0.0 0.0 355000 1844 ? Sl 13:51 0:01 /apps/slurm/latest/sbin/slurmd root 413204 0.0 0.0 306140 1536 ? Sl 21:14 0:00 slurmstepd: [14200.extern] root 413214 0.0 0.0 430408 1592 ? Sl 21:14 0:00 slurmstepd: [14200.0] [root@john11 ~]# scontrol show slurmd scontrol: error: slurm_slurmd_info: Socket timed out on send/recv operation slurm_load_slurmd_status: Socket timed out on send/recv operation pstree -a shows: ├─slurmd │ └─4*[{slurmd}] ├─slurmstepd │ ├─sleep 1000000 │ └─4*[{slurmstepd}] ├─slurmstepd │ └─7*[{slurmstepd}] slurmd stack according to pstack: [root@john11 ~]# pstack 351901 Thread 5 (Thread 0x2b50d9cb3700 (LWP 413551)): #0 0x00002b50dabe170d in read () from /usr/lib64/libpthread.so.0 #1 0x00002b50da52007e in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=jobid@entry=14200, stepid=stepid@entry=0, protocol_version=protocol_version@entry=0x2b50d9cb2bde) at stepd_api.c:281 #2 0x0000000000415e0e in _signal_jobstep (jobid=jobid@entry=14200, stepid=0, signal=signal@entry=18, flags=flags@entry=0) at req.c:3217 #3 0x0000000000415f13 in _kill_all_active_steps (jobid=14200, sig=sig@entry=18, flags=flags@entry=0, batch=batch@entry=true) at req.c:4467 #4 0x0000000000417d0a in _rpc_terminate_job (msg=msg@entry=0x2b50e0018520) at req.c:5414 #5 0x00000000004197ac in slurmd_req (msg=msg@entry=0x2b50e0018520) at req.c:355 #6 0x000000000040c671 in _service_connection (arg=<optimized out>) at slurmd.c:544 #7 0x00002b50dabdae25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00002b50daee734d in clone () from /usr/lib64/libc.so.6 Thread 4 (Thread 0x2b50de069700 (LWP 414724)): #0 0x00002b50dabe170d in read () from /usr/lib64/libpthread.so.0 #1 0x00002b50da52007e in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=14200, stepid=0, protocol_version=protocol_version@entry=0x2b50e40019f8) at stepd_api.c:281 #2 0x000000000040d9ac in _fill_registration_msg (msg=0x2b50e400d930) at slurmd.c:688 #3 send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:565 #4 0x000000000041999c in slurmd_req (msg=msg@entry=0x2b50e400d890) at req.c:386 #5 0x000000000040c671 in _service_connection (arg=<optimized out>) at slurmd.c:544 #6 0x00002b50dabdae25 in start_thread () from /usr/lib64/libpthread.so.0 #7 0x00002b50daee734d in clone () from /usr/lib64/libc.so.6 Thread 3 (Thread 0x2b50df565700 (LWP 416434)): #0 0x00002b50dabe170d in read () from /usr/lib64/libpthread.so.0 #1 0x00002b50da52007e in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=14200, stepid=0, protocol_version=protocol_version@entry=0x2b50e80112f8) at stepd_api.c:281 #2 0x000000000040d9ac in _fill_registration_msg (msg=0x2b50e80008d0) at slurmd.c:688 #3 send_registration_msg (status=status@entry=0, startup=startup@entry=true) at slurmd.c:565 #4 0x000000000041999c in slurmd_req (msg=msg@entry=0x2b50e8000ae0) at req.c:386 #5 0x000000000040c671 in _service_connection (arg=<optimized out>) at slurmd.c:544 #6 0x00002b50dabdae25 in start_thread () from /usr/lib64/libpthread.so.0 #7 0x00002b50daee734d in clone () from /usr/lib64/libc.so.6 Thread 2 (Thread 0x2b50df666700 (LWP 417058)): #0 0x00002b50dabe170d in read () from /usr/lib64/libpthread.so.0 #1 0x00002b50da52007e in stepd_connect (directory=<optimized out>, nodename=<optimized out>, jobid=14200, stepid=0, protocol_version=protocol_version@entry=0x2b50ec010db8) at stepd_api.c:281 #2 0x000000000041a91f in _get_step_list () at req.c:3475 #3 _rpc_daemon_status (msg=0x2b50ec0008d0) at req.c:3526 #4 slurmd_req (msg=msg@entry=0x2b50ec0008d0) at req.c:426 #5 0x000000000040c671 in _service_connection (arg=<optimized out>) at slurmd.c:544 #6 0x00002b50dabdae25 in start_thread () from /usr/lib64/libpthread.so.0 #7 0x00002b50daee734d in clone () from /usr/lib64/libc.so.6 Thread 1 (Thread 0x2b50d9bafa00 (LWP 351901)): #0 0x00002b50dabe198d in accept () from /usr/lib64/libpthread.so.0 #1 0x00002b50da4e3197 in slurm_accept_msg_conn (fd=<optimized out>, addr=<optimized out>) at slurm_protocol_socket_implementation.c:454 #2 0x000000000040f16f in _msg_engine () at slurmd.c:437 #3 main (argc=1, argv=0x7fff09133838) at slurmd.c:373 slurmstepd for .0: [root@john11 ~]# pstack 413214 Thread 8 (Thread 0x2b836469d700 (LWP 413219)): #0 0x00002b83600fda3d in poll () from /usr/lib64/libc.so.6 #1 0x00002b835ef10737 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2b83680008d0) at eio.c:364 #2 eio_handle_mainloop (eio=0x1fa91f0) at eio.c:328 #3 0x000000000041ca78 in _msg_thr_internal (job_arg=0x1f61850) at req.c:245 #4 0x00002b835fdfbe25 in start_thread () from /usr/lib64/libpthread.so.0 #5 0x00002b836010834d in clone () from /usr/lib64/libc.so.6 Thread 7 (Thread 0x2b83657e2700 (LWP 413220)): #0 0x00002b83600fda3d in poll () from /usr/lib64/libc.so.6 #1 0x00002b835ef10737 in _poll_internal (shutdown_time=<optimized out>, nfds=3, pfds=0x2b836c000ac0) at eio.c:364 #2 eio_handle_mainloop (eio=0x2b836c0008d0) at eio.c:328 #3 0x00002b83652cb31c in _agent (unused=<optimized out>) at agent.c:328 #4 0x00002b835fdfbe25 in start_thread () from /usr/lib64/libpthread.so.0 #5 0x00002b836010834d in clone () from /usr/lib64/libc.so.6 Thread 6 (Thread 0x2b83659e4700 (LWP 413222)): #0 0x00002b8360115eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6 #1 0x00002b836017760d in _L_lock_27 () from /usr/lib64/libc.so.6 #2 0x00002b83601775bd in arena_thread_freeres () from /usr/lib64/libc.so.6 #3 0x00002b8360177662 in __libc_thread_freeres () from /usr/lib64/libc.so.6 #4 0x00002b835fdfbe38 in start_thread () from /usr/lib64/libpthread.so.0 #5 0x00002b836010834d in clone () from /usr/lib64/libc.so.6 Thread 5 (Thread 0x2b835ecce700 (LWP 413553)): #0 0x00002b8360115eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6 #1 0x00002b83600927d8 in _L_lock_1579 () from /usr/lib64/libc.so.6 #2 0x00002b836008aca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6 #3 0x00002b83600900fe in malloc () from /usr/lib64/libc.so.6 #4 0x00002b835ef0a878 in slurm_xmalloc (size=size@entry=24, clear=clear@entry=false, file=file@entry=0x2b835ef39992 "pack.c", line=line@entry=152, func=func@entry=0x2b835ef240a5 <__func__.7819> "init_buf") at xmalloc.c:86 #5 0x00002b835ee60bd0 in init_buf (size=16384) at pack.c:152 #6 0x000000000041c6ab in _handle_accept (arg=0x0) at req.c:384 #7 0x00002b835fdfbe25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00002b836010834d in clone () from /usr/lib64/libc.so.6 Thread 4 (Thread 0x2b836459c700 (LWP 414726)): #0 0x00002b8360115eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6 #1 0x00002b83600927d8 in _L_lock_1579 () from /usr/lib64/libc.so.6 #2 0x00002b836008aca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6 #3 0x00002b83600900fe in malloc () from /usr/lib64/libc.so.6 #4 0x00002b835ef0a878 in slurm_xmalloc (size=size@entry=24, clear=clear@entry=false, file=file@entry=0x2b835ef39992 "pack.c", line=line@entry=152, func=func@entry=0x2b835ef240a5 <__func__.7819> "init_buf") at xmalloc.c:86 #5 0x00002b835ee60bd0 in init_buf (size=16384) at pack.c:152 #6 0x000000000041c6ab in _handle_accept (arg=0x0) at req.c:384 #7 0x00002b835fdfbe25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00002b836010834d in clone () from /usr/lib64/libc.so.6 Thread 3 (Thread 0x2b83658e3700 (LWP 416436)): #0 0x00002b8360115eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6 #1 0x00002b83600927d8 in _L_lock_1579 () from /usr/lib64/libc.so.6 #2 0x00002b836008aca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6 #3 0x00002b83600900fe in malloc () from /usr/lib64/libc.so.6 #4 0x00002b835ef0a878 in slurm_xmalloc (size=size@entry=24, clear=clear@entry=false, file=file@entry=0x2b835ef39992 "pack.c", line=line@entry=152, func=func@entry=0x2b835ef240a5 <__func__.7819> "init_buf") at xmalloc.c:86 #5 0x00002b835ee60bd0 in init_buf (size=16384) at pack.c:152 #6 0x000000000041c6ab in _handle_accept (arg=0x0) at req.c:384 #7 0x00002b835fdfbe25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00002b836010834d in clone () from /usr/lib64/libc.so.6 Thread 2 (Thread 0x2b8363b7d700 (LWP 417060)): #0 0x00002b8360115eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6 #1 0x00002b83600927d8 in _L_lock_1579 () from /usr/lib64/libc.so.6 #2 0x00002b836008aca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6 #3 0x00002b83600900fe in malloc () from /usr/lib64/libc.so.6 #4 0x00002b835ef0a878 in slurm_xmalloc (size=size@entry=24, clear=clear@entry=false, file=file@entry=0x2b835ef39992 "pack.c", line=line@entry=152, func=func@entry=0x2b835ef240a5 <__func__.7819> "init_buf") at xmalloc.c:86 #5 0x00002b835ee60bd0 in init_buf (size=16384) at pack.c:152 #6 0x000000000041c6ab in _handle_accept (arg=0x0) at req.c:384 #7 0x00002b835fdfbe25 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00002b836010834d in clone () from /usr/lib64/libc.so.6 Thread 1 (Thread 0x2b835ebcbfc0 (LWP 413214)): #0 0x00002b835fdfcf57 in pthread_join () from /usr/lib64/libpthread.so.0 #1 0x00000000004106ef in _wait_for_io (job=0x1f61850) at mgr.c:2219 #2 job_manager (job=job@entry=0x1f61850) at mgr.c:1397 #3 0x000000000040c957 in main (argc=1, argv=0x7ffe3331f1f8) at slurmstepd.c:172 slurmstepd for the extern step: Thread 5 (Thread 0x2b36e3df5700 (LWP 413207)): #0 0x00002b36e4f26945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x00002b36e3fab028 in _watch_tasks (arg=<optimized out>) at slurm_jobacct_gather.c:242 #2 0x00002b36e4f22e25 in start_thread () from /usr/lib64/libpthread.so.0 #3 0x00002b36e522f34d in clone () from /usr/lib64/libc.so.6 Thread 4 (Thread 0x2b36e96c3700 (LWP 413208)): #0 0x00002b36e51f61ad in nanosleep () from /usr/lib64/libc.so.6 #1 0x00002b36e5226ec4 in usleep () from /usr/lib64/libc.so.6 #2 0x00002b36e3fa5039 in _timer_thread (args=<optimized out>) at slurm_acct_gather_profile.c:189 #3 0x00002b36e4f22e25 in start_thread () from /usr/lib64/libpthread.so.0 #4 0x00002b36e522f34d in clone () from /usr/lib64/libc.so.6 Thread 3 (Thread 0x2b36e97c4700 (LWP 413209)): #0 0x00002b36e5224a3d in poll () from /usr/lib64/libc.so.6 #1 0x00002b36e4037737 in _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2b36ec0008d0) at eio.c:364 #2 eio_handle_mainloop (eio=0x2297060) at eio.c:328 #3 0x000000000041ca78 in _msg_thr_internal (job_arg=0x2256ca0) at req.c:245 #4 0x00002b36e4f22e25 in start_thread () from /usr/lib64/libpthread.so.0 #5 0x00002b36e522f34d in clone () from /usr/lib64/libc.so.6 Thread 2 (Thread 0x2b36ea6f2700 (LWP 413210)): #0 0x00002b36e5224a3d in poll () from /usr/lib64/libc.so.6 #1 0x00002b36e99cd9a2 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:466 #2 0x00002b36e4f22e25 in start_thread () from /usr/lib64/libpthread.so.0 #3 0x00002b36e522f34d in clone () from /usr/lib64/libc.so.6 Thread 1 (Thread 0x2b36e3cf2fc0 (LWP 413204)): #0 0x00002b36e51f5e6a in wait4 () from /usr/lib64/libc.so.6 #1 0x000000000040ffe6 in _spawn_job_container (job=0x2256ca0) at mgr.c:1107 #2 job_manager (job=job@entry=0x2256ca0) at mgr.c:1216 #3 0x000000000040c957 in main (argc=1, argv=0x7ffdfe65e528) at slurmstepd.c:172 Thanks, Chris
Would you mind attaching slurmd.log and slurmctld.log so that I can see the messages since a bit before the job start till the current state?
Also was job 14200 a batch job? salloc/srun one? or adopted incoming ssh? Can you grab the information related to its submission? paramters and potentially the batch script.
On 08/02/18 03:50, bugs@schedmd.com wrote: > Also was job 14200 a batch job? salloc/srun one? or adopted incoming > ssh? Can you grab the information related to its submission? > paramters and potentially the batch script. As per initial report these are just batch jobs. Script for the one in question is: --------------------8< snip snip 8<-------------------- #!/bin/bash #SBATCH --nodes=8 #SBATCH --ntasks-per-node=36 #SBATCH --mem=180gb #SBATCH --time=02:0:00 #module load iomkl/2016.2 module load intel/2016.2.181-gcc-6.4.0 openmpi/3.0.0 mpirun xhpl --------------------8< snip snip 8<-------------------- OpenMPI appears to have been configured with: Configure command line: '--prefix=/apps/skylake/software/compiler/intel/2016.2.181-gcc-6.4.0/openmpi/3.0.0' '--enable-shared' '--enable-mpi-thread-multiple' '--with-verbs' '--enable-mpirun-prefix-by-default' '--with-hwloc=' '--disable-dlopen' '--with-slurm=/apps/slurm/latest' '--with-pmi=/apps/slurm/latest' '--with-ucx=/apps/skylake/software/core/ucx/1.2.1' So mpirun launches orted via srun. Job information with scontrol show job -ddd: JobId=14200 JobName=hpl.slurm UserId=rhumble(1040) GroupId=hpcadmin(10190) MCS_label=N/A Priority=18671 Nice=0 Account=hpcadmin QOS=normal JobState=COMPLETING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:35 TimeLimit=02:00:00 TimeMin=N/A SubmitTime=2018-02-07T20:18:38 EligibleTime=2018-02-07T20:18:38 StartTime=2018-02-07T21:14:40 EndTime=2018-02-07T21:15:15 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-02-07T21:14:40 Partition=skylake AllocNode:Sid=farnarkle2:322324 ReqNodeList=(null) ExcNodeList=(null) NodeList=john11 BatchHost=john6 NumNodes=1 NumCPUs=36 NumTasks=288 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=288,mem=1440G,node=8,billing=288 Socks/Node=* NtasksPerN:B:S:C=36:0:*:* CoreSpec=* Nodes=john[6-13] CPU_IDs=0-35 Mem=184320 GRES_IDX= MinCPUsNode=36 MinMemoryNode=180G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/rhumble/bench/hpl/8node/hpl.slurm WorkDir=/home/rhumble/bench/hpl/8node StdErr=/home/rhumble/bench/hpl/8node/slurm-14200.out StdIn=/dev/null StdOut=/home/rhumble/bench/hpl/8node/slurm-14200.out Power= Slurmctld logs for the job: [2018-02-07T20:18:38.470] _slurm_rpc_submit_batch_job: JobId=14200 InitPrio=18615 usec=366 [2018-02-07T21:14:40.747] sched: Allocate JobID=14200 NodeList=john[6-13] #CPUs=288 Partition=skylake [2018-02-07T21:15:15.956] _job_complete: JobID=14200 State=0x1 NodeCnt=8 WEXITSTATUS 0 [2018-02-07T21:15:15.956] _job_complete: JobID=14200 State=0x8003 NodeCnt=8 done [2018-02-07T21:16:40.883] Resending TERMINATE_JOB request JobId=14200 Nodelist=john11 [2018-02-07T21:17:40.973] Resending TERMINATE_JOB request JobId=14200 Nodelist=john11 [2018-02-07T21:18:40.067] Resending TERMINATE_JOB request JobId=14200 Nodelist=john11 [2018-02-07T21:19:40.672] Resending TERMINATE_JOB request JobId=14200 Nodelist=john11 Slurmd logs (sent to syslog) are: Feb 7 21:14:40 john11 slurmd[351901]: _run_prolog: run job script took usec=10344 Feb 7 21:14:40 john11 slurmd[351901]: _run_prolog: prolog with lock for job 14200 ran for 0 seconds Feb 7 21:14:40 john11 slurmstepd[413204]: task/cgroup: /slurm/uid_1040/job_14200: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 7 21:14:40 john11 slurmstepd[413204]: task/cgroup: /slurm/uid_1040/job_14200/step_extern: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 7 21:14:41 john11 slurmd[351901]: launch task 14200.0 request from 1040.10190@192.168.44.106 (port 28894) Feb 7 21:14:41 john11 slurmstepd[413214]: task/cgroup: /slurm/uid_1040/job_14200: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 7 21:14:41 john11 slurmstepd[413214]: task/cgroup: /slurm/uid_1040/job_14200/step_0: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Just prior to that (no intervening messages) I see: Feb 7 21:13:40 john11 slurmd[351901]: _run_prolog: run job script took usec=10307 Feb 7 21:13:40 john11 slurmd[351901]: _run_prolog: prolog with lock for job 14187 ran for 0 seconds Feb 7 21:13:40 john11 slurmstepd[412824]: task/cgroup: /slurm/uid_1040/job_14187: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 7 21:13:40 john11 slurmstepd[412824]: task/cgroup: /slurm/uid_1040/job_14187/step_extern: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 7 21:13:40 john11 slurmd[351901]: launch task 14187.0 request from 1040.10190@192.168.44.106 (port 4318) Feb 7 21:13:40 john11 slurmstepd[412834]: task/cgroup: /slurm/uid_1040/job_14187: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 7 21:13:40 john11 slurmstepd[412834]: task/cgroup: /slurm/uid_1040/job_14187/step_0: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 7 21:14:14 john11 slurmstepd[412834]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused Feb 7 21:14:14 john11 slurmstepd[412834]: done with job Feb 7 21:14:15 john11 slurmstepd[412824]: _oom_event_monitor: oom-kill event count: 1 Feb 7 21:14:15 john11 slurmstepd[412824]: done with job So that's one that cleaned up properly from the look of it. cheers, Chris
Here's the WCHAN info for the Slurm processes, in case that's of use. [root@john11 ~]# ps -Ao pid,cmd:30,wchan | grep slurm 351901 /apps/slurm/latest/sbin/slurmd inet_csk_accept 413204 slurmstepd: [14200.extern] do_wait 413214 slurmstepd: [14200.0] futex_wait_queue_me
On 08/02/18 09:40, Christopher Samuel wrote: > OpenMPI appears to have been configured with: > > Configure command line: > '--prefix=/apps/skylake/software/compiler/intel/2016.2.181-gcc-6.4.0/openmpi/3.0.0' > '--enable-shared' '--enable-mpi-thread-multiple' '--with-verbs' > '--enable-mpirun-prefix-by-default' '--with-hwloc=' '--disable-dlopen' > '--with-slurm=/apps/slurm/latest' '--with-pmi=/apps/slurm/latest' > '--with-ucx=/apps/skylake/software/core/ucx/1.2.1' On a hunch I requested a duplicate build with the only change being to drop the final --with-ucx=[...] build option. Removing that from OpenMPI seems to make the issue go away when we run the same binary with the different stack. Now *why* enabling UCX in OpenMPI would cause slurmstepd to occasionally get hung up is another question and one I don't have any idea about. So we can work around it with this, but it does still appear to be an issue with Slurm if it can't clear up after these. Hope this helps your investigation! All the best, Chris
(In reply to Christopher Samuel from comment #13) > On 08/02/18 09:40, Christopher Samuel wrote: > > > OpenMPI appears to have been configured with: > > > > Configure command line: > > '--prefix=/apps/skylake/software/compiler/intel/2016.2.181-gcc-6.4.0/openmpi/3.0.0' > > '--enable-shared' '--enable-mpi-thread-multiple' '--with-verbs' > > '--enable-mpirun-prefix-by-default' '--with-hwloc=' '--disable-dlopen' > > '--with-slurm=/apps/slurm/latest' '--with-pmi=/apps/slurm/latest' > > '--with-ucx=/apps/skylake/software/core/ucx/1.2.1' > > On a hunch I requested a duplicate build with the only change being to > drop the final --with-ucx=[...] build option. > > Removing that from OpenMPI seems to make the issue go away > when we run the same binary with the different stack. > > Now *why* enabling UCX in OpenMPI would cause slurmstepd to > occasionally get hung up is another question and one I don't > have any idea about. > > So we can work around it with this, but it does still appear > to be an issue with Slurm if it can't clear up after these. > > Hope this helps your investigation! > > All the best, > Chris Interesting. Are you using pmi1, pmi2 or pmix? There's a commit adding notes to the Slurm mpi guide webpage: https://slurm.schedmd.com/mpi_guide.html on how to build Slurm with PMIx, but it seems the changes have not yet been pushed to the website. I'll ask a team colleague to refresh the web so you can read them from the web instead that from the commit patches: https://github.com/SchedMD/slurm/commit/34209c471a29aeb5cf44e35 and continuation commit: https://github.com/SchedMD/slurm/commit/baa4f239eb6e732321bddb8 There's also a very useful how-to documented by the PMIx developers themselves that can be found here: https://pmix.org/support/how-to/slurm-support/ This document includes a "Using the UCX Extension". I'm wondering if you built all the software pieces so that they are compatible with each other and using the right config options. Did you configure Slurm with --with-ucx=path-to-UCX-installation? There's also a bunch of SLURM_ variables enabling Direct-connect, Early-wireup and also able to specify UCX network devices that come into play here. Would you mind reviewing all these guidelines' recommendations and procedures and compare them against the way you built Slurm, PMI[1|2|x] and OpenMPI? Thanks.
Also are you using the same compiler to build the different soft components? And also which version of libc are you using? Thanks!
Christopher, we're interested in your glibc version since we've been reported with similar stack traces in Bug 4690, and it appears to be caused by a bug in glibc malloc arena locks, where there seems to be a deadlock between malloc and fork. This bug is described in the following link: https://bugzilla.redhat.com/show_bug.cgi?id=906468 Fixed In Version: glibc-2.17-162.el7
Hi Alejandro, On Saturday, 10 February 2018 2:41:11 AM AEDT bugs@schedmd.com wrote: > Christopher, we're interested in your glibc version since we've been > reported with similar stack traces in Bug 4690, and it appears to be caused > by a bug in glibc malloc arena locks, where there seems to be a deadlock > between malloc and fork. This bug is described in the following link: Sorry for the delay, I don't work Fridays so I didn't see your replies. I'm back in the office on Monday so I'll look further but we are running RHEL7 and the glibc version on compute nodes is: glibc-2.17-196.el7_4.2.x86_64 So that's meant to have those bug fixes. :-/ All the best, Chris
Hi Alejandro, On 08/02/18 22:28, bugs@schedmd.com wrote: > Interesting. Are you using pmi1, pmi2 or pmix? PMI2, I had tried out PMIx last year, without success. > This document includes a "Using the UCX Extension". Thanks, will take a look. > I'm wondering if you built all the software pieces so that they are compatible > with each other and using the right config options. Did you configure Slurm > with --with-ucx=path-to-UCX-installation? Hmm, I used to have that when testing, but I think I removed that before starting to do production builds. So no, we don't have UCX support enabled in Slurm! :-/ > There's also a bunch of SLURM_ variables enabling Direct-connect, Early-wireup > and also able to specify UCX network devices that come into play here. > > Would you mind reviewing all these guidelines' recommendations and procedures > and compare them against the way you built Slurm, PMI[1|2|x] and OpenMPI? Not a problem - thanks for this. All the best, Chris
On 09/02/18 04:15, bugs@schedmd.com wrote: > Also are you using the same compiler to build the different soft components? We use the RHEL7 GCC for Slurm, but of course OpenMPI has to be built with multiple compilers because the Fortran F90 .mod file is compiler implementation dependent. The version that's being tested is built with the Intel compilers because that's our expectation of what the users building code will mostly be using. All the best, Chris
On 08/02/18 18:05, Christopher Samuel wrote: > Now *why* enabling UCX in OpenMPI would cause slurmstepd to > occasionally get hung up is another question and one I don't > have any idea about. I've now seen a non-UCX OpenMPI build cause a job to get stuck in CG, but this time it's the extern step (the sleep is gone) and this ptrace output makes it look like we might have multiple calls to log_msg() be causing this deadlock. [root@john21 ~]# pstack 366079 Thread 4 (Thread 0x2b9103abc700 (LWP 366084)): #0 0x00002b90ff22142d in __lll_lock_wait () from /usr/lib64/libpthread.so.0 #1 0x00002b90ff21cdcb in _L_lock_812 () from /usr/lib64/libpthread.so.0 #2 0x00002b90ff21cc98 in pthread_mutex_lock () from /usr/lib64/libpthread.so.0 #3 0x00002b90fe277771 in log_msg (level=level@entry=LOG_LEVEL_DEBUG3, fmt=fmt@entry=0x42c853 "Called _msg_socket_accept", args=args@entry=0x2b9103abbb78) at log.c:959 #4 0x00002b90fe278d14 in debug3 (fmt=fmt@entry=0x42c853 "Called _msg_socket_accept") at log.c:1185 #5 0x0000000000416c59 in _msg_socket_accept (obj=0x13195f0, objs=<optimized out>) at req.c:321 #6 0x00002b90fe32f928 in _poll_handle_event (objList=0x1324560, obj=0x13195f0, revents=1) at eio.c:495 #7 _poll_dispatch (objList=0x1324560, map=0x2b9104000900, nfds=1, pfds=0x2b91040008d0) at eio.c:447 #8 eio_handle_mainloop (eio=0x135b1a0) at eio.c:335 #9 0x000000000041ca78 in _msg_thr_internal (job_arg=0x131bfd0) at req.c:245 #10 0x00002b90ff21ae25 in start_thread () from /usr/lib64/libpthread.so.0 #11 0x00002b90ff52734d in clone () from /usr/lib64/libc.so.6 Thread 3 (Thread 0x2b9103fd6700 (LWP 366085)): #0 0x00002b90ff534eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6 #1 0x00002b90ff4b18c8 in _L_lock_2251 () from /usr/lib64/libc.so.6 #2 0x00002b90ff4aa0ef in arena_get2.isra.3 () from /usr/lib64/libc.so.6 #3 0x00002b90ff4afbca in calloc () from /usr/lib64/libc.so.6 #4 0x00002b90fe329871 in slurm_xmalloc (size=30, clear=clear@entry=true, file=file@entry=0x2b90fe35e0fe "xstring.c", line=line@entry=104, func=func@entry=0x2b90fe35e1bd <__func__.4940> "makespace") at xmalloc.c:84 #5 0x00002b90fe329c74 in makespace (str=str@entry=0x2b9103fd5bd8, needed=<optimized out>) at xstring.c:104 #6 0x00002b90fe329e35 in _xstrcat (str1=str1@entry=0x2b9103fd5bd8, str2=<optimized out>, str2@entry=0x2b9103cc9019 "%s: oom-kill event count: %lu") at xstring.c:135 #7 0x00002b90fe2773c7 in vxstrfmt (fmt=fmt@entry=0x2b9103cc9019 "%s: oom-kill event count: %lu", ap=ap@entry=0x2b9103fd5cb8) at log.c:827 #8 0x00002b90fe2779f3 in log_msg (level=level@entry=LOG_LEVEL_INFO, fmt=fmt@entry=0x2b9103cc9019 "%s: oom-kill event count: %lu", args=args@entry=0x2b9103fd5cb8) at log.c:1038 #9 0x00002b90fe278aa4 in info (fmt=fmt@entry=0x2b9103cc9019 "%s: oom-kill event count: %lu") at log.c:1149 #10 0x00002b9103cc5aa1 in _oom_event_monitor (x=<optimized out>) at task_cgroup_memory.c:489 #11 0x00002b90ff21ae25 in start_thread () from /usr/lib64/libpthread.so.0 #12 0x00002b90ff52734d in clone () from /usr/lib64/libc.so.6 Thread 2 (Thread 0x2b9108b14700 (LWP 366390)): #0 0x00002b90ff534eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6 #1 0x00002b90ff59660d in _L_lock_27 () from /usr/lib64/libc.so.6 #2 0x00002b90ff5965bd in arena_thread_freeres () from /usr/lib64/libc.so.6 #3 0x00002b90ff596662 in __libc_thread_freeres () from /usr/lib64/libc.so.6 #4 0x00002b90ff21ae38 in start_thread () from /usr/lib64/libpthread.so.0 #5 0x00002b90ff52734d in clone () from /usr/lib64/libc.so.6 Thread 1 (Thread 0x2b90fdfeafc0 (LWP 366079)): #0 0x00002b90ff22142d in __lll_lock_wait () from /usr/lib64/libpthread.so.0 #1 0x00002b90ff21cdcb in _L_lock_812 () from /usr/lib64/libpthread.so.0 #2 0x00002b90ff21cc98 in pthread_mutex_lock () from /usr/lib64/libpthread.so.0 #3 0x00002b90fe277771 in log_msg (level=level@entry=LOG_LEVEL_DEBUG2, fmt=fmt@entry=0x42faad "%s: rmdir(%s): %m", args=args@entry=0x7ffdc3da1348) at log.c:959 #4 0x00002b90fe278c78 in debug2 (fmt=fmt@entry=0x42faad "%s: rmdir(%s): %m") at log.c:1176 #5 0x0000000000427c77 in xcgroup_delete (cg=cg@entry=0x2b91038b6640 <user_memory_cg>) at xcgroup.c:493 #6 0x00002b91036ad5e8 in jobacct_gather_cgroup_memory_fini (slurm_cgroup_conf=slurm_cgroup_conf@entry=0x2b91038b2420 <slurm_cgroup_conf>) at jobacct_gather_cgroup_memory.c:155 #7 0x00002b91036aca58 in fini () at jobacct_gather_cgroup.c:261 #8 0x00002b90fe2862bf in plugin_unload (plug=0x13218d0) at plugin.c:305 #9 0x00002b90fe2866ee in plugin_context_destroy (c=0x1319450) at plugin.c:477 #10 0x00002b90fe2a3674 in jobacct_gather_fini () at slurm_jobacct_gather.c:335 #11 0x00002b90fe29bcd7 in acct_gather_profile_fini () at slurm_acct_gather_profile.c:252 #12 0x0000000000410062 in _spawn_job_container (job=0x131bfd0) at mgr.c:1124 #13 job_manager (job=job@entry=0x131bfd0) at mgr.c:1216 #14 0x000000000040c957 in main (argc=1, argv=0x7ffdc3da27b8) at slurmstepd.c:172 Now I'm not sure this is the same scenario, I'll go and look at the previous stack traces I've put in to see if it's plausible there too. cheers, Chris
On 12/02/18 09:31, Christopher Samuel wrote: > I've now seen a non-UCX OpenMPI build cause a job to get stuck in CG, > but this time it's the extern step (the sleep is gone) and this ptrace > output makes it look like we might have multiple calls to log_msg() be > causing this deadlock. This does look like a different issue, the stack trace is quite different to what we were seeing before. Looking back at that I can see why you'd think a glibc bug in fork() (which is implmented via clone()) would be implicated, that seems like the only slurmstepd call there is main() and then wait_for_io(). All the best, Chris
From this second backtrace in comment 22, it looks like: Thread 4 waits for log_msg() lock: ... log_msg --> ... __lll_lock_wait Thread 3 owns the log_msg() lock, but waits for malloc() arena one: ... log_msg --> acquires the lock ... slurm_xmalloc --> ... __lll_lock_wait_private Thread 2 waits lock on glibc arena lock after clone(): start_thread --> ... arena_thread_freeres ... __lll_lock_wait_private Thread 1 waits on log_msg lock after jobacct_gather/cgroup memory clean: jobacct_gather_cgroup_memory_fini --> xcgroup_delete --> debug2 --> (Since it can't rmdir() the cgroup, attempts to log_msg()) log_msg --> pthread_mutex_lock From me it looks like Thread 2 waits for 3 and viceversa (thus the deadlock) and 1 and 4 wait for 3 who owns the log_msg lock. I'll keep investigating whether there are newer bugs relate to these locks in glibc. - Can you reproduce at will? If so could you indicate the steps? - Slurm webpage is refreshed with pmix docs, but doesn't look related: https://slurm.schedmd.com/mpi_guide.html#pmix - In this second backtrace, that extern stepd was the only one stuck or there were more? If so, could you attach the backtrace from those too? - Maybe the slurm.conf/cgroup.conf and log messages from slurmd.log around the life-cycle of step with pid 366079 would be helpful too. Thanks.
Hi Alejandro, On Monday, 12 February 2018 10:04:40 PM AEDT bugs@schedmd.com wrote: > From this second backtrace in comment 22, it looks like: > I'll keep investigating whether there are newer bugs relate to these locks > in glibc. Thanks, I've not found anything obvious but I didn't have a lot of time to pursue that. > - Can you reproduce at will? If so could you indicate the steps? It doesn't happen all the time, my colleague is submitting a lot of HPL jobs that take about 30 seconds each (they're GPU ones) using OpenMPI that was built with UCX support. We get a handfull of them that hit this deadlock (so it seems to be a race condition of some form). > - Slurm webpage is refreshed with pmix docs, but doesn't look related: > https://slurm.schedmd.com/mpi_guide.html#pmix Thanks. > - In this second backtrace, that extern stepd was the only one stuck or > there were more? If so, could you attach the backtrace from those too? No, that was all there was for that job, just the extern slurmstepd. > - Maybe the slurm.conf/cgroup.conf and log messages from slurmd.log around > the life-cycle of step with pid 366079 would be helpful too. Here's cgroup.conf: --------------------------8< snip snip 8<-------------------------- ### # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters #-- ### CgroupAutomount=yes ConstrainCores=yes TaskAffinity=no ConstrainRAMSpace=yes ConstrainSwapSpace=yes AllowedRAMSpace=100 AllowedSwapSpace=10 # Disable kmem cgroup due to OPA specific issue ConstrainKmemSpace=no # CS; Wed Feb 7 09:44:13 AEDT 2018 # Constrain devices via cgroups (to limits access to GPUs etc) ConstrainDevices=yes --------------------------8< snip snip 8<-------------------------- Here are all the log messages from that node for that job and a few earlier ones: Feb 12 01:19:50 john21 slurmd[350152]: _run_prolog: run job script took usec=36671 Feb 12 01:19:50 john21 slurmd[350152]: _run_prolog: prolog with lock for job 19439 ran for 0 seconds Feb 12 01:19:50 john21 slurmstepd[365368]: task/cgroup: /slurm/uid_1040/job_19439: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 12 01:19:50 john21 slurmstepd[365368]: task/cgroup: /slurm/uid_1040/job_19439/step_extern: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 12 01:19:51 john21 slurmd[350152]: launch task 19439.0 request from 1040.10190@192.168.44.114 (port 45267) Feb 12 01:19:51 john21 slurmstepd[365379]: task/cgroup: /slurm/uid_1040/job_19439: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 12 01:19:51 john21 slurmstepd[365379]: task/cgroup: /slurm/uid_1040/job_19439/step_0: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 12 01:20:20 john21 slurmstepd[365379]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused Feb 12 01:20:20 john21 slurmstepd[365379]: done with job Feb 12 01:20:21 john21 slurmstepd[365368]: _oom_event_monitor: oom-kill event count: 1 Feb 12 01:20:21 john21 slurmstepd[365368]: done with job Feb 12 01:20:50 john21 slurmd[350152]: _run_prolog: run job script took usec=35217 Feb 12 01:20:50 john21 slurmd[350152]: _run_prolog: prolog with lock for job 19452 ran for 0 seconds Feb 12 01:20:50 john21 slurmstepd[365724]: task/cgroup: /slurm/uid_1040/job_19452: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 12 01:20:50 john21 slurmstepd[365724]: task/cgroup: /slurm/uid_1040/job_19452/step_extern: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 12 01:20:50 john21 slurmd[350152]: launch task 19452.0 request from 1040.10190@192.168.44.114 (port 6356) Feb 12 01:20:50 john21 slurmstepd[365734]: task/cgroup: /slurm/uid_1040/job_19452: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 12 01:20:50 john21 slurmstepd[365734]: task/cgroup: /slurm/uid_1040/job_19452/step_0: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 12 01:21:19 john21 slurmstepd[365734]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused Feb 12 01:21:19 john21 slurmstepd[365734]: done with job Feb 12 01:21:20 john21 slurmstepd[365724]: _oom_event_monitor: oom-kill event count: 1 Feb 12 01:21:20 john21 slurmstepd[365724]: done with job Feb 12 01:21:50 john21 slurmd[350152]: _run_prolog: run job script took usec=38899 Feb 12 01:21:50 john21 slurmd[350152]: _run_prolog: prolog with lock for job 19465 ran for 0 seconds Feb 12 01:21:50 john21 slurmstepd[366079]: task/cgroup: /slurm/uid_1040/job_19465: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 12 01:21:50 john21 slurmstepd[366079]: task/cgroup: /slurm/uid_1040/job_19465/step_extern: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 12 01:21:50 john21 slurmd[350152]: launch task 19465.0 request from 1040.10190@192.168.44.114 (port 32980) Feb 12 01:21:50 john21 slurmstepd[366089]: task/cgroup: /slurm/uid_1040/job_19465: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 12 01:21:50 john21 slurmstepd[366089]: task/cgroup: /slurm/uid_1040/job_19465/step_0: alloc=184320MB mem.limit=184320MB memsw.limit=202752MB Feb 12 01:22:20 john21 slurmstepd[366089]: error: Failed to send MESSAGE_TASK_EXIT: Connection refused Feb 12 01:22:20 john21 slurmstepd[366089]: done with job Hope that helps! All the best, Chris
On 12/02/18 22:26, Chris Samuel wrote: > It doesn't happen all the time, my colleague is submitting a lot of HPL jobs > that take about 30 seconds each (they're GPU ones) using OpenMPI that was > built with UCX support. We get a handfull of them that hit this deadlock (so > it seems to be a race condition of some form). We had a couple hit the same problem when configured without UCX (so just using Slurm's usual PMI2 support, nothing fancy). Unfortunately I did an 'scontrol reconfigure' to make a change to the PrologFlags active to enable X11 forwarding and that caused both the hung jobs to get cleaned up. I had grabbed a copy of the job script first: -----------------------8< snip snip 8<----------------------- #!/bin/bash #SBATCH --nodes=8 #SBATCH --ntasks-per-node=36 #SBATCH --mem=180gb #SBATCH --time=02:0:00 #module load iomkl/2016.2 #module load intel/2016.2.181-gcc-6.4.0 openmpi/3.0.0 module load intel/2016.2.181-gcc-6.4.0 openmpi-no-ucx/3.0.0 mpirun xhpl -----------------------8< snip snip 8<----------------------- I had taken a stack trace yesterday but lost it when my laptop got into a weird state overnight. :-( Pretty sure it was the same we had seen before though. So it might be the race is still present, just easier to hit with UCX enabled.. All the best, Chris
You mean that after setting PrologFlags=X11 the hung jobs to stop being stuck and got finally cleaned up, and you're not seeing stuck jobs anymore with this flag set? I'm asking this since X11 flag implicitly enables Contain and thus Alloc consequently, and the stepd behavior is different from regular steps.
On 20/02/18 22:12, bugs@schedmd.com wrote: > You mean that after setting PrologFlags=X11 the hung jobs to stop being stuck > and got finally cleaned up, and you're not seeing stuck jobs anymore with this > flag set? I'm asking this since X11 flag implicitly enables Contain and thus > Alloc consequently, and the stepd behavior is different from regular steps. No, we're still seeing jobs that get stuck in CG state. Saw one that was with OpenMPI 1.10.1 so definitely not UCX related as there was no support in that version. My guess is that it's an odd race condition. I'm not at work at the moment, but will upload what I caught from that hang tomorrow when I'm in the office. All the best, Chris
Created attachment 6197 [details] ompi-1.10.7-lockup.tar.gz On 20/02/18 22:14, Christopher Samuel wrote: > I'm not at work at the moment, but will upload what I caught from that > hang tomorrow when I'm in the office. Here's the logs I promised last night. I got the OMPI version number slightly wrong by memory last night, it was 1.10.7 not 1.10.1. The tarball includes the output from ompi_info too in case that helps. All the best, Chris
FYI, I just replicated a similar issue on a Cray system. Relevant thread trace from the hung slurmstepd: Thread 4 (Thread 0x2aaab1edb700 (LWP 171580)): #0 0x00002aaaabdf201b in __lll_lock_wait_private () from /lib64/libc.so.6 #1 0x00002aaaabd76b72 in arena_get2.isra () from /lib64/libc.so.6 #2 0x00002aaaabd77412 in malloc () from /lib64/libc.so.6 #3 0x00002aaaaae1cf48 in slurm_xmalloc (size=size@entry=24, clear=clear@entry=false, file=file@entry=0x2aaaaae4c512 "pack.c", line=line@entry=152, func=func@entry=0x2aaaaae36a05 <__func__.8403> "init_buf") at xmalloc.c:86 #4 0x00002aaaaad727f0 in init_buf (size=16384) at pack.c:152 #5 0x000000000041d296 in _handle_accept (arg=0x0) at req.c:384 #6 0x00002aaaab090724 in start_thread () from /lib64/libpthread.so.0 #7 0x00002aaaabde5c1d in clone () from /lib64/libc.so.6 looks like it is hanging on a malloc while attempting the process an RPC (sadly probably the RPC to terminate the step...)
sorry, full backtrace: Thread 4 (Thread 0x2aaab1edb700 (LWP 171580)): #0 0x00002aaaabdf201b in __lll_lock_wait_private () from /lib64/libc.so.6 #1 0x00002aaaabd76b72 in arena_get2.isra () from /lib64/libc.so.6 #2 0x00002aaaabd77412 in malloc () from /lib64/libc.so.6 #3 0x00002aaaaae1cf48 in slurm_xmalloc (size=size@entry=24, clear=clear@entry=false, file=file@entry=0x2aaaaae4c512 "pack.c", line=line@entry=152, func=func@entry=0x2aaaaae36a05 <__func__.8403> "init_buf") at xmalloc.c:86 #4 0x00002aaaaad727f0 in init_buf (size=16384) at pack.c:152 #5 0x000000000041d296 in _handle_accept (arg=0x0) at req.c:384 #6 0x00002aaaab090724 in start_thread () from /lib64/libpthread.so.0 #7 0x00002aaaabde5c1d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x2aaab3d4d700 (LWP 170560)): #0 0x00002aaaabdf201b in __lll_lock_wait_private () from /lib64/libc.so.6 #1 0x00002aaaabe4a494 in arena_thread_freeres () from /lib64/libc.so.6 #2 0x00002aaaabe4a552 in __libc_thread_freeres () from /lib64/libc.so.6 #3 0x00002aaaab09073c in start_thread () from /lib64/libpthread.so.0 #4 0x00002aaaabde5c1d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x2aaab1fdc700 (LWP 170557)): #0 0x00002aaaabddd37d in poll () from /lib64/libc.so.6 #1 0x00002aaaaae230b7 in poll (__timeout=-1, __nfds=<optimized out>, __fds=0x2aaab40008d0) at /usr/include/bits/poll2.h:46 #2 _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x2aaab40008d0) at eio.c:364 #3 eio_handle_mainloop (eio=0x6cf6f0) at eio.c:328 #4 0x000000000041d678 in _msg_thr_internal (job_arg=0x6bd550) at req.c:245 #5 0x00002aaaab090724 in start_thread () from /lib64/libpthread.so.0 #6 0x00002aaaabde5c1d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x2aaaaab34ac0 (LWP 170554)): #0 0x00002aaaab0919ad in pthread_join () from /lib64/libpthread.so.0 #1 0x0000000000410fce in _wait_for_io (job=0x6bd550) at mgr.c:2219 #2 job_manager (job=job@entry=0x6bd550) at mgr.c:1397 #3 0x000000000040d17a in main (argc=1, argv=0x7fffffffedd8) at slurmstepd.c:172 (gdb) looks like threads 3 and 4 may be both involved (similar to earlier discussion)
I'm wondering if this fix has something to do with these deadlocks: https://bugzilla.redhat.com/show_bug.cgi?id=1332917 Fixed In Version: glibc-2.22-15.fc23, glibc-2.23.1-7.fc24 Doug, what's your system glibc version? Christopher reported glibc-2.17-196.el7_4.2.x86_64 so most probably doesn't include the fix above.
We are presently using Cray CLE6.0up04, which is based upon SuSE 12sp2 We've been using up04 for some time (last June) but I've never seen this issue. I believe there were some glibc updates associated with the recent security patches for meltdown, spectre, user namespaces issues, and the hugedirty COW. Possibly the security update generated a regression. We installed those patches on 2/6. Our current image has: glibc-profile-2.22-62.3.4.x86_64 glibc-locale-32bit-2.22-62.3.4.x86_64 glibc-2.22-62.3.4.x86_64 glibc-profile-32bit-2.22-62.3.4.x86_64 glibc-info-2.22-62.3.4.noarch glibc-locale-2.22-62.3.4.x86_64 glibc-32bit-2.22-62.3.4.x86_64 glibc-i18ndata-2.22-62.3.4.noarch glibc-html-2.22-62.3.4.noarch The previous image has (where I don't think I saw this issue): glibc-2.22-61.3.x86_64 glibc-32bit-2.22-61.3.x86_64 glibc-profile-32bit-2.22-61.3.x86_64 glibc-html-2.22-61.3.noarch glibc-locale-32bit-2.22-61.3.x86_64 glibc-info-2.22-61.3.noarch glibc-profile-2.22-61.3.x86_64 glibc-locale-2.22-61.3.x86_64 glibc-i18ndata-2.22-61.3.noarch
On Saturday, 24 February 2018 1:58:56 AM AEDT bugs@schedmd.com wrote: > We've been using up04 for some time (last June) but I've never seen this > issue. I believe there were some glibc updates associated with the recent > security patches for meltdown, spectre, user namespaces issues, and the > hugedirty COW. Interesting, I've been wondering whether this might be related, it feels to me like we only started hearing about that after those fixes appeared. My colleague who was seeing this is away this weekend but I'll ask if they remember that too (I was away in the USA when the news broke and the fixes were released). Of course with microcode updates (later reverted), kernel changes and glibc changes it's going to be interesting trying to untangle this (if it is the cause).. Doug: which Slurm version are you running? When was that last changed? cheers! Chris
We are using slurm 17.11.3 as of 2/6 (same day as the other patches), but were using 17.11.2 from 1/9 (and 17.02.9 before that)
Hi Doug, On 24/02/18 11:24, bugs@schedmd.com wrote: > We are using slurm 17.11.3 as of 2/6 (same day as the other patches), > but were using 17.11.2 from 1/9 (and 17.02.9 before that) Did you see this deadlock on 17.02.9 ? cheers, Chris
it's possible it has happened with 17.02, but I haven't seen it until just today. Given that it is a glibc bug, I personally suspect the recent updates to glibc.
Created attachment 6234 [details] slurmstepd.gdb-full On 24/02/18 11:38, bugs@schedmd.com wrote: > it's possible it has happened with 17.02, but I haven't seen it until > just today. Given that it is a glibc bug, I personally suspect the > recent updates to glibc. I think they suspect a glibc bug, but reading the latest one that Alejandro found it's not clear to me if that's a different bug to the RHEL7 one that it's a clone of, but instead targeting Fedora. The bug it's cloned from was fixed last year and our glibc should have that patch in it. Errata here: https://access.redhat.com/errata/RHSA-2017:1916 Another interpretation of the bug is that it's the same symptoms but a different cause, and they only fixed it for Fedora. I've checked and we have the latest glibc updates in production, the last errata was this one: https://access.redhat.com/errata/RHBA-2017:3296 So my previous thought that there were glibc updates for Meltdown & Specre was wrong. In case it's useful I attach a gdb "thread apply all bt full" from one of our currently stuck jobs in case it gives more context. All the best, Chris
Created attachment 6236 [details] slurmstepd.0.gdb On 24/02/18 22:21, Christopher Samuel wrote: > In case it's useful I attach a gdb "thread apply all bt full" from > one of our currently stuck jobs in case it gives more context. I managed to trigger the same issue for a job launched with "srun" using my colleagues HPL binary and input deck. However, unlike when launched with mpirun this is stuck in the 'R' state. [csamuel@farnarkle1 ~]$ squeue | fgrep 850 39928_850 skylake hpl.srun csamuel R 21:50 8 john[54-61] That's 1 job element out of 1000 that got stuck. Interestingly Slurm has already cleared up the parent job array at this point so asking about the job shows no info: [csamuel@farnarkle1 ~]$ squeue -j 39928 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [csamuel@farnarkle1 ~]$ (that does work for another running job array) I've attached the stack trace (grabbed via pstack and gbd) in case this gives more insight than the mpirun case. All the best, Chris
Created attachment 6237 [details] slurmstepd.0.pstack
On 26/02/18 15:13, Christopher Samuel wrote: > I managed to trigger the same issue for a job launched with "srun" > using my colleagues HPL binary and input deck. However, unlike when > launched with mpirun this is stuck in the 'R' state. It occurred to me that this might be the case with the mpirun jobs too, until they hit their walltime limit and Slurm tries to clean them up. Trying to scancel that srun job does indeed get it wedged in the 'CG' state. [csamuel@farnarkle1 ~]$ scancel 39928_850 [csamuel@farnarkle1 ~]$ squeue | fgrep 850 39928_850 skylake hpl.srun csamuel CG 28:13 2 john[54-55] Eventually it cleans up everything except the step on the node that has wedged. [csamuel@farnarkle1 ~]$ squeue | fgrep 850 39928_850 skylake hpl.srun csamuel CG 28:13 2 john[54-55] [csamuel@farnarkle1 ~]$ squeue | fgrep 850 39928_850 skylake hpl.srun csamuel CG 28:13 1 john55 Which matches what we eventually see with the mpirun jobs. All the best, Chris
On 26/02/18 15:21, Christopher Samuel wrote: > It occurred to me that this might be the case with the mpirun jobs too, > until they hit their walltime limit and Slurm tries to clean them up. Just to make a liar out of me I caught an mpirun job which had gone straight from 'R' to 'CG' and was wedged. All the best, Chris
Created attachment 6251 [details] slurmstepd-gdb-bt-with-debuginfo.txt Now we've managed to get the required debuginfo's installed I've attached a repeat stack trace from one of the slurmstepd's via gdb after grabbing a core file with "gcore". All the best, Chris
Created attachment 6252 [details] stack-slurmd.log Apologies for spamming you with backtraces, but I thought it would be useful to have two corresponding ones for both the slurmd and slurmstepd so you can see what connections were being made from slurmd to the step daemon in case that might help understand the trigger. Both of these are got using gdb to analyse core files got with "gcore" and done with all the necessary debug symbols installed. All the best, Chris
Created attachment 6253 [details] stack-slurmstepd.log
On 24/02/18 22:21, Christopher Samuel wrote: > I think they suspect a glibc bug I've been corresponding with a glibc maintainer and they've looked at the backtraces and the glibc version and their comment is: # The backtraces show that the threads are blocked on free_list_lock. # # During fork, we reinitialize this lock (because we reconstruct all # data protected by it), so the parent state at the point of fork should # not matter as far as this lock is concerned. # # As far as I can see, free_list_lock is correctly handled inside # malloc. The lock/unlock calls are properly paired, and the critical # sections are really short. # # We've also not seen anything like that, so it's probably some # application-specific issue. So it seems unlikely that this is glibc related. Certainly I didn't see this at ${JOB-1} where I was helping with a RHEL7 based cluster a lot. All the best, Chris
Hi Christopher, thanks for the new information. We're still trying to figure out the root cause of these deadlocks. We'll keep you informed.
Hi there, I have also experienced jobs stuck in CG state and similar symptoms to those described by Christopher: slurmstepd blocks and is unable to cleanup finished jobs. I am not sure if this is the same bug, as I have noticed certain differences: In my case, the jobs initially do not exit because they get stuck for a while in uninterruptible D state. At this point, I can see the user process stuck in write() due to a storage problem, which is causing the uninterruptible sleep, and slurmstepd is wait4()ing blocking in kernel's do_wait. I think this is expected since the user job is stuck and slurm will be unable to SIGKILL the job. However, after the process is liberated from this state (after a timespan of hours), and is destroyed, I find that slurmstepd seems deadlocked (wchan shows do_signal_stop). My stacktrace of slurmstepd also looks a bit different from Christopher's as I see no memory allocation/freeing codepaths: Thread 5 (Thread 0x2b08cc466700 (LWP 14992)): #0 0x00002b08cd1a1945 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #1 0x000000000054955f in _watch_tasks (arg=0x0) at slurm_jobacct_gather.c:244 #2 0x00002b08cd19de25 in start_thread () from /usr/lib64/libpthread.so.0 #3 0x00002b08cd4aa34d in clone () from /usr/lib64/libc.so.6 Thread 4 (Thread 0x2b08d0f23700 (LWP 14993)): #0 0x00002b08cd1a442d in __lll_lock_wait () from /usr/lib64/libpthread.so.0 #1 0x00002b08cd19fdcb in _L_lock_812 () from /usr/lib64/libpthread.so.0 #2 0x00002b08cd19fc98 in pthread_mutex_lock () from /usr/lib64/libpthread.so.0 #3 0x000000000047e6ab in log_msg (level=LOG_LEVEL_DEBUG2, fmt=0x5edcfb "profile signalling type %s", args=0x2b08d0f22d60) at log.c:967 #4 0x000000000047f3f9 in debug2 (fmt=0x5edcfb "profile signalling type %s") at log.c:1182 #5 0x000000000054d697 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:174 #6 0x00002b08cd19de25 in start_thread () from /usr/lib64/libpthread.so.0 #7 0x00002b08cd4aa34d in clone () from /usr/lib64/libc.so.6 Thread 3 (Thread 0x2b08d1024700 (LWP 14994)): #0 0x00002b08cd1a442d in __lll_lock_wait () from /usr/lib64/libpthread.so.0 #1 0x00002b08cd19fdcb in _L_lock_812 () from /usr/lib64/libpthread.so.0 #2 0x00002b08cd19fc98 in pthread_mutex_lock () from /usr/lib64/libpthread.so.0 #3 0x000000000047e6ab in log_msg (level=LOG_LEVEL_DEBUG3, fmt=0x5c8343 "Called _msg_socket_accept", args=0x2b08d1023bc0) at log.c:967 #4 0x000000000047f4c5 in debug3 (fmt=0x5c8343 "Called _msg_socket_accept") at log.c:1191 #5 0x000000000043d6ae in _msg_socket_accept (obj=0xb04260, objs=0xaff850) at req.c:336 #6 0x0000000000453f73 in _poll_handle_event (revents=1, obj=0xb04260, objList=0xaff850) at eio.c:493 #7 0x0000000000453d39 in _poll_dispatch (pfds=0x2b08d40008d0, nfds=1, map=0x2b08d4000900, objList=0xaff850) at eio.c:445 #8 0x0000000000453895 in eio_handle_mainloop (eio=0xb04300) at eio.c:333 #9 0x000000000043d2de in _msg_thr_internal (job_arg=0xaea000) at req.c:242 #10 0x00002b08cd19de25 in start_thread () from /usr/lib64/libpthread.so.0 #11 0x00002b08cd4aa34d in clone () from /usr/lib64/libc.so.6 Thread 2 (Thread 0x2b08d237b700 (LWP 30612)): #0 0x00002b08cd1a46ad in write () from /usr/lib64/libpthread.so.0 #1 0x0000000000482856 in cbuf_put_fd (srcbuf=0xb083b0, pdstfd=0x2b08d237a9c4, len=79) at cbuf.c:1243 #2 0x0000000000482c93 in cbuf_reader (src=0xb08350, len=79, putf=0x482829 <cbuf_put_fd>, dst=0x2b08d237a9c4) at cbuf.c:1411 #3 0x0000000000481d37 in cbuf_read_to_fd (src=0xb08350, dstfd=2, len=79) at cbuf.c:865 #4 0x000000000047e5ee in _log_printf (log=0xaced10, cb=0xb08350, stream=0x2b08cd76f1c0 <_IO_2_1_stderr_>, fmt=0x5d22d6 "%s: %s%s\n") at log.c:947 #5 0x000000000047eb37 in log_msg (level=LOG_LEVEL_ERROR, fmt=0x5c8de0 "*** %s ON %s CANCELLED AT %s ***", args=0x2b08d237abe0) at log.c:1058 #6 0x000000000047f0bc in error (fmt=0x5c8de0 "*** %s ON %s CANCELLED AT %s ***") at log.c:1140 #7 0x000000000043fe4c in _handle_signal_container (fd=10, job=0xaea000, uid=0) at req.c:838 #8 0x000000000043e564 in _handle_request (fd=10, job=0xaea000, uid=0, gid=0) at req.c:511 #9 0x000000000043e0f8 in _handle_accept (arg=0x0) at req.c:455 #10 0x00002b08cd19de25 in start_thread () from /usr/lib64/libpthread.so.0 #11 0x00002b08cd4aa34d in clone () from /usr/lib64/libc.so.6 Thread 1 (Thread 0x2b08cc32f5c0 (LWP 14990)): #0 0x00002b08cd470e6a in wait4 () from /usr/lib64/libc.so.6 #1 0x0000000000430d8e in _wait_for_any_task (job=0xaea000, waitflag=true) at mgr.c:1989 #2 0x00000000004312fb in _wait_for_all_tasks (job=0xaea000) at mgr.c:2097 #3 0x000000000042f731 in job_manager (job=0xaea000) at mgr.c:1306 #4 0x000000000042a6a4 in main (argc=1, argv=0x7ffe8ea856d8) at slurmstepd.c:183 slurmd is also unresponsive (scontrol show slurmd times out). Restarting slurmd does not help as it sees a stale job socket in the spool directory and tries to connect() to it, resulting in more blocking. We are running slurm 17.02.6, with the MVAPICH2 patch (http://mvapich.cse.ohio-state.edu/download/mvapich/osu-shmempmi-slurm-17.02.2.patch), on an up-to-date CentOS 7.4 (and also glibc-2.17-196). In this case, these were sbatch jobs which invoke MVAPICH2-2.2's mpiexec, which was compiled with these flags: '--enable-hybrid' '--enable-dependency-tracking' '--enable-cache' '--enable-cxx' '--enable-fortran=all' '--enable-threads=multiple' 'CFLAGS=-m64 -O2 -pipe -Wall -Wshadow' 'CXXFLAGS=-m64 -O2 -pipe -Wall -Weffc++ -Wshadow' 'FCFLAGS=-m64 -O2 -pipe -Wall' 'FFLAGS=-m64 -O2 -pipe -Wall' Does this look as the same bug to you, or do you want me to report a different bug? Or do you think this looks more like a PEBKAC? :) Cheers, Pablo
Sorry, I amend my previous comment because in this case slurm was compiled *without* the aforementioned patch.
We have the same issue: All user processes are dead and one node from the job is stuck in 'completing' state. There is a slurm extern step and .0 step (with no child processes) left on the node. slurmstepd has to be manually killed to return the node to service. 1 16986 16985 16985 ? -1 Sl 0 0:00 /usr/sbin/slurmd 1 18508 18507 18507 ? -1 Sl 0 0:04 slurmstepd: [5181.extern] 18508 18513 18513 18507 ? -1 S 0 0:00 \_ sleep 1000000 1 18516 18515 18515 ? -1 Sl 0 0:01 slurmstepd: [5181.0] Please advise if you would like me to collect any debugging information. Thanks, Patrick
Hello, At Cray we have seen this problem on a CS system which is an IB cluster and a VM cluster. We tried a patch commenting out the debug code and here is a summary of our backtrace. We saw the problem on both 17.02 and 17.11. Thread 193 #0 poll() from /usr/lib64/libc.so.6 #1 _poll_internal eio.c:364 #2 eio_handle_mainloop at eioc.328 #3 _io_thr io.c:1479 #4 start_thread() /usr/lib64/libpthread.so #5 clone() /usr/lib64/libc.so.6 Thread 192 <same as thread 2> ... Thread 3 <same as thread 2> Thread 2 #0 ___lll_lock_wait_private from /usr/lib64/libc.so.6 #1 _L_lock_2251() from /usr/lib64/libc.so.6 #2 arena_get2.isra.3() from /usr/lib64/libc.so.6 #3 malloc() from /usr/lib64/libc.so.6 #4 slurm_xmalloc() pack.c #5 init_buf() pack.c #6 in _handle_accept req.c #7 start_thread() /usr/lib64/libpthread.so.0 #8 clone() from /usr/lib64/libc.so Thread 1 #0 pthread_join() /usr/lib64/libpthread.so.0 #1 acct_gather_profile_fini() at slurm_acct_gather_profile.c #2 job_manager at mgr.c #3 main slurmstepd.c:172 On the VM the problem happens about once a day. The workflow is a lot of serial jobs on the same node. The memory foot print is small so this looks like it may be associated with memory fragmentation. On the CS system the memory is large. The problem can be encountered when running large, short MPI jobs. I checked the Slurm code in the call stack and it has not change in the last 3 major release. According to #46 this is not glibc/malloc. Where are we at with this issue?
All three bugs 4733, 4810, 4690 seem to report the same issue. We're still making conjectures on the root cause of this.
Hi folks - Apologies for the bug update, we have three nearly identical versions of this bug (bug 4690, 4733, 4810), and I'm trying to narrow down the possible sources. As best we can tell at the moment, this does seem to only affect a limited number of glibc version - most of which appear to be 2.17 (plus extensive vendor patching). Even with that version, we're seeing inconsistent results trying to reproduce, and I'm starting to lean towards this issue being caused by an interaction between CPU, GLIBC, and some subtle timing differences in how the slurmstepd manages the extern step vs. normal steps. What would help us at the moment is if you could confirm: - Exact GLIBC version, including all vendor sub-numbering. - Output from /proc/cpuinfo (just the first processor is fine). E.g., at the moment we're able to reproduce on a box with: - glibc-2.17-196.el7.x86_64 - Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz thanks, - Tim
On 12/04/18 13:52, bugs@schedmd.com wrote: > What would help us at the moment is if you could confirm: > > - Exact GLIBC version, including all vendor sub-numbering. > - Output from /proc/cpuinfo (just the first processor is fine). For us: glibc-2.17-196.el7_4.2.x86_64 processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz stepping : 4 microcode : 0x2000043 cpu MHz : 2300.000 cache size : 25344 KB physical id : 0 siblings : 18 core id : 0 cpu cores : 18 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_pt spec_ctrl ibpb_support tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts bogomips : 4600.00 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
On 10/04/18 07:23, bugs@schedmd.com wrote: > On the CS system the memory is large. The problem can be encountered when > running large, short MPI jobs. We were also seeing this with lots of short, wide MPI jobs (I *think* they were GPU HPL runs on a selection of nodes but I'm seeking confirmation).
(In reply to Tim Wickberg from comment #56) > Hi folks - > > Apologies for the bug update, we have three nearly identical versions of > this bug (bug 4690, 4733, 4810), and I'm trying to narrow down the possible > sources. > > As best we can tell at the moment, this does seem to only affect a limited > number of glibc version - most of which appear to be 2.17 (plus extensive > vendor patching). > > Even with that version, we're seeing inconsistent results trying to > reproduce, and I'm starting to lean towards this issue being caused by an > interaction between CPU, GLIBC, and some subtle timing differences in how > the slurmstepd manages the extern step vs. normal steps. > > What would help us at the moment is if you could confirm: > > - Exact GLIBC version, including all vendor sub-numbering. > - Output from /proc/cpuinfo (just the first processor is fine). > > E.g., at the moment we're able to reproduce on a box with: > > - glibc-2.17-196.el7.x86_64 > - Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz > > thanks, > - Tim glibc-2.17-196.el7_4.2.x86_64 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
*** Ticket 5062 has been marked as a duplicate of this ticket. ***
*** Ticket 5091 has been marked as a duplicate of this ticket. ***
Hey folks - Thank you all for your patience while we've been working through this. We believe we now understand the root cause of these deadlocks, and are working on a series of fixes to address them. It does appear to be an issue within Slurm itself related to unsafe use of pthread_kill, and not a glibc bug as we'd suspected. (Although recent changes to glibc do seem to make this race condition significantly easier to trigger - Slurm has inadvertently relied on this unsafe behavior working for quite some time.) I'm closing this as a duplicate of that new bug 5103 which will track those improvements. cheers, - Tim *** This ticket has been marked as a duplicate of ticket 5103 ***