Ticket 6890

Summary: slurmstepd hanging issues
Product: Slurm Reporter: Wei Feinstein <wfeinstein>
Component: slurmstepdAssignee: Marshall Garey <marshall>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 17.11.4   
Hardware: Linux   
OS: Linux   
Site: LBNL - Lawrence Berkeley National Laboratory Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Wei Feinstein 2019-04-19 16:48:04 MDT
Jobs submitted with multiple nodes are showing up in a CG state and when looking at squeue it only shows the 1 node that it is stuck on.

i.e. grep 16343163 jobcomp.log 
JobId=16343163 UserId=junholee(42713) GroupId=msd(505) Name=W-Vac-S JobState=TIMEOUT Partition=lr6 TimeLimit=4320 StartTime=2019-04-16T08:18:44 EndTime=2019-04-19T08:18:45 NodeList=n0024.lr6,n0025.lr6,n0026.lr6,n0027.lr6,n0028.lr6,n0029.lr6,n0031.lr6,n0032.lr6,n0033.lr6,n0034.lr6,n0035.lr6,n0036.lr6,n0037.lr6,n0038.lr6,n0039.lr6,n0040.lr6,n0041.lr6,n0042.lr6,n0043.lr6,n0044.lr6,n0045.lr6,n0046.lr6,n0048.lr6,n0049.lr6 NodeCnt=24 ProcCnt=768 WorkDir=/global/home/users/junholee/work/WS2/Vac_W/relax/wSOC/7x7 ReservationName= Gres= Account=ac_minnehaha QOS=lr_normal WcKey= Cluster=perceus-00 SubmitTime=2019-04-16T08:18:37 EligibleTime=2019-04-16T08:18:37 DerivedExitCode=0:0 ExitCode=0:0 


squeue --state=CG
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          16343163       lr6  W-Vac-S junholee CG 3-00:00:01      1 n0048.lr6

process  - 
ps -eaf |grep slurm
root      21998  21938  0 15:44 pts/0    00:00:00 grep --color=auto slurm
root     140700      1  0 Mar28 ?        00:00:11 /usr/sbin/slurmd
root     451838      1  0 Apr16 ?        00:01:27 slurmstepd: [16343163.0]

gdb - PID
(gdb) bt
#0  0x00002b581ef00f47 in pthread_join () from /lib64/libpthread.so.0
#1  0x000000000040fd1c in _wait_for_io (job=0x1eab430) at mgr.c:2219
#2  job_manager (job=job@entry=0x1eab430) at mgr.c:1397
#3  0x000000000040bdc9 in main (argc=1, argv=0x7ffe06dc4c08) at slurmstepd.c:172

info thread
  Id   Target Id         Frame 
  17   Thread 0x2b5821163700 (LWP 451841) "slurmstepd" 0x00002b581f20720d in poll () from /lib64/libc.so.6
  16   Thread 0x2b5822593700 (LWP 451843) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
  15   Thread 0x2b581e228700 (LWP 19542) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
  14   Thread 0x2b5821062700 (LWP 19651) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
  13   Thread 0x2b5822492700 (LWP 19841) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
  12   Thread 0x2b5822694700 (LWP 20083) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
  11   Thread 0x2b5820653700 (LWP 20284) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
  10   Thread 0x2b5820754700 (LWP 20483) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
  9    Thread 0x2b5820855700 (LWP 20647) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
  8    Thread 0x2b5820956700 (LWP 20850) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
  7    Thread 0x2b5820a57700 (LWP 21037) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
  6    Thread 0x2b5820b58700 (LWP 21213) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
  5    Thread 0x2b5820c59700 (LWP 21414) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
  4    Thread 0x2b5820d5a700 (LWP 21587) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
  3    Thread 0x2b5820e5b700 (LWP 21777) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
  2    Thread 0x2b5820f5c700 (LWP 21992) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
* 1    Thread 0x2b581e126100 (LWP 451838) "slurmstepd" 0x00002b581ef00f47 in pthread_join () from /lib64/libpthread.so.0


 up 
#1  0x000000000040fd1c in _wait_for_io (job=0x1eab430) at mgr.c:2219
2219	mgr.c: No such file or directory.
(gdb) list
2214	in mgr.c

thread 3
[Switching to thread 3 (Thread 0x2b5820e5b700 (LWP 21777))]
#0  0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6
(gdb) list
2214	in mgr.c

Seeing this on a few users jobs and not sure why.  Is there something else I can run to identify the io_wait this job is pending on.

SLURMCTLD is sending out 
[2019-04-19T15:36:49.223] Resending TERMINATE_JOB request JobId=16343163 Nodelist=n0048.lr6
Comment 2 Marshall Garey 2019-04-19 17:09:50 MDT
Hi Jackie,

This looks like a duplicate of bug 5103 and bug 5545, which you had previously reported. You'll need to upgrade to fix this. When we release 19.05 next month, 17.11 will no longer be supported, so I recommend upgrading to the latest 18.08 (currently 18.08.7).

I'm marking this as a duplicate of bug 5545. If you still have problems with hanging slurmstepd's after upgrading, please open a new ticket.

- Marshall

*** This ticket has been marked as a duplicate of ticket 5545 ***