Jobs submitted with multiple nodes are showing up in a CG state and when looking at squeue it only shows the 1 node that it is stuck on. i.e. grep 16343163 jobcomp.log JobId=16343163 UserId=junholee(42713) GroupId=msd(505) Name=W-Vac-S JobState=TIMEOUT Partition=lr6 TimeLimit=4320 StartTime=2019-04-16T08:18:44 EndTime=2019-04-19T08:18:45 NodeList=n0024.lr6,n0025.lr6,n0026.lr6,n0027.lr6,n0028.lr6,n0029.lr6,n0031.lr6,n0032.lr6,n0033.lr6,n0034.lr6,n0035.lr6,n0036.lr6,n0037.lr6,n0038.lr6,n0039.lr6,n0040.lr6,n0041.lr6,n0042.lr6,n0043.lr6,n0044.lr6,n0045.lr6,n0046.lr6,n0048.lr6,n0049.lr6 NodeCnt=24 ProcCnt=768 WorkDir=/global/home/users/junholee/work/WS2/Vac_W/relax/wSOC/7x7 ReservationName= Gres= Account=ac_minnehaha QOS=lr_normal WcKey= Cluster=perceus-00 SubmitTime=2019-04-16T08:18:37 EligibleTime=2019-04-16T08:18:37 DerivedExitCode=0:0 ExitCode=0:0 squeue --state=CG JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 16343163 lr6 W-Vac-S junholee CG 3-00:00:01 1 n0048.lr6 process - ps -eaf |grep slurm root 21998 21938 0 15:44 pts/0 00:00:00 grep --color=auto slurm root 140700 1 0 Mar28 ? 00:00:11 /usr/sbin/slurmd root 451838 1 0 Apr16 ? 00:01:27 slurmstepd: [16343163.0] gdb - PID (gdb) bt #0 0x00002b581ef00f47 in pthread_join () from /lib64/libpthread.so.0 #1 0x000000000040fd1c in _wait_for_io (job=0x1eab430) at mgr.c:2219 #2 job_manager (job=job@entry=0x1eab430) at mgr.c:1397 #3 0x000000000040bdc9 in main (argc=1, argv=0x7ffe06dc4c08) at slurmstepd.c:172 info thread Id Target Id Frame 17 Thread 0x2b5821163700 (LWP 451841) "slurmstepd" 0x00002b581f20720d in poll () from /lib64/libc.so.6 16 Thread 0x2b5822593700 (LWP 451843) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 15 Thread 0x2b581e228700 (LWP 19542) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 14 Thread 0x2b5821062700 (LWP 19651) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 13 Thread 0x2b5822492700 (LWP 19841) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 12 Thread 0x2b5822694700 (LWP 20083) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 11 Thread 0x2b5820653700 (LWP 20284) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 10 Thread 0x2b5820754700 (LWP 20483) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 9 Thread 0x2b5820855700 (LWP 20647) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 8 Thread 0x2b5820956700 (LWP 20850) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 7 Thread 0x2b5820a57700 (LWP 21037) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 6 Thread 0x2b5820b58700 (LWP 21213) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 5 Thread 0x2b5820c59700 (LWP 21414) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 4 Thread 0x2b5820d5a700 (LWP 21587) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 3 Thread 0x2b5820e5b700 (LWP 21777) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 2 Thread 0x2b5820f5c700 (LWP 21992) "slurmstepd" 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 * 1 Thread 0x2b581e126100 (LWP 451838) "slurmstepd" 0x00002b581ef00f47 in pthread_join () from /lib64/libpthread.so.0 up #1 0x000000000040fd1c in _wait_for_io (job=0x1eab430) at mgr.c:2219 2219 mgr.c: No such file or directory. (gdb) list 2214 in mgr.c thread 3 [Switching to thread 3 (Thread 0x2b5820e5b700 (LWP 21777))] #0 0x00002b581f21fb9c in __lll_lock_wait_private () from /lib64/libc.so.6 (gdb) list 2214 in mgr.c Seeing this on a few users jobs and not sure why. Is there something else I can run to identify the io_wait this job is pending on. SLURMCTLD is sending out [2019-04-19T15:36:49.223] Resending TERMINATE_JOB request JobId=16343163 Nodelist=n0048.lr6
Hi Jackie, This looks like a duplicate of bug 5103 and bug 5545, which you had previously reported. You'll need to upgrade to fix this. When we release 19.05 next month, 17.11 will no longer be supported, so I recommend upgrading to the latest 18.08 (currently 18.08.7). I'm marking this as a duplicate of bug 5545. If you still have problems with hanging slurmstepd's after upgrading, please open a new ticket. - Marshall *** This ticket has been marked as a duplicate of ticket 5545 ***