Ticket 601

Summary: straggling sruns
Product: Slurm Reporter: Don Lipari <lipari1>
Component: Bluegene select pluginAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: da
Version: 2.6.x   
Hardware: IBM BlueGene   
OS: Linux   
Site: LLNL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 14.03.4 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: gdb logs
gdb log for process 29686
back trace of orphan srun

Description Don Lipari 2014-02-26 04:48:32 MST
Created attachment 648 [details]
gdb logs

I noticed today three sruns that are still running long after the job completed.

UID        PID  PPID  C STIME TTY          TIME CMD
petert   23029     1  0 Feb12 ?        00:00:00 srun -n 32 /g/g22/petert/ALLcrzuseq
petert   29686     1  0 Feb14 ?        00:00:00 srun -N 2 -n 32 /g/g22/petert/ALLcrzuseq
petert   29687 29686  0 Feb14 ?        00:00:00 srun -N 2 -n 32 /g/g22/petert/ALLcrzuseq

gdb logs from two of the above are attached.
Comment 1 Danny Auble 2014-02-26 04:54:24 MST
Based off the dates these are quite old indeed ;).

And there is nothing in the slurmctld log talking about the db waiting for the jobs to finish?

Since they are so old I am guessing there isn't anything in the logs about these jobs or how they finished.

I am also guessing the processes aren't still running in the database?  And the resources they were running on have ran jobs since these finished?

Could you give me the sacct information on these?

Could you also send the gdb information of pid 29686?  29687 is the child process.
Comment 2 Don Lipari 2014-02-26 05:11:18 MST
Created attachment 649 [details]
gdb log for process 29686
Comment 3 Don Lipari 2014-02-26 05:12:35 MST
(In reply to Danny Auble from comment #1)
> Based off the dates these are quite old indeed ;).
> 
> And there is nothing in the slurmctld log talking about the db waiting for
> the jobs to finish?

No.  They've rolled off.

> Since they are so old I am guessing there isn't anything in the logs about
> these jobs or how they finished.
> 
> I am also guessing the processes aren't still running in the database?  And
> the resources they were running on have ran jobs since these finished?
> 
> Could you give me the sacct information on these?

I have yet to match up the job to the hung sruns.  Here's what I've got:

$ sacct -X -u petert -S 02/12/14
       JobID      User    JobName    Account    Cluster  Partition        QOS   NNodes      State               Start    Elapsed ExitCode 
------------ --------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------------- ---------- -------- 
173512          petert     sxterm    science     rzuseq     pdebug     normal       32    TIMEOUT 2014-02-12T09:40:12   01:00:01      0:1 
173556          petert     sxterm    science     rzuseq     pdebug     normal       32    TIMEOUT 2014-02-12T10:40:30   01:00:13      0:1 
173578          petert     sxterm    science     rzuseq     pdebug     normal       32    TIMEOUT 2014-02-12T12:10:09   01:00:04      0:1 
173591          petert     sxterm    science     rzuseq     pdebug     normal       32  COMPLETED 2014-02-12T15:59:36   00:58:37      0:0 
173593          petert     sxterm    science     rzuseq     pdebug     normal       32  COMPLETED 2014-02-12T16:58:20   00:02:26      0:0 
173595          petert     sxterm    science     rzuseq     pdebug     normal       32    TIMEOUT 2014-02-12T17:00:55   01:00:19      0:1 
180181          petert     sxterm    science     rzuseq     pdebug     normal        2    TIMEOUT 2014-02-14T05:53:33   01:00:12      0:1 
180714          petert     sxterm    science     rzuseq     pdebug     normal        2  COMPLETED 2014-02-14T09:55:50   00:23:22      0:0 
180720          petert     sxterm    science     rzuseq     pdebug     normal        2    TIMEOUT 2014-02-14T10:19:17   01:00:00      0:1 
189434          petert     sxterm    science     rzuseq     pdebug     normal        2    TIMEOUT 2014-02-19T11:39:41   01:00:18      0:1 

> Could you also send the gdb information of pid 29686?  29687 is the child
> process.

Attached.
Comment 4 Danny Auble 2014-02-26 05:21:46 MST
If you go to the last gdb you sent for 29686 (stuck in runjob code) frame 8 should give you the job id.

print *job

IBM might want to look at that backtrace.

This one appears to still be running a job with no sign of ending ;).  I bet if you signalled it to shutdown it would work just fine.

Before/if you do that lets see what Slurm thinks the state of that job is as we can find out from frame 8.

The first one is different, it appears not to have a child and stuck launching the step.  I'll look more into that one.  If you could go into frame 1
and

print *step_params

That would tell us which job this belongs to.
Comment 5 Don Lipari 2014-02-26 05:29:56 MST
(In reply to Danny Auble from comment #4)
> If you go to the last gdb you sent for 29686 (stuck in runjob code) frame 8
> should give you the job id.
> 
> print *job

jobid = 180714

> IBM might want to look at that backtrace.
> 
> This one appears to still be running a job with no sign of ending ;).  I bet
> if you signalled it to shutdown it would work just fine.
> 
> Before/if you do that lets see what Slurm thinks the state of that job is as
> we can find out from frame 8.

$ sacct -j 180714
       JobID      User    JobName    Account    Cluster  Partition        QOS   NNodes      State               Start    Elapsed ExitCode 
------------ --------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------------- ---------- -------- 
180714          petert     sxterm    science     rzuseq     pdebug     normal        2  COMPLETED 2014-02-14T09:55:50   00:23:22      0:0 
180714.batch                batch    science     rzuseq                              1  CANCELLED 2014-02-14T09:55:50   00:23:22  255:126 
180714.0               /g/g22/pe+    science     rzuseq                              2  COMPLETED 2014-02-14T10:08:48   00:01:53      0:0 
180714.1               /g/g22/pe+    science     rzuseq                              2  COMPLETED 2014-02-14T10:10:52   00:00:04      0:0 
180714.2               /g/g22/pe+    science     rzuseq                              2  COMPLETED 2014-02-14T10:10:59   00:00:04      0:0 
180714.3               /g/g22/pe+    science     rzuseq                              2  CANCELLED 2014-02-14T10:15:12   00:04:00  255:126 
 
> The first one is different, it appears not to have a child and stuck
> launching the step.  I'll look more into that one.  If you could go into
> frame 1
> and
> 
> print *step_params
> 
> That would tell us which job this belongs to.

(gdb) p *step_params
$1 = {ckpt_interval = 0, cpu_count = 32, cpu_freq = 4294967294, exclusive = 0, features = 0x0, immediate = 0, job_id = 173591, pn_min_memory = 0, 
  ckpt_dir = 0x1045f7a8 "/g/g22/petert", gres = 0x0, name = 0x104c2108 "/g/g22/petert/ALLcrzuseq", network = 0x0, profile = 0, no_kill = 0 '\000', min_nodes = 32, 
  max_nodes = 32, node_list = 0x0, overcommit = false, plane_size = 65534, relative = 65534, resv_port_cnt = 65534, task_count = 32, task_dist = 1, time_limit = 0, 
  uid = 8615, verbose_level = 0}

$ sacct -j 173591
       JobID      User    JobName    Account    Cluster  Partition        QOS   NNodes      State               Start    Elapsed ExitCode 
------------ --------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------------- ---------- -------- 
173591          petert     sxterm    science     rzuseq     pdebug     normal       32  COMPLETED 2014-02-12T15:59:36   00:58:37      0:0 
173591.batch                batch    science     rzuseq                              1  CANCELLED 2014-02-12T15:59:36   00:58:37  255:126 
173591.0               ./ALLcrzu+    science     rzuseq                             32  COMPLETED 2014-02-12T16:44:57   00:00:06      0:0 
173591.1               /g/g22/pe+    science     rzuseq                             32  CANCELLED 2014-02-12T16:51:25   00:06:48  255:126
Comment 6 Danny Auble 2014-02-26 05:39:20 MST
Thanks Don, does 180714 also have a step # available?

173591 is a strange one, it just looks like it is waiting for something to happen but it never does.

I am wondering if it was scancelled by the user while it was waiting for a creation.  Based off gdb the step never was allocated but I am guessing it was either 173591.1 or a non-existent 173591.2.

Could you ask the user how many steps should run in that particular job?  And if they are all started at once or one after another.
Comment 7 Danny Auble 2014-02-26 06:05:03 MST
Also in the gdb of 173591 in frame 1 could you

print destroy_step;
Comment 8 Don Lipari 2014-02-26 06:11:21 MST
(In reply to Danny Auble from comment #7)
> Also in the gdb of 173591 in frame 1 could you
> 
> print destroy_step;

sanity check... (gdb) p *step_params
$1 = {ckpt_interval = 0, cpu_count = 32, cpu_freq = 4294967294, exclusive = 0, features = 0x0, immediate = 0, job_id = 173591, pn_min_memory = 0, 
  ckpt_dir = 0x1045f7a8 "/g/g22/petert", gres = 0x0, name = 0x104c2108 "/g/g22/petert/ALLcrzuseq", network = 0x0, profile = 0, no_kill = 0 '\000', min_nodes = 32, 
  max_nodes = 32, node_list = 0x0, overcommit = false, plane_size = 65534, relative = 65534, resv_port_cnt = 65534, task_count = 32, task_dist = 1, time_limit = 0, 
  uid = 8615, verbose_level = 0}

as requested...
(gdb) p destroy_step
$2 = 0
Comment 9 Don Lipari 2014-02-26 06:16:45 MST
(In reply to Danny Auble from comment #6)
> Thanks Don, does 180714 also have a step # available?
> 
> 173591 is a strange one, it just looks like it is waiting for something to
> happen but it never does.
> 
> I am wondering if it was scancelled by the user while it was waiting for a
> creation.  Based off gdb the step never was allocated but I am guessing it
> was either 173591.1 or a non-existent 173591.2.
> 
> Could you ask the user how many steps should run in that particular job? 
> And if they are all started at once or one after another.

sxterm is an xterm window from which the user manually launches srun's.

FWIW, from the slurmd.log...

> zgrep 173591 slurmd.log-20140217.gz
[2014-02-12T15:59:36.258] debug:  task_slurmd_batch_request: 173591
[2014-02-12T15:59:36.265] Running spank/prolog for jobid [173591] uid [8615]
[2014-02-12T15:59:36.270] debug:  [job 173591] attempting to run prolog [/usr/sbin/slurm_prolog]
[2014-02-12T15:59:36.433] Launching batch job 173591 for UID 8615
[2014-02-12T15:59:36.529] [173591] laying out the 32 tasks on 32 hosts rzuseq0000 dist 2
[2014-02-12T15:59:36.529] [173591] Message thread started pid = 20884
[2014-02-12T15:59:36.529] [173591] task NONE plugin loaded
[2014-02-12T15:59:36.530] [173591] Checkpoint plugin loaded: checkpoint/none
[2014-02-12T15:59:36.530] [173591] mpi type = (null)
[2014-02-12T15:59:36.530] [173591] spank: opening plugin stack /etc/slurm/plugstack.conf
[2014-02-12T15:59:36.530] [173591] mpi type = (null)
[2014-02-12T15:59:36.530] [173591] mpi/none: slurmstepd prefork
[2014-02-12T15:59:36.535] [173591] debug level = 2
[2014-02-12T15:59:36.537] [173591] task 0 (20888) started 2014-02-12T15:59:36
[2014-02-12T15:59:36.537] [173591] task_pre_launch_priv: 173591.4294967294
[2014-02-12T15:59:36.538] [173591] task_pre_launch: 173591.4294967294, task 0
[2014-02-12T16:24:43.415] [173591] auth plugin for Munge (http://code.google.com/p/munge/) loaded
[2014-02-12T16:24:43.416] [173591] Handling REQUEST_STATE
[2014-02-12T16:24:43.416] debug:  found apparently running job 173591
[2014-02-12T16:58:03.974] [173591] Handling REQUEST_STATE
[2014-02-12T16:58:03.974] debug:  found apparently running job 173591
[2014-02-12T16:58:13.417] [173591] task 0 (20888) exited with exit code 0.
[2014-02-12T16:58:13.417] [173591] task_post_term: 173591.4294967294, task 0
[2014-02-12T16:58:13.418] [173591] Message thread exited
[2014-02-12T16:58:13.418] [173591] job 173591 completed with slurm_rc = 0, job_rc = 0
[2014-02-12T16:58:13.418] [173591] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0
[2014-02-12T16:58:13.423] [173591] done with job
[2014-02-12T16:58:13.429] debug:  task_slurmd_release_resources: 173591
[2014-02-12T16:58:13.429] debug:  credential for job 173591 revoked
[2014-02-12T16:58:13.430] debug:  Waiting for job 173591's prolog to complete
[2014-02-12T16:58:13.430] debug:  Finished wait for job 173591's prolog to complete
[2014-02-12T16:58:13.435] Running spank/epilog for jobid [173591] uid [8615]
[2014-02-12T16:58:13.441] debug:  [job 173591] attempting to run epilog [/usr/sbin/slurm_epilog]
[2014-02-12T16:58:13.490] debug:  completed epilog for jobid 173591
[2014-02-12T16:58:13.491] debug:  Job 173591: sent epilog complete msg: rc = 0
Comment 10 Danny Auble 2014-02-26 06:17:23 MST
Thanks Don, could you tell me which proctrack you are using?
Comment 11 Don Lipari 2014-02-26 06:19:15 MST
(In reply to Danny Auble from comment #10)
> Thanks Don, could you tell me which proctrack you are using?

ProctrackType = proctrack/pgid
Comment 12 Danny Auble 2014-03-20 11:46:56 MDT
Don have you seen this happen since this time?

Did IBM have any ideas on why their component was just sitting there?
Comment 13 Don Lipari 2014-03-20 12:01:50 MDT
(In reply to Danny Auble from comment #12)
> Don have you seen this happen since this time?
> 
> Did IBM have any ideas on why their component was just sitting there?

I have not followed up w/ IBM regarding this.  The sruns are still there, though, so I will.

I have not seen any more instances of the problem on any of the three systems.
Comment 14 Danny Auble 2014-03-20 12:09:54 MDT
The one that is looking to create a step for job 173591 is just strange.  We are unable to create it.  You can probably end it try sending a SIGTERM to it and see if that kills it.  I am guessing it will.
Comment 15 Danny Auble 2014-04-24 10:22:26 MDT
Have you seen this happen since?
Comment 16 Don Lipari 2014-04-28 08:17:24 MDT
(In reply to Danny Auble from comment #15)
> Have you seen this happen since?

Yes.  On Vulcan.  There were some sruns from old jobs still running when I updated slurm on vulcan last Thursday.  Unfortunately, I had a tight window and elected to kill them off rather than delay the update.

Going forward, I will periodically check for the problem.  Keep the ticket open for the time being.
Comment 17 Don Lipari 2014-05-15 04:52:58 MDT
I noticed a new occurrence of a straggling srun on vulcan today:
$ scontrol version
slurm 14.03.3-2

$ ps -fC srun
UID        PID  PPID  C STIME TTY          TIME CMD
clay8    12013     1  0 May14 ?        00:00:00 srun -ppdebug -N 8 -n 64 /g/g14/clay8/bgq/vasp.5.3/vasp

The back trace will be attached.  It showed me that the job ID was 1135477

$ sacct -j 1135477 -o jobid,user,nnodes,start,end,state,exit
       JobID      User   NNodes               Start                 End      State ExitCode 
------------ --------- -------- ------------------- ------------------- ---------- -------- 
1135477          clay8        8 2014-05-14T11:17:47 2014-05-14T11:18:35 CANCELLED+      0:0 

> zgrep 1135477 slurmctld.log-20140515.gz              
[2014-05-14T11:17:28.330] debug:  bluegene:submit_job: 1135477 mode=512 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=(null) mps=1-1-1
[2014-05-14T11:17:28.333] debug:  bluegene:submit_job: 1135477 mode=513 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=(null) mps=1-1-1
[2014-05-14T11:17:28.337] debug:  513 can start unassigned job 1135477 at 1400094098 on vulcan0003
[2014-05-14T11:17:28.337] sched: _slurm_rpc_allocate_resources JobId=1135477 NodeList=(null) usec=7716
[2014-05-14T11:17:47.914] debug:  bluegene:submit_job: 1135477 mode=512 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=unassigned mps=1-1-1
[2014-05-14T11:17:47.923] debug:  512(0) can start job 1135477 at 1400091467 on RMP13Ma232809064(vulcan0003) 1
[2014-05-14T11:17:47.924] Queue start of job 1135477 in BG block RMP13Ma232809064
[2014-05-14T11:17:47.924] sched: Allocate JobId=1135477 MidplaneList=vulcan0003[00000x00130]
[2014-05-14T11:18:35.706] Queue termination of job 1135477 in BG block RMP13Ma232809064
[2014-05-14T11:18:35.706] job_signal 9 of running job 1135477 successful
[2014-05-14T11:18:35.706] sched: Cancel of JobId=1135477 by UID=46265, usec=349
[2014-05-14T11:18:35.713] debug:  ba_remove_job_in_block_job_list: Removing sub-block job 1135477 from block RMP13Ma232809064

> grep "2014-05-14 11:18" vulcanlac3-runjob_mux_vulcanlac3.log
2014-05-14 11:18:43.352 (INFO ) [0x400075090c0] 1167:ibm.runjob.mux.client.Listener: accepted
2014-05-14 11:18:43.354 (INFO ) [0x400075090c0] 1167:ibm.runjob.mux.client.Credentials: clay8 12075
2014-05-14 11:18:57.066 (INFO ) [0x400075090c0] 1167:clay8:ibm.runjob.mux.client.Runjob: job 2058564 inserted
2014-05-14 11:18:57.117 (INFO ) [0x400059090c0] 1167:clay8:ibm.runjob.mux.client.Runjob: job 2058564 started
2014-05-14 11:18:58.413 (WARN ) [0x400059090c0] 1167:clay8:ibm.runjob.mux.client.Runjob: Could not read: End of file
2014-05-14 11:18:58.413 (INFO ) [0x400059090c0] 1167:clay8:ibm.runjob.mux.client.Runjob: ~Runjob
Comment 18 Don Lipari 2014-05-15 04:54:02 MDT
Created attachment 857 [details]
back trace of orphan srun
Comment 19 Danny Auble 2014-05-15 04:59:29 MDT
Hum, similar to the other job.  Looks like the srun was still waiting for the job to start, or the block to boot.

Could you send me the full log from

2014-05-14T11:17:47.924 - 2014-05-14T11:18:35.706

Can you tell if the srun is stuck in that log spot?  It looks like the signal came in (at least that is where the thread is) and isn't progressing.

After you do that see what happens if you send a sigterm to the srun process.  If the process dies then it isn't locked up.
Comment 20 Danny Auble 2014-05-15 05:11:31 MDT
Don it was only these 2 threads in the srun process?
Comment 21 Don Lipari 2014-05-15 05:12:52 MDT
(In reply to Danny Auble from comment #19)
> Hum, similar to the other job.  Looks like the srun was still waiting for
> the job to start, or the block to boot.
> 
> Could you send me the full log from
> 
> 2014-05-14T11:17:47.924 - 2014-05-14T11:18:35.706
> 
> Can you tell if the srun is stuck in that log spot?

I'm not sure how to answer this.  Does this help?

> gstack 12013
Thread 2 (Thread 0x40004c6d0c0 (LWP 12014)):
#0  0x00000400003e3f00 in .__lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00000400003dc700 in .pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x0000000010059ad0 in .log_msg ()
#3  0x000000001005aa60 in .debug3 ()
#4  0x0000000010025450 in .eio_message_socket_accept ()
#5  0x0000000010026978 in ._poll_handle_event ()
#6  0x0000000010026514 in ._poll_dispatch ()
#7  0x0000000010025eac in .eio_handle_mainloop ()
#8  0x00000000101d7a0c in ._msg_thr_internal ()
#9  0x00000400003da21c in .start_thread () from /lib64/libpthread.so.0
#10 0x000004000052e57c in .__clone () from /lib64/libc.so.6
Thread 1 (Thread 0x400035ea6a0 (LWP 12013)):
#0  0x00000400003e3f00 in .__lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00000400003dc700 in .pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x0000000010059ad0 in .log_msg ()
#3  0x000000001005a980 in .debug ()
#4  0x000000001020dacc in ._signal_while_allocating ()
#5  <signal handler called>
#6  0x000000001020da94 in ._signal_while_allocating ()
#7  <signal handler called>
#8  0x000000001020da94 in ._signal_while_allocating ()
#9  <signal handler called>
#10 0x00000400003e3eb0 in .__lll_lock_wait () from /lib64/libpthread.so.0
#11 0x00000400003dc700 in .pthread_mutex_lock () from /lib64/libpthread.so.0
#12 0x0000000010059ad0 in .log_msg ()
#13 0x000000001005a980 in .debug ()
#14 0x000000001020dacc in ._signal_while_allocating ()
#15 <signal handler called>
#16 0x0000040000521e84 in .__poll () from /lib64/libc.so.6
#17 0x000000001005697c in ._fd_writeable ()
#18 0x0000000010059954 in ._log_printf ()
#19 0x000000001005a060 in .log_msg ()
#20 0x000000001005a8a0 in .slurm_info ()
#21 0x00000000101d7854 in ._wait_for_allocation_response ()
#22 0x00000000101d5e8c in .slurm_allocate_resources_blocking ()
#23 0x000000001020e81c in .allocate_nodes ()
#24 0x000000001020a9a8 in .create_srun_job ()
#25 0x0000000010023c94 in .srun ()
#26 0x0000000010024ff8 in .main ()


> It looks like the
> signal came in (at least that is where the thread is) and isn't progressing.
> 
> After you do that see what happens if you send a sigterm to the srun
> process.  If the process dies then it isn't locked up.

The process remained:
kill -TERM 12013
# vulcanlac3 /root >  ps -fC srun
UID        PID  PPID  C STIME TTY          TIME CMD
clay8    12013     1  0 May14 ?        00:00:00 srun -ppdebug -N 8 -n 64 /g/g14/clay8/bgq/vasp.5.3/vasp
Comment 22 Don Lipari 2014-05-15 05:13:51 MDT
(In reply to Danny Auble from comment #20)
> Don it was only these 2 threads in the srun process?

Yes.  I did not filter anything out of the gdb output.
Comment 23 Danny Auble 2014-05-15 05:22:23 MDT
Don, in frame 17 of thread 1 could you see if you can tell what errno is?
Comment 24 Don Lipari 2014-05-15 05:25:13 MDT
(In reply to Danny Auble from comment #23)
> Don, in frame 17 of thread 1 could you see if you can tell what errno is?

#17 0x000000001005697c in _fd_writeable (fd=2) at log.c:249
249     log.c: No such file or directory.
        in log.c
(gdb) p errno
$1 = 0
Comment 25 Danny Auble 2014-05-15 06:33:07 MDT
I have been able to reproduce.
Comment 26 Danny Auble 2014-05-15 06:39:46 MDT
This is now fixed in commit c563b34e3c4cbcd2fcba29ff159ca3c2cbe7b670.  This last gdb trace made it clear what was happening.  As you are probably aware should be a very rare occurrence.  The fallout is what you see, the srun will hang, and will not run the job.  The slurmctld cleans things up, but the srun job will be lost.  This would only happen on srun allocated jobs.  sbatch/salloc jobs will not have this situation.