| Summary: | straggling sruns | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | Bluegene select plugin | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | da |
| Version: | 2.6.x | ||
| Hardware: | IBM BlueGene | ||
| OS: | Linux | ||
| Site: | LLNL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 14.03.4 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
gdb logs
gdb log for process 29686 back trace of orphan srun |
||
Based off the dates these are quite old indeed ;). And there is nothing in the slurmctld log talking about the db waiting for the jobs to finish? Since they are so old I am guessing there isn't anything in the logs about these jobs or how they finished. I am also guessing the processes aren't still running in the database? And the resources they were running on have ran jobs since these finished? Could you give me the sacct information on these? Could you also send the gdb information of pid 29686? 29687 is the child process. Created attachment 649 [details]
gdb log for process 29686
(In reply to Danny Auble from comment #1) > Based off the dates these are quite old indeed ;). > > And there is nothing in the slurmctld log talking about the db waiting for > the jobs to finish? No. They've rolled off. > Since they are so old I am guessing there isn't anything in the logs about > these jobs or how they finished. > > I am also guessing the processes aren't still running in the database? And > the resources they were running on have ran jobs since these finished? > > Could you give me the sacct information on these? I have yet to match up the job to the hung sruns. Here's what I've got: $ sacct -X -u petert -S 02/12/14 JobID User JobName Account Cluster Partition QOS NNodes State Start Elapsed ExitCode ------------ --------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------------- ---------- -------- 173512 petert sxterm science rzuseq pdebug normal 32 TIMEOUT 2014-02-12T09:40:12 01:00:01 0:1 173556 petert sxterm science rzuseq pdebug normal 32 TIMEOUT 2014-02-12T10:40:30 01:00:13 0:1 173578 petert sxterm science rzuseq pdebug normal 32 TIMEOUT 2014-02-12T12:10:09 01:00:04 0:1 173591 petert sxterm science rzuseq pdebug normal 32 COMPLETED 2014-02-12T15:59:36 00:58:37 0:0 173593 petert sxterm science rzuseq pdebug normal 32 COMPLETED 2014-02-12T16:58:20 00:02:26 0:0 173595 petert sxterm science rzuseq pdebug normal 32 TIMEOUT 2014-02-12T17:00:55 01:00:19 0:1 180181 petert sxterm science rzuseq pdebug normal 2 TIMEOUT 2014-02-14T05:53:33 01:00:12 0:1 180714 petert sxterm science rzuseq pdebug normal 2 COMPLETED 2014-02-14T09:55:50 00:23:22 0:0 180720 petert sxterm science rzuseq pdebug normal 2 TIMEOUT 2014-02-14T10:19:17 01:00:00 0:1 189434 petert sxterm science rzuseq pdebug normal 2 TIMEOUT 2014-02-19T11:39:41 01:00:18 0:1 > Could you also send the gdb information of pid 29686? 29687 is the child > process. Attached. If you go to the last gdb you sent for 29686 (stuck in runjob code) frame 8 should give you the job id. print *job IBM might want to look at that backtrace. This one appears to still be running a job with no sign of ending ;). I bet if you signalled it to shutdown it would work just fine. Before/if you do that lets see what Slurm thinks the state of that job is as we can find out from frame 8. The first one is different, it appears not to have a child and stuck launching the step. I'll look more into that one. If you could go into frame 1 and print *step_params That would tell us which job this belongs to. (In reply to Danny Auble from comment #4) > If you go to the last gdb you sent for 29686 (stuck in runjob code) frame 8 > should give you the job id. > > print *job jobid = 180714 > IBM might want to look at that backtrace. > > This one appears to still be running a job with no sign of ending ;). I bet > if you signalled it to shutdown it would work just fine. > > Before/if you do that lets see what Slurm thinks the state of that job is as > we can find out from frame 8. $ sacct -j 180714 JobID User JobName Account Cluster Partition QOS NNodes State Start Elapsed ExitCode ------------ --------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------------- ---------- -------- 180714 petert sxterm science rzuseq pdebug normal 2 COMPLETED 2014-02-14T09:55:50 00:23:22 0:0 180714.batch batch science rzuseq 1 CANCELLED 2014-02-14T09:55:50 00:23:22 255:126 180714.0 /g/g22/pe+ science rzuseq 2 COMPLETED 2014-02-14T10:08:48 00:01:53 0:0 180714.1 /g/g22/pe+ science rzuseq 2 COMPLETED 2014-02-14T10:10:52 00:00:04 0:0 180714.2 /g/g22/pe+ science rzuseq 2 COMPLETED 2014-02-14T10:10:59 00:00:04 0:0 180714.3 /g/g22/pe+ science rzuseq 2 CANCELLED 2014-02-14T10:15:12 00:04:00 255:126 > The first one is different, it appears not to have a child and stuck > launching the step. I'll look more into that one. If you could go into > frame 1 > and > > print *step_params > > That would tell us which job this belongs to. (gdb) p *step_params $1 = {ckpt_interval = 0, cpu_count = 32, cpu_freq = 4294967294, exclusive = 0, features = 0x0, immediate = 0, job_id = 173591, pn_min_memory = 0, ckpt_dir = 0x1045f7a8 "/g/g22/petert", gres = 0x0, name = 0x104c2108 "/g/g22/petert/ALLcrzuseq", network = 0x0, profile = 0, no_kill = 0 '\000', min_nodes = 32, max_nodes = 32, node_list = 0x0, overcommit = false, plane_size = 65534, relative = 65534, resv_port_cnt = 65534, task_count = 32, task_dist = 1, time_limit = 0, uid = 8615, verbose_level = 0} $ sacct -j 173591 JobID User JobName Account Cluster Partition QOS NNodes State Start Elapsed ExitCode ------------ --------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ------------------- ---------- -------- 173591 petert sxterm science rzuseq pdebug normal 32 COMPLETED 2014-02-12T15:59:36 00:58:37 0:0 173591.batch batch science rzuseq 1 CANCELLED 2014-02-12T15:59:36 00:58:37 255:126 173591.0 ./ALLcrzu+ science rzuseq 32 COMPLETED 2014-02-12T16:44:57 00:00:06 0:0 173591.1 /g/g22/pe+ science rzuseq 32 CANCELLED 2014-02-12T16:51:25 00:06:48 255:126 Thanks Don, does 180714 also have a step # available? 173591 is a strange one, it just looks like it is waiting for something to happen but it never does. I am wondering if it was scancelled by the user while it was waiting for a creation. Based off gdb the step never was allocated but I am guessing it was either 173591.1 or a non-existent 173591.2. Could you ask the user how many steps should run in that particular job? And if they are all started at once or one after another. Also in the gdb of 173591 in frame 1 could you print destroy_step; (In reply to Danny Auble from comment #7) > Also in the gdb of 173591 in frame 1 could you > > print destroy_step; sanity check... (gdb) p *step_params $1 = {ckpt_interval = 0, cpu_count = 32, cpu_freq = 4294967294, exclusive = 0, features = 0x0, immediate = 0, job_id = 173591, pn_min_memory = 0, ckpt_dir = 0x1045f7a8 "/g/g22/petert", gres = 0x0, name = 0x104c2108 "/g/g22/petert/ALLcrzuseq", network = 0x0, profile = 0, no_kill = 0 '\000', min_nodes = 32, max_nodes = 32, node_list = 0x0, overcommit = false, plane_size = 65534, relative = 65534, resv_port_cnt = 65534, task_count = 32, task_dist = 1, time_limit = 0, uid = 8615, verbose_level = 0} as requested... (gdb) p destroy_step $2 = 0 (In reply to Danny Auble from comment #6) > Thanks Don, does 180714 also have a step # available? > > 173591 is a strange one, it just looks like it is waiting for something to > happen but it never does. > > I am wondering if it was scancelled by the user while it was waiting for a > creation. Based off gdb the step never was allocated but I am guessing it > was either 173591.1 or a non-existent 173591.2. > > Could you ask the user how many steps should run in that particular job? > And if they are all started at once or one after another. sxterm is an xterm window from which the user manually launches srun's. FWIW, from the slurmd.log... > zgrep 173591 slurmd.log-20140217.gz [2014-02-12T15:59:36.258] debug: task_slurmd_batch_request: 173591 [2014-02-12T15:59:36.265] Running spank/prolog for jobid [173591] uid [8615] [2014-02-12T15:59:36.270] debug: [job 173591] attempting to run prolog [/usr/sbin/slurm_prolog] [2014-02-12T15:59:36.433] Launching batch job 173591 for UID 8615 [2014-02-12T15:59:36.529] [173591] laying out the 32 tasks on 32 hosts rzuseq0000 dist 2 [2014-02-12T15:59:36.529] [173591] Message thread started pid = 20884 [2014-02-12T15:59:36.529] [173591] task NONE plugin loaded [2014-02-12T15:59:36.530] [173591] Checkpoint plugin loaded: checkpoint/none [2014-02-12T15:59:36.530] [173591] mpi type = (null) [2014-02-12T15:59:36.530] [173591] spank: opening plugin stack /etc/slurm/plugstack.conf [2014-02-12T15:59:36.530] [173591] mpi type = (null) [2014-02-12T15:59:36.530] [173591] mpi/none: slurmstepd prefork [2014-02-12T15:59:36.535] [173591] debug level = 2 [2014-02-12T15:59:36.537] [173591] task 0 (20888) started 2014-02-12T15:59:36 [2014-02-12T15:59:36.537] [173591] task_pre_launch_priv: 173591.4294967294 [2014-02-12T15:59:36.538] [173591] task_pre_launch: 173591.4294967294, task 0 [2014-02-12T16:24:43.415] [173591] auth plugin for Munge (http://code.google.com/p/munge/) loaded [2014-02-12T16:24:43.416] [173591] Handling REQUEST_STATE [2014-02-12T16:24:43.416] debug: found apparently running job 173591 [2014-02-12T16:58:03.974] [173591] Handling REQUEST_STATE [2014-02-12T16:58:03.974] debug: found apparently running job 173591 [2014-02-12T16:58:13.417] [173591] task 0 (20888) exited with exit code 0. [2014-02-12T16:58:13.417] [173591] task_post_term: 173591.4294967294, task 0 [2014-02-12T16:58:13.418] [173591] Message thread exited [2014-02-12T16:58:13.418] [173591] job 173591 completed with slurm_rc = 0, job_rc = 0 [2014-02-12T16:58:13.418] [173591] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 [2014-02-12T16:58:13.423] [173591] done with job [2014-02-12T16:58:13.429] debug: task_slurmd_release_resources: 173591 [2014-02-12T16:58:13.429] debug: credential for job 173591 revoked [2014-02-12T16:58:13.430] debug: Waiting for job 173591's prolog to complete [2014-02-12T16:58:13.430] debug: Finished wait for job 173591's prolog to complete [2014-02-12T16:58:13.435] Running spank/epilog for jobid [173591] uid [8615] [2014-02-12T16:58:13.441] debug: [job 173591] attempting to run epilog [/usr/sbin/slurm_epilog] [2014-02-12T16:58:13.490] debug: completed epilog for jobid 173591 [2014-02-12T16:58:13.491] debug: Job 173591: sent epilog complete msg: rc = 0 Thanks Don, could you tell me which proctrack you are using? (In reply to Danny Auble from comment #10) > Thanks Don, could you tell me which proctrack you are using? ProctrackType = proctrack/pgid Don have you seen this happen since this time? Did IBM have any ideas on why their component was just sitting there? (In reply to Danny Auble from comment #12) > Don have you seen this happen since this time? > > Did IBM have any ideas on why their component was just sitting there? I have not followed up w/ IBM regarding this. The sruns are still there, though, so I will. I have not seen any more instances of the problem on any of the three systems. The one that is looking to create a step for job 173591 is just strange. We are unable to create it. You can probably end it try sending a SIGTERM to it and see if that kills it. I am guessing it will. Have you seen this happen since? (In reply to Danny Auble from comment #15) > Have you seen this happen since? Yes. On Vulcan. There were some sruns from old jobs still running when I updated slurm on vulcan last Thursday. Unfortunately, I had a tight window and elected to kill them off rather than delay the update. Going forward, I will periodically check for the problem. Keep the ticket open for the time being. I noticed a new occurrence of a straggling srun on vulcan today:
$ scontrol version
slurm 14.03.3-2
$ ps -fC srun
UID PID PPID C STIME TTY TIME CMD
clay8 12013 1 0 May14 ? 00:00:00 srun -ppdebug -N 8 -n 64 /g/g14/clay8/bgq/vasp.5.3/vasp
The back trace will be attached. It showed me that the job ID was 1135477
$ sacct -j 1135477 -o jobid,user,nnodes,start,end,state,exit
JobID User NNodes Start End State ExitCode
------------ --------- -------- ------------------- ------------------- ---------- --------
1135477 clay8 8 2014-05-14T11:17:47 2014-05-14T11:18:35 CANCELLED+ 0:0
> zgrep 1135477 slurmctld.log-20140515.gz
[2014-05-14T11:17:28.330] debug: bluegene:submit_job: 1135477 mode=512 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=(null) mps=1-1-1
[2014-05-14T11:17:28.333] debug: bluegene:submit_job: 1135477 mode=513 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=(null) mps=1-1-1
[2014-05-14T11:17:28.337] debug: 513 can start unassigned job 1135477 at 1400094098 on vulcan0003
[2014-05-14T11:17:28.337] sched: _slurm_rpc_allocate_resources JobId=1135477 NodeList=(null) usec=7716
[2014-05-14T11:17:47.914] debug: bluegene:submit_job: 1135477 mode=512 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=unassigned mps=1-1-1
[2014-05-14T11:17:47.923] debug: 512(0) can start job 1135477 at 1400091467 on RMP13Ma232809064(vulcan0003) 1
[2014-05-14T11:17:47.924] Queue start of job 1135477 in BG block RMP13Ma232809064
[2014-05-14T11:17:47.924] sched: Allocate JobId=1135477 MidplaneList=vulcan0003[00000x00130]
[2014-05-14T11:18:35.706] Queue termination of job 1135477 in BG block RMP13Ma232809064
[2014-05-14T11:18:35.706] job_signal 9 of running job 1135477 successful
[2014-05-14T11:18:35.706] sched: Cancel of JobId=1135477 by UID=46265, usec=349
[2014-05-14T11:18:35.713] debug: ba_remove_job_in_block_job_list: Removing sub-block job 1135477 from block RMP13Ma232809064
> grep "2014-05-14 11:18" vulcanlac3-runjob_mux_vulcanlac3.log
2014-05-14 11:18:43.352 (INFO ) [0x400075090c0] 1167:ibm.runjob.mux.client.Listener: accepted
2014-05-14 11:18:43.354 (INFO ) [0x400075090c0] 1167:ibm.runjob.mux.client.Credentials: clay8 12075
2014-05-14 11:18:57.066 (INFO ) [0x400075090c0] 1167:clay8:ibm.runjob.mux.client.Runjob: job 2058564 inserted
2014-05-14 11:18:57.117 (INFO ) [0x400059090c0] 1167:clay8:ibm.runjob.mux.client.Runjob: job 2058564 started
2014-05-14 11:18:58.413 (WARN ) [0x400059090c0] 1167:clay8:ibm.runjob.mux.client.Runjob: Could not read: End of file
2014-05-14 11:18:58.413 (INFO ) [0x400059090c0] 1167:clay8:ibm.runjob.mux.client.Runjob: ~Runjob
Created attachment 857 [details]
back trace of orphan srun
Hum, similar to the other job. Looks like the srun was still waiting for the job to start, or the block to boot. Could you send me the full log from 2014-05-14T11:17:47.924 - 2014-05-14T11:18:35.706 Can you tell if the srun is stuck in that log spot? It looks like the signal came in (at least that is where the thread is) and isn't progressing. After you do that see what happens if you send a sigterm to the srun process. If the process dies then it isn't locked up. Don it was only these 2 threads in the srun process? (In reply to Danny Auble from comment #19) > Hum, similar to the other job. Looks like the srun was still waiting for > the job to start, or the block to boot. > > Could you send me the full log from > > 2014-05-14T11:17:47.924 - 2014-05-14T11:18:35.706 > > Can you tell if the srun is stuck in that log spot? I'm not sure how to answer this. Does this help? > gstack 12013 Thread 2 (Thread 0x40004c6d0c0 (LWP 12014)): #0 0x00000400003e3f00 in .__lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00000400003dc700 in .pthread_mutex_lock () from /lib64/libpthread.so.0 #2 0x0000000010059ad0 in .log_msg () #3 0x000000001005aa60 in .debug3 () #4 0x0000000010025450 in .eio_message_socket_accept () #5 0x0000000010026978 in ._poll_handle_event () #6 0x0000000010026514 in ._poll_dispatch () #7 0x0000000010025eac in .eio_handle_mainloop () #8 0x00000000101d7a0c in ._msg_thr_internal () #9 0x00000400003da21c in .start_thread () from /lib64/libpthread.so.0 #10 0x000004000052e57c in .__clone () from /lib64/libc.so.6 Thread 1 (Thread 0x400035ea6a0 (LWP 12013)): #0 0x00000400003e3f00 in .__lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00000400003dc700 in .pthread_mutex_lock () from /lib64/libpthread.so.0 #2 0x0000000010059ad0 in .log_msg () #3 0x000000001005a980 in .debug () #4 0x000000001020dacc in ._signal_while_allocating () #5 <signal handler called> #6 0x000000001020da94 in ._signal_while_allocating () #7 <signal handler called> #8 0x000000001020da94 in ._signal_while_allocating () #9 <signal handler called> #10 0x00000400003e3eb0 in .__lll_lock_wait () from /lib64/libpthread.so.0 #11 0x00000400003dc700 in .pthread_mutex_lock () from /lib64/libpthread.so.0 #12 0x0000000010059ad0 in .log_msg () #13 0x000000001005a980 in .debug () #14 0x000000001020dacc in ._signal_while_allocating () #15 <signal handler called> #16 0x0000040000521e84 in .__poll () from /lib64/libc.so.6 #17 0x000000001005697c in ._fd_writeable () #18 0x0000000010059954 in ._log_printf () #19 0x000000001005a060 in .log_msg () #20 0x000000001005a8a0 in .slurm_info () #21 0x00000000101d7854 in ._wait_for_allocation_response () #22 0x00000000101d5e8c in .slurm_allocate_resources_blocking () #23 0x000000001020e81c in .allocate_nodes () #24 0x000000001020a9a8 in .create_srun_job () #25 0x0000000010023c94 in .srun () #26 0x0000000010024ff8 in .main () > It looks like the > signal came in (at least that is where the thread is) and isn't progressing. > > After you do that see what happens if you send a sigterm to the srun > process. If the process dies then it isn't locked up. The process remained: kill -TERM 12013 # vulcanlac3 /root > ps -fC srun UID PID PPID C STIME TTY TIME CMD clay8 12013 1 0 May14 ? 00:00:00 srun -ppdebug -N 8 -n 64 /g/g14/clay8/bgq/vasp.5.3/vasp (In reply to Danny Auble from comment #20) > Don it was only these 2 threads in the srun process? Yes. I did not filter anything out of the gdb output. Don, in frame 17 of thread 1 could you see if you can tell what errno is? (In reply to Danny Auble from comment #23) > Don, in frame 17 of thread 1 could you see if you can tell what errno is? #17 0x000000001005697c in _fd_writeable (fd=2) at log.c:249 249 log.c: No such file or directory. in log.c (gdb) p errno $1 = 0 I have been able to reproduce. This is now fixed in commit c563b34e3c4cbcd2fcba29ff159ca3c2cbe7b670. This last gdb trace made it clear what was happening. As you are probably aware should be a very rare occurrence. The fallout is what you see, the srun will hang, and will not run the job. The slurmctld cleans things up, but the srun job will be lost. This would only happen on srun allocated jobs. sbatch/salloc jobs will not have this situation. |
Created attachment 648 [details] gdb logs I noticed today three sruns that are still running long after the job completed. UID PID PPID C STIME TTY TIME CMD petert 23029 1 0 Feb12 ? 00:00:00 srun -n 32 /g/g22/petert/ALLcrzuseq petert 29686 1 0 Feb14 ? 00:00:00 srun -N 2 -n 32 /g/g22/petert/ALLcrzuseq petert 29687 29686 0 Feb14 ? 00:00:00 srun -N 2 -n 32 /g/g22/petert/ALLcrzuseq gdb logs from two of the above are attached.