We have an intermittent problem where a MPI-based application will abort, but the srun will be "stuck" until the job hits its walltime and is killed. The slurmd log shows: [2023-10-18T09:28:21.752] Launching batch job 1478837 for UID 58732 [2023-10-18T09:28:37.515] launch task StepId=1478837.0 request from UID:58732 GID:1222 HOST:10.128.0.24 PORT:49856 [2023-10-18T09:28:37.515] task/affinity: lllp_distribution: JobId=1478837 binding: threads, dist 2 [2023-10-18T09:28:37.515] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic [2023-10-18T09:28:37.515] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [1478837]: mask_cpu, 0x000000000000000000000000000000FE,0x0000000000000000000000000000FE00,0x00000000000000000000000000FE0000,0x000000000000000000000000FE000000,0x0000000000000000000000FE00000000,0x00000000000000000000FE0000000000,0x000000000000000000FE000000000000,0x0000000000000000FE00000000000000 [2023-10-18T09:31:22.317] [1478837.0] error: *** STEP 1478837.0 ON frontier00001 CANCELLED AT 2023-10-18T09:31:22 *** [2023-10-18T09:33:37.000] [1478837.0] get_exit_code task 0 died by signal: 15 [2023-10-18T09:33:38.505] [1478837.0] done with job [2023-10-18T09:49:22.206] error: forward_thread to frontier00294 (10.128.4.172:6818): Connection timed out [2023-10-18T09:49:22.214] error: forward_thread to frontier00351 (10.128.5.144:6818): Connection timed out Note those two nodes with the forward_thread errors are down and not in this job I believe they are unrelated messages. [root@frontier00001 ~]# sinfo -n frontier00294,frontier00351 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST batch* up 7-00:00:00 1 drain$ frontier00294 batch* up 7-00:00:00 1 down$ frontier00351 srun showed: srun: error: frontier06442: task 43344: Exited with exit code 255 srun: Terminating StepId=1478837.0 slurmstepd: error: *** STEP 1478837.0 ON frontier00001 CANCELLED AT 2023-10-18T09:31:22 *** srun: error: frontier08642: task 58415: Terminated # ... lots of Terminated messages srun: error: frontier06443: tasks 43352-43355,43357-43359: Terminated slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection timed out srun: error: frontier06445: tasks 43368,43370-43372,43374-43375: Terminated slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection timed out slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection timed out slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection timed out On all the nodes, the only processes remaining for the user are on the batch node: [root@frontier00001 ~]# ps aux|grep srun reubendb 40642 1.1 0.0 1638196 78196 ? Sl 09:28 0:47 srun -u -N 8000 -n 64000 -c 7 --gpu-bind=closest --gpus-per-task=1 /lustre/orion/stf016/scratch/reubendb/workspac es/frontier/Scratch/cholla/Cosmo_8000_Nodes/1697555966.1081736/build_directory/./bin/cholla.FOM.frontier cosmo.txt reubendb 40644 0.0 0.0 156612 4952 ? S 09:28 0:00 srun -u -N 8000 -n 64000 -c 7 --gpu-bind=closest --gpus-per-task=1 /lustre/orion/stf016/scratch/reubendb/workspac es/frontier/Scratch/cholla/Cosmo_8000_Nodes/1697555966.1081736/build_directory/./bin/cholla.FOM.frontier cosmo.txt root 42654 0.0 0.0 10268 884 pts/2 S+ 10:37 0:00 grep --color=auto srun [root@frontier00001 ~]# gstack 40642 Thread 5 (Thread 0x15554a7d4700 (LWP 40854) "srun"): #0 0x0000155554cec70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000155554f6a1b5 in _check_io_timeout (_sls=0x6264c0) at step_launch.c:1858 #2 0x0000155554ce56ea in start_thread () from /lib64/libpthread.so.0 #3 0x000015555408da6f in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x15554bbe8700 (LWP 40647) "srun"): #0 0x00001555540811c9 in poll () from /lib64/libc.so.6 #1 0x0000155554f93ee2 in poll (__timeout=-1, __nfds=<optimized out>, __fds=0x15553c38abb0) at /usr/include/bits/poll2.h:46 #2 _poll_internal (shutdown_time=<optimized out>, nfds=168, pfds=0x15553c38abb0) at eio.c:351 #3 eio_handle_mainloop (eio=<optimized out>) at eio.c:314 #4 0x0000155554f67e48 in _io_thr_internal (cio_arg=0x824c60) at step_io.c:814 #5 0x0000155554ce56ea in start_thread () from /lib64/libpthread.so.0 #6 0x000015555408da6f in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x15554bce9700 (LWP 40646) "srun"): #0 0x00001555540811c9 in poll () from /lib64/libc.so.6 #1 0x0000155554f93ee2 in poll (__timeout=-1, __nfds=<optimized out>, __fds=0x155544000b70) at /usr/include/bits/poll2.h:46 #2 _poll_internal (shutdown_time=<optimized out>, nfds=169, pfds=0x155544000b70) at eio.c:351 #3 eio_handle_mainloop (eio=<optimized out>) at eio.c:314 #4 0x0000155554f68cf2 in _msg_thr_internal (arg=<optimized out>) at step_launch.c:1110 #5 0x0000155554ce56ea in start_thread () from /lib64/libpthread.so.0 #6 0x000015555408da6f in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x15554bfda700 (LWP 40645) "srun"): #0 0x0000155553fc1ccc in sigtimedwait () from /lib64/libc.so.6 #1 0x0000155554cf13cc in sigwait () from /lib64/libpthread.so.0 #2 0x00000000004151ba in _srun_signal_mgr (job_ptr=0x492fb0) at srun_job.c:2349 #3 0x0000155554ce56ea in start_thread () from /lib64/libpthread.so.0 #4 0x000015555408da6f in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x155555529740 (LWP 40642) "srun"): #0 0x0000155554cec70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000155554f6c42c in slurm_step_launch_wait_finish (ctx=0x626480) at step_launch.c:691 #2 0x000000000040cfad in launch_g_step_wait (job=job@entry=0x492fb0, got_alloc=got_alloc@entry=false, opt_local=opt_local@entry=0x424fa0 <opt>) at launch.c:1488 #3 0x00000000004113cb in _launch_one_app (data=<optimized out>) at srun.c:305 #4 0x0000000000412005 in _launch_app (got_alloc=false, srun_job_list=0x0, job=0x492fb0) at srun.c:600 #5 srun (ac=<optimized out>, av=0x7fffffff3b78) at srun.c:226 #6 0x0000000000417410 in main (argc=<optimized out>, argv=<optimized out>) at srun.wrapper.c:17 [root@frontier00001 ~]# gstack 40644 #0 0x0000155554cf06ce in read () from /lib64/libpthread.so.0 #1 0x0000000000416e6c in read (__nbytes=1, __buf=0x7fffffff389f, __fd=<optimized out>) at /usr/include/bits/unistd.h:44 #2 _shepherd_spawn (got_alloc=false, srun_job_list=0x0, job=0x492fb0) at srun_job.c:2304 #3 create_srun_job (p_job=p_job@entry=0x425ca8 <job>, got_alloc=got_alloc@entry=0x7fffffff3943, slurm_started=slurm_started@entry=false, handle_signals=handle_signals@entry=true) at srun_job.c:1524 #4 0x0000000000411a8b in srun (ac=12, av=0x7fffffff3b78) at srun.c:195 #5 0x0000000000417410 in main (argc=<optimized out>, argv=<optimized out>) at srun.wrapper.c:17 What can we debug to see why this srun is hung? Which tasks failed to receive a MESSAGE_TASK_EXIT, and will it retry or timeout eventually?
I found the nodes that had the MESSAGE_TASK_EXIT send failures: frontier06443: Oct 18 09:35:42 frontier06443 slurmstepd[83892]: [1478837.0] error: Failed to send MESSAGE_TASK_EXIT: Connection timed out frontier06441: Oct 18 09:34:56 frontier06441 slurmstepd[49910]: [1478837.0] error: Failed to send MESSAGE_TASK_EXIT: Connection timed out frontier06447: Oct 18 09:35:00 frontier06447 slurmstepd[96585]: [1478837.0] error: Failed to send MESSAGE_TASK_EXIT: Connection timed out frontier06445: Oct 18 09:35:44 frontier06445 slurmstepd[19700]: [1478837.0] error: Failed to send MESSAGE_TASK_EXIT: Connection timed out Note that these nodes had an interface bounce due to a switch issue (which is what caused MPI to abort in the first place). So likely the cause of the connection timed out, but connection was restored before slurmd_timeout so the nodes weren't marked down. Is that ever retried in the slurmstepd? Is there anything in the srun that times out waiting for all the tasks to report that?
_send_srun_resp_msg() will retry. For this 8000 node job, it will retry 12 times with a sleep of 0.1s, 0.2s, 0.4s, and then 0.8s repeating for a total of ~8 seconds (might be off by one retry). The link was down for more than 8 seconds, so it seems likely that we blackholed all the retries here. _send_exit_msg() will log a failure of _send_srun_resp_msg() but still returns SLURM_SUCCESS. No bother, stepd_send_pending_exit_msgs() ignores the RC anyway despite setting esent to true, so there's never a higher-level retry. I *think* the behavior I'm expecting to see here is that slurmstepd hangs around until it can confirm it sent the exit message.
Created attachment 32823 [details] Proposed patch v1 Compile tested only - I'll try this on our test cluster later tonight.
Verified that the patch seems to work. Control: started a 2-node job and downed an interface on node-2. Verified output stopped and MPI aborted. Verified node-2 logged 'Failed to send MESSAGE_TASK_EXIT'. Upped the port. Verified that the srun never returned. Test-with-patch: started a 2-node job and downed an interface on node-2. Verified output stopped and MPI aborted. Verified node-2 logged 'Failed to send MESSAGE_TASK_EXIT'. Upped the port. Verified that srun returned, stdout showing: slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: No route to host srun: error: borg014: tasks 10-15: Terminated srun: error: borg014: tasks 8-9: Terminated srun: Force Terminated StepId=46959.0 ezy@borg013:~/system_tests/programs/halo/bin>
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: No route to host This error message won't show up on an XC or EX system. There are no IP routers, so the error path goes through a MessageTimeout. Also, you should add a test to kill srun, that would leave the slurmstepd orphaned. That error path should be handled with a timeout.
(In reply to Brian F Gilmer from comment #5) > slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: No route to host > > This error message won't show up on an XC or EX system. There are no IP > routers, so the error path goes through a MessageTimeout. This test was run on an EX. I downed the port on the Slingshot switch (since just ifdown-ing the port on the node only stops ethernet connectivity, not RDMA). I'm not sure the exact mechanics here (the kernel saw the interface as down and returned no route or the retry handler parked the destination and returned a NACK?). > Also, you should add a test to kill srun, that would leave the slurmstepd > orphaned. That error path should be handled with a timeout. Ah, good point. My assertion earlier "I *think* the behavior I'm expecting to see here is that slurmstepd hangs around until it can confirm it sent the exit message" breaks down in this scenario. I guess if you have two entities that need to talk, but either could go away unexpectedly, both have to eventually timeout trying to talk to each other.
Hey Matt! I tend to agree with Brian here that this probably needs to timeout eventually, but I also think we don't need to be as aggressive about dropping this communication as we are today. I don't think your patch would cause significant problems in most cases though, but for jobs without a timelimit waiting forever could be an issue. I'll take a look and see what we can do here to make things better. For my reference, about how long was the link down for? And do you know what is/was the cause if the link going down and back up? Thanks! --Tim
(In reply to Tim McMullan from comment #7) > I'll take a look and see what we can do here to make things better. For my > reference, about how long was the link down for? And do you know what > is/was the cause if the link going down and back up? We have 4 links per node. All incoming slurm traffic uses hsn0, since this has the IP associated with the hostname. Outgoing traffic may source from any of the links, based on how the kernel chooses to hash routes. hsn0 did not go down so slurmd was still able to respond to pings. The connection from this node back to the srun likely sourced from a link that did go down. I'm not sure why the kernel didn't update its routes to use a different interface, I'll have to look into that... In this specific case, a switch rebooted. So 2 links (attached to the same switch) were down for 3:30 Oct 18 09:31:03 frontier06441 kernel: cxi_core 0000:d5:00.0: cxi2[hsn2] CXI_EVENT_LINK_DOWN Oct 18 09:31:04 frontier06441 kernel: cxi_core 0000:dd:00.0: cxi3[hsn3] CXI_EVENT_LINK_DOWN Oct 18 09:34:33 frontier06441 kernel: cxi_core 0000:d5:00.0: cxi2[hsn2] CXI_EVENT_LINK_UP Oct 18 09:34:33 frontier06441 kernel: cxi_core 0000:dd:00.0: cxi3[hsn3] CXI_EVENT_LINK_UP This is not a typical scenario (that it's down that long) but it does happen occasionally. In most other cases we've observed, a link will flap due to high bit-error-rate or loss of signal alignment. Generally the link can re-train its signal parameters and become operational again in 5-15 seconds. For jobs less than 1024 nodes, currently it only retries 5 times for about 2.3 seconds, which won't survive a link flap.
Hey Matt, I'm sorry this took so long but we finally came to something that made sense to us and pushed it ahead of 23.11.4. https://github.com/SchedMD/slurm/commit/0cbf687c5f What we've done here is changed the logic so it is a constant time wait, regardless of how the message failure happened and it will always be for msg_timeout. There are a bunch of ways this was dying without really waiting very long, and I think this will help that situation. Let me know if you have any thoughts on this! Thanks, --Tim
Hi Matt, Unfortunately we have had to push this change off. Additional testing revealed some other problems that are more complicated to fix. We are going to work on fixing the other underlying problem and then implement a very similar approach to fix the problem described by this ticket initially. Sorry for the false start on this one! --Tim
*** Ticket 19422 has been marked as a duplicate of this ticket. ***
*** Ticket 16013 has been marked as a duplicate of this ticket. ***
Hi Matt, I'm sorry for the continued delay here. We've made some substantial progress on the underlying issues that need solving in 24.05 3efd8109f0 through 0d20e2cd18 were all pushed ahead of 24.05.0 along as other changes that should help reduce the likelihood of this happening naturally. That said, there is still work being done to this area to fully fix the issue. I'll keep you updated as we make more progress. Thanks! --Tim