Ticket 17941

Summary: srun intermittent hangs on abort
Product: Slurm Reporter: Matt Ezell <ezellma>
Component: User CommandsAssignee: Tim McMullan <mcmullan>
Status: OPEN --- QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: brian.gilmer, esteva.m, marshall, tyler, vergaravg
Version: 23.02.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=16013
https://bugs.schedmd.com/show_bug.cgi?id=19118
Site: ORNL-OLCF Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Ticket Depends on:    
Ticket Blocks: 16013    
Attachments: Proposed patch v1

Description Matt Ezell 2023-10-18 09:19:12 MDT
We have an intermittent problem where a MPI-based application will abort, but the srun will be "stuck" until the job hits its walltime and is killed.

The slurmd log shows:
[2023-10-18T09:28:21.752] Launching batch job 1478837 for UID 58732
[2023-10-18T09:28:37.515] launch task StepId=1478837.0 request from UID:58732 GID:1222 HOST:10.128.0.24 PORT:49856
[2023-10-18T09:28:37.515] task/affinity: lllp_distribution: JobId=1478837 binding: threads, dist 2
[2023-10-18T09:28:37.515] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic 
[2023-10-18T09:28:37.515] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [1478837]: mask_cpu, 0x000000000000000000000000000000FE,0x0000000000000000000000000000FE00,0x00000000000000000000000000FE0000,0x000000000000000000000000FE000000,0x0000000000000000000000FE00000000,0x00000000000000000000FE0000000000,0x000000000000000000FE000000000000,0x0000000000000000FE00000000000000
[2023-10-18T09:31:22.317] [1478837.0] error: *** STEP 1478837.0 ON frontier00001 CANCELLED AT 2023-10-18T09:31:22 ***
[2023-10-18T09:33:37.000] [1478837.0] get_exit_code task 0 died by signal: 15
[2023-10-18T09:33:38.505] [1478837.0] done with job
[2023-10-18T09:49:22.206] error: forward_thread to frontier00294 (10.128.4.172:6818): Connection timed out
[2023-10-18T09:49:22.214] error: forward_thread to frontier00351 (10.128.5.144:6818): Connection timed out

Note those two nodes with the forward_thread errors are down and not in this job I believe they are unrelated messages.

[root@frontier00001 ~]# sinfo -n frontier00294,frontier00351
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
batch*       up 7-00:00:00      1 drain$ frontier00294
batch*       up 7-00:00:00      1  down$ frontier00351

srun showed:
srun: error: frontier06442: task 43344: Exited with exit code 255
srun: Terminating StepId=1478837.0
slurmstepd: error: *** STEP 1478837.0 ON frontier00001 CANCELLED AT 2023-10-18T09:31:22 ***
srun: error: frontier08642: task 58415: Terminated
# ... lots of Terminated messages
srun: error: frontier06443: tasks 43352-43355,43357-43359: Terminated
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection timed out
srun: error: frontier06445: tasks 43368,43370-43372,43374-43375: Terminated
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection timed out
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection timed out
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection timed out

On all the nodes, the only processes remaining for the user are on the batch node:

[root@frontier00001 ~]# ps aux|grep srun
reubendb  40642  1.1  0.0 1638196 78196 ?       Sl   09:28   0:47 srun -u -N 8000 -n 64000 -c 7 --gpu-bind=closest --gpus-per-task=1 /lustre/orion/stf016/scratch/reubendb/workspac
es/frontier/Scratch/cholla/Cosmo_8000_Nodes/1697555966.1081736/build_directory/./bin/cholla.FOM.frontier cosmo.txt
reubendb  40644  0.0  0.0 156612  4952 ?        S    09:28   0:00 srun -u -N 8000 -n 64000 -c 7 --gpu-bind=closest --gpus-per-task=1 /lustre/orion/stf016/scratch/reubendb/workspac
es/frontier/Scratch/cholla/Cosmo_8000_Nodes/1697555966.1081736/build_directory/./bin/cholla.FOM.frontier cosmo.txt
root      42654  0.0  0.0  10268   884 pts/2    S+   10:37   0:00 grep --color=auto srun
[root@frontier00001 ~]# gstack 40642
Thread 5 (Thread 0x15554a7d4700 (LWP 40854) "srun"):
#0  0x0000155554cec70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000155554f6a1b5 in _check_io_timeout (_sls=0x6264c0) at step_launch.c:1858
#2  0x0000155554ce56ea in start_thread () from /lib64/libpthread.so.0
#3  0x000015555408da6f in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x15554bbe8700 (LWP 40647) "srun"):
#0  0x00001555540811c9 in poll () from /lib64/libc.so.6
#1  0x0000155554f93ee2 in poll (__timeout=-1, __nfds=<optimized out>, __fds=0x15553c38abb0) at /usr/include/bits/poll2.h:46
#2  _poll_internal (shutdown_time=<optimized out>, nfds=168, pfds=0x15553c38abb0) at eio.c:351
#3  eio_handle_mainloop (eio=<optimized out>) at eio.c:314
#4  0x0000155554f67e48 in _io_thr_internal (cio_arg=0x824c60) at step_io.c:814
#5  0x0000155554ce56ea in start_thread () from /lib64/libpthread.so.0
#6  0x000015555408da6f in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x15554bce9700 (LWP 40646) "srun"):
#0  0x00001555540811c9 in poll () from /lib64/libc.so.6
#1  0x0000155554f93ee2 in poll (__timeout=-1, __nfds=<optimized out>, __fds=0x155544000b70) at /usr/include/bits/poll2.h:46
#2  _poll_internal (shutdown_time=<optimized out>, nfds=169, pfds=0x155544000b70) at eio.c:351
#3  eio_handle_mainloop (eio=<optimized out>) at eio.c:314
#4  0x0000155554f68cf2 in _msg_thr_internal (arg=<optimized out>) at step_launch.c:1110
#5  0x0000155554ce56ea in start_thread () from /lib64/libpthread.so.0
#6  0x000015555408da6f in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x15554bfda700 (LWP 40645) "srun"):
#0  0x0000155553fc1ccc in sigtimedwait () from /lib64/libc.so.6
#1  0x0000155554cf13cc in sigwait () from /lib64/libpthread.so.0
#2  0x00000000004151ba in _srun_signal_mgr (job_ptr=0x492fb0) at srun_job.c:2349
#3  0x0000155554ce56ea in start_thread () from /lib64/libpthread.so.0
#4  0x000015555408da6f in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x155555529740 (LWP 40642) "srun"):
#0  0x0000155554cec70c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000155554f6c42c in slurm_step_launch_wait_finish (ctx=0x626480) at step_launch.c:691
#2  0x000000000040cfad in launch_g_step_wait (job=job@entry=0x492fb0, got_alloc=got_alloc@entry=false, opt_local=opt_local@entry=0x424fa0 <opt>) at launch.c:1488
#3  0x00000000004113cb in _launch_one_app (data=<optimized out>) at srun.c:305
#4  0x0000000000412005 in _launch_app (got_alloc=false, srun_job_list=0x0, job=0x492fb0) at srun.c:600
#5  srun (ac=<optimized out>, av=0x7fffffff3b78) at srun.c:226
#6  0x0000000000417410 in main (argc=<optimized out>, argv=<optimized out>) at srun.wrapper.c:17
[root@frontier00001 ~]# gstack 40644
#0  0x0000155554cf06ce in read () from /lib64/libpthread.so.0
#1  0x0000000000416e6c in read (__nbytes=1, __buf=0x7fffffff389f, __fd=<optimized out>) at /usr/include/bits/unistd.h:44
#2  _shepherd_spawn (got_alloc=false, srun_job_list=0x0, job=0x492fb0) at srun_job.c:2304
#3  create_srun_job (p_job=p_job@entry=0x425ca8 <job>, got_alloc=got_alloc@entry=0x7fffffff3943, slurm_started=slurm_started@entry=false, handle_signals=handle_signals@entry=true) at srun_job.c:1524
#4  0x0000000000411a8b in srun (ac=12, av=0x7fffffff3b78) at srun.c:195
#5  0x0000000000417410 in main (argc=<optimized out>, argv=<optimized out>) at srun.wrapper.c:17

What can we debug to see why this srun is hung? Which tasks failed to receive a MESSAGE_TASK_EXIT, and will it retry or timeout eventually?
Comment 1 Matt Ezell 2023-10-18 10:41:56 MDT
I found the nodes that had the MESSAGE_TASK_EXIT send failures:

frontier06443: Oct 18 09:35:42 frontier06443 slurmstepd[83892]: [1478837.0] error: Failed to send MESSAGE_TASK_EXIT: Connection timed out
frontier06441: Oct 18 09:34:56 frontier06441 slurmstepd[49910]: [1478837.0] error: Failed to send MESSAGE_TASK_EXIT: Connection timed out
frontier06447: Oct 18 09:35:00 frontier06447 slurmstepd[96585]: [1478837.0] error: Failed to send MESSAGE_TASK_EXIT: Connection timed out
frontier06445: Oct 18 09:35:44 frontier06445 slurmstepd[19700]: [1478837.0] error: Failed to send MESSAGE_TASK_EXIT: Connection timed out

Note that these nodes had an interface bounce due to a switch issue (which is what caused MPI to abort in the first place). So likely the cause of the connection timed out, but connection was restored before slurmd_timeout so the nodes weren't marked down.

Is that ever retried in the slurmstepd? Is there anything in the srun that times out waiting for all the tasks to report that?
Comment 2 Matt Ezell 2023-10-18 12:07:39 MDT
_send_srun_resp_msg() will retry. For this 8000 node job, it will retry 12 times with a sleep of 0.1s, 0.2s, 0.4s, and then 0.8s repeating for a total of ~8 seconds (might be off by one retry). The link was down for more than 8 seconds, so it seems likely that we blackholed all the retries here.

_send_exit_msg() will log a failure of _send_srun_resp_msg() but still returns SLURM_SUCCESS. No bother, stepd_send_pending_exit_msgs() ignores the RC anyway despite setting esent to true, so there's never a higher-level retry.

I *think* the behavior I'm expecting to see here is that slurmstepd hangs around until it can confirm it sent the exit message.
Comment 3 Matt Ezell 2023-10-18 14:36:47 MDT
Created attachment 32823 [details]
Proposed patch v1

Compile tested only - I'll try this on our test cluster later tonight.
Comment 4 Matt Ezell 2023-10-18 22:27:47 MDT
Verified that the patch seems to work.

Control: started a 2-node job and downed an interface on node-2. Verified output stopped and MPI aborted. Verified node-2 logged 'Failed to send MESSAGE_TASK_EXIT'. Upped the port. Verified that the srun never returned.

Test-with-patch: started a 2-node job and downed an interface on node-2. Verified output stopped and MPI aborted. Verified node-2 logged 'Failed to send MESSAGE_TASK_EXIT'. Upped the port. Verified that srun returned, stdout showing:
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: No route to host
srun: error: borg014: tasks 10-15: Terminated
srun: error: borg014: tasks 8-9: Terminated
srun: Force Terminated StepId=46959.0
ezy@borg013:~/system_tests/programs/halo/bin>
Comment 5 Brian F Gilmer 2023-10-19 07:22:52 MDT
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: No route to host

This error message won't show up on an XC or EX system. There are no IP routers, so the error path goes through a MessageTimeout.

Also, you should add a test to kill srun, that would leave the slurmstepd orphaned. That error path should be handled with a timeout.
Comment 6 Matt Ezell 2023-10-19 07:56:50 MDT
(In reply to Brian F Gilmer from comment #5)
> slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: No route to host
> 
> This error message won't show up on an XC or EX system. There are no IP
> routers, so the error path goes through a MessageTimeout.

This test was run on an EX. I downed the port on the Slingshot switch (since just ifdown-ing the port on the node only stops ethernet connectivity, not RDMA). I'm not sure the exact mechanics here (the kernel saw the interface as down and returned no route or the retry handler parked the destination and returned a NACK?).

> Also, you should add a test to kill srun, that would leave the slurmstepd
> orphaned. That error path should be handled with a timeout.

Ah, good point. My assertion earlier "I *think* the behavior I'm expecting to see here is that slurmstepd hangs around until it can confirm it sent the exit message" breaks down in this scenario.

I guess if you have two entities that need to talk, but either could go away unexpectedly, both have to eventually timeout trying to talk to each other.
Comment 7 Tim McMullan 2023-10-19 10:36:32 MDT
Hey Matt!

I tend to agree with Brian here that this probably needs to timeout eventually, but I also think we don't need to be as aggressive about dropping this communication as we are today.  I don't think your patch would cause significant problems in most cases though, but for jobs without a timelimit waiting forever could be an issue.

I'll take a look and see what we can do here to make things better.  For my reference, about how long was the link down for?  And do you know what is/was the cause if the link going down and back up?

Thanks!
--Tim
Comment 8 Matt Ezell 2023-10-19 11:47:55 MDT
(In reply to Tim McMullan from comment #7)
> I'll take a look and see what we can do here to make things better.  For my
> reference, about how long was the link down for?  And do you know what
> is/was the cause if the link going down and back up?

We have 4 links per node. All incoming slurm traffic uses hsn0, since this has the IP associated with the hostname. Outgoing traffic may source from any of the links, based on how the kernel chooses to hash routes. hsn0 did not go down so slurmd was still able to respond to pings. The connection from this node back to the srun likely sourced from a link that did go down. I'm not sure why the kernel didn't update its routes to use a different interface, I'll have to look into that...

In this specific case, a switch rebooted. So 2 links (attached to the same switch) were down for 3:30

Oct 18 09:31:03 frontier06441 kernel: cxi_core 0000:d5:00.0: cxi2[hsn2] CXI_EVENT_LINK_DOWN
Oct 18 09:31:04 frontier06441 kernel: cxi_core 0000:dd:00.0: cxi3[hsn3] CXI_EVENT_LINK_DOWN
Oct 18 09:34:33 frontier06441 kernel: cxi_core 0000:d5:00.0: cxi2[hsn2] CXI_EVENT_LINK_UP
Oct 18 09:34:33 frontier06441 kernel: cxi_core 0000:dd:00.0: cxi3[hsn3] CXI_EVENT_LINK_UP

This is not a typical scenario (that it's down that long) but it does happen occasionally.

In most other cases we've observed, a link will flap due to high bit-error-rate or loss of signal alignment. Generally the link can re-train its signal parameters and become operational again in 5-15 seconds.

For jobs less than 1024 nodes, currently it only retries 5 times for about 2.3 seconds, which won't survive a link flap.
Comment 20 Tim McMullan 2024-02-13 14:32:13 MST
Hey Matt,

I'm sorry this took so long but we finally came to something that made sense to us and pushed it ahead of 23.11.4.  https://github.com/SchedMD/slurm/commit/0cbf687c5f

What we've done here is changed the logic so it is a constant time wait, regardless of how the message failure happened and it will always be for msg_timeout.  There are a bunch of ways this was dying without really waiting very long, and I think this will help that situation.

Let me know if you have any thoughts on this!

Thanks,
--Tim
Comment 24 Tim McMullan 2024-02-20 10:37:47 MST
Hi Matt,

Unfortunately we have had to push this change off.  Additional testing revealed some other problems that are more complicated to fix.  We are going to work on fixing the other underlying problem and then implement a very similar approach to fix the problem described by this ticket initially.

Sorry for the false start on this one!
--Tim
Comment 28 Marshall Garey 2024-03-25 14:55:53 MDT
*** Ticket 19422 has been marked as a duplicate of this ticket. ***
Comment 34 Marshall Garey 2024-04-23 08:13:16 MDT
*** Ticket 16013 has been marked as a duplicate of this ticket. ***
Comment 36 Tim McMullan 2024-06-26 06:52:24 MDT
Hi Matt,

I'm sorry for the continued delay here.  We've made some substantial progress on the underlying issues that need solving in 24.05

3efd8109f0 through 0d20e2cd18 were all pushed ahead of 24.05.0 along as other changes that should help reduce the likelihood of this happening naturally.  That said, there is still work being done to this area to fully fix the issue.

I'll keep you updated as we make more progress.

Thanks!
--Tim