Summary: | srun intermittent hangs on abort | ||
---|---|---|---|
Product: | Slurm | Reporter: | Matt Ezell <ezellma> |
Component: | User Commands | Assignee: | Tim McMullan <mcmullan> |
Status: | OPEN --- | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | brian.gilmer, esteva.m, marshall, tyler, vergaravg |
Version: | 23.02.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=16013 https://bugs.schedmd.com/show_bug.cgi?id=19118 |
||
Site: | ORNL-OLCF | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Ticket Depends on: | |||
Ticket Blocks: | 16013 | ||
Attachments: | Proposed patch v1 |
Description
Matt Ezell
2023-10-18 09:19:12 MDT
I found the nodes that had the MESSAGE_TASK_EXIT send failures: frontier06443: Oct 18 09:35:42 frontier06443 slurmstepd[83892]: [1478837.0] error: Failed to send MESSAGE_TASK_EXIT: Connection timed out frontier06441: Oct 18 09:34:56 frontier06441 slurmstepd[49910]: [1478837.0] error: Failed to send MESSAGE_TASK_EXIT: Connection timed out frontier06447: Oct 18 09:35:00 frontier06447 slurmstepd[96585]: [1478837.0] error: Failed to send MESSAGE_TASK_EXIT: Connection timed out frontier06445: Oct 18 09:35:44 frontier06445 slurmstepd[19700]: [1478837.0] error: Failed to send MESSAGE_TASK_EXIT: Connection timed out Note that these nodes had an interface bounce due to a switch issue (which is what caused MPI to abort in the first place). So likely the cause of the connection timed out, but connection was restored before slurmd_timeout so the nodes weren't marked down. Is that ever retried in the slurmstepd? Is there anything in the srun that times out waiting for all the tasks to report that? _send_srun_resp_msg() will retry. For this 8000 node job, it will retry 12 times with a sleep of 0.1s, 0.2s, 0.4s, and then 0.8s repeating for a total of ~8 seconds (might be off by one retry). The link was down for more than 8 seconds, so it seems likely that we blackholed all the retries here. _send_exit_msg() will log a failure of _send_srun_resp_msg() but still returns SLURM_SUCCESS. No bother, stepd_send_pending_exit_msgs() ignores the RC anyway despite setting esent to true, so there's never a higher-level retry. I *think* the behavior I'm expecting to see here is that slurmstepd hangs around until it can confirm it sent the exit message. Created attachment 32823 [details]
Proposed patch v1
Compile tested only - I'll try this on our test cluster later tonight.
Verified that the patch seems to work. Control: started a 2-node job and downed an interface on node-2. Verified output stopped and MPI aborted. Verified node-2 logged 'Failed to send MESSAGE_TASK_EXIT'. Upped the port. Verified that the srun never returned. Test-with-patch: started a 2-node job and downed an interface on node-2. Verified output stopped and MPI aborted. Verified node-2 logged 'Failed to send MESSAGE_TASK_EXIT'. Upped the port. Verified that srun returned, stdout showing: slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: No route to host srun: error: borg014: tasks 10-15: Terminated srun: error: borg014: tasks 8-9: Terminated srun: Force Terminated StepId=46959.0 ezy@borg013:~/system_tests/programs/halo/bin> slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: No route to host This error message won't show up on an XC or EX system. There are no IP routers, so the error path goes through a MessageTimeout. Also, you should add a test to kill srun, that would leave the slurmstepd orphaned. That error path should be handled with a timeout. (In reply to Brian F Gilmer from comment #5) > slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: No route to host > > This error message won't show up on an XC or EX system. There are no IP > routers, so the error path goes through a MessageTimeout. This test was run on an EX. I downed the port on the Slingshot switch (since just ifdown-ing the port on the node only stops ethernet connectivity, not RDMA). I'm not sure the exact mechanics here (the kernel saw the interface as down and returned no route or the retry handler parked the destination and returned a NACK?). > Also, you should add a test to kill srun, that would leave the slurmstepd > orphaned. That error path should be handled with a timeout. Ah, good point. My assertion earlier "I *think* the behavior I'm expecting to see here is that slurmstepd hangs around until it can confirm it sent the exit message" breaks down in this scenario. I guess if you have two entities that need to talk, but either could go away unexpectedly, both have to eventually timeout trying to talk to each other. Hey Matt! I tend to agree with Brian here that this probably needs to timeout eventually, but I also think we don't need to be as aggressive about dropping this communication as we are today. I don't think your patch would cause significant problems in most cases though, but for jobs without a timelimit waiting forever could be an issue. I'll take a look and see what we can do here to make things better. For my reference, about how long was the link down for? And do you know what is/was the cause if the link going down and back up? Thanks! --Tim (In reply to Tim McMullan from comment #7) > I'll take a look and see what we can do here to make things better. For my > reference, about how long was the link down for? And do you know what > is/was the cause if the link going down and back up? We have 4 links per node. All incoming slurm traffic uses hsn0, since this has the IP associated with the hostname. Outgoing traffic may source from any of the links, based on how the kernel chooses to hash routes. hsn0 did not go down so slurmd was still able to respond to pings. The connection from this node back to the srun likely sourced from a link that did go down. I'm not sure why the kernel didn't update its routes to use a different interface, I'll have to look into that... In this specific case, a switch rebooted. So 2 links (attached to the same switch) were down for 3:30 Oct 18 09:31:03 frontier06441 kernel: cxi_core 0000:d5:00.0: cxi2[hsn2] CXI_EVENT_LINK_DOWN Oct 18 09:31:04 frontier06441 kernel: cxi_core 0000:dd:00.0: cxi3[hsn3] CXI_EVENT_LINK_DOWN Oct 18 09:34:33 frontier06441 kernel: cxi_core 0000:d5:00.0: cxi2[hsn2] CXI_EVENT_LINK_UP Oct 18 09:34:33 frontier06441 kernel: cxi_core 0000:dd:00.0: cxi3[hsn3] CXI_EVENT_LINK_UP This is not a typical scenario (that it's down that long) but it does happen occasionally. In most other cases we've observed, a link will flap due to high bit-error-rate or loss of signal alignment. Generally the link can re-train its signal parameters and become operational again in 5-15 seconds. For jobs less than 1024 nodes, currently it only retries 5 times for about 2.3 seconds, which won't survive a link flap. Hey Matt, I'm sorry this took so long but we finally came to something that made sense to us and pushed it ahead of 23.11.4. https://github.com/SchedMD/slurm/commit/0cbf687c5f What we've done here is changed the logic so it is a constant time wait, regardless of how the message failure happened and it will always be for msg_timeout. There are a bunch of ways this was dying without really waiting very long, and I think this will help that situation. Let me know if you have any thoughts on this! Thanks, --Tim Hi Matt, Unfortunately we have had to push this change off. Additional testing revealed some other problems that are more complicated to fix. We are going to work on fixing the other underlying problem and then implement a very similar approach to fix the problem described by this ticket initially. Sorry for the false start on this one! --Tim *** Ticket 19422 has been marked as a duplicate of this ticket. *** *** Ticket 16013 has been marked as a duplicate of this ticket. *** Hi Matt, I'm sorry for the continued delay here. We've made some substantial progress on the underlying issues that need solving in 24.05 3efd8109f0 through 0d20e2cd18 were all pushed ahead of 24.05.0 along as other changes that should help reduce the likelihood of this happening naturally. That said, there is still work being done to this area to fully fix the issue. I'll keep you updated as we make more progress. Thanks! --Tim |