We recently observed an alarming and unexpected number of nodes draining due to prolog timeouts. Based upon some poor network metrics we observed on the network used by slurmctld to communicate with slurmds, within the region of the network where prolog timeouts were being observed, we suspected that the prolog timeouts were network related. Since our slurmctld is capable of communicating with slurmds across different networks, and since our compute nodes and scheduler node are multihomed on the same two networks (a regular ethernet and InfiniBand), we reconfigured a subset of our slurmds to use the second network to communicate with slurmctld. The result of this change appears to have had the desired impact as we have not seen any new prolog timeouts since making this change.
*** Ticket 8936 has been marked as a duplicate of this ticket. ***
Hi Jay, Thanks for the report. It's not exactly clear if you still need assistance from SchedMD, or if you are just reporting this for public awareness. Do you need help, or can we mark this as resolved? Thanks, Michael
Hi Michael, We were interested in your take on our workaround, which was to run slurm traffic on our infiniband network, rather than ethernet for a subset of our cluster. Is this workaround recommended? Are there other steps we can take to mitigate network issues with our Slurm environment, by changing any slurm configs? I realize you'd need more information to answer any of these questions. Please let me know what I can provide. -Jay
Well, as you said, we will probably need more details before we can say anything about the workaround. So feel free to attach some slurmctld and slurmd logs from before the workaround was implemented so we can see what's going on.
Created attachment 14025 [details] slurmdbd log
Created attachment 14027 [details] slurmctld log
logs have been uploaded. 32 nodes were moved to our IB network on 4/16, and another 32 were moved on 4/17. hopefully these logs can help us understand the following- Do you think it was indeed the bottleneck in the Ethernet (here's the timing data we collected)? Do you think we are actually out of the woods, or anything in new timing data (here you go) that indicates we are on the edge? How can we tell? Is having this traffic on IB good practice? Is having some on IB and some on Ethernet good practice? Does our proposed longer-term fix make sense? Maybe they could also help us determine whether those three nodes with recent timeouts are recurrences of the same issue or something else.
Could you attach a recent slurm.conf? (In reply to jay.kubeck from comment #7) > Maybe they could also help us determine whether those three nodes with > recent timeouts are recurrences of the same issue or something else. Could you attach the logs for the three nodes you are referring to? What are the recent timeouts you are referring to? What makes you certain that the prolog failures were due to prolog timeouts? Have you looked into changing PrologEpilogTimeout, MessageTimeout, and BatchStartTimeout in slurm.conf? Thanks, -Michael
Sorry for the delay. Here are some answers to your questions after some internal discussion: (In reply to jay.kubeck from comment #7) > Do you think it was indeed the bottleneck in the Ethernet (here's the timing > data we collected)? It's hard to say. You need to profile your network during the times where you see performance issues. The Slurm logs you attached don't really include that kind of timing data. > Do you think we are actually out of the woods, or anything in new timing > data (here you go) that indicates we are on the edge? How can we tell? Again, hard to say. Maybe you forgot to attach the network timing data you are referring to? > Is having this traffic on IB good practice? Many sites do this, but it all depends on your site's requirements. We do not see any issues with routing Slurm over Ethernet or IB. > Is having some on IB and some on Ethernet good practice? This just adds more complexity to your setup and an additional network to troubleshoot, so we would not recommend mixing networks. > Does our proposed longer-term fix make sense? Yes. > Maybe they could also help us determine whether those three nodes with > recent timeouts are recurrences of the same issue or something else. Prolog timeouts can be caused by a few different things such as slow disks, networking issues, or a taxed CPU, to name a few. In this case, it does seem like a bottleneck in your network was the cause. We recommend that our customers have a dedicated network for Slurm to work off of to avoid network saturation and to also help isolate traffic. Additionally, bug 8984 has a patch that may help with some of these prolog timeout errors, so let's continue any further discussion on that topic there. Thanks, Michael
I'm going to go ahead and close this out as info given. Feel free to reopen if you have further questions. Thanks! -Michael