8935 – prolog timeouts

Ticket 8935 - prolog timeouts

Summary: prolog timeouts

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	20.02.1
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Director of Support
QA Contact:

URL:

Duplicates (1):	8936 (view as ticket list)
Depends on:
Blocks:

Reported:	2020-04-24 07:07 MDT by jay.kubeck
Modified:	2020-06-23 15:43 MDT (History)
CC List:	1 user (show)

See Also:	8984
Site:	Yale
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmdbd log (2.48 KB, application/gzip) 2020-04-28 14:07 MDT, jay.kubeck	Details
slurmctld log (28.23 MB, application/gzip) 2020-04-28 14:34 MDT, jay.kubeck	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description jay.kubeck 2020-04-24 07:07:36 MDT

We recently observed an alarming and unexpected number of nodes draining due to prolog timeouts. Based upon some poor network metrics we observed on the network used by slurmctld to communicate with slurmds, within the region of the network where prolog timeouts were being observed, we suspected that the prolog timeouts were network related.

 

Since our slurmctld is capable of communicating with slurmds across different networks, and since our compute nodes and scheduler node are multihomed on the same two networks (a regular ethernet and InfiniBand), we reconfigured a subset of our slurmds to use the second network to communicate with slurmctld. The result of this change appears to have had the desired impact as we have not seen any new prolog timeouts since making this change.

Comment 1 Jason Booth 2020-04-24 09:40:22 MDT

*** Ticket 8936 has been marked as a duplicate of this ticket. ***

Comment 2 Michael Hinton 2020-04-24 10:43:26 MDT

Hi Jay,

Thanks for the report. It's not exactly clear if you still need assistance from SchedMD, or if you are just reporting this for public awareness. Do you need help, or can we mark this as resolved?

Thanks,
Michael

Comment 3 jay.kubeck 2020-04-24 12:46:20 MDT

Hi Michael,

We were interested in your take on our workaround, which was to run slurm traffic on our infiniband network, rather than ethernet for a subset of our cluster. Is this workaround recommended? Are there other steps we can take to mitigate network issues with our Slurm environment, by changing any slurm configs? 

I realize you'd need more information to answer any of these questions. Please let me know what I can provide. 

-Jay

Comment 4 Michael Hinton 2020-04-24 14:36:56 MDT

Well, as you said, we will probably need more details before we can say anything about the workaround. So feel free to attach some slurmctld and slurmd logs from before the workaround was implemented so we can see what's going on.

Comment 5 jay.kubeck 2020-04-28 14:07:13 MDT

Created attachment 14025 [details]
slurmdbd log

Comment 6 jay.kubeck 2020-04-28 14:34:07 MDT

Created attachment 14027 [details]
slurmctld log

Comment 7 jay.kubeck 2020-04-28 14:51:06 MDT

logs have been uploaded. 

32 nodes were moved to our IB network on 4/16, and another 32 were moved on 4/17. 

hopefully these logs can help us understand the following- 

Do you think it was indeed the bottleneck in the Ethernet (here's the timing data we collected)?
Do you think we are actually out of the woods, or anything in new timing data (here you go) that indicates we are on the edge?  How can we tell?
Is having this traffic on IB good practice?
Is having some on IB and some on Ethernet good practice?
Does our proposed longer-term fix make sense?
Maybe they could also help us determine whether those three nodes with recent timeouts are recurrences of the same issue or something else.

Comment 11 Michael Hinton 2020-04-30 17:01:30 MDT

Could you attach a recent slurm.conf?

(In reply to jay.kubeck from comment #7)
> Maybe they could also help us determine whether those three nodes with
> recent timeouts are recurrences of the same issue or something else.
Could you attach the logs for the three nodes you are referring to? What are the recent timeouts you are referring to?

What makes you certain that the prolog failures were due to prolog timeouts?

Have you looked into changing PrologEpilogTimeout, MessageTimeout, and BatchStartTimeout in slurm.conf?

Thanks,
-Michael

Comment 15 Michael Hinton 2020-06-01 15:53:51 MDT

Sorry for the delay. Here are some answers to your questions after some internal discussion:

(In reply to jay.kubeck from comment #7)
> Do you think it was indeed the bottleneck in the Ethernet (here's the timing
> data we collected)?
It's hard to say. You need to profile your network during the times where you see performance issues. The Slurm logs you attached don't really include that kind of timing data.

> Do you think we are actually out of the woods, or anything in new timing
> data (here you go) that indicates we are on the edge?  How can we tell?
Again, hard to say. Maybe you forgot to attach the network timing data you are referring to?

> Is having this traffic on IB good practice?
Many sites do this, but it all depends on your site's requirements. We do not see any issues with routing Slurm over Ethernet or IB.

> Is having some on IB and some on Ethernet good practice?
This just adds more complexity to your setup and an additional network to troubleshoot, so we would not recommend mixing networks.

> Does our proposed longer-term fix make sense?
Yes.

> Maybe they could also help us determine whether those three nodes with
> recent timeouts are recurrences of the same issue or something else.
Prolog timeouts can be caused by a few different things such as slow disks, networking issues, or a taxed CPU, to name a few. In this case, it does seem like a bottleneck in your network was the cause. We recommend that our customers have a dedicated network for Slurm to work off of to avoid network saturation and to also help isolate traffic.

Additionally, bug 8984 has a patch that may help with some of these prolog timeout errors, so let's continue any further discussion on that topic there.

Thanks,
Michael

Comment 16 Michael Hinton 2020-06-23 15:43:42 MDT

I'm going to go ahead and close this out as info given. Feel free to reopen if you have further questions.

Thanks!
-Michael