Ticket 8936

Summary:	prolog timeouts
Product:	Slurm	Reporter:	jay.kubeck
Component:	slurmctld	Assignee:	Director of Support <support>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	20.02.1
Hardware:	Linux
OS:	Linux
Site:	Yale	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description jay.kubeck 2020-04-24 07:09:17 MDT

We recently observed an alarming and unexpected number of nodes draining due to prolog timeouts, following some significant changes to the network used by slurmctld and the slurmds to communicate. The prolog timeouts appear to occur when large volumes of short-running jobs are running; that, combined with our recent network changes, has urged us to examine our network, but so far we have not found anything that would explain the timeouts.

Comment 1 Jason Booth 2020-04-24 09:40:22 MDT

Resolving this second ticket as a duplicate.

*** This ticket has been marked as a duplicate of ticket 8935 ***