Ticket 8936

Summary: prolog timeouts
Product: Slurm Reporter: jay.kubeck
Component: slurmctldAssignee: Director of Support <support>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.02.1   
Hardware: Linux   
OS: Linux   
Site: Yale Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description jay.kubeck 2020-04-24 07:09:17 MDT
We recently observed an alarming and unexpected number of nodes draining due to prolog timeouts, following some significant changes to the network used by slurmctld and the slurmds to communicate. The prolog timeouts appear to occur when large volumes of short-running jobs are running; that, combined with our recent network changes, has urged us to examine our network, but so far we have not found anything that would explain the timeouts.
Comment 1 Jason Booth 2020-04-24 09:40:22 MDT
Resolving this second ticket as a duplicate.

*** This ticket has been marked as a duplicate of ticket 8935 ***