8639 – No indication cloud node is having problems

Ticket 8639 - No indication cloud node is having problems

Summary: No indication cloud node is having problems

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Cloud (show other tickets)
Version:	19.05.2
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Broderick Gardner
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-03-06 16:32 MST by Will Shanks
Modified:	2020-05-07 14:15 MDT (History)
CC List:	1 user (show)

See Also:
Site:	UCAR
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Will Shanks 2020-03-06 16:32:04 MST

The only indication that a Cloud node is being problematic is a single nondescript log message "debug3: problems with <NodeName>" besides finding this log message the only way to discover this problem is to notice jobs are not being scheduled on the problem node.

Comment 1 Will Shanks 2020-03-06 16:45:45 MST

We got to this state because our ResumeProgram program took too long to start up the cloud node, but did not immediately notice as the node did not appear at all in sinfo, and scontrol reported it as "not found".

Comment 2 Will Shanks 2020-03-06 16:52:18 MST

Additionally, what would be considered best practices for monitoring to detect this situation in the future?