Ticket 8639 - No indication cloud node is having problems
Summary: No indication cloud node is having problems
Status: RESOLVED WONTFIX
Alias: None
Product: Slurm
Classification: Unclassified
Component: Cloud (show other tickets)
Version: 19.05.2
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Broderick Gardner
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-03-06 16:32 MST by Will Shanks
Modified: 2020-05-07 14:15 MDT (History)
1 user (show)

See Also:
Site: UCAR
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Will Shanks 2020-03-06 16:32:04 MST
The only indication that a Cloud node is being problematic is a single nondescript log message "debug3: problems with <NodeName>" besides finding this log message the only way to discover this problem is to notice jobs are not being scheduled on the problem node.
Comment 1 Will Shanks 2020-03-06 16:45:45 MST
We got to this state because our ResumeProgram program took too long to start up the cloud node, but did not immediately notice as the node did not appear at all in sinfo, and scontrol reported it as "not found".
Comment 2 Will Shanks 2020-03-06 16:52:18 MST
Additionally, what would be considered best practices for monitoring to detect this situation in the future?