Ticket 8639

Summary: No indication cloud node is having problems
Product: Slurm Reporter: Will Shanks <shanks>
Component: CloudAssignee: Broderick Gardner <broderick>
Status: RESOLVED WONTFIX QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: nate
Version: 19.05.2   
Hardware: Linux   
OS: Linux   
Site: UCAR Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Will Shanks 2020-03-06 16:32:04 MST
The only indication that a Cloud node is being problematic is a single nondescript log message "debug3: problems with <NodeName>" besides finding this log message the only way to discover this problem is to notice jobs are not being scheduled on the problem node.
Comment 1 Will Shanks 2020-03-06 16:45:45 MST
We got to this state because our ResumeProgram program took too long to start up the cloud node, but did not immediately notice as the node did not appear at all in sinfo, and scontrol reported it as "not found".
Comment 2 Will Shanks 2020-03-06 16:52:18 MST
Additionally, what would be considered best practices for monitoring to detect this situation in the future?