| Summary: | No indication cloud node is having problems | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Will Shanks <shanks> |
| Component: | Cloud | Assignee: | Broderick Gardner <broderick> |
| Status: | RESOLVED WONTFIX | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | nate |
| Version: | 19.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | UCAR | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Will Shanks
2020-03-06 16:32:04 MST
We got to this state because our ResumeProgram program took too long to start up the cloud node, but did not immediately notice as the node did not appear at all in sinfo, and scontrol reported it as "not found". Additionally, what would be considered best practices for monitoring to detect this situation in the future? |