| Summary: | Job exit with NODE_FAIL and resumed on another node with same ID | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | ARC Admins <arc-slurm-admins> |
| Component: | Accounting | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 18.08.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Michigan | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
ARC Admins
2019-10-08 07:17:43 MDT
It appears we DO have JobRequeue set to 1 according to scontrol show config. So, I think that accounts for it. But I'll leave this open to be sure. David After reading a bit, I see how a user can request this option on their own as well. So, I'm going to close this. Thanks for listening, though :) David |