Summary: | sacct display which nodes of a job allocation failed | ||
---|---|---|---|
Product: | Slurm | Reporter: | David Bigagli <david> |
Component: | Accounting | Assignee: | Unassigned Developer <dev-unassigned> |
Status: | OPEN --- | QA Contact: | |
Severity: | 5 - Enhancement | ||
Priority: | --- | CC: | akmalm, phils |
Version: | 14.11.8 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | DownUnder GeoSolutions | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
David Bigagli
2015-09-08 21:17:16 MDT
Based on #1913 if a job fails because one of the nodes in the allocation failed, it is not immediately clear which nodes it was that failed. Moreover the slurmstepd on the other nodes log a message in the job output which is misleading as it mentions its own hostname which has nothing to do with the failed node that caused the job to be terminated. David |