| Summary: | sdiag "failed jobs" count always 0 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
| Component: | Other | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 16.05.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Stanford | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 16.05.6 17.02-pre3 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
sdiag's 'failed' statistic only tracks jobs failing due to slurmd problems, not jobs that complete but return a non-zero exit code. Having that number be non-zero indicates some more serious issues within Slurm, it doesn't directly relate to the jobs success or failure. It's admittedly confusing, and I'm looking at how to better document and/or possibly rename that statistic to better describe what it's reporting. - Tim Aah, thanks for enlightening me! Makes sense now. It could indeed use some clarification in the documentation. :) Cheers, -- Kilian Man page has some minimal clarification with commit 7fc830f7bab7. |
Hi! Looks like sdiag doesn't really report the number of failed jobs anymore: # sacct -X -u kilian -o jobid,state JobID State ------------ ---------- 8695013 COMPLETED 8695014 COMPLETED 8695015 COMPLETED 8695016 COMPLETED 8695017 COMPLETED 9508805 TIMEOUT 9513365 FAILED 9513420 COMPLETED 9513431 FAILED 9513432 FAILED # sdiag | grep Jobs Jobs submitted: 13153 Jobs started: 10851 Jobs completed: 9424 Jobs canceled: 1034 Jobs failed: 0 Cheers, -- Kilian