Hi! Looks like sdiag doesn't really report the number of failed jobs anymore: # sacct -X -u kilian -o jobid,state JobID State ------------ ---------- 8695013 COMPLETED 8695014 COMPLETED 8695015 COMPLETED 8695016 COMPLETED 8695017 COMPLETED 9508805 TIMEOUT 9513365 FAILED 9513420 COMPLETED 9513431 FAILED 9513432 FAILED # sdiag | grep Jobs Jobs submitted: 13153 Jobs started: 10851 Jobs completed: 9424 Jobs canceled: 1034 Jobs failed: 0 Cheers, -- Kilian
sdiag's 'failed' statistic only tracks jobs failing due to slurmd problems, not jobs that complete but return a non-zero exit code. Having that number be non-zero indicates some more serious issues within Slurm, it doesn't directly relate to the jobs success or failure. It's admittedly confusing, and I'm looking at how to better document and/or possibly rename that statistic to better describe what it's reporting. - Tim
Aah, thanks for enlightening me! Makes sense now. It could indeed use some clarification in the documentation. :) Cheers, -- Kilian
Man page has some minimal clarification with commit 7fc830f7bab7.