Ticket 3023

Summary: sdiag "failed jobs" count always 0
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: OtherAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 16.05.4   
Hardware: Linux   
OS: Linux   
Site: Stanford Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 16.05.6 17.02-pre3 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Kilian Cavalotti 2016-08-25 15:31:54 MDT
Hi!

Looks like sdiag doesn't really report the number of failed jobs anymore:

# sacct -X -u kilian -o jobid,state
       JobID      State
------------ ----------
8695013       COMPLETED
8695014       COMPLETED
8695015       COMPLETED
8695016       COMPLETED
8695017       COMPLETED
9508805         TIMEOUT
9513365          FAILED
9513420       COMPLETED
9513431          FAILED
9513432          FAILED

# sdiag | grep Jobs
Jobs submitted: 13153
Jobs started:   10851
Jobs completed: 9424
Jobs canceled:  1034
Jobs failed:    0

Cheers,
-- 
Kilian
Comment 1 Tim Wickberg 2016-08-25 16:36:20 MDT
sdiag's 'failed' statistic only tracks jobs failing due to slurmd problems, not jobs that complete but return a non-zero exit code.

Having that number be non-zero indicates some more serious issues within Slurm, it doesn't directly relate to the jobs success or failure.

It's admittedly confusing, and I'm looking at how to better document and/or possibly rename that statistic to better describe what it's reporting.

- Tim
Comment 2 Kilian Cavalotti 2016-08-25 16:49:22 MDT
Aah, thanks for enlightening me! Makes sense now.

It could indeed use some clarification in the documentation. :)

Cheers,
--
Kilian
Comment 3 Tim Wickberg 2016-10-12 14:40:59 MDT
Man page has some minimal clarification with commit 7fc830f7bab7.