Ticket 3023 - sdiag "failed jobs" count always 0
Summary: sdiag "failed jobs" count always 0
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 16.05.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-08-25 15:31 MDT by Kilian Cavalotti
Modified: 2016-10-12 14:40 MDT (History)
0 users

See Also:
Site: Stanford
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 16.05.6 17.02-pre3
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2016-08-25 15:31:54 MDT
Hi!

Looks like sdiag doesn't really report the number of failed jobs anymore:

# sacct -X -u kilian -o jobid,state
       JobID      State
------------ ----------
8695013       COMPLETED
8695014       COMPLETED
8695015       COMPLETED
8695016       COMPLETED
8695017       COMPLETED
9508805         TIMEOUT
9513365          FAILED
9513420       COMPLETED
9513431          FAILED
9513432          FAILED

# sdiag | grep Jobs
Jobs submitted: 13153
Jobs started:   10851
Jobs completed: 9424
Jobs canceled:  1034
Jobs failed:    0

Cheers,
-- 
Kilian
Comment 1 Tim Wickberg 2016-08-25 16:36:20 MDT
sdiag's 'failed' statistic only tracks jobs failing due to slurmd problems, not jobs that complete but return a non-zero exit code.

Having that number be non-zero indicates some more serious issues within Slurm, it doesn't directly relate to the jobs success or failure.

It's admittedly confusing, and I'm looking at how to better document and/or possibly rename that statistic to better describe what it's reporting.

- Tim
Comment 2 Kilian Cavalotti 2016-08-25 16:49:22 MDT
Aah, thanks for enlightening me! Makes sense now.

It could indeed use some clarification in the documentation. :)

Cheers,
--
Kilian
Comment 3 Tim Wickberg 2016-10-12 14:40:59 MDT
Man page has some minimal clarification with commit 7fc830f7bab7.