Ticket 225

Summary: sacct shows cancelled job when asked for running
Product: Slurm Reporter: Lloyd Brown <lloyd_brown>
Component: AccountingAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 2.5.x   
Hardware: Linux   
OS: Linux   
Site: BYU - Brigham Young University Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Lloyd Brown 2013-02-07 09:20:45 MST
I'm really not sure if this is a misunderstanding on our part or not, but we have a node where we run the following command, and it gives us output that we didn't expect.  Basically we're trying to get a list of all the currently-running jobs on that node, and it shows us a job that's been cancelled, and is no longer running.

Here's the specific command and output:

# sacct -N m7-2-5 -s R
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
80777        base-042-+      m7,m6   sgorrell         96    TIMEOUT      0:1 
80777.batch       batch              sgorrell          1  CANCELLED     0:15 


As I said, if we're misunderstanding, and "-s R" DOESN'T mean "show running jobs", then that's fine.  It just wasn't what we expected.

I'm not sure how to go about diagnosing this.  If you need me to run some commands to get more info to you, I'd be happy to do that.  I just don't know what other info to give you.

Lloyd Brown
Fulton Supercomputing Lab
Brigham Young University
Comment 1 Danny Auble 2013-02-07 09:47:00 MST
Lloyd, I think I see what is happening.  By default we fill in a start time of midnight for the current day.

What I think is happening is job 80777 was running at midnight on this node. So it is showing up on the query.

I'll see about removing the default start time when asking for states.  In the meantime you can just add -Snow to your sacct line.  It is probably safer to do that anyway so you are clear what you are asking for.

Let me know if that fixes your issue.
Comment 2 Lloyd Brown 2013-02-07 09:50:14 MST
Ah.  A misunderstanding on my part.  For our purposes, adding that parameter is an easy fix, so if you don't want to change the default behavior, that's fine with me.

Thanks for the explanation.
Comment 3 Danny Auble 2013-02-07 10:15:11 MST
I still think this is misleading, or at best confusing.  I just changed it to default to now when asking for states.

https://github.com/SchedMD/slurm/commit/af29d9a7a759958a9c5653df374593577cd430d3

I also cleared up the documentation here

https://github.com/SchedMD/slurm/commit/5d2181f4b28c783e2d88f9d5d9236f2663f8c2ec

Both these patches will be in the next 2.5 release.