Ticket 10215

Summary: Terminal job states
Product: Slurm Reporter: Kris Whetham <kwhetham>
Component: AccountingAssignee: Felip Moll <felip.moll>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.02.5   
Hardware: Linux   
OS: Linux   
Site: FB (PSLA) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Kris Whetham 2020-11-13 06:26:09 MST
Hi Sched, 

Which of these Job State Codes are considered terminal? (i.e. once a job reaches this state, none of its information as reported by sacct will change)?

We are trying to build an ETL pipeline to pull this data out of SLURM, so at each interval we would like to extract data which will not change in the future.



JOB STATE CODES

BF BOOT_FAIL
    Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued). 
CA CANCELLED
    Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. 
CD COMPLETED
    Job has terminated all processes on all nodes with an exit code of zero. 
DL DEADLINE
    Job terminated on deadline. 
F FAILED
    Job terminated with non-zero exit code or other failure condition. 
NF NODE_FAIL
    Job terminated due to failure of one or more allocated nodes. 
OOM OUT_OF_MEMORY
    Job experienced out of memory error. 
PD PENDING
    Job is awaiting resource allocation. 
PR PREEMPTED
    Job terminated due to preemption. 
R RUNNING
    Job currently has an allocation. 
RQ REQUEUED
    Job was requeued. 
RS RESIZING
    Job is about to change size. 
RV REVOKED
    Sibling was removed from cluster due to other cluster starting the job. 
S SUSPENDED
    Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. 
TO TIMEOUT
    Job terminated upon reaching its time limit. 

Thanks, 
Kris W.
Comment 1 Jason Booth 2020-11-13 11:51:46 MST
Kris - we will look into this request and let you know.
Comment 2 Felip Moll 2020-11-18 09:58:25 MST
I want to give you some extra information.

Internally, we can differentiate job states from job state flags. Job state flags gives a bit more detail on the status of the job at a certain moment.
Also, there are a few more states/flags than the ones you wrote there (like JOB_REQUEUE_CRON which is new in 20.11), I describe them below.

Generally speaking, any "state" 'greater' than JOB_SUSPENDED means that the job is effectively terminated, so a job in states PD, R or S can be modified, but a job in all other states is considered 'done'.

Depending on what happened to the job, sacct will show instead the state flags and not the job state.
For example, for a job that has been requeued you will see the state as REQUEUED (the flag).

(In this case, note that there will be duplicate entries for the job since a requeue creates a new entry for the job with same job id):

[lipi@llagosti 20.02]$ sacct --duplicates -j 17357
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
17357              wrap      debug       lipi          2   REQUEUED      0:0    <--- That's the requeued job and steps
17357.batch       batch                  lipi          2  CANCELLED     0:15 
17357.extern     extern                  lipi          2  COMPLETED      0:0 
17357              wrap      debug       lipi          1    PENDING      0:0  <-- That's the new entry


In that case, you should consider RQ, SE, RV as definitive states. All the other ones are transient states and could potentially be changed.

/* JOB_STATE_FLAGS */

- CG - JOB_COMPLETING
	String: "COMPLETING"
- SO - JOB_STAGE_OUT
	String: "STAGE_OUT"
- CF - JOB_CONFIGURING
	String: "CONFIGURING"
- RS - JOB_RESIZING
	String: "RESIZING"
- RC - JOB_REQUEUE_CRON
	String: "REQUEUED_CRON"
- RQ - JOB_REQUEUE
	String: "REQUEUED"
- RF - JOB_REQUEUE_FED
	String: "REQUEUE_FED"
- RH - JOB_REQUEUE_HOLD
	String: "REQUEUE_HOLD"
- SE - JOB_SPECIAL_EXIT
	String: "SPECIAL_EXIT"
- ST - JOB_STOPPED
	String: "STOPPED"
- RV - JOB_REVOKED
	String: "REVOKED"
- RD - JOB_RESV_DEL_HOLD
	String: "RESV_DEL_HOLD"
- SI - JOB_SIGNALING
	String: "SIGNALING"

/* JOB_STATE_BASE */

- PD - JOB_PENDING:
	String: "PENDING"
- R - JOB_RUNNING:
	String: "RUNNING"
- S - JOB_SUSPENDED:
	String: "SUSPENDED"
- CD - JOB_COMPLETE:
	String: "COMPLETED"
- CA - JOB_CANCELLED:
	String: "CANCELLED"
- F - JOB_FAILED:
	String: "FAILED"
- TO - JOB_TIMEOUT:
	String: "TIMEOUT"
- NF - JOB_NODE_FAIL:
	String: "NODE_FAIL"
- PR - JOB_PREEMPTED:
	String: "PREEMPTED"
- BF - JOB_BOOT_FAIL:
	String: "BOOT_FAIL"
- DL - JOB_DEADLINE:
	String: "DEADLINE"
- OOM - JOB_OOM:
	String: "OUT_OF_MEMORY"



Does it clear your doubts?
Comment 3 Felip Moll 2020-11-23 12:35:13 MST
Hi,

I am marking the bug as INFOGIVEN, please, set it as open again if something is not clear.

Thank you!