| Summary: | Terminal job states | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kris Whetham <kwhetham> |
| Component: | Accounting | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.02.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | FB (PSLA) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Kris - we will look into this request and let you know. I want to give you some extra information.
Internally, we can differentiate job states from job state flags. Job state flags gives a bit more detail on the status of the job at a certain moment.
Also, there are a few more states/flags than the ones you wrote there (like JOB_REQUEUE_CRON which is new in 20.11), I describe them below.
Generally speaking, any "state" 'greater' than JOB_SUSPENDED means that the job is effectively terminated, so a job in states PD, R or S can be modified, but a job in all other states is considered 'done'.
Depending on what happened to the job, sacct will show instead the state flags and not the job state.
For example, for a job that has been requeued you will see the state as REQUEUED (the flag).
(In this case, note that there will be duplicate entries for the job since a requeue creates a new entry for the job with same job id):
[lipi@llagosti 20.02]$ sacct --duplicates -j 17357
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
17357 wrap debug lipi 2 REQUEUED 0:0 <--- That's the requeued job and steps
17357.batch batch lipi 2 CANCELLED 0:15
17357.extern extern lipi 2 COMPLETED 0:0
17357 wrap debug lipi 1 PENDING 0:0 <-- That's the new entry
In that case, you should consider RQ, SE, RV as definitive states. All the other ones are transient states and could potentially be changed.
/* JOB_STATE_FLAGS */
- CG - JOB_COMPLETING
String: "COMPLETING"
- SO - JOB_STAGE_OUT
String: "STAGE_OUT"
- CF - JOB_CONFIGURING
String: "CONFIGURING"
- RS - JOB_RESIZING
String: "RESIZING"
- RC - JOB_REQUEUE_CRON
String: "REQUEUED_CRON"
- RQ - JOB_REQUEUE
String: "REQUEUED"
- RF - JOB_REQUEUE_FED
String: "REQUEUE_FED"
- RH - JOB_REQUEUE_HOLD
String: "REQUEUE_HOLD"
- SE - JOB_SPECIAL_EXIT
String: "SPECIAL_EXIT"
- ST - JOB_STOPPED
String: "STOPPED"
- RV - JOB_REVOKED
String: "REVOKED"
- RD - JOB_RESV_DEL_HOLD
String: "RESV_DEL_HOLD"
- SI - JOB_SIGNALING
String: "SIGNALING"
/* JOB_STATE_BASE */
- PD - JOB_PENDING:
String: "PENDING"
- R - JOB_RUNNING:
String: "RUNNING"
- S - JOB_SUSPENDED:
String: "SUSPENDED"
- CD - JOB_COMPLETE:
String: "COMPLETED"
- CA - JOB_CANCELLED:
String: "CANCELLED"
- F - JOB_FAILED:
String: "FAILED"
- TO - JOB_TIMEOUT:
String: "TIMEOUT"
- NF - JOB_NODE_FAIL:
String: "NODE_FAIL"
- PR - JOB_PREEMPTED:
String: "PREEMPTED"
- BF - JOB_BOOT_FAIL:
String: "BOOT_FAIL"
- DL - JOB_DEADLINE:
String: "DEADLINE"
- OOM - JOB_OOM:
String: "OUT_OF_MEMORY"
Does it clear your doubts?
Hi, I am marking the bug as INFOGIVEN, please, set it as open again if something is not clear. Thank you! |
Hi Sched, Which of these Job State Codes are considered terminal? (i.e. once a job reaches this state, none of its information as reported by sacct will change)? We are trying to build an ETL pipeline to pull this data out of SLURM, so at each interval we would like to extract data which will not change in the future. JOB STATE CODES BF BOOT_FAIL Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued). CA CANCELLED Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero. DL DEADLINE Job terminated on deadline. F FAILED Job terminated with non-zero exit code or other failure condition. NF NODE_FAIL Job terminated due to failure of one or more allocated nodes. OOM OUT_OF_MEMORY Job experienced out of memory error. PD PENDING Job is awaiting resource allocation. PR PREEMPTED Job terminated due to preemption. R RUNNING Job currently has an allocation. RQ REQUEUED Job was requeued. RS RESIZING Job is about to change size. RV REVOKED Sibling was removed from cluster due to other cluster starting the job. S SUSPENDED Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. TO TIMEOUT Job terminated upon reaching its time limit. Thanks, Kris W.