Ticket 11122

Summary: Job number in output file does not match job number of output file
Product: Slurm Reporter: Jeff Haferman <jlhaferm>
Component: OtherAssignee: Marshall Garey <marshall>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.02.5   
Hardware: Linux   
OS: Linux   
Site: NPS HPC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Jeff Haferman 2021-03-17 10:56:21 MDT
Strange issue... a user is running an array job, and directing error output via:

#SBATCH --error=./jobout/wrench_%x_%A-%a.err

In an output file named: "wrench_w352_41004292-999.err"
He is seeing a message: "JOB 41005626 ON compute-7-39 CANCELLED"

The question is why is "JOB 41005626" being reported in a file that corresponds to job "41004292" (per the name of the error log)?
Comment 1 Marshall Garey 2021-03-17 11:37:20 MDT
(In reply to Jeff Haferman from comment #0)
> Strange issue... a user is running an array job, and directing error output
> via:
> 
> #SBATCH --error=./jobout/wrench_%x_%A-%a.err
> 
> In an output file named: "wrench_w352_41004292-999.err"
> He is seeing a message: "JOB 41005626 ON compute-7-39 CANCELLED"
> 
> The question is why is "JOB 41005626" being reported in a file that
> corresponds to job "41004292" (per the name of the error log)?

An array job has a few more details regarding job ID's. When submitting an array job, a single job record with a job iD is created but it has special information such as the number of jobs in the array and the job ID of the array. Whenever a job in the array is scheduled, a new job record is created for that job. That new job record has a new job ID, although it still keeps track of the array job ID and its index in the array.

In this case:

* The array job ID is 41004202.
* The job ID is 41005626, not 41004202.
* The job's index in the array is 999.

All of these values can be accessed via different environment variables, which can be found in the sbatch man page (https://slurm.schedmd.com/sbatch.html).

SLURM_ARRAY_TASK_COUNT
    Total number of tasks in a job array. 
SLURM_ARRAY_TASK_ID
    Job array ID (index) number. 
SLURM_ARRAY_TASK_MAX
    Job array's maximum ID (index) number. 
SLURM_ARRAY_TASK_MIN
    Job array's minimum ID (index) number. 
SLURM_ARRAY_TASK_STEP
    Job array's index step size. 
SLURM_ARRAY_JOB_ID
    Job array's master job ID number. 

The meaning of the different "%" options in a filename can be found under the "filename pattern" section of the sbatch man page.

%A
    Job array's master job allocation number. 
%a
    Job array ID (index) number. 
%x
    Job name.


Does that make sense?
Comment 2 Jeff Haferman 2021-03-17 12:20:28 MDT
Marshall -
Thank you, I figured it the explanation would be something like this, but I couldn't quite find it in the documentation.

Appreciate it!
Comment 3 Marshall Garey 2021-03-17 13:01:42 MDT
You're welcome. Closing as infogiven.