Ticket 11122 - Job number in output file does not match job number of output file
Summary: Job number in output file does not match job number of output file
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 20.02.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-03-17 10:56 MDT by Jeff Haferman
Modified: 2021-03-17 13:01 MDT (History)
0 users

See Also:
Site: NPS HPC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Jeff Haferman 2021-03-17 10:56:21 MDT
Strange issue... a user is running an array job, and directing error output via:

#SBATCH --error=./jobout/wrench_%x_%A-%a.err

In an output file named: "wrench_w352_41004292-999.err"
He is seeing a message: "JOB 41005626 ON compute-7-39 CANCELLED"

The question is why is "JOB 41005626" being reported in a file that corresponds to job "41004292" (per the name of the error log)?
Comment 1 Marshall Garey 2021-03-17 11:37:20 MDT
(In reply to Jeff Haferman from comment #0)
> Strange issue... a user is running an array job, and directing error output
> via:
> 
> #SBATCH --error=./jobout/wrench_%x_%A-%a.err
> 
> In an output file named: "wrench_w352_41004292-999.err"
> He is seeing a message: "JOB 41005626 ON compute-7-39 CANCELLED"
> 
> The question is why is "JOB 41005626" being reported in a file that
> corresponds to job "41004292" (per the name of the error log)?

An array job has a few more details regarding job ID's. When submitting an array job, a single job record with a job iD is created but it has special information such as the number of jobs in the array and the job ID of the array. Whenever a job in the array is scheduled, a new job record is created for that job. That new job record has a new job ID, although it still keeps track of the array job ID and its index in the array.

In this case:

* The array job ID is 41004202.
* The job ID is 41005626, not 41004202.
* The job's index in the array is 999.

All of these values can be accessed via different environment variables, which can be found in the sbatch man page (https://slurm.schedmd.com/sbatch.html).

SLURM_ARRAY_TASK_COUNT
    Total number of tasks in a job array. 
SLURM_ARRAY_TASK_ID
    Job array ID (index) number. 
SLURM_ARRAY_TASK_MAX
    Job array's maximum ID (index) number. 
SLURM_ARRAY_TASK_MIN
    Job array's minimum ID (index) number. 
SLURM_ARRAY_TASK_STEP
    Job array's index step size. 
SLURM_ARRAY_JOB_ID
    Job array's master job ID number. 

The meaning of the different "%" options in a filename can be found under the "filename pattern" section of the sbatch man page.

%A
    Job array's master job allocation number. 
%a
    Job array ID (index) number. 
%x
    Job name.


Does that make sense?
Comment 2 Jeff Haferman 2021-03-17 12:20:28 MDT
Marshall -
Thank you, I figured it the explanation would be something like this, but I couldn't quite find it in the documentation.

Appreciate it!
Comment 3 Marshall Garey 2021-03-17 13:01:42 MDT
You're welcome. Closing as infogiven.