11122 – Job number in output file does not match job number of output file

Ticket 11122 - Job number in output file does not match job number of output file

Summary: Job number in output file does not match job number of output file

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	20.02.5
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-03-17 10:56 MDT by Jeff Haferman
Modified:	2021-03-17 13:01 MDT (History)
CC List:	0 users

See Also:
Site:	NPS HPC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Jeff Haferman 2021-03-17 10:56:21 MDT

Strange issue... a user is running an array job, and directing error output via:

#SBATCH --error=./jobout/wrench_%x_%A-%a.err

In an output file named: "wrench_w352_41004292-999.err"
He is seeing a message: "JOB 41005626 ON compute-7-39 CANCELLED"

The question is why is "JOB 41005626" being reported in a file that corresponds to job "41004292" (per the name of the error log)?

Comment 1 Marshall Garey 2021-03-17 11:37:20 MDT

(In reply to Jeff Haferman from comment #0)
> Strange issue... a user is running an array job, and directing error output
> via:
> 
> #SBATCH --error=./jobout/wrench_%x_%A-%a.err
> 
> In an output file named: "wrench_w352_41004292-999.err"
> He is seeing a message: "JOB 41005626 ON compute-7-39 CANCELLED"
> 
> The question is why is "JOB 41005626" being reported in a file that
> corresponds to job "41004292" (per the name of the error log)?

An array job has a few more details regarding job ID's. When submitting an array job, a single job record with a job iD is created but it has special information such as the number of jobs in the array and the job ID of the array. Whenever a job in the array is scheduled, a new job record is created for that job. That new job record has a new job ID, although it still keeps track of the array job ID and its index in the array.

In this case:

* The array job ID is 41004202.
* The job ID is 41005626, not 41004202.
* The job's index in the array is 999.

All of these values can be accessed via different environment variables, which can be found in the sbatch man page (https://slurm.schedmd.com/sbatch.html).

SLURM_ARRAY_TASK_COUNT
    Total number of tasks in a job array. 
SLURM_ARRAY_TASK_ID
    Job array ID (index) number. 
SLURM_ARRAY_TASK_MAX
    Job array's maximum ID (index) number. 
SLURM_ARRAY_TASK_MIN
    Job array's minimum ID (index) number. 
SLURM_ARRAY_TASK_STEP
    Job array's index step size. 
SLURM_ARRAY_JOB_ID
    Job array's master job ID number. 

The meaning of the different "%" options in a filename can be found under the "filename pattern" section of the sbatch man page.

%A
    Job array's master job allocation number. 
%a
    Job array ID (index) number. 
%x
    Job name.


Does that make sense?

Comment 2 Jeff Haferman 2021-03-17 12:20:28 MDT

Marshall -
Thank you, I figured it the explanation would be something like this, but I couldn't quite find it in the documentation.

Appreciate it!

Comment 3 Marshall Garey 2021-03-17 13:01:42 MDT

You're welcome. Closing as infogiven.