| Summary: | Job number in output file does not match job number of output file | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jeff Haferman <jlhaferm> |
| Component: | Other | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.02.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NPS HPC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Jeff Haferman
2021-03-17 10:56:21 MDT
(In reply to Jeff Haferman from comment #0) > Strange issue... a user is running an array job, and directing error output > via: > > #SBATCH --error=./jobout/wrench_%x_%A-%a.err > > In an output file named: "wrench_w352_41004292-999.err" > He is seeing a message: "JOB 41005626 ON compute-7-39 CANCELLED" > > The question is why is "JOB 41005626" being reported in a file that > corresponds to job "41004292" (per the name of the error log)? An array job has a few more details regarding job ID's. When submitting an array job, a single job record with a job iD is created but it has special information such as the number of jobs in the array and the job ID of the array. Whenever a job in the array is scheduled, a new job record is created for that job. That new job record has a new job ID, although it still keeps track of the array job ID and its index in the array. In this case: * The array job ID is 41004202. * The job ID is 41005626, not 41004202. * The job's index in the array is 999. All of these values can be accessed via different environment variables, which can be found in the sbatch man page (https://slurm.schedmd.com/sbatch.html). SLURM_ARRAY_TASK_COUNT Total number of tasks in a job array. SLURM_ARRAY_TASK_ID Job array ID (index) number. SLURM_ARRAY_TASK_MAX Job array's maximum ID (index) number. SLURM_ARRAY_TASK_MIN Job array's minimum ID (index) number. SLURM_ARRAY_TASK_STEP Job array's index step size. SLURM_ARRAY_JOB_ID Job array's master job ID number. The meaning of the different "%" options in a filename can be found under the "filename pattern" section of the sbatch man page. %A Job array's master job allocation number. %a Job array ID (index) number. %x Job name. Does that make sense? Marshall - Thank you, I figured it the explanation would be something like this, but I couldn't quite find it in the documentation. Appreciate it! You're welcome. Closing as infogiven. |