| Summary: | Node used for job not reflected in sacct | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Scott Jeschonek <scottjes> |
| Component: | Accounting | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 22.05.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | FB (PSLA) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Hi Scott, That's interesting, there obviously shouldn't be messages from a job on a node that it didn't run on. Can you send the sacct output showing the node(s) this job ran on? Did you identify this problem while the job was still running, and if so did you gather the output of 'scontrol show job <jobid>' for this job? If so we'd like to see that output. What was the message you saw in /var/log/messages? Can you share that as well? Hi Ben if a job runs on nodes A,B,C and then it requeues on nodes A,B,D. Will sacct show the last set of nodes only? Also Ben here is what we see on the node itself:
[2023-06-16T12:37:04.459] launch task StepId=672163.0 request from UID:409200051 GID:409200051 HOST:{redacted} PORT:46302
[2023-06-16T12:37:04.460] task/affinity: lllp_distribution: JobId=672163 auto binding off: mask_cpu
[2023-06-16T12:38:22.472] [672163.0] error: spank-auks: unable to unpack auks cred from reply : auks api : request processing failed
[2023-06-16T12:38:22.606] [672163.0] error: Could not open output file /checkpoint/{REDACTED}/xldumps/{redacted}7{redcated}/{redacted}/672163/672163_432_log.out: Permission denied
[2023-06-16T12:38:22.607] [672163.0] error: _fork_all_tasks: IO setup failed: Slurmd could not connect IO
[2023-06-16T12:38:22.659] [672163.0] error: spank-auks: spank_auks_remote_exit: called 0 times
[2023-06-16T12:38:22.659] [672163.0] error: job_manager: exiting abnormally: Slurmd could not connect IO
[2023-06-16T12:38:22.660] [672163.0] get_exit_code task 0 died by signal: 53
[2023-06-16T12:38:22.662] [672163.0] done with job
[2023-06-16T13:08:39.216] epilog for job 672163 ran for 65 seconds
As far as sacct output, that machine does not show up (sacct -j 672163.0 --format Node%5000)
Hi Scott,
Thanks for the additional detail, I think I know what's happening now. When a job is requeued then it can run on a different set of nodes than it did the first time it started. There will be unique job records created for each time the job ran since the job will have a unique start and end time (and potentially node list) for the first run that doesn't apply to the second run. The default behavior of sacct is to just show the last instance of a job, but you can ask it to show any duplicate records that it might have for a given job id.
Here's an example of how that might look. I submitted a job and it started on 'node10'.
$ sbatch -n24 -poverflow --wrap='srun sleep 120'
Submitted batch job 8573
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8573 overflow wrap ben R 0:02 1 node10
Then I requeued the job and marked the node as down. When the job restared it ran on a different node (node11).
$ scontrol requeue 8573
$ scontrol update nodename=node10 state=down reason=test
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8573 overflow wrap ben R 0:07 1 node11
When that job completes and I request information about the job record from sacct it will just show the most recent instance of the job that ran on node11.
$ sacct -X -j 8573 --format=jobid,partition,state,nodelist
JobID Partition State NodeList
------------ ---------- ---------- ---------------
8573 overflow COMPLETED node11
If I add the '--duplicates' flag (you can also just specify '-D') then it will show both times the job ran with the unique nodes.
$ sacct -X -j 8573 --duplicates --format=jobid,partition,state,nodelist
JobID Partition State NodeList
------------ ---------- ---------- ---------------
8573 overflow REQUEUED node10
8573 overflow COMPLETED node11
Let me know if this doesn't sound like it's what happened in your case.
Thanks,
Ben
Hi Scott, Were you able to get the information you needed by showing the duplicate records for jobs that had been requeued? Let me know if this wasn't what was going on in your case or if this ticket is ok to close. Thanks, Ben Thanks for the follow-up please go ahead and close this case! Sounds good, closing now. Thanks, Ben |
A job {JOBID} does not show a specific node (when running sacct -j {Job} ) but slurmd shows a message from the node indicating it's part of the job in /var/log/messages Why is there a discrepancy in sacct?