16998 – Node used for job not reflected in sacct

Ticket 16998 - Node used for job not reflected in sacct

Summary: Node used for job not reflected in sacct

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	22.05.8
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Ben Roberts
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-06-16 16:45 MDT by Scott Jeschonek
Modified:	2023-06-29 08:22 MDT (History)
CC List:	0 users

See Also:
Site:	FB (PSLA)
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Scott Jeschonek 2023-06-16 16:45:04 MDT

A job {JOBID} does not show a specific node (when running sacct -j {Job} )

but slurmd shows a message from the node indicating it's part of the job in /var/log/messages

Why is there a discrepancy in sacct?

Comment 1 Ben Roberts 2023-06-19 09:28:37 MDT

Hi Scott,

That's interesting, there obviously shouldn't be messages from a job on a node that it didn't run on.  Can you send the sacct output showing the node(s) this job ran on?  Did you identify this problem while the job was still running, and if so did you gather the output of 'scontrol show job <jobid>' for this job?  If so we'd like to see that output.  What was the message you saw in /var/log/messages?  Can you share that as well?

Comment 2 Scott Jeschonek 2023-06-19 11:39:30 MDT

Hi Ben

if a job runs on nodes A,B,C and then it requeues on nodes A,B,D. Will sacct show the last set of nodes only?

Comment 3 Scott Jeschonek 2023-06-19 11:43:07 MDT

Also Ben here is what we see on the node itself:

[2023-06-16T12:37:04.459] launch task StepId=672163.0 request from UID:409200051 GID:409200051 HOST:{redacted} PORT:46302
[2023-06-16T12:37:04.460] task/affinity: lllp_distribution: JobId=672163 auto binding off: mask_cpu
[2023-06-16T12:38:22.472] [672163.0] error: spank-auks: unable to unpack auks cred from reply : auks api : request processing failed
[2023-06-16T12:38:22.606] [672163.0] error: Could not open output file /checkpoint/{REDACTED}/xldumps/{redacted}7{redcated}/{redacted}/672163/672163_432_log.out: Permission denied
[2023-06-16T12:38:22.607] [672163.0] error: _fork_all_tasks: IO setup failed: Slurmd could not connect IO
[2023-06-16T12:38:22.659] [672163.0] error: spank-auks: spank_auks_remote_exit: called 0 times
[2023-06-16T12:38:22.659] [672163.0] error: job_manager: exiting abnormally: Slurmd could not connect IO
[2023-06-16T12:38:22.660] [672163.0] get_exit_code task 0 died by signal: 53
[2023-06-16T12:38:22.662] [672163.0] done with job
[2023-06-16T13:08:39.216] epilog for job 672163 ran for 65 seconds


As far as sacct output, that machine does not show up (sacct -j 672163.0 --format Node%5000)

Comment 4 Ben Roberts 2023-06-19 12:26:27 MDT

Hi Scott,

Thanks for the additional detail, I think I know what's happening now. When a job is requeued then it can run on a different set of nodes than it did the first time it started. There will be unique job records created for each time the job ran since the job will have a unique start and end time (and potentially node list) for the first run that doesn't apply to the second run. The default behavior of sacct is to just show the last instance of a job, but you can ask it to show any duplicate records that it might have for a given job id.

Here's an example of how that might look. I submitted a job and it started on 'node10'.

$ sbatch -n24 -poverflow --wrap='srun sleep 120'
Submitted batch job 8573

$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8573 overflow wrap ben R 0:02 1 node10

Then I requeued the job and marked the node as down. When the job restared it ran on a different node (node11).

$ scontrol requeue 8573

$ scontrol update nodename=node10 state=down reason=test

$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8573 overflow wrap ben R 0:07 1 node11

When that job completes and I request information about the job record from sacct it will just show the most recent instance of the job that ran on node11.

$ sacct -X -j 8573 --format=jobid,partition,state,nodelist
JobID Partition State NodeList
------------ ---------- ---------- ---------------
8573 overflow COMPLETED node11

If I add the '--duplicates' flag (you can also just specify '-D') then it will show both times the job ran with the unique nodes.

$ sacct -X -j 8573 --duplicates --format=jobid,partition,state,nodelist
JobID Partition State NodeList
------------ ---------- ---------- ---------------
8573 overflow REQUEUED node10
8573 overflow COMPLETED node11

Let me know if this doesn't sound like it's what happened in your case.

Thanks,
Ben

Comment 5 Ben Roberts 2023-06-28 14:24:26 MDT

Hi Scott,

Were you able to get the information you needed by showing the duplicate records for jobs that had been requeued?  Let me know if this wasn't what was going on in your case or if this ticket is ok to close.

Thanks,
Ben

Comment 6 Scott Jeschonek 2023-06-28 16:26:20 MDT

Thanks for the follow-up please go ahead and close this case!

Comment 7 Ben Roberts 2023-06-29 08:22:12 MDT

Sounds good, closing now.

Thanks,
Ben