Ticket 16998

Summary: Node used for job not reflected in sacct
Product: Slurm Reporter: Scott Jeschonek <scottjes>
Component: AccountingAssignee: Ben Roberts <ben>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 22.05.8   
Hardware: Linux   
OS: Linux   
Site: FB (PSLA) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Scott Jeschonek 2023-06-16 16:45:04 MDT
A job {JOBID} does not show a specific node (when running sacct -j {Job} )

but slurmd shows a message from the node indicating it's part of the job in /var/log/messages

Why is there a discrepancy in sacct?
Comment 1 Ben Roberts 2023-06-19 09:28:37 MDT
Hi Scott,

That's interesting, there obviously shouldn't be messages from a job on a node that it didn't run on.  Can you send the sacct output showing the node(s) this job ran on?  Did you identify this problem while the job was still running, and if so did you gather the output of 'scontrol show job <jobid>' for this job?  If so we'd like to see that output.  What was the message you saw in /var/log/messages?  Can you share that as well?
Comment 2 Scott Jeschonek 2023-06-19 11:39:30 MDT
Hi Ben

if a job runs on nodes A,B,C and then it requeues on nodes A,B,D. Will sacct show the last set of nodes only?
Comment 3 Scott Jeschonek 2023-06-19 11:43:07 MDT
Also Ben here is what we see on the node itself:

[2023-06-16T12:37:04.459] launch task StepId=672163.0 request from UID:409200051 GID:409200051 HOST:{redacted} PORT:46302
[2023-06-16T12:37:04.460] task/affinity: lllp_distribution: JobId=672163 auto binding off: mask_cpu
[2023-06-16T12:38:22.472] [672163.0] error: spank-auks: unable to unpack auks cred from reply : auks api : request processing failed
[2023-06-16T12:38:22.606] [672163.0] error: Could not open output file /checkpoint/{REDACTED}/xldumps/{redacted}7{redcated}/{redacted}/672163/672163_432_log.out: Permission denied
[2023-06-16T12:38:22.607] [672163.0] error: _fork_all_tasks: IO setup failed: Slurmd could not connect IO
[2023-06-16T12:38:22.659] [672163.0] error: spank-auks: spank_auks_remote_exit: called 0 times
[2023-06-16T12:38:22.659] [672163.0] error: job_manager: exiting abnormally: Slurmd could not connect IO
[2023-06-16T12:38:22.660] [672163.0] get_exit_code task 0 died by signal: 53
[2023-06-16T12:38:22.662] [672163.0] done with job
[2023-06-16T13:08:39.216] epilog for job 672163 ran for 65 seconds


As far as sacct output, that machine does not show up (sacct -j 672163.0 --format Node%5000)
Comment 4 Ben Roberts 2023-06-19 12:26:27 MDT
Hi Scott,

Thanks for the additional detail, I think I know what's happening now.  When a job is requeued then it can run on a different set of nodes than it did the first time it started.  There will be unique job records created for each time the job ran since the job will have a unique start and end time (and potentially node list) for the first run that doesn't apply to the second run.  The default behavior of sacct is to just show the last instance of a job, but you can ask it to show any duplicate records that it might have for a given job id.

Here's an example of how that might look.  I submitted a job and it started on 'node10'.

$ sbatch -n24 -poverflow --wrap='srun sleep 120'
Submitted batch job 8573

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8573  overflow     wrap      ben  R       0:02      1 node10



Then I requeued the job and marked the node as down.  When the job restared it ran on a different node (node11).

$ scontrol requeue 8573

$ scontrol update nodename=node10 state=down reason=test

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8573  overflow     wrap      ben  R       0:07      1 node11



When that job completes and I request information about the job record from sacct it will just show the most recent instance of the job that ran on node11.

$ sacct -X -j 8573 --format=jobid,partition,state,nodelist
JobID         Partition      State        NodeList 
------------ ---------- ---------- --------------- 
8573           overflow  COMPLETED          node11 




If I add the '--duplicates' flag (you can also just specify '-D') then it will show both times the job ran with the unique nodes.

$ sacct -X -j 8573 --duplicates --format=jobid,partition,state,nodelist
JobID         Partition      State        NodeList 
------------ ---------- ---------- --------------- 
8573           overflow   REQUEUED          node10 
8573           overflow  COMPLETED          node11 


Let me know if this doesn't sound like it's what happened in your case.

Thanks,
Ben
Comment 5 Ben Roberts 2023-06-28 14:24:26 MDT
Hi Scott,

Were you able to get the information you needed by showing the duplicate records for jobs that had been requeued?  Let me know if this wasn't what was going on in your case or if this ticket is ok to close.

Thanks,
Ben
Comment 6 Scott Jeschonek 2023-06-28 16:26:20 MDT
Thanks for the follow-up please go ahead and close this case!
Comment 7 Ben Roberts 2023-06-29 08:22:12 MDT
Sounds good, closing now.

Thanks,
Ben