A job {JOBID} does not show a specific node (when running sacct -j {Job} ) but slurmd shows a message from the node indicating it's part of the job in /var/log/messages Why is there a discrepancy in sacct?
Hi Scott, That's interesting, there obviously shouldn't be messages from a job on a node that it didn't run on. Can you send the sacct output showing the node(s) this job ran on? Did you identify this problem while the job was still running, and if so did you gather the output of 'scontrol show job <jobid>' for this job? If so we'd like to see that output. What was the message you saw in /var/log/messages? Can you share that as well?
Hi Ben if a job runs on nodes A,B,C and then it requeues on nodes A,B,D. Will sacct show the last set of nodes only?
Also Ben here is what we see on the node itself: [2023-06-16T12:37:04.459] launch task StepId=672163.0 request from UID:409200051 GID:409200051 HOST:{redacted} PORT:46302 [2023-06-16T12:37:04.460] task/affinity: lllp_distribution: JobId=672163 auto binding off: mask_cpu [2023-06-16T12:38:22.472] [672163.0] error: spank-auks: unable to unpack auks cred from reply : auks api : request processing failed [2023-06-16T12:38:22.606] [672163.0] error: Could not open output file /checkpoint/{REDACTED}/xldumps/{redacted}7{redcated}/{redacted}/672163/672163_432_log.out: Permission denied [2023-06-16T12:38:22.607] [672163.0] error: _fork_all_tasks: IO setup failed: Slurmd could not connect IO [2023-06-16T12:38:22.659] [672163.0] error: spank-auks: spank_auks_remote_exit: called 0 times [2023-06-16T12:38:22.659] [672163.0] error: job_manager: exiting abnormally: Slurmd could not connect IO [2023-06-16T12:38:22.660] [672163.0] get_exit_code task 0 died by signal: 53 [2023-06-16T12:38:22.662] [672163.0] done with job [2023-06-16T13:08:39.216] epilog for job 672163 ran for 65 seconds As far as sacct output, that machine does not show up (sacct -j 672163.0 --format Node%5000)
Hi Scott, Thanks for the additional detail, I think I know what's happening now. When a job is requeued then it can run on a different set of nodes than it did the first time it started. There will be unique job records created for each time the job ran since the job will have a unique start and end time (and potentially node list) for the first run that doesn't apply to the second run. The default behavior of sacct is to just show the last instance of a job, but you can ask it to show any duplicate records that it might have for a given job id. Here's an example of how that might look. I submitted a job and it started on 'node10'. $ sbatch -n24 -poverflow --wrap='srun sleep 120' Submitted batch job 8573 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 8573 overflow wrap ben R 0:02 1 node10 Then I requeued the job and marked the node as down. When the job restared it ran on a different node (node11). $ scontrol requeue 8573 $ scontrol update nodename=node10 state=down reason=test $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 8573 overflow wrap ben R 0:07 1 node11 When that job completes and I request information about the job record from sacct it will just show the most recent instance of the job that ran on node11. $ sacct -X -j 8573 --format=jobid,partition,state,nodelist JobID Partition State NodeList ------------ ---------- ---------- --------------- 8573 overflow COMPLETED node11 If I add the '--duplicates' flag (you can also just specify '-D') then it will show both times the job ran with the unique nodes. $ sacct -X -j 8573 --duplicates --format=jobid,partition,state,nodelist JobID Partition State NodeList ------------ ---------- ---------- --------------- 8573 overflow REQUEUED node10 8573 overflow COMPLETED node11 Let me know if this doesn't sound like it's what happened in your case. Thanks, Ben
Hi Scott, Were you able to get the information you needed by showing the duplicate records for jobs that had been requeued? Let me know if this wasn't what was going on in your case or if this ticket is ok to close. Thanks, Ben
Thanks for the follow-up please go ahead and close this case!
Sounds good, closing now. Thanks, Ben