Ticket 11521

Summary: sacct ranges are inaccurate
Product: Slurm Reporter: Kris Whetham <kwhetham>
Component: User CommandsAssignee: Scott Hilton <scott>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: albert.gil, calebh
Version: 20.11.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=12102
Site: FB (PSLA) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Kris Whetham 2021-05-04 07:49:01 MDT
sacct returns jobs which end past the -E time. Moreover, they cannot be recovered when querying the next period.

calebh@h2repl:~$ echo $STATES

out_of_memory,resizing,timeout,cancelled,revoked,deadline,completed,requeued,node_fail,failed,preempted,boot_fail

calebh@h2repl:~$ sacct -P -S '2021-04-07T23:30:00' -E '2021-04-07T23:59:59' -s "$STATES" -a -o jobid,state,start,end | grep '04-08'

39125549.0|CANCELLED|2021-04-05T02:07:39|2021-04-08T00:00:22

39125551.0|CANCELLED|2021-04-05T22:06:00|2021-04-08T00:00:22

39125552.0|CANCELLED|2021-04-05T22:09:54|2021-04-08T00:00:23

39125557.0|CANCELLED|2021-04-05T01:59:34|2021-04-08T00:00:22

39125558.0|CANCELLED|2021-04-05T01:59:36|2021-04-08T00:00:23

39125559.0|CANCELLED|2021-04-05T01:59:36|2021-04-08T00:00:23

39125560.0|CANCELLED|2021-04-05T01:59:36|2021-04-08T00:00:22

39125561.0|CANCELLED|2021-04-05T01:59:36|2021-04-08T00:00:22

39125562.0|CANCELLED|2021-04-05T01:59:36|2021-04-08T00:00:22

39125563.0|CANCELLED|2021-04-05T01:59:36|2021-04-08T00:00:22

39228614.2|COMPLETED|2021-04-07T23:56:21|2021-04-08T00:00:12

calebh@h2repl:~$ sacct -P -S '2021-04-08T00:00:00' -E '2021-04-08T00:01:00' -s "$STATES" -a -o jobid,state,start,end | grep '39125549.0'
Comment 2 Scott Hilton 2021-05-04 11:39:50 MDT
Kris,

The first statement makes sense. sacct will select all jobs that were running during a certain period even if they start before or continue beyond the time period specified by the query.

The fact that these jobs don't show up in the second query puzzles me.

-Scott
Comment 3 Scott Hilton 2021-05-04 11:51:04 MDT
Kris,

Can you try this. Just want to check that grep and the states option aren't breaking it.

sacct -P -S '2021-04-08T00:00:00' -E '2021-04-08T00:01:00' -a -o jobid,state,start,end -j 39125549,39125551,39125552,39125557,39228614

-Scott
Comment 5 Scott Hilton 2021-05-05 13:46:52 MDT
Kris,

To amend my first comment: "sacct will select all jobs that were running during a certain period even if they start before or continue beyond the time period specified by the query." When -s (--state) is used, that state must exist in the time period.

Most of the filters are applied to just jobs not steps and if a job doesn't pass the filter none of its steps will be shown. 

I would guess in the second instance that the job was not in any of those states specified (it was probably in "running") between 2021-04-08T00:00:00 and 2021-04-08T00:01:00

Could you run this query so I can see a whole job and all its steps
sacct -a -o jobid,state,start,end -j 39125549

-Scott
Comment 6 Kris Whetham 2021-05-05 15:08:12 MDT
Hi Scott, Thanks for the additional info - please find output below. 


> Could you run this query so I can see a whole job and all its steps
> sacct -a -o jobid,state,start,end -j 39125549
> 
> -Scott


sacct -a -o jobid,state,start,end -j 39125549
       JobID      State               Start                 End 
------------ ---------- ------------------- ------------------- 
39125549     CANCELLED+ 2021-04-05T02:07:20 2021-04-07T23:59:38 
39125549.ba+  CANCELLED 2021-04-05T02:07:20 2021-04-07T23:59:40 
39125549.ex+  COMPLETED 2021-04-05T02:07:20 2021-04-07T23:59:38 
39125549.0    CANCELLED 2021-04-05T02:07:39 2021-04-08T00:00:22
Comment 8 Scott Hilton 2021-05-06 09:11:04 MDT
Kris,

It looks like 39125549.0 took a little while to fully shutdown. Its parent job ended 44 seconds earlier. Because the parent job didn't fit the second query the step didn't show up.

-Scott
Comment 9 Scott Hilton 2021-05-10 09:15:29 MDT
Kris,

Does this answer your question? Do you have any follow up questions?

-Scott
Comment 10 Kris Whetham 2021-05-10 11:49:27 MDT
Hi Scott, 
Adding Caleb
Comment 11 Kris Whetham 2021-05-10 11:50:53 MDT
Hi Scott, 
Adding Caleb to the case. 

-Kris
Comment 12 calebh 2021-05-10 11:53:57 MDT
Thanks for the info Scott. To confirm, the time range filtering only applies on jobs and not job steps. If this is the case, then I have no further questions; feel free to close the ticket.
Comment 13 Scott Hilton 2021-05-10 14:50:30 MDT
Caleb,

The time filtering applies first to jobs then to steps. For a step to appear both the job and the step have to be in the time frame.

-Scott
Comment 14 Scott Hilton 2021-05-10 14:51:23 MDT
Closing ticket. If you have follow up questions feel free to reopen it.

-Scott