Ticket 11521 - sacct ranges are inaccurate
Summary: sacct ranges are inaccurate
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 20.11.5
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-05-04 07:49 MDT by Kris Whetham
Modified: 2021-07-27 03:40 MDT (History)
2 users (show)

See Also:
Site: FB (PSLA)
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Kris Whetham 2021-05-04 07:49:01 MDT
sacct returns jobs which end past the -E time. Moreover, they cannot be recovered when querying the next period.

calebh@h2repl:~$ echo $STATES

out_of_memory,resizing,timeout,cancelled,revoked,deadline,completed,requeued,node_fail,failed,preempted,boot_fail

calebh@h2repl:~$ sacct -P -S '2021-04-07T23:30:00' -E '2021-04-07T23:59:59' -s "$STATES" -a -o jobid,state,start,end | grep '04-08'

39125549.0|CANCELLED|2021-04-05T02:07:39|2021-04-08T00:00:22

39125551.0|CANCELLED|2021-04-05T22:06:00|2021-04-08T00:00:22

39125552.0|CANCELLED|2021-04-05T22:09:54|2021-04-08T00:00:23

39125557.0|CANCELLED|2021-04-05T01:59:34|2021-04-08T00:00:22

39125558.0|CANCELLED|2021-04-05T01:59:36|2021-04-08T00:00:23

39125559.0|CANCELLED|2021-04-05T01:59:36|2021-04-08T00:00:23

39125560.0|CANCELLED|2021-04-05T01:59:36|2021-04-08T00:00:22

39125561.0|CANCELLED|2021-04-05T01:59:36|2021-04-08T00:00:22

39125562.0|CANCELLED|2021-04-05T01:59:36|2021-04-08T00:00:22

39125563.0|CANCELLED|2021-04-05T01:59:36|2021-04-08T00:00:22

39228614.2|COMPLETED|2021-04-07T23:56:21|2021-04-08T00:00:12

calebh@h2repl:~$ sacct -P -S '2021-04-08T00:00:00' -E '2021-04-08T00:01:00' -s "$STATES" -a -o jobid,state,start,end | grep '39125549.0'
Comment 2 Scott Hilton 2021-05-04 11:39:50 MDT
Kris,

The first statement makes sense. sacct will select all jobs that were running during a certain period even if they start before or continue beyond the time period specified by the query.

The fact that these jobs don't show up in the second query puzzles me.

-Scott
Comment 3 Scott Hilton 2021-05-04 11:51:04 MDT
Kris,

Can you try this. Just want to check that grep and the states option aren't breaking it.

sacct -P -S '2021-04-08T00:00:00' -E '2021-04-08T00:01:00' -a -o jobid,state,start,end -j 39125549,39125551,39125552,39125557,39228614

-Scott
Comment 5 Scott Hilton 2021-05-05 13:46:52 MDT
Kris,

To amend my first comment: "sacct will select all jobs that were running during a certain period even if they start before or continue beyond the time period specified by the query." When -s (--state) is used, that state must exist in the time period.

Most of the filters are applied to just jobs not steps and if a job doesn't pass the filter none of its steps will be shown. 

I would guess in the second instance that the job was not in any of those states specified (it was probably in "running") between 2021-04-08T00:00:00 and 2021-04-08T00:01:00

Could you run this query so I can see a whole job and all its steps
sacct -a -o jobid,state,start,end -j 39125549

-Scott
Comment 6 Kris Whetham 2021-05-05 15:08:12 MDT
Hi Scott, Thanks for the additional info - please find output below. 


> Could you run this query so I can see a whole job and all its steps
> sacct -a -o jobid,state,start,end -j 39125549
> 
> -Scott


sacct -a -o jobid,state,start,end -j 39125549
       JobID      State               Start                 End 
------------ ---------- ------------------- ------------------- 
39125549     CANCELLED+ 2021-04-05T02:07:20 2021-04-07T23:59:38 
39125549.ba+  CANCELLED 2021-04-05T02:07:20 2021-04-07T23:59:40 
39125549.ex+  COMPLETED 2021-04-05T02:07:20 2021-04-07T23:59:38 
39125549.0    CANCELLED 2021-04-05T02:07:39 2021-04-08T00:00:22
Comment 8 Scott Hilton 2021-05-06 09:11:04 MDT
Kris,

It looks like 39125549.0 took a little while to fully shutdown. Its parent job ended 44 seconds earlier. Because the parent job didn't fit the second query the step didn't show up.

-Scott
Comment 9 Scott Hilton 2021-05-10 09:15:29 MDT
Kris,

Does this answer your question? Do you have any follow up questions?

-Scott
Comment 10 Kris Whetham 2021-05-10 11:49:27 MDT
Hi Scott, 
Adding Caleb
Comment 11 Kris Whetham 2021-05-10 11:50:53 MDT
Hi Scott, 
Adding Caleb to the case. 

-Kris
Comment 12 calebh 2021-05-10 11:53:57 MDT
Thanks for the info Scott. To confirm, the time range filtering only applies on jobs and not job steps. If this is the case, then I have no further questions; feel free to close the ticket.
Comment 13 Scott Hilton 2021-05-10 14:50:30 MDT
Caleb,

The time filtering applies first to jobs then to steps. For a step to appear both the job and the step have to be in the time frame.

-Scott
Comment 14 Scott Hilton 2021-05-10 14:51:23 MDT
Closing ticket. If you have follow up questions feel free to reopen it.

-Scott