Ticket 21506 - sacct does not return records for a specific job when searching between dates and the state is CA
Summary: sacct does not return records for a specific job when searching between dates...
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 24.05.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Miquel Comas
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-11-23 23:26 MST by Greg Wickham
Modified: 2025-05-20 09:17 MDT (History)
2 users (show)

See Also:
Site: KAUST
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (4.58 KB, text/plain)
2024-11-25 04:37 MST, Greg Wickham
Details
slurmdbd.conf (874 bytes, text/plain)
2024-11-25 04:37 MST, Greg Wickham
Details
slurmctld.log (4.16 MB, text/plain)
2024-12-19 01:39 MST, Greg Wickham
Details
slurmdbd.log (27.14 KB, text/plain)
2024-12-19 01:39 MST, Greg Wickham
Details
slurmdbd.log with debug flag DB_JOB and Level=verbpse (5.95 KB, text/plain)
2025-04-22 03:31 MDT, Greg Wickham
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Greg Wickham 2024-11-23 23:26:39 MST
$ sacct --start 2024-03-13T00:00:00 --end 2024-03-15T00:00:00 --allusers  -X | grep 32774910
32774910_0    Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_1    Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_2    Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_3    Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_4    Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_5    Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_6    Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_10   Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_14   Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 

but specifying the same date range and state "CA" (Cancelled):

$ sacct --start 2024-03-13T00:00:00 --end 2024-03-15T00:00:00 --allusers  -X -s CA | grep 32774910
32774910_7    Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_8    Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_9    Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_11   Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_12   Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_13   Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_15   Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_16   Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_17   Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_18   Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_[1+  Q-q100-SM gpu,gpu24+ conf-eccv+          0 CANCELLED+      0:0 

All the missing allocations are presented.

Is this a bug?

(If not, what is necessary to fetch all allocations in _any_ state?)

   -Greg
Comment 1 Greg Wickham 2024-11-24 01:00:18 MST
And:

$ sacct -j 35914175_112 --format jobid,start,end,state
JobID                      Start                 End      State 
------------ ------------------- ------------------- ---------- 
35914175_112 2024-11-01T11:31:56 2024-11-01T14:34:13  COMPLETED 
35914175_11+ 2024-11-01T11:31:56 2024-11-01T14:34:13  COMPLETED 
35914175_11+ 2024-11-01T11:31:56 2024-11-01T14:34:13  COMPLETED 

1/ No entry returned

$ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X | grep 35914175_112
$ 

2/ same search but with state "R" but job is completed:

$ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X --state R | grep 35914175_112
35914175_112 2024-11-01T11:31:56 2024-11-01T14:34:13  COMPLETED 

3/ request state "CD":

$ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X --state CD | grep 35914175_112
35914175_112 2024-11-01T11:31:56 2024-11-01T14:34:13  COMPLETED 


   -Greg
Comment 2 Miquel Comas 2024-11-25 02:34:52 MST
Hi Greg,

indeed, it seems something is wrong as you should be getting the complete job list when using sact. Could you upload your slurm.conf and slurmdbd.conf?

Also, what user are you using to submit these jobs and from which user are you making the sacct queries?

Thank you,

Miquel
Comment 3 Greg Wickham 2024-11-25 04:37:36 MST
Created attachment 39862 [details]
slurm.conf
Comment 4 Greg Wickham 2024-11-25 04:37:53 MST
Created attachment 39863 [details]
slurmdbd.conf
Comment 5 Greg Wickham 2024-11-25 04:41:19 MST
(In reply to Miquel Comas from comment #2)
> Hi Greg,
> 
> indeed, it seems something is wrong as you should be getting the complete
> job list when using sact. Could you upload your slurm.conf and slurmdbd.conf?
> 
> Also, what user are you using to submit these jobs and from which user are
> you making the sacct queries?
> 
> Thank you,
> 
> Miquel

Hi Miquel,


I'm running sacct from my personal user account, however I've just tested the "root" account and the same results are obtained.


I cannot tell you what commands were used during job submission - the jobs ran around March 2023.

   -Greg
Comment 6 Miquel Comas 2024-11-26 02:23:21 MST
Hi Greg,

Thank you for the configs. At first glance, it does not look like there is a misconfiguration. Could you reproduce the problem with these debug options enabled and then share the slurmdbd.log with us?

(in slurmdbd.conf):
> DebugFlags=DB_QUERY,DB_ASSOC
> DebugLevel=debug

[1] https://slurm.schedmd.com/slurmdbd.conf.html#OPT_DB_QUERY
[2] https://slurm.schedmd.com/slurmdbd.conf.html#OPT_DB_ASSOC
[3] https://slurm.schedmd.com/slurmdbd.conf.html#OPT_DebugLevel

Additionally, please also add this debug option and share the slurmctld.log when making the sacct call.

(in slurm.conf):
> DebugFlags=DBD_Agent
> DebugLevel=debug

[4] https://slurm.schedmd.com/slurm.conf.html#OPT_DBD_Agent

Once you have made the calls with the debug options enabled you can revert them to your original state to avoid filling the logs with extra information.

When reproducing the issue, please make the calls to sacct you did in Comment 1. This way we will be able to compare different queries to the database.


Thank you,

Miquel
Comment 7 Miquel Comas 2024-12-04 08:04:58 MST
Hi Greg,

do you need further assistance with your question?

Best regards,
Comment 8 Miquel Comas 2024-12-19 01:23:47 MST
Hi Greg,

are there any updates on the issue?


Thank you,
Comment 9 Greg Wickham 2024-12-19 01:39:38 MST
Created attachment 40201 [details]
slurmctld.log
Comment 10 Greg Wickham 2024-12-19 01:39:54 MST
Created attachment 40202 [details]
slurmdbd.log
Comment 11 Greg Wickham 2024-12-19 01:41:03 MST
Actions taken as requested; log files attached.

Note that "DebugLevel=debug" is not valid in slurmctld.conf.

"SlurmctldDebug=debug" was used instead.
Comment 12 Miquel Comas 2025-01-02 07:54:11 MST
Hi Greg,

thank you for the logs. I have been digging into them and I would like to request another debug flag to gather more information.
Please, add DB_JOB to DebugFlags for slurmdbd.conf. Then reconfigure the database with `sacctmgr reconfigure` and provide me the slurmctld and slurmdbd logs of an "sacct -X" (you can add a --start and --end range if you want to) where the "cancelled" jobs do not appear, and then run "sacct -X -s CA" (the same calls done in Comment 1 should suffice).

This will provide the queries that are run in the slurmdbd when information about jobs is gathered, and then we will be able to know if there is a difference in the fields between a "plain" `sacct -X` and the one specifying the cancelled job state that could be causing this issue.


Thank you,
Comment 13 Miquel Comas 2025-01-15 08:06:15 MST
Hi Greg,

were you able to apply this log changes?


Best regards,

Miquel
Comment 14 Miquel Comas 2025-01-24 02:51:37 MST
Hi Greg,

are there any news from your side?

Best regards,
Comment 15 Sergiy Khan 2025-02-25 09:04:17 MST
I accidentally came across this ticket and it reminded me about a similar problem I noticed with sacct years ago. It could only be solved by supplying '-s R'. Here are the comments from my code in case it is useful:

"
Eligible timestamp as 'Unknown'. By default, sacct shows only jobs with Eligible time. Some jobs do not have Eligible time (i.e. it is 'Unknown'). Many such jobs are CANCELLED and have zero usage, but some are not. Such jobs can only be explicitly retrieved by their JobIDRaw, or by supplying '-s R' in addition to the time interval (-S -E) that encompasses the Start time. Note that the time interval alone without '-s R' will not capture such jobs, because their Eligible is unknown.
"

And here is the from the official documentation: "NOTE: If no -s (--state) option is given sacct will display *eligible* jobs... ".
Comment 16 Miquel Comas 2025-04-21 01:46:21 MDT
Hi Greg,

As Sergiy said:
> It could only be solved by supplying
> '-s R'. Here are the comments from my code in case it is useful:

> "
> Eligible timestamp as 'Unknown'. By default, sacct shows only jobs with
> Eligible time. Some jobs do not have Eligible time (i.e. it is 'Unknown').
> Many such jobs are CANCELLED and have zero usage, but some are not. Such
> jobs can only be explicitly retrieved by their JobIDRaw, or by supplying '-s
> R' in addition to the time interval (-S -E) that encompasses the Start time.
> Note that the time interval alone without '-s R' will not capture such jobs,
> because their Eligible is unknown.
> "

> From the official documentation: "NOTE: If no -s (--state)
> option is given sacct will display *eligible* jobs... ".

If your jobs did not have Eligible time it is possible that you have run into this. Could you confirm if this is the case?

Best regards,
Comment 17 Greg Wickham 2025-04-22 02:26:42 MDT
Thanks Sergiy, and Miquel.

I do not think this has anything to do with being "eligible".

The job is queryable using the Job ID, and while being short did run to completion:

$ sacct -j 35914175_112 --format=jobid,jobidraw,start,end,state
JobID        JobIDRaw                   Start                 End      State 
------------ ------------ ------------------- ------------------- ---------- 
35914175_112 35917415     2024-11-01T11:31:56 2024-11-01T14:34:13  COMPLETED 
35914175_11+ 35917415.ba+ 2024-11-01T11:31:56 2024-11-01T14:34:13  COMPLETED 
35914175_11+ 35917415.ex+ 2024-11-01T11:31:56 2024-11-01T14:34:13  COMPLETED 

Even thou Slurm has been upgraded and is now on 23.11.4, the original issue still exists:

$ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X | grep 35914175_112
$ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X --state R | grep 35914175_112
35914175_112 2024-11-01T11:31:56 2024-11-01T14:34:13  COMPLETED 
$

Using a standard query to fetch all job data doesn't return information about this particular array member unless '-s r' is used.
Comment 18 Miquel Comas 2025-04-22 02:36:40 MDT
Hi Greg,

Thank you for your response. This way we can discard more possibilities. Is it feasible for you to apply the requested debug options mentioned in Comment 12 and to upload the new logs?


Thank you for your time,

Miquel
Comment 19 Greg Wickham 2025-04-22 03:30:30 MDT
Miquel,

I'll attached the slurmdbd.log requested.

There were was no information in the slurmctld.log

In comment #12 there's specific mention of a "cancelled" job - the job that is missing isn't cancelled.
Comment 20 Greg Wickham 2025-04-22 03:31:24 MDT
Created attachment 41515 [details]
slurmdbd.log with debug flag DB_JOB and Level=verbpse
Comment 21 Greg Wickham 2025-04-22 03:31:43 MDT
$ date ; sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X | grep 35914175_112; date
Tue Apr 22 12:20:38 PM +03 2025
Tue Apr 22 12:20:39 PM +03 2025
$
Comment 22 Sergiy Khan 2025-04-22 06:59:11 MDT
Can you also print out the Eligible column for these jobs?

--format=jobidraw,submit,eligible,start,end,elapsedraw,state
Comment 23 Greg Wickham 2025-04-22 07:02:00 MDT
$ sacct -j 35914175_112 --format=jobidraw,submit,eligible,start,end,elapsedraw,state
JobIDRaw                  Submit            Eligible               Start                 End ElapsedRaw      State 
------------ ------------------- ------------------- ------------------- ------------------- ---------- ---------- 
35917415     2024-10-31T19:38:43             Unknown 2024-11-01T11:31:56 2024-11-01T14:34:13      10937  COMPLETED 
35917415.ba+ 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T14:34:13      10937  COMPLETED 
35917415.ex+ 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T14:34:13      10937  COMPLETED
Comment 24 Sergiy Khan 2025-04-22 07:10:04 MDT
Ok, so Eligible is 'Unknown' and it matches the description of the problem in comments 1 and 15. My conclusion at the time was that it was an unfortunate but expected behaviour since it was documented.
Comment 25 Greg Wickham 2025-04-22 07:11:46 MDT
Sergiy,

Is there a fix?

Any idea to know how many jobs aren't being reported without specifying job state?

  -Greg
Comment 27 Sergiy Khan 2025-04-22 07:30:49 MDT
> Is there a fix?

This issue forced me to always query jobs for a given time interval by explicitly specifying all possible states. It is a workaround, not a fix, I guess.

TZ=UTC sacct --duplicates --allusers --allocations --parsable2 --delimiter='|' --format=Account,AllocCPUS,etc --state=BF,CA,CD,DL,F,NF,OOM,PD,PR,R,RQ,RS,RV,S,TO

See https://slurm.schedmd.com/sacct.html#SECTION_JOB-STATE-CODES

> Any idea to know how many jobs aren't being reported without specifying job
> state?

My old comments in the code suggest that most jobs with Unknown in Eligible have zero run-time (CANCELLED state), but not all of them. How many, not clear.
Comment 28 Miquel Comas 2025-04-23 08:32:08 MDT
Hi Greg,

The fact that, without specifying the job id, sacct only shows jobs which are eligible is something documented [1].

What should be clarified is why does this job in particular have Unknown eligible time when it has been run and its steps do have one.

> $ sacct -j 35914175_112 --format=jobidraw,submit,eligible,start,end,elapsedraw,state
> JobIDRaw                  Submit            Eligible               Start                 End ElapsedRaw      State 
> ------------ ------------------- ------------------- ------------------- ------------------- ---------- ---------- 
> 35917415     2024-10-31T19:38:43             Unknown 2024-11-01T11:31:56 2024-11-01T14:34:13      10937  COMPLETED 
> 35917415.ba+ 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T14:34:13      10937  COMPLETED 
> 35917415.ex+ 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T14:34:13      10937  COMPLETED

In order to try to dig why is this happening, could you clarify if:
- Has it happened with other jobs as well?
- Do you have its job submit line?
- Do you know if this job was requeued?

[1] https://slurm.schedmd.com/sacct.html#OPT_jobs

Best regards,

Miquel
Comment 29 Miquel Comas 2025-05-05 03:19:57 MDT
Hi Greg,

could you provide the requested information from comment 28 when possible?
> In order to try to dig why is this happening, could you clarify if:
> - Has it happened with other jobs as well?
> - Do you have its job submit line?
> - Do you know if this job was requeued?

Best regards,

Miquel
Comment 30 Miquel Comas 2025-05-20 09:17:38 MDT
Hi Greg,

I will be closing this ticket as it has been a month without updates. Please do not hesitate to reopen it if you are able to add this information.

Thank you,

Miquel