Ticket 21506 - sacct does not return records for a specific job when searching between dates and the state is CA
Summary: sacct does not return records for a specific job when searching between dates...
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 24.05.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Miquel Comas
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-11-23 23:26 MST by Greg Wickham
Modified: 2025-02-25 09:04 MST (History)
2 users (show)

See Also:
Site: KAUST
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (4.58 KB, text/plain)
2024-11-25 04:37 MST, Greg Wickham
Details
slurmdbd.conf (874 bytes, text/plain)
2024-11-25 04:37 MST, Greg Wickham
Details
slurmctld.log (4.16 MB, text/plain)
2024-12-19 01:39 MST, Greg Wickham
Details
slurmdbd.log (27.14 KB, text/plain)
2024-12-19 01:39 MST, Greg Wickham
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Greg Wickham 2024-11-23 23:26:39 MST
$ sacct --start 2024-03-13T00:00:00 --end 2024-03-15T00:00:00 --allusers  -X | grep 32774910
32774910_0    Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_1    Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_2    Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_3    Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_4    Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_5    Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_6    Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_10   Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 
32774910_14   Q-q100-SM      gpu24 conf-eccv+          8  COMPLETED      0:0 

but specifying the same date range and state "CA" (Cancelled):

$ sacct --start 2024-03-13T00:00:00 --end 2024-03-15T00:00:00 --allusers  -X -s CA | grep 32774910
32774910_7    Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_8    Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_9    Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_11   Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_12   Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_13   Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_15   Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_16   Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_17   Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_18   Q-q100-SM      gpu24 conf-eccv+          8 CANCELLED+      0:0 
32774910_[1+  Q-q100-SM gpu,gpu24+ conf-eccv+          0 CANCELLED+      0:0 

All the missing allocations are presented.

Is this a bug?

(If not, what is necessary to fetch all allocations in _any_ state?)

   -Greg
Comment 1 Greg Wickham 2024-11-24 01:00:18 MST
And:

$ sacct -j 35914175_112 --format jobid,start,end,state
JobID                      Start                 End      State 
------------ ------------------- ------------------- ---------- 
35914175_112 2024-11-01T11:31:56 2024-11-01T14:34:13  COMPLETED 
35914175_11+ 2024-11-01T11:31:56 2024-11-01T14:34:13  COMPLETED 
35914175_11+ 2024-11-01T11:31:56 2024-11-01T14:34:13  COMPLETED 

1/ No entry returned

$ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X | grep 35914175_112
$ 

2/ same search but with state "R" but job is completed:

$ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X --state R | grep 35914175_112
35914175_112 2024-11-01T11:31:56 2024-11-01T14:34:13  COMPLETED 

3/ request state "CD":

$ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X --state CD | grep 35914175_112
35914175_112 2024-11-01T11:31:56 2024-11-01T14:34:13  COMPLETED 


   -Greg
Comment 2 Miquel Comas 2024-11-25 02:34:52 MST
Hi Greg,

indeed, it seems something is wrong as you should be getting the complete job list when using sact. Could you upload your slurm.conf and slurmdbd.conf?

Also, what user are you using to submit these jobs and from which user are you making the sacct queries?

Thank you,

Miquel
Comment 3 Greg Wickham 2024-11-25 04:37:36 MST
Created attachment 39862 [details]
slurm.conf
Comment 4 Greg Wickham 2024-11-25 04:37:53 MST
Created attachment 39863 [details]
slurmdbd.conf
Comment 5 Greg Wickham 2024-11-25 04:41:19 MST
(In reply to Miquel Comas from comment #2)
> Hi Greg,
> 
> indeed, it seems something is wrong as you should be getting the complete
> job list when using sact. Could you upload your slurm.conf and slurmdbd.conf?
> 
> Also, what user are you using to submit these jobs and from which user are
> you making the sacct queries?
> 
> Thank you,
> 
> Miquel

Hi Miquel,


I'm running sacct from my personal user account, however I've just tested the "root" account and the same results are obtained.


I cannot tell you what commands were used during job submission - the jobs ran around March 2023.

   -Greg
Comment 6 Miquel Comas 2024-11-26 02:23:21 MST
Hi Greg,

Thank you for the configs. At first glance, it does not look like there is a misconfiguration. Could you reproduce the problem with these debug options enabled and then share the slurmdbd.log with us?

(in slurmdbd.conf):
> DebugFlags=DB_QUERY,DB_ASSOC
> DebugLevel=debug

[1] https://slurm.schedmd.com/slurmdbd.conf.html#OPT_DB_QUERY
[2] https://slurm.schedmd.com/slurmdbd.conf.html#OPT_DB_ASSOC
[3] https://slurm.schedmd.com/slurmdbd.conf.html#OPT_DebugLevel

Additionally, please also add this debug option and share the slurmctld.log when making the sacct call.

(in slurm.conf):
> DebugFlags=DBD_Agent
> DebugLevel=debug

[4] https://slurm.schedmd.com/slurm.conf.html#OPT_DBD_Agent

Once you have made the calls with the debug options enabled you can revert them to your original state to avoid filling the logs with extra information.

When reproducing the issue, please make the calls to sacct you did in Comment 1. This way we will be able to compare different queries to the database.


Thank you,

Miquel
Comment 7 Miquel Comas 2024-12-04 08:04:58 MST
Hi Greg,

do you need further assistance with your question?

Best regards,
Comment 8 Miquel Comas 2024-12-19 01:23:47 MST
Hi Greg,

are there any updates on the issue?


Thank you,
Comment 9 Greg Wickham 2024-12-19 01:39:38 MST
Created attachment 40201 [details]
slurmctld.log
Comment 10 Greg Wickham 2024-12-19 01:39:54 MST
Created attachment 40202 [details]
slurmdbd.log
Comment 11 Greg Wickham 2024-12-19 01:41:03 MST
Actions taken as requested; log files attached.

Note that "DebugLevel=debug" is not valid in slurmctld.conf.

"SlurmctldDebug=debug" was used instead.
Comment 12 Miquel Comas 2025-01-02 07:54:11 MST
Hi Greg,

thank you for the logs. I have been digging into them and I would like to request another debug flag to gather more information.
Please, add DB_JOB to DebugFlags for slurmdbd.conf. Then reconfigure the database with `sacctmgr reconfigure` and provide me the slurmctld and slurmdbd logs of an "sacct -X" (you can add a --start and --end range if you want to) where the "cancelled" jobs do not appear, and then run "sacct -X -s CA" (the same calls done in Comment 1 should suffice).

This will provide the queries that are run in the slurmdbd when information about jobs is gathered, and then we will be able to know if there is a difference in the fields between a "plain" `sacct -X` and the one specifying the cancelled job state that could be causing this issue.


Thank you,
Comment 13 Miquel Comas 2025-01-15 08:06:15 MST
Hi Greg,

were you able to apply this log changes?


Best regards,

Miquel
Comment 14 Miquel Comas 2025-01-24 02:51:37 MST
Hi Greg,

are there any news from your side?

Best regards,
Comment 15 Sergiy Khan 2025-02-25 09:04:17 MST
I accidentally came across this ticket and it reminded me about a similar problem I noticed with sacct years ago. It could only be solved by supplying '-s R'. Here are the comments from my code in case it is useful:

"
Eligible timestamp as 'Unknown'. By default, sacct shows only jobs with Eligible time. Some jobs do not have Eligible time (i.e. it is 'Unknown'). Many such jobs are CANCELLED and have zero usage, but some are not. Such jobs can only be explicitly retrieved by their JobIDRaw, or by supplying '-s R' in addition to the time interval (-S -E) that encompasses the Start time. Note that the time interval alone without '-s R' will not capture such jobs, because their Eligible is unknown.
"

And here is the from the official documentation: "NOTE: If no -s (--state) option is given sacct will display *eligible* jobs... ".