$ sacct --start 2024-03-13T00:00:00 --end 2024-03-15T00:00:00 --allusers -X | grep 32774910 32774910_0 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_1 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_2 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_3 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_4 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_5 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_6 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_10 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_14 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 but specifying the same date range and state "CA" (Cancelled): $ sacct --start 2024-03-13T00:00:00 --end 2024-03-15T00:00:00 --allusers -X -s CA | grep 32774910 32774910_7 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_8 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_9 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_11 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_12 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_13 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_15 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_16 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_17 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_18 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_[1+ Q-q100-SM gpu,gpu24+ conf-eccv+ 0 CANCELLED+ 0:0 All the missing allocations are presented. Is this a bug? (If not, what is necessary to fetch all allocations in _any_ state?) -Greg
And: $ sacct -j 35914175_112 --format jobid,start,end,state JobID Start End State ------------ ------------------- ------------------- ---------- 35914175_112 2024-11-01T11:31:56 2024-11-01T14:34:13 COMPLETED 35914175_11+ 2024-11-01T11:31:56 2024-11-01T14:34:13 COMPLETED 35914175_11+ 2024-11-01T11:31:56 2024-11-01T14:34:13 COMPLETED 1/ No entry returned $ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X | grep 35914175_112 $ 2/ same search but with state "R" but job is completed: $ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X --state R | grep 35914175_112 35914175_112 2024-11-01T11:31:56 2024-11-01T14:34:13 COMPLETED 3/ request state "CD": $ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X --state CD | grep 35914175_112 35914175_112 2024-11-01T11:31:56 2024-11-01T14:34:13 COMPLETED -Greg
Hi Greg, indeed, it seems something is wrong as you should be getting the complete job list when using sact. Could you upload your slurm.conf and slurmdbd.conf? Also, what user are you using to submit these jobs and from which user are you making the sacct queries? Thank you, Miquel
Created attachment 39862 [details] slurm.conf
Created attachment 39863 [details] slurmdbd.conf
(In reply to Miquel Comas from comment #2) > Hi Greg, > > indeed, it seems something is wrong as you should be getting the complete > job list when using sact. Could you upload your slurm.conf and slurmdbd.conf? > > Also, what user are you using to submit these jobs and from which user are > you making the sacct queries? > > Thank you, > > Miquel Hi Miquel, I'm running sacct from my personal user account, however I've just tested the "root" account and the same results are obtained. I cannot tell you what commands were used during job submission - the jobs ran around March 2023. -Greg
Hi Greg, Thank you for the configs. At first glance, it does not look like there is a misconfiguration. Could you reproduce the problem with these debug options enabled and then share the slurmdbd.log with us? (in slurmdbd.conf): > DebugFlags=DB_QUERY,DB_ASSOC > DebugLevel=debug [1] https://slurm.schedmd.com/slurmdbd.conf.html#OPT_DB_QUERY [2] https://slurm.schedmd.com/slurmdbd.conf.html#OPT_DB_ASSOC [3] https://slurm.schedmd.com/slurmdbd.conf.html#OPT_DebugLevel Additionally, please also add this debug option and share the slurmctld.log when making the sacct call. (in slurm.conf): > DebugFlags=DBD_Agent > DebugLevel=debug [4] https://slurm.schedmd.com/slurm.conf.html#OPT_DBD_Agent Once you have made the calls with the debug options enabled you can revert them to your original state to avoid filling the logs with extra information. When reproducing the issue, please make the calls to sacct you did in Comment 1. This way we will be able to compare different queries to the database. Thank you, Miquel
Hi Greg, do you need further assistance with your question? Best regards,
Hi Greg, are there any updates on the issue? Thank you,
Created attachment 40201 [details] slurmctld.log
Created attachment 40202 [details] slurmdbd.log
Actions taken as requested; log files attached. Note that "DebugLevel=debug" is not valid in slurmctld.conf. "SlurmctldDebug=debug" was used instead.
Hi Greg, thank you for the logs. I have been digging into them and I would like to request another debug flag to gather more information. Please, add DB_JOB to DebugFlags for slurmdbd.conf. Then reconfigure the database with `sacctmgr reconfigure` and provide me the slurmctld and slurmdbd logs of an "sacct -X" (you can add a --start and --end range if you want to) where the "cancelled" jobs do not appear, and then run "sacct -X -s CA" (the same calls done in Comment 1 should suffice). This will provide the queries that are run in the slurmdbd when information about jobs is gathered, and then we will be able to know if there is a difference in the fields between a "plain" `sacct -X` and the one specifying the cancelled job state that could be causing this issue. Thank you,
Hi Greg, were you able to apply this log changes? Best regards, Miquel
Hi Greg, are there any news from your side? Best regards,
I accidentally came across this ticket and it reminded me about a similar problem I noticed with sacct years ago. It could only be solved by supplying '-s R'. Here are the comments from my code in case it is useful: " Eligible timestamp as 'Unknown'. By default, sacct shows only jobs with Eligible time. Some jobs do not have Eligible time (i.e. it is 'Unknown'). Many such jobs are CANCELLED and have zero usage, but some are not. Such jobs can only be explicitly retrieved by their JobIDRaw, or by supplying '-s R' in addition to the time interval (-S -E) that encompasses the Start time. Note that the time interval alone without '-s R' will not capture such jobs, because their Eligible is unknown. " And here is the from the official documentation: "NOTE: If no -s (--state) option is given sacct will display *eligible* jobs... ".
Hi Greg, As Sergiy said: > It could only be solved by supplying > '-s R'. Here are the comments from my code in case it is useful: > " > Eligible timestamp as 'Unknown'. By default, sacct shows only jobs with > Eligible time. Some jobs do not have Eligible time (i.e. it is 'Unknown'). > Many such jobs are CANCELLED and have zero usage, but some are not. Such > jobs can only be explicitly retrieved by their JobIDRaw, or by supplying '-s > R' in addition to the time interval (-S -E) that encompasses the Start time. > Note that the time interval alone without '-s R' will not capture such jobs, > because their Eligible is unknown. > " > From the official documentation: "NOTE: If no -s (--state) > option is given sacct will display *eligible* jobs... ". If your jobs did not have Eligible time it is possible that you have run into this. Could you confirm if this is the case? Best regards,
Thanks Sergiy, and Miquel. I do not think this has anything to do with being "eligible". The job is queryable using the Job ID, and while being short did run to completion: $ sacct -j 35914175_112 --format=jobid,jobidraw,start,end,state JobID JobIDRaw Start End State ------------ ------------ ------------------- ------------------- ---------- 35914175_112 35917415 2024-11-01T11:31:56 2024-11-01T14:34:13 COMPLETED 35914175_11+ 35917415.ba+ 2024-11-01T11:31:56 2024-11-01T14:34:13 COMPLETED 35914175_11+ 35917415.ex+ 2024-11-01T11:31:56 2024-11-01T14:34:13 COMPLETED Even thou Slurm has been upgraded and is now on 23.11.4, the original issue still exists: $ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X | grep 35914175_112 $ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X --state R | grep 35914175_112 35914175_112 2024-11-01T11:31:56 2024-11-01T14:34:13 COMPLETED $ Using a standard query to fetch all job data doesn't return information about this particular array member unless '-s r' is used.
Hi Greg, Thank you for your response. This way we can discard more possibilities. Is it feasible for you to apply the requested debug options mentioned in Comment 12 and to upload the new logs? Thank you for your time, Miquel
Miquel, I'll attached the slurmdbd.log requested. There were was no information in the slurmctld.log In comment #12 there's specific mention of a "cancelled" job - the job that is missing isn't cancelled.
Created attachment 41515 [details] slurmdbd.log with debug flag DB_JOB and Level=verbpse
$ date ; sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X | grep 35914175_112; date Tue Apr 22 12:20:38 PM +03 2025 Tue Apr 22 12:20:39 PM +03 2025 $
Can you also print out the Eligible column for these jobs? --format=jobidraw,submit,eligible,start,end,elapsedraw,state
$ sacct -j 35914175_112 --format=jobidraw,submit,eligible,start,end,elapsedraw,state JobIDRaw Submit Eligible Start End ElapsedRaw State ------------ ------------------- ------------------- ------------------- ------------------- ---------- ---------- 35917415 2024-10-31T19:38:43 Unknown 2024-11-01T11:31:56 2024-11-01T14:34:13 10937 COMPLETED 35917415.ba+ 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T14:34:13 10937 COMPLETED 35917415.ex+ 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T14:34:13 10937 COMPLETED
Ok, so Eligible is 'Unknown' and it matches the description of the problem in comments 1 and 15. My conclusion at the time was that it was an unfortunate but expected behaviour since it was documented.
Sergiy, Is there a fix? Any idea to know how many jobs aren't being reported without specifying job state? -Greg
> Is there a fix? This issue forced me to always query jobs for a given time interval by explicitly specifying all possible states. It is a workaround, not a fix, I guess. TZ=UTC sacct --duplicates --allusers --allocations --parsable2 --delimiter='|' --format=Account,AllocCPUS,etc --state=BF,CA,CD,DL,F,NF,OOM,PD,PR,R,RQ,RS,RV,S,TO See https://slurm.schedmd.com/sacct.html#SECTION_JOB-STATE-CODES > Any idea to know how many jobs aren't being reported without specifying job > state? My old comments in the code suggest that most jobs with Unknown in Eligible have zero run-time (CANCELLED state), but not all of them. How many, not clear.
Hi Greg, The fact that, without specifying the job id, sacct only shows jobs which are eligible is something documented [1]. What should be clarified is why does this job in particular have Unknown eligible time when it has been run and its steps do have one. > $ sacct -j 35914175_112 --format=jobidraw,submit,eligible,start,end,elapsedraw,state > JobIDRaw Submit Eligible Start End ElapsedRaw State > ------------ ------------------- ------------------- ------------------- ------------------- ---------- ---------- > 35917415 2024-10-31T19:38:43 Unknown 2024-11-01T11:31:56 2024-11-01T14:34:13 10937 COMPLETED > 35917415.ba+ 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T14:34:13 10937 COMPLETED > 35917415.ex+ 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T11:31:56 2024-11-01T14:34:13 10937 COMPLETED In order to try to dig why is this happening, could you clarify if: - Has it happened with other jobs as well? - Do you have its job submit line? - Do you know if this job was requeued? [1] https://slurm.schedmd.com/sacct.html#OPT_jobs Best regards, Miquel
Hi Greg, could you provide the requested information from comment 28 when possible? > In order to try to dig why is this happening, could you clarify if: > - Has it happened with other jobs as well? > - Do you have its job submit line? > - Do you know if this job was requeued? Best regards, Miquel
Hi Greg, I will be closing this ticket as it has been a month without updates. Please do not hesitate to reopen it if you are able to add this information. Thank you, Miquel