$ sacct --start 2024-03-13T00:00:00 --end 2024-03-15T00:00:00 --allusers -X | grep 32774910 32774910_0 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_1 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_2 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_3 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_4 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_5 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_6 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_10 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 32774910_14 Q-q100-SM gpu24 conf-eccv+ 8 COMPLETED 0:0 but specifying the same date range and state "CA" (Cancelled): $ sacct --start 2024-03-13T00:00:00 --end 2024-03-15T00:00:00 --allusers -X -s CA | grep 32774910 32774910_7 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_8 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_9 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_11 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_12 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_13 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_15 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_16 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_17 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_18 Q-q100-SM gpu24 conf-eccv+ 8 CANCELLED+ 0:0 32774910_[1+ Q-q100-SM gpu,gpu24+ conf-eccv+ 0 CANCELLED+ 0:0 All the missing allocations are presented. Is this a bug? (If not, what is necessary to fetch all allocations in _any_ state?) -Greg
And: $ sacct -j 35914175_112 --format jobid,start,end,state JobID Start End State ------------ ------------------- ------------------- ---------- 35914175_112 2024-11-01T11:31:56 2024-11-01T14:34:13 COMPLETED 35914175_11+ 2024-11-01T11:31:56 2024-11-01T14:34:13 COMPLETED 35914175_11+ 2024-11-01T11:31:56 2024-11-01T14:34:13 COMPLETED 1/ No entry returned $ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X | grep 35914175_112 $ 2/ same search but with state "R" but job is completed: $ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X --state R | grep 35914175_112 35914175_112 2024-11-01T11:31:56 2024-11-01T14:34:13 COMPLETED 3/ request state "CD": $ sacct --start 2024-11-01 --end 2024-11-02 --format jobid,start,end,state --allusers -X --state CD | grep 35914175_112 35914175_112 2024-11-01T11:31:56 2024-11-01T14:34:13 COMPLETED -Greg
Hi Greg, indeed, it seems something is wrong as you should be getting the complete job list when using sact. Could you upload your slurm.conf and slurmdbd.conf? Also, what user are you using to submit these jobs and from which user are you making the sacct queries? Thank you, Miquel
Created attachment 39862 [details] slurm.conf
Created attachment 39863 [details] slurmdbd.conf
(In reply to Miquel Comas from comment #2) > Hi Greg, > > indeed, it seems something is wrong as you should be getting the complete > job list when using sact. Could you upload your slurm.conf and slurmdbd.conf? > > Also, what user are you using to submit these jobs and from which user are > you making the sacct queries? > > Thank you, > > Miquel Hi Miquel, I'm running sacct from my personal user account, however I've just tested the "root" account and the same results are obtained. I cannot tell you what commands were used during job submission - the jobs ran around March 2023. -Greg
Hi Greg, Thank you for the configs. At first glance, it does not look like there is a misconfiguration. Could you reproduce the problem with these debug options enabled and then share the slurmdbd.log with us? (in slurmdbd.conf): > DebugFlags=DB_QUERY,DB_ASSOC > DebugLevel=debug [1] https://slurm.schedmd.com/slurmdbd.conf.html#OPT_DB_QUERY [2] https://slurm.schedmd.com/slurmdbd.conf.html#OPT_DB_ASSOC [3] https://slurm.schedmd.com/slurmdbd.conf.html#OPT_DebugLevel Additionally, please also add this debug option and share the slurmctld.log when making the sacct call. (in slurm.conf): > DebugFlags=DBD_Agent > DebugLevel=debug [4] https://slurm.schedmd.com/slurm.conf.html#OPT_DBD_Agent Once you have made the calls with the debug options enabled you can revert them to your original state to avoid filling the logs with extra information. When reproducing the issue, please make the calls to sacct you did in Comment 1. This way we will be able to compare different queries to the database. Thank you, Miquel
Hi Greg, do you need further assistance with your question? Best regards,
Hi Greg, are there any updates on the issue? Thank you,
Created attachment 40201 [details] slurmctld.log
Created attachment 40202 [details] slurmdbd.log
Actions taken as requested; log files attached. Note that "DebugLevel=debug" is not valid in slurmctld.conf. "SlurmctldDebug=debug" was used instead.
Hi Greg, thank you for the logs. I have been digging into them and I would like to request another debug flag to gather more information. Please, add DB_JOB to DebugFlags for slurmdbd.conf. Then reconfigure the database with `sacctmgr reconfigure` and provide me the slurmctld and slurmdbd logs of an "sacct -X" (you can add a --start and --end range if you want to) where the "cancelled" jobs do not appear, and then run "sacct -X -s CA" (the same calls done in Comment 1 should suffice). This will provide the queries that are run in the slurmdbd when information about jobs is gathered, and then we will be able to know if there is a difference in the fields between a "plain" `sacct -X` and the one specifying the cancelled job state that could be causing this issue. Thank you,
Hi Greg, were you able to apply this log changes? Best regards, Miquel
Hi Greg, are there any news from your side? Best regards,
I accidentally came across this ticket and it reminded me about a similar problem I noticed with sacct years ago. It could only be solved by supplying '-s R'. Here are the comments from my code in case it is useful: " Eligible timestamp as 'Unknown'. By default, sacct shows only jobs with Eligible time. Some jobs do not have Eligible time (i.e. it is 'Unknown'). Many such jobs are CANCELLED and have zero usage, but some are not. Such jobs can only be explicitly retrieved by their JobIDRaw, or by supplying '-s R' in addition to the time interval (-S -E) that encompasses the Start time. Note that the time interval alone without '-s R' will not capture such jobs, because their Eligible is unknown. " And here is the from the official documentation: "NOTE: If no -s (--state) option is given sacct will display *eligible* jobs... ".