squeue has the ability to tell you how many times a job restarted but that data is not recorded anywhere for posterity. It would be useful to have a record of this so that one could reconstruct things like, on average how many jobs are getting preempted/restarted and what is the higher number of restarts? Please add this as a field to the slurmdbd job records.
Paul, you can access this information today with a few of the sacct options. https://slurm.schedmd.com/sacct.html#OPT_duplicates https://slurm.schedmd.com/sacct.html#OPT_state The -D option can be used with a specific job, so a user who wants to know how many times a job is restarted can use this option to count those events. For a broader overview, you can use a timeframe with the state flag. For example: > sacct -S2020-01-01-9:00 -E2022-10-20-12:00 --state=PREEMPTED https://slurm.schedmd.com/sacct.html#OPT_PR--PREEMPTED > PR PREEMPTED > Job terminated due to preemption. Would these satisfy your needs?
That's quite a bit of work to just get a value for a counter. A summary counter of how may requeues would be easier to reference and plot graphs of. Imagine looping over all the jobs in your database to pull the requeue information for historic estimates of how many requeues on average jobs get. Instead of querying a single integer value, you have to pull a bunch of information from the slurmdbd and then do some math to reconstruct the value. Or if you want to poll this on a loop now you have massive sacct lines and returns coming for the slurmdbd to then reconstruct into a single value you want to plot. This adds overhead to polling the scheduler rather than looking at a single counter per job that can be updated. The data is already available easily in squeue as RestartCnt: *RestartCnt*<https://slurm.schedmd.com/squeue.html#OPT_RestartCnt> The number of restarts for the job. (Valid for jobs only) Ideally all the values availble to squeue should be available as historic records in the slurmdbd. This would also unify the information and options between squeue and sacct which arguably should be able to show the same data. -PaulE dmon- On 10/20/22 12:11 PM, bugs@schedmd.com wrote: > Jason Booth <mailto:jbooth@schedmd.com> changed bug 15240 > <https://bugs.schedmd.com/show_bug.cgi?id=15240> > What Removed Added > Assignee support@schedmd.com jbooth@schedmd.com > > *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=15240#c2> on > bug 15240 <https://bugs.schedmd.com/show_bug.cgi?id=15240> from Jason > Booth <mailto:jbooth@schedmd.com> * > Paul, you can access this information today with a few of the sacct options. > > > https://slurm.schedmd.com/sacct.html#OPT_duplicates > https://slurm.schedmd.com/sacct.html#OPT_state > > > The -D option can be used with a specific job, so a user who wants to know how > many times a job is restarted can use this option to count those events. > > For a broader overview, you can use a timeframe with the state flag. > > For example: > > sacct -S2020-01-01-9:00 -E2022-10-20-12:00 --state=PREEMPTED > > https://slurm.schedmd.com/sacct.html#OPT_PR--PREEMPTED > > > PR PREEMPTED > Job terminated due to preemption. > > > Would these satisfy your needs? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
I assume my reply was satisfactory. I am resolving this out. Please feel free to re-open this should you have question related to my last reply.