Ticket 15240

Summary: Record Restart Count in Database
Product: Slurm Reporter: Paul Edmon <pedmon>
Component: DatabaseAssignee: Jason Booth <jbooth>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 23.02.x   
Hardware: Linux   
OS: Linux   
Site: Harvard University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Paul Edmon 2022-10-19 11:09:09 MDT
squeue has the ability to tell you how many times a job restarted but that data is not recorded anywhere for posterity.  It would be useful to have a record of this so that one could reconstruct things like, on average how many jobs are getting preempted/restarted and what is the higher number of restarts? Please add this as a field to the slurmdbd job records.
Comment 2 Jason Booth 2022-10-20 10:11:46 MDT
Paul, you can access this information today with a few of the sacct options.


https://slurm.schedmd.com/sacct.html#OPT_duplicates
https://slurm.schedmd.com/sacct.html#OPT_state


The -D option can be used with a specific job, so a user who wants to know how many times a job is restarted can use this option to count those events.

For a broader overview, you can use a timeframe with the state flag.

For example:
> sacct -S2020-01-01-9:00 -E2022-10-20-12:00 --state=PREEMPTED

https://slurm.schedmd.com/sacct.html#OPT_PR--PREEMPTED

> PR PREEMPTED
> Job terminated due to preemption.


Would these satisfy your needs?
Comment 3 Paul Edmon 2022-10-20 10:41:39 MDT
That's quite a bit of work to just get a value for a counter.  A summary 
counter of how may requeues would be easier to reference and plot graphs 
of.  Imagine looping over all the jobs in your database to pull the 
requeue information for historic estimates of how many requeues on 
average jobs get.  Instead of querying a single integer value, you have 
to pull a bunch of information from the slurmdbd and then do some math 
to reconstruct the value.  Or if you want to poll this on a loop now you 
have massive sacct lines and returns coming for the slurmdbd to then 
reconstruct into a single value you want to plot.  This adds overhead to 
polling the scheduler rather than looking at a single counter per job 
that can be updated.

The data is already available easily in squeue as RestartCnt:

*RestartCnt*<https://slurm.schedmd.com/squeue.html#OPT_RestartCnt>
    The number of restarts for the job. (Valid for jobs only) 

Ideally all the values availble to squeue should be available as 
historic records in the slurmdbd.  This would also unify the information 
and options between squeue and sacct which arguably should be able to 
show the same data.

-PaulE dmon-

On 10/20/22 12:11 PM, bugs@schedmd.com wrote:
> Jason Booth <mailto:jbooth@schedmd.com> changed bug 15240 
> <https://bugs.schedmd.com/show_bug.cgi?id=15240>
> What 	Removed 	Added
> Assignee 	support@schedmd.com 	jbooth@schedmd.com
>
> *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=15240#c2> on 
> bug 15240 <https://bugs.schedmd.com/show_bug.cgi?id=15240> from Jason 
> Booth <mailto:jbooth@schedmd.com> *
> Paul, you can access this information today with a few of the sacct options.
>
>
> https://slurm.schedmd.com/sacct.html#OPT_duplicates
> https://slurm.schedmd.com/sacct.html#OPT_state
>
>
> The -D option can be used with a specific job, so a user who wants to know how
> many times a job is restarted can use this option to count those events.
>
> For a broader overview, you can use a timeframe with the state flag.
>
> For example:
> > sacct -S2020-01-01-9:00 -E2022-10-20-12:00 --state=PREEMPTED
>
> https://slurm.schedmd.com/sacct.html#OPT_PR--PREEMPTED
>
> > PR PREEMPTED > Job terminated due to preemption.
>
>
> Would these satisfy your needs?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 4 Jason Booth 2022-12-22 16:06:43 MST
I assume my reply was satisfactory. I am resolving this out. Please feel free to re-open this should you have question related to my last reply.