15240 – Record Restart Count in Database

Ticket 15240 - Record Restart Count in Database

Summary: Record Restart Count in Database

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Database (show other tickets)
Version:	23.02.x
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Jason Booth
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-10-19 11:09 MDT by Paul Edmon
Modified:	2022-12-22 16:06 MST (History)
CC List:	0 users

See Also:
Site:	Harvard University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Paul Edmon 2022-10-19 11:09:09 MDT

squeue has the ability to tell you how many times a job restarted but that data is not recorded anywhere for posterity.  It would be useful to have a record of this so that one could reconstruct things like, on average how many jobs are getting preempted/restarted and what is the higher number of restarts? Please add this as a field to the slurmdbd job records.

Comment 2 Jason Booth 2022-10-20 10:11:46 MDT

Paul, you can access this information today with a few of the sacct options.


https://slurm.schedmd.com/sacct.html#OPT_duplicates
https://slurm.schedmd.com/sacct.html#OPT_state


The -D option can be used with a specific job, so a user who wants to know how many times a job is restarted can use this option to count those events.

For a broader overview, you can use a timeframe with the state flag.

For example:
> sacct -S2020-01-01-9:00 -E2022-10-20-12:00 --state=PREEMPTED

https://slurm.schedmd.com/sacct.html#OPT_PR--PREEMPTED

> PR PREEMPTED
> Job terminated due to preemption.


Would these satisfy your needs?

Comment 3 Paul Edmon 2022-10-20 10:41:39 MDT

That's quite a bit of work to just get a value for a counter.  A summary 
counter of how may requeues would be easier to reference and plot graphs 
of.  Imagine looping over all the jobs in your database to pull the 
requeue information for historic estimates of how many requeues on 
average jobs get.  Instead of querying a single integer value, you have 
to pull a bunch of information from the slurmdbd and then do some math 
to reconstruct the value.  Or if you want to poll this on a loop now you 
have massive sacct lines and returns coming for the slurmdbd to then 
reconstruct into a single value you want to plot.  This adds overhead to 
polling the scheduler rather than looking at a single counter per job 
that can be updated.

The data is already available easily in squeue as RestartCnt:

*RestartCnt*<https://slurm.schedmd.com/squeue.html#OPT_RestartCnt>
    The number of restarts for the job. (Valid for jobs only) 

Ideally all the values availble to squeue should be available as 
historic records in the slurmdbd.  This would also unify the information 
and options between squeue and sacct which arguably should be able to 
show the same data.

-PaulE dmon-

On 10/20/22 12:11 PM, bugs@schedmd.com wrote:
> Jason Booth <mailto:jbooth@schedmd.com> changed bug 15240 
> <https://bugs.schedmd.com/show_bug.cgi?id=15240>
> What 	Removed 	Added
> Assignee 	support@schedmd.com 	jbooth@schedmd.com
>
> *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=15240#c2> on 
> bug 15240 <https://bugs.schedmd.com/show_bug.cgi?id=15240> from Jason 
> Booth <mailto:jbooth@schedmd.com> *
> Paul, you can access this information today with a few of the sacct options.
>
>
> https://slurm.schedmd.com/sacct.html#OPT_duplicates
> https://slurm.schedmd.com/sacct.html#OPT_state
>
>
> The -D option can be used with a specific job, so a user who wants to know how
> many times a job is restarted can use this option to count those events.
>
> For a broader overview, you can use a timeframe with the state flag.
>
> For example:
> > sacct -S2020-01-01-9:00 -E2022-10-20-12:00 --state=PREEMPTED
>
> https://slurm.schedmd.com/sacct.html#OPT_PR--PREEMPTED
>
> > PR PREEMPTED > Job terminated due to preemption.
>
>
> Would these satisfy your needs?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 4 Jason Booth 2022-12-22 16:06:43 MST

I assume my reply was satisfactory. I am resolving this out. Please feel free to re-open this should you have question related to my last reply.