Ticket 11498

Summary:	RFE: Configurable number of requeues
Product:	Slurm	Reporter:	Janne Blomqvist <jblomqvist>
Component:	Scheduling	Assignee:	Chad Vizino <chad>
Status:	RESOLVED FIXED	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	fabecassis, kilian, lyeager, tim
Version:	20.11.5
Hardware:	Linux
OS:	Linux
Site:	NVIDIA (PSLA)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	23.02.0rc1
Target Release:	23.02	DevPrio:	1 - Paid
Emory-Cloud Sites:	---

Description Janne Blomqvist 2021-04-30 06:02:03 MDT

Hi,

currently the maximum number of requeues is hardcoded as 5 in slurmctld.h:

#define MAX_BATCH_REQUEUE 5

Could this be be made a configuration parameter?

Comment 3 Kilian Cavalotti 2021-04-30 15:28:15 MDT

Seconded!

Comment 4 Jason Booth 2021-05-04 15:04:00 MDT

Can you define your use case for this in more detail? What are you trying to accomplish with this? Are you trying to requeue failures more than 5 times? If so, what do you hope to accomplish with this?

Comment 5 Felix Abecassis 2021-05-04 17:15:00 MDT

In most situations, we would like to *lower* the maximum number of requeues to 1 or 2.

Comment 7 Tim Wickberg 2021-05-25 10:28:48 MDT

(In reply to Felix Abecassis from comment #5)
> In most situations, we would like to *lower* the maximum number of requeues
> to 1 or 2.

As with all these hard-coded settings, it's not especially difficult to expose yet another tuning parameter to alter... but at the same time we already have a plethora of options, and I'd rather not add another without good reason. Especially if there's some other structural approach we can take to target the same issue.

So, to rephrase: can you elaborate on why you're seeing this as an issue? And are you using RequeueExit or other related options as part of the jobs' workflow?

Comment 8 Janne Blomqvist 2021-06-08 02:34:07 MDT

Hi,

no, we're not using RequeueExit or similar options.

Our logic is basically that if a job requeues due to a node failure once, it might be a "real" issue, twice maybe, but more than that is more likely due to the job itself doing something which causes the node failure, and we want to avoid unnecessarily draining nodes before we have the opportunity to look into what causes the problem.

Other sites may have other considerations, hence our request to make it configurable instead of changing the hardcoded value. But just for our own use, we'd be happy with just reducing the hardcoded value too.

(And just to clarify since I'm not sure what the behavior is based on reading the docs, we don't want this limit to count against requeuing due to preemption. The requeue limit due to preemption should be considerably higher, maybe even effectively unlimited)

Comment 21 Tim Wickberg 2022-12-13 12:05:53 MST

Hi Janne -

We're just about ready to land this change, but had one minor edge case pop up in discussion and wanted your opinion on it:

For a job that's started running, but then suffers from a node failure (so not just a Prolog failure), and has enabled requeuing, it is not currently subjected to the 5 run limit.

At the moment, I'm not expecting to limit it with this new MaxBatchRequeue option, but wanted to check that's not an issue for your use case.

thanks,
- Tim

Comment 39 Chad Vizino 2023-01-24 15:37:03 MST

Hi. We've completed this work and it will be in 23.02 when it's released next month.

Comment 40 Kilian Cavalotti 2023-01-24 15:55:54 MST

(In reply to Chad Vizino from comment #39)
> Hi. We've completed this work and it will be in 23.02 when it's released
> next month.

Wow, nice, thanks Chad!

Cheers,
--
Kilian