Ticket 11498 - RFE: Configurable number of requeues
Summary: RFE: Configurable number of requeues
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 20.11.5
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Chad Vizino
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-04-30 06:02 MDT by Janne Blomqvist
Modified: 2023-01-24 15:55 MST (History)
4 users (show)

See Also:
Site: NVIDIA (PSLA)
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.0rc1
Target Release: 23.02
DevPrio: 1 - Paid
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Janne Blomqvist 2021-04-30 06:02:03 MDT
Hi,

currently the maximum number of requeues is hardcoded as 5 in slurmctld.h:

#define MAX_BATCH_REQUEUE 5

Could this be be made a configuration parameter?
Comment 3 Kilian Cavalotti 2021-04-30 15:28:15 MDT
Seconded!
Comment 4 Jason Booth 2021-05-04 15:04:00 MDT
Can you define your use case for this in more detail? What are you trying to accomplish with this? Are you trying to requeue failures more than 5 times? If so, what do you hope to accomplish with this?
Comment 5 Felix Abecassis 2021-05-04 17:15:00 MDT
In most situations, we would like to *lower* the maximum number of requeues to 1 or 2.
Comment 7 Tim Wickberg 2021-05-25 10:28:48 MDT
(In reply to Felix Abecassis from comment #5)
> In most situations, we would like to *lower* the maximum number of requeues
> to 1 or 2.

As with all these hard-coded settings, it's not especially difficult to expose yet another tuning parameter to alter... but at the same time we already have a plethora of options, and I'd rather not add another without good reason. Especially if there's some other structural approach we can take to target the same issue.

So, to rephrase: can you elaborate on why you're seeing this as an issue? And are you using RequeueExit or other related options as part of the jobs' workflow?
Comment 8 Janne Blomqvist 2021-06-08 02:34:07 MDT
Hi,

no, we're not using RequeueExit or similar options.

Our logic is basically that if a job requeues due to a node failure once, it might be a "real" issue, twice maybe, but more than that is more likely due to the job itself doing something which causes the node failure, and we want to avoid unnecessarily draining nodes before we have the opportunity to look into what causes the problem.

Other sites may have other considerations, hence our request to make it configurable instead of changing the hardcoded value. But just for our own use, we'd be happy with just reducing the hardcoded value too.

(And just to clarify since I'm not sure what the behavior is based on reading the docs, we don't want this limit to count against requeuing due to preemption. The requeue limit due to preemption should be considerably higher, maybe even effectively unlimited)
Comment 21 Tim Wickberg 2022-12-13 12:05:53 MST
Hi Janne -

We're just about ready to land this change, but had one minor edge case pop up in discussion and wanted your opinion on it:

For a job that's started running, but then suffers from a node failure (so not just a Prolog failure), and has enabled requeuing, it is not currently subjected to the 5 run limit.

At the moment, I'm not expecting to limit it with this new MaxBatchRequeue option, but wanted to check that's not an issue for your use case.

thanks,
- Tim
Comment 39 Chad Vizino 2023-01-24 15:37:03 MST
Hi. We've completed this work and it will be in 23.02 when it's released next month.
Comment 40 Kilian Cavalotti 2023-01-24 15:55:54 MST
(In reply to Chad Vizino from comment #39)
> Hi. We've completed this work and it will be in 23.02 when it's released
> next month.

Wow, nice, thanks Chad!

Cheers,
--
Kilian