| Summary: | RFE: Configurable number of requeues | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Janne Blomqvist <jblomqvist> |
| Component: | Scheduling | Assignee: | Chad Vizino <chad> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | fabecassis, kilian, lyeager, tim |
| Version: | 20.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NVIDIA (PSLA) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 23.02.0rc1 | Target Release: | 23.02 |
| DevPrio: | 1 - Paid | Emory-Cloud Sites: | --- |
|
Description
Janne Blomqvist
2021-04-30 06:02:03 MDT
Seconded! Can you define your use case for this in more detail? What are you trying to accomplish with this? Are you trying to requeue failures more than 5 times? If so, what do you hope to accomplish with this? In most situations, we would like to *lower* the maximum number of requeues to 1 or 2. (In reply to Felix Abecassis from comment #5) > In most situations, we would like to *lower* the maximum number of requeues > to 1 or 2. As with all these hard-coded settings, it's not especially difficult to expose yet another tuning parameter to alter... but at the same time we already have a plethora of options, and I'd rather not add another without good reason. Especially if there's some other structural approach we can take to target the same issue. So, to rephrase: can you elaborate on why you're seeing this as an issue? And are you using RequeueExit or other related options as part of the jobs' workflow? Hi, no, we're not using RequeueExit or similar options. Our logic is basically that if a job requeues due to a node failure once, it might be a "real" issue, twice maybe, but more than that is more likely due to the job itself doing something which causes the node failure, and we want to avoid unnecessarily draining nodes before we have the opportunity to look into what causes the problem. Other sites may have other considerations, hence our request to make it configurable instead of changing the hardcoded value. But just for our own use, we'd be happy with just reducing the hardcoded value too. (And just to clarify since I'm not sure what the behavior is based on reading the docs, we don't want this limit to count against requeuing due to preemption. The requeue limit due to preemption should be considerably higher, maybe even effectively unlimited) Hi Janne - We're just about ready to land this change, but had one minor edge case pop up in discussion and wanted your opinion on it: For a job that's started running, but then suffers from a node failure (so not just a Prolog failure), and has enabled requeuing, it is not currently subjected to the 5 run limit. At the moment, I'm not expecting to limit it with this new MaxBatchRequeue option, but wanted to check that's not an issue for your use case. thanks, - Tim Hi. We've completed this work and it will be in 23.02 when it's released next month. (In reply to Chad Vizino from comment #39) > Hi. We've completed this work and it will be in 23.02 when it's released > next month. Wow, nice, thanks Chad! Cheers, -- Kilian |