Ticket 17990

Summary: Confusing message/behavior when submitting job to reservation with too long runtime
Product: Slurm Reporter: Alexander Grund <alexander.grund>
Component: User CommandsAssignee: Jacob Jenson <jacob>
Status: OPEN --- QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 23.11.x   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Alexander Grund 2023-10-24 02:02:46 MDT
We have the following situation:
- a project reservation for a range of nodes
- a maintenance reservation starting shortly after the project reservation for those nodes

Now the user submits an SBATCH script to that project reservation with a `--time` parameter that would exceed the remaining duration of the project reservation and even run into the maintenance reservation. (Use case: Repeated submit of the same script over a period of a couple weeks eventually getting to the end of the reservation time)

The job will now get queued but marked as "Pending" with "(ReqNodeNotAvail, Reserved for maintenance)".
The user complained that this is a bit confusing and I'm wondering whether this is intentional: Should you be able to submit to a reservation with a runtime exceeding it?

But the issue is much worse when the user specifies a range of nodes (`--nodelist`) to use (valid subset of the project reservation) in which case `sbatch` immediately returns with
> sbatch: error: Required nodes outside of the reservation
> sbatch: error: Batch job submission failed: Requested node configuration is not available

Especially the first part of the error is confusing if not wrong: The nodes specified are part of the reservation! The actual issue is either that the reservation ends before the job could finish with the given time or the maintenance reservation overlapping with the requested runtime.

It can also be reproduced with `srun` instead of `sbatch`, like `srun --time=15-0:0:10 --reservation=p_1204 --partition=ml --nodelist=node[8015,8019,8022] hostname`

I think the maintenance reservation is part of the issue that it claims "Required nodes outside of the reservation" as for another reservation requesting an overlong time with a `--nodelist` yields

> srun: error: Problem using reservation
> srun: error: Unable to allocate resources: Requested node configuration is not available

This is different although not exactly helpful either.
And again without `--nodelist` the job gets scheduled.

So in summary the request is:
- Fix the difference when using a reservation with and without an explicitly given nodelist (either both should error or be scheduled)
- Improve the error message (if any) especially fixing the wrong "Required nodes outside of the reservation" part