Ticket 14634

Summary: We should not allow the REPLACE and STATIC_ALLOC flags to be used at the same time in a reservation
Product: Slurm Reporter: Caden Ellis <caden>
Component: reservationsAssignee: Director of Support <support>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 23.02.x   
Hardware: Linux   
OS: Linux   
Site: SchedMD Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 23.02.0pre1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Ticket Depends on:    
Ticket Blocks: 14589    
Attachments: bug14634_2302_v1
bug14624_2302_v2
bug14634_2302_v3
bug14634_2302_v4

Description Caden Ellis 2022-07-27 14:03:39 MDT
Currently, a user can create a reservation and put both the REPLACE flag and the STATIC_ALLOC flag, which are both mutually exclusive. This should throw an error. Note that MAINT now implies STATIC_ALLOC, so using that flag with the REPLACE flag should also throw an error.
Comment 1 Caden Ellis 2022-08-30 16:26:03 MDT
Created attachment 26531 [details]
bug14634_2302_v1

This should remove the ability to create a reservation that uses either the REPLACE or REPLACE_DOWN flag and also uses the STATIC_ALLOC or MAINT flag. These pairs of flags are direct opposites, so this should cause an error.
Comment 2 Caden Ellis 2022-08-30 16:45:39 MDT
caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation user=caden nodecnt=3 starttime=2022-08-30T14:03:00 duration=2 flags=maint,replace
Error creating the reservation: Requested operation not supported on this system


Logs show this:
[2022-08-30T14:00:55.155] REPLACE and REPLACE_DOWN flags cannot be used with STATIC_ALLOC or MAINT flags

MAINT, STATIC_ALLOC, REPLACE, and REPLACE_DOWN all still individually work. 

Bug 14589 is adding this to the reservation flag tests
Comment 3 Caden Ellis 2022-08-30 17:08:58 MDT
Created attachment 26535 [details]
bug14624_2302_v2

Previous patch had a white space error
Comment 4 Marshall Garey 2022-09-08 10:59:43 MDT
Created attachment 26671 [details]
bug14634_2302_v3

This looks good.
This fixes creating a reservation. However, you can still update the reservation to have those flags:

marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall nodecnt=1
Reservation created: resv_bug14634
marshall@voyager:~/slurm/23.02/install/c1$ scontrol update reservationname=resv_bug14634 flags=static_alloc,replace
Reservation updated.
marshall@voyager:~/slurm/23.02/install/c1$ scontrol show reservations
ReservationName=resv_bug14634 StartTime=2022-09-08T10:41:59 EndTime=2022-09-08T10:51:59 Duration=00:10:00
   Nodes=n1-1 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=debug Flags=STATIC,REPLACE
   TRES=cpu=8
   Users=marshall Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)


I think there is some duplicate code in the create and update reservation functions, and ideally we would combine as much of the verification code into functions, but don't worry about this for now. It looks like a bigger cleanup project. For now, just fix the update reservation.



I also noticed that the "not supported" error message wasn't very helpful. You had to look at the slurmctld log to know what happened. There were other error messages that weren't helpful. So, I added a commit that sends helpful error messages to the user command for create reservation errors. I only added error messages in places where I thought the error message was unclear - the remaining places I thought that they already had clear error messages.

TODO:

* Fix static_alloc/maint and replace/replace_down for updating reservations.
* I didn't add error messages for update reservation errors. Can you do that?
* Can you add NEWS and RELEASE_NOTES for your patch? (We don't need it for mine.)

Here are the cases where I thought they had unhelpful error messages. Each example shows before and after the patch.

marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall flags=replace nodes=n1-1
Error creating the reservation: Invalid node name specified

marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall flags=replace nodes=n1-1
scontrol: error: REPLACE or REPLACE_DOWN flags should be used with the NodeCnt reservation option; do not specify Nodes
Error creating the reservation: Invalid node name specified




marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall flags=replace nodecnt=1 corecnt=1
Error creating the reservation: CPU count specification invalid

marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall flags=replace nodecnt=1 corecnt=1
scontrol: error: REPLACE or REPLACE_DOWN flags should be used with the NodeCnt reservation option; do not specify CoreCnt
Error creating the reservation: CPU count specification invalid



marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall flags=replace,maint nodecnt=1
Error creating the reservation: Requested operation not supported on this system

marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall flags=replace,maint nodecnt=1
scontrol: error: REPLACE and REPLACE_DOWN flags cannot be used with STATIC_ALLOC or MAINT flags
Error creating the reservation: Requested operation not supported on this system



marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall nodecnt=1 flags=time_float,weekly
Error creating the reservation: Requested operation not supported on this system

marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall nodecnt=1 flags=time_float,weekly
scontrol: error: Reservation request has mutually exclusive flags. Repeating floating reservations are not supported.
Error creating the reservation: Requested operation not supported on this system





Another thing that I didn't address:

marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall nodes=n1-1 corecnt=12341234
Error creating the reservation: Requested nodes are busy
Note, unless nodes are directly requested a reservation must exist in a single partition.
If no partition is requested the default partition is assumed.

This is correctly rejected because my node doesn't have that many cores; however, the error message is wrong. I didn't look into this so I don't know how hard it will be to fix. Before you worry about this, can you get the other things done? I think this could be pushed separately. This could be a separate bug, or if it turns out to be too difficult I just wouldn't even worry about it (since the reservation is still rejected).
Comment 5 Marshall Garey 2022-09-08 15:48:55 MDT
Add me back when you're ready
Comment 6 Caden Ellis 2022-11-15 13:12:11 MST
Update with the new patch.

1) No change with this error message. Note: it is saying that a core-based reservation can't be updated, but the original reservation wasn't core-based, so that error message is a little off (maybe something more like "can't update a reservation to be core-based" would be better). But it does correctly reject it. I left this alone, because this is checked before all other checks, and has a FIXME comment to support more core-based reservation features. So I didn't want to touch this one.

caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation user=caden starttime=now+20 duration=3 flags=replace nodecnt=1
Reservation created: caden_3
caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update reservationname=caden_3 corecnt=1
Error updating the reservation: Core-based reservation can not be updated
slurm_update error: Core-based reservation can not be updated

2a) Updating REPLACE reservations to have a node list error - Old:

caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+600 users=caden flags=replace NodeCnt=2
Reservation created: caden_9
caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update reservationname=caden_9 nodes=n[0-1]
Error updating the reservation: Requested operation not supported on this system
slurm_update error: Requested operation not supported on this system

2b) New:

caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+600 users=caden flags=replace NodeCnt=2
Reservation created: caden_10
caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update reservationname=caden_10 nodes=n[0-1]
scontrol: error: Reservation can't be updated with Nodes option; it is incompatible with REPLACE[_DOWN]
Error updating the reservation: Requested operation not supported on this system
slurm_update error: Requested operation not supported on this system

3a) Updating to add mutually exclusive flags messages with STATIC and REPLACE - Old:

caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+600 users=caden flags=replace NodeCnt=2
Reservation created: caden_15
caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update reservationname=caden_15 flags=maint
Error updating the reservation: Requested operation not supported on this system
slurm_update error: Requested operation not supported on this system

3b) New:

caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+600 users=caden flags=replace NodeCnt=2
Reservation created: caden_16
caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update reservationname=caden_16 flags=maint
scontrol: error: REPLACE and REPLACE_DOWN flags cannot be used with STATIC_ALLOC or MAINT flags
Error updating the reservation: Requested operation not supported on this system
slurm_update error: Requested operation not supported on this system

4a) Reservation has reoccurring flag, but tries update to have time_float (mutually exclusive) error - No change

This is a different case, because time_float can't be added or removed from a reservation according to the docs. So it doesn't matter that weekly and time float are mutually exclusive, you can't add or remove time_float anyway. This is the error message it gives, which seems close enough, though not crazy informative. I can add more info if you want. There is an info message in slurmctld log saying "reservation __ request to set TIME_FLOAT flag," implying the user should already know it could not be added and that is why it is giving this info message.

caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+600 users=caden flags=weekly NodeCnt=2
Reservation created: caden_14
caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update reservationname=caden_14 flags=time_float
Error updating the reservation: Invalid time specified
slurm_update error: Invalid time specified
caden@kashyyyk:~/slurm/22.05/slurm$

4b i) Reservation has time_float, but reoccurring flag is added in update- Old. We allowed this before

4b ii) New -
caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+180 users=caden NodeCnt=1 flags=time_float
Reservation created: caden_8
caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update res reservationname=caden_8 flags=daily
scontrol: error: Cannot add a reoccurring flag to a floating reservation
Error updating the reservation: Requested operation not supported on this system

5a) Reservation can set multiple reoccurring flags together (either update or creation) - We allowed this

5b) New -
caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+180 users=caden NodeCnt=1 flags=weekly,daily
scontrol: error: Reservation has multiple reoccurring flags. Please specify only one reoccurring flag
Error creating the reservation: Requested operation not supported on this system


Or with update:

caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+180 users=caden NodeCnt=1 flags=weekly
Reservation created: caden_4

caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update res reservationname=caden_4 flags=daily
scontrol: error: Cannot update reservation to have multiple reoccurring flags. Please specify only one reoccurring flag
Error updating the reservation: Requested operation not supported on this system
slurm_update error: Requested operation not supported on this system

Starting with no flags and trying to add multiple:

caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+180 users=caden NodeCnt=1
Reservation created: caden_5

caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update res reservationname=caden_5 flags=daily,weekly
scontrol: error: Cannot update reservation to have multiple reoccurring flags. Please specify only one reoccurring flag
Error updating the reservation: Requested operation not supported on this system
slurm_update error: Requested operation not supported on this system
caden@kashyyyk:~/slurm/22.05/slurm$
Comment 7 Caden Ellis 2022-11-15 13:15:03 MST
Created attachment 27767 [details]
bug14634_2302_v4
Comment 8 Marshall Garey 2022-11-28 17:11:02 MST
Caden,

I think this looks really good. I need to actually run some tests before I push.

Regarding patch 8/10: I think this is good. We don't properly handle multiple reoccurring flags anyway: in _advance_resv_time() they are in an if-else if block, so only one will happen.
Comment 9 Caden Ellis 2022-11-29 11:13:24 MST
Another case:

Reservation has one reoccurring flag, its removed, and a different one is added. Should work (and it does).

caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+180 users=caden NodeCnt=1 flags=weekly
Reservation created: caden_12

caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update res reservationname=caden_12 flags-=weekly
Reservation updated.

caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update res reservationname=caden_12 flags+=daily
Reservation updated.

caden@kashyyyk:~/slurm/22.05/slurm$ scontrol show res
ReservationName=caden_12 StartTime=2022-11-30T11:08:53 EndTime=2022-11-30T11:11:53 Duration=00:03:00
   Nodes=n0 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=normal Flags=DAILY
   TRES=cpu=2
   Users=caden Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)
Comment 10 Marshall Garey 2022-11-29 16:53:49 MST
Caden,

I rebased your fixup commit. This is in the master branch in the following commits:

83abcdab97 (HEAD -> master, origin/master, origin/HEAD) NEWS and RELEASE_NOTES for previous commits
7caf5b329f Remove ability to update a floating reservation to be reoccurring
042bfaba9e Update reservation creation/update to only accept one reoccurring flag
759f21db2c Remove abilitiy to update a resv to have REPLACE and STATIC_ALLOC flags
cf58ba522d Send helpful error messages to user commands for reservation updates
4ff4a9f453 Fix typo of RESERVE_REOCCURRING macro
9f0345d137 Fix indentation
48660b6027 Send helpful error messages to user commands for reservation requests
3519f1b89e Remove ability to have reservation with REPLACE and STATIC_ALLOC flags


This was awesome. Great job.