| Summary: | We should not allow the REPLACE and STATIC_ALLOC flags to be used at the same time in a reservation | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Caden Ellis <caden> |
| Component: | reservations | Assignee: | Director of Support <support> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 23.02.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | SchedMD | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 23.02.0pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Ticket Depends on: | |||
| Ticket Blocks: | 14589 | ||
| Attachments: |
bug14634_2302_v1
bug14624_2302_v2 bug14634_2302_v3 bug14634_2302_v4 |
||
|
Description
Caden Ellis
2022-07-27 14:03:39 MDT
Created attachment 26531 [details] bug14634_2302_v1 This should remove the ability to create a reservation that uses either the REPLACE or REPLACE_DOWN flag and also uses the STATIC_ALLOC or MAINT flag. These pairs of flags are direct opposites, so this should cause an error. caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation user=caden nodecnt=3 starttime=2022-08-30T14:03:00 duration=2 flags=maint,replace Error creating the reservation: Requested operation not supported on this system Logs show this: [2022-08-30T14:00:55.155] REPLACE and REPLACE_DOWN flags cannot be used with STATIC_ALLOC or MAINT flags MAINT, STATIC_ALLOC, REPLACE, and REPLACE_DOWN all still individually work. Bug 14589 is adding this to the reservation flag tests Created attachment 26535 [details] bug14624_2302_v2 Previous patch had a white space error Created attachment 26671 [details] bug14634_2302_v3 This looks good. This fixes creating a reservation. However, you can still update the reservation to have those flags: marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall nodecnt=1 Reservation created: resv_bug14634 marshall@voyager:~/slurm/23.02/install/c1$ scontrol update reservationname=resv_bug14634 flags=static_alloc,replace Reservation updated. marshall@voyager:~/slurm/23.02/install/c1$ scontrol show reservations ReservationName=resv_bug14634 StartTime=2022-09-08T10:41:59 EndTime=2022-09-08T10:51:59 Duration=00:10:00 Nodes=n1-1 NodeCnt=1 CoreCnt=8 Features=(null) PartitionName=debug Flags=STATIC,REPLACE TRES=cpu=8 Users=marshall Groups=(null) Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) I think there is some duplicate code in the create and update reservation functions, and ideally we would combine as much of the verification code into functions, but don't worry about this for now. It looks like a bigger cleanup project. For now, just fix the update reservation. I also noticed that the "not supported" error message wasn't very helpful. You had to look at the slurmctld log to know what happened. There were other error messages that weren't helpful. So, I added a commit that sends helpful error messages to the user command for create reservation errors. I only added error messages in places where I thought the error message was unclear - the remaining places I thought that they already had clear error messages. TODO: * Fix static_alloc/maint and replace/replace_down for updating reservations. * I didn't add error messages for update reservation errors. Can you do that? * Can you add NEWS and RELEASE_NOTES for your patch? (We don't need it for mine.) Here are the cases where I thought they had unhelpful error messages. Each example shows before and after the patch. marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall flags=replace nodes=n1-1 Error creating the reservation: Invalid node name specified marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall flags=replace nodes=n1-1 scontrol: error: REPLACE or REPLACE_DOWN flags should be used with the NodeCnt reservation option; do not specify Nodes Error creating the reservation: Invalid node name specified marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall flags=replace nodecnt=1 corecnt=1 Error creating the reservation: CPU count specification invalid marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall flags=replace nodecnt=1 corecnt=1 scontrol: error: REPLACE or REPLACE_DOWN flags should be used with the NodeCnt reservation option; do not specify CoreCnt Error creating the reservation: CPU count specification invalid marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall flags=replace,maint nodecnt=1 Error creating the reservation: Requested operation not supported on this system marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall flags=replace,maint nodecnt=1 scontrol: error: REPLACE and REPLACE_DOWN flags cannot be used with STATIC_ALLOC or MAINT flags Error creating the reservation: Requested operation not supported on this system marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall nodecnt=1 flags=time_float,weekly Error creating the reservation: Requested operation not supported on this system marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall nodecnt=1 flags=time_float,weekly scontrol: error: Reservation request has mutually exclusive flags. Repeating floating reservations are not supported. Error creating the reservation: Requested operation not supported on this system Another thing that I didn't address: marshall@voyager:~/slurm/23.02/install/c1$ scontrol create reservationname=resv_bug14634 start=now end=now+600 users=marshall nodes=n1-1 corecnt=12341234 Error creating the reservation: Requested nodes are busy Note, unless nodes are directly requested a reservation must exist in a single partition. If no partition is requested the default partition is assumed. This is correctly rejected because my node doesn't have that many cores; however, the error message is wrong. I didn't look into this so I don't know how hard it will be to fix. Before you worry about this, can you get the other things done? I think this could be pushed separately. This could be a separate bug, or if it turns out to be too difficult I just wouldn't even worry about it (since the reservation is still rejected). Add me back when you're ready Update with the new patch. 1) No change with this error message. Note: it is saying that a core-based reservation can't be updated, but the original reservation wasn't core-based, so that error message is a little off (maybe something more like "can't update a reservation to be core-based" would be better). But it does correctly reject it. I left this alone, because this is checked before all other checks, and has a FIXME comment to support more core-based reservation features. So I didn't want to touch this one. caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation user=caden starttime=now+20 duration=3 flags=replace nodecnt=1 Reservation created: caden_3 caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update reservationname=caden_3 corecnt=1 Error updating the reservation: Core-based reservation can not be updated slurm_update error: Core-based reservation can not be updated 2a) Updating REPLACE reservations to have a node list error - Old: caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+600 users=caden flags=replace NodeCnt=2 Reservation created: caden_9 caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update reservationname=caden_9 nodes=n[0-1] Error updating the reservation: Requested operation not supported on this system slurm_update error: Requested operation not supported on this system 2b) New: caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+600 users=caden flags=replace NodeCnt=2 Reservation created: caden_10 caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update reservationname=caden_10 nodes=n[0-1] scontrol: error: Reservation can't be updated with Nodes option; it is incompatible with REPLACE[_DOWN] Error updating the reservation: Requested operation not supported on this system slurm_update error: Requested operation not supported on this system 3a) Updating to add mutually exclusive flags messages with STATIC and REPLACE - Old: caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+600 users=caden flags=replace NodeCnt=2 Reservation created: caden_15 caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update reservationname=caden_15 flags=maint Error updating the reservation: Requested operation not supported on this system slurm_update error: Requested operation not supported on this system 3b) New: caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+600 users=caden flags=replace NodeCnt=2 Reservation created: caden_16 caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update reservationname=caden_16 flags=maint scontrol: error: REPLACE and REPLACE_DOWN flags cannot be used with STATIC_ALLOC or MAINT flags Error updating the reservation: Requested operation not supported on this system slurm_update error: Requested operation not supported on this system 4a) Reservation has reoccurring flag, but tries update to have time_float (mutually exclusive) error - No change This is a different case, because time_float can't be added or removed from a reservation according to the docs. So it doesn't matter that weekly and time float are mutually exclusive, you can't add or remove time_float anyway. This is the error message it gives, which seems close enough, though not crazy informative. I can add more info if you want. There is an info message in slurmctld log saying "reservation __ request to set TIME_FLOAT flag," implying the user should already know it could not be added and that is why it is giving this info message. caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+600 users=caden flags=weekly NodeCnt=2 Reservation created: caden_14 caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update reservationname=caden_14 flags=time_float Error updating the reservation: Invalid time specified slurm_update error: Invalid time specified caden@kashyyyk:~/slurm/22.05/slurm$ 4b i) Reservation has time_float, but reoccurring flag is added in update- Old. We allowed this before 4b ii) New - caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+180 users=caden NodeCnt=1 flags=time_float Reservation created: caden_8 caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update res reservationname=caden_8 flags=daily scontrol: error: Cannot add a reoccurring flag to a floating reservation Error updating the reservation: Requested operation not supported on this system 5a) Reservation can set multiple reoccurring flags together (either update or creation) - We allowed this 5b) New - caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+180 users=caden NodeCnt=1 flags=weekly,daily scontrol: error: Reservation has multiple reoccurring flags. Please specify only one reoccurring flag Error creating the reservation: Requested operation not supported on this system Or with update: caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+180 users=caden NodeCnt=1 flags=weekly Reservation created: caden_4 caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update res reservationname=caden_4 flags=daily scontrol: error: Cannot update reservation to have multiple reoccurring flags. Please specify only one reoccurring flag Error updating the reservation: Requested operation not supported on this system slurm_update error: Requested operation not supported on this system Starting with no flags and trying to add multiple: caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+180 users=caden NodeCnt=1 Reservation created: caden_5 caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update res reservationname=caden_5 flags=daily,weekly scontrol: error: Cannot update reservation to have multiple reoccurring flags. Please specify only one reoccurring flag Error updating the reservation: Requested operation not supported on this system slurm_update error: Requested operation not supported on this system caden@kashyyyk:~/slurm/22.05/slurm$ Created attachment 27767 [details] bug14634_2302_v4 Caden, I think this looks really good. I need to actually run some tests before I push. Regarding patch 8/10: I think this is good. We don't properly handle multiple reoccurring flags anyway: in _advance_resv_time() they are in an if-else if block, so only one will happen. Another case: Reservation has one reoccurring flag, its removed, and a different one is added. Should work (and it does). caden@kashyyyk:~/slurm/22.05/slurm$ scontrol create reservation start=now end=now+180 users=caden NodeCnt=1 flags=weekly Reservation created: caden_12 caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update res reservationname=caden_12 flags-=weekly Reservation updated. caden@kashyyyk:~/slurm/22.05/slurm$ scontrol update res reservationname=caden_12 flags+=daily Reservation updated. caden@kashyyyk:~/slurm/22.05/slurm$ scontrol show res ReservationName=caden_12 StartTime=2022-11-30T11:08:53 EndTime=2022-11-30T11:11:53 Duration=00:03:00 Nodes=n0 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=normal Flags=DAILY TRES=cpu=2 Users=caden Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) Caden, I rebased your fixup commit. This is in the master branch in the following commits: 83abcdab97 (HEAD -> master, origin/master, origin/HEAD) NEWS and RELEASE_NOTES for previous commits 7caf5b329f Remove ability to update a floating reservation to be reoccurring 042bfaba9e Update reservation creation/update to only accept one reoccurring flag 759f21db2c Remove abilitiy to update a resv to have REPLACE and STATIC_ALLOC flags cf58ba522d Send helpful error messages to user commands for reservation updates 4ff4a9f453 Fix typo of RESERVE_REOCCURRING macro 9f0345d137 Fix indentation 48660b6027 Send helpful error messages to user commands for reservation requests 3519f1b89e Remove ability to have reservation with REPLACE and STATIC_ALLOC flags This was awesome. Great job. |