| Summary: | Reservation core count changed after controller restart | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | CSC sysadmins <csc-slurm-tickets> |
| Component: | Scheduling | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | alex |
| Version: | 19.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CSC - IT Center for Science | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 19.05.6 20.02.pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
CSC sysadmins
2019-09-16 01:48:11 MDT
Hi Tommi, I've been trying to reproduce the issue you're describing this morning, but I haven't found a way to make it happen. Is this something you're able to reproduce? If so, can you send me the 'scontrol create reservation' command you're using to create the reservation? If you can't reproduce it, can you see if your colleague can find the command he used in his/her history? Thanks, Ben Grepped from the history: scontrol create reservationname=training1 nodecnt=5 users=root,userx starttime=2019-09-10T08:00:00 duration=10:00:00 Flags=PART_NODES PartitionName=large scontrol create reservationname=training2 nodecnt=5 users=root,userx starttime=2019-09-11T08:00:00 duration=10:00:00 Flags=PART_NODES PartitionName=large I retested with similar flags and I was able to reproduce this bug. Colleague removed partition from the reservation and after that reservation does not change on scontrol reconfig scontrol update ReservationName=training1 PartitionName= Hi Tommi, Thanks for sending those commands, I was able to reproduce the issue as well. It looks like the problem behavior is being caused by the PART_NODES flag. This flag isn't necessary for the type of reservation being created. You can keep the partition specification as well as the node count to have the reservation get the right nodes. Leaving off the PART_NODES flag shouldn't have an effect since the requirements to use the flag aren't being met. Here's the description of the flag from the documentation: ----------------- This flag can be used to reserve all nodes within the specified partition. PartitionName and Nodes=ALL must be specified or this option is ignored. ----------------- The flag is being ignored at the reservation creation time, but something is going wrong with the logic when Slurm is restarted. To work around this bug you can leave off the flag and I'll keep looking into what's happening when Slurm is restarted. Please let me know if you have any questions. Thanks, Ben Hi, Thanks for info, I think we can cope without PART_NODES flag :) Hi Tommi, I know you were able to work around this by removing the PART_NODES flag from your reservation creation command. For your reference, I wanted to let you know that there have been fixes checked in to address this behavior. The change in 19.05.6 will be to remove the PART_NODES flag after the reservation was created if it wasn't created with all nodes. You can see the details of the commit here: https://github.com/SchedMD/slurm/commit/4bbce568958a0b20f46fbccb069bf9140f7f514e In 20.02 we will ensure that users specify all nodes when using the PART_NODES flag: https://github.com/SchedMD/slurm/commit/77ae6880d94852961820a68db8010299da95f523 I'll close this ticket now. Let me know if you have questions about this. Thanks, Ben |