Hi, Colleague made couple of standard reservations: [2019-09-09T14:32:23.124] sched: Created reservation=training1 users=root nodes=r14c[07,09,21,43,47] cores=200 licenses=(null) tres=cpu=200 watts=4294967294 start=2019-09-10T08:00:00 end=2019-09-10T18:00:00 [2019-09-09T14:32:30.092] sched: Created reservation=training2 users=root nodes=r16c[33,36,44-46] cores=200 licenses=(null) tres=cpu=200 watts=4294967294 start=2019-09-11T08:00:00 end=2019-09-11T18:00:00 Couple of hours later we restarted slurmctld and suddenly reservation core count changed from 200 to 26400! (partition maximum) [2019-09-09T16:18:09.861] Recovered state of reservation dlintro [2019-09-09T16:18:09.861] Recovered state of reservation gputest [2019-09-09T16:18:09.861] Recovered state of reservation prolog_test [2019-09-09T16:18:09.861] Recovered state of reservation training1 [2019-09-09T16:18:09.861] Recovered state of reservation training2 [2019-09-09T16:18:09.863] sched: Updated reservation=training1 users=root nodes=r[01-04,13-18]c[01-48],r[05-06]c[01-64],r07c[05-56] cores=26400 licenses=(null) tres=cpu=26400 watts=4294967294 start=2019-09-10T08:00:00 end=2019-09-10T18:00:00 [2019-09-09T16:18:09.863] sched: Updated reservation=training2 users=root nodes=r[01-04,13-18]c[01-48],r[05-06]c[01-64],r07c[05-56] cores=26400 licenses=(null) tres=cpu=26400 watts=4294967294 start=2019-09-11T08:00:00 end=2019-09-11T18:00:00 Config files can be found on bug: https://bugs.schedmd.com/show_bug.cgi?id=7685
Hi Tommi, I've been trying to reproduce the issue you're describing this morning, but I haven't found a way to make it happen. Is this something you're able to reproduce? If so, can you send me the 'scontrol create reservation' command you're using to create the reservation? If you can't reproduce it, can you see if your colleague can find the command he used in his/her history? Thanks, Ben
Grepped from the history: scontrol create reservationname=training1 nodecnt=5 users=root,userx starttime=2019-09-10T08:00:00 duration=10:00:00 Flags=PART_NODES PartitionName=large scontrol create reservationname=training2 nodecnt=5 users=root,userx starttime=2019-09-11T08:00:00 duration=10:00:00 Flags=PART_NODES PartitionName=large I retested with similar flags and I was able to reproduce this bug. Colleague removed partition from the reservation and after that reservation does not change on scontrol reconfig scontrol update ReservationName=training1 PartitionName=
Hi Tommi, Thanks for sending those commands, I was able to reproduce the issue as well. It looks like the problem behavior is being caused by the PART_NODES flag. This flag isn't necessary for the type of reservation being created. You can keep the partition specification as well as the node count to have the reservation get the right nodes. Leaving off the PART_NODES flag shouldn't have an effect since the requirements to use the flag aren't being met. Here's the description of the flag from the documentation: ----------------- This flag can be used to reserve all nodes within the specified partition. PartitionName and Nodes=ALL must be specified or this option is ignored. ----------------- The flag is being ignored at the reservation creation time, but something is going wrong with the logic when Slurm is restarted. To work around this bug you can leave off the flag and I'll keep looking into what's happening when Slurm is restarted. Please let me know if you have any questions. Thanks, Ben
Hi, Thanks for info, I think we can cope without PART_NODES flag :)
Hi Tommi, I know you were able to work around this by removing the PART_NODES flag from your reservation creation command. For your reference, I wanted to let you know that there have been fixes checked in to address this behavior. The change in 19.05.6 will be to remove the PART_NODES flag after the reservation was created if it wasn't created with all nodes. You can see the details of the commit here: https://github.com/SchedMD/slurm/commit/4bbce568958a0b20f46fbccb069bf9140f7f514e In 20.02 we will ensure that users specify all nodes when using the PART_NODES flag: https://github.com/SchedMD/slurm/commit/77ae6880d94852961820a68db8010299da95f523 I'll close this ticket now. Let me know if you have questions about this. Thanks, Ben