*** Ticket 18251 has been marked as a duplicate of this ticket. *** Hi David, Thanks for reporting this issue, we have been taking a look at it and we have more or less identified the source. We are currently discussing how to address this, I will write you back as soon as we reach a conclusion. Best regards, Ricard. Would it be possible to get this fixed in 23.02.7? We had opened 18251 which was duplicate of this and will be attempting to upgrade to 23.02.x on our December 19th downtime so having even a patch available before then would be good but having this as part of 23.02.7 would be even better. Hello Trey, Since we've had 23.11 released last month, 23.02 is now only eligible for security patches and segfault fixes. Right now we are reviewing a proposed solution for this bug, but since it is a change in behavior, it will most likely get patched for 23.11.1. Best regards, Ricard. Would it be possible to get a patch for 23.02 that just doesn't make it into a maintenance release? This would allow OSC to change the behavior locally without changing the behavior in the 23.02 release for all customers. Without a 23.02 patch our options are either live with the behavior or upgrade to 23.11 but we are not comfortable going to 23.11 yet and we have limited time windows when we can do the major upgrades, ie either December 19th 2023 or May 2024. Hello Trey, We can send a local patch for 23.02 if needed, but please keep in mind that it would be a best effort patch that hasn't run our regular QA regressions, which also won't be maintained. Our initial solution for your case seems to produce a regression for another feature, so I'm trying to have a "stable" patch before December 19th. Best regards, Ricard. After some internal discussion OSC has decided we will just delay our upgrade and do an upgrade to the 23.11 release that has a fix for this bug. The hope is that the upgrade will be done in early 2024 so a patch for 23.11 would be useful so we can validate the fix. Any update on this getting fixed in 23.11 release? Hello Trey, We are still iterating proposed fixes for this bug. Right now we have a solution that already fixes your initial issue, but clashes with the node range (-N <min_nodes>-<max_nodes>) feature on specific circumstances. This is taking a bit of time because the allocation logic has a lot of possible parameter combinations, which need to be tested properly. When are you planning to do the upgrade? Best regards, Ricard. Our upgrade timeline is dependent on this getting resolved so we haven't schedule the upgrade. We will be doing a rolling reboot to handle the upgrade so we don't have to wait for our May 2024 downtime. Hello Trey, Sorry for the delay, this took some review iterations but the final solution for this issue has been reviewed and pushed to master, it will be included in the coming 24.05 release. The related commits are the following: * bd70f4df7a NEWS for the ntasks precedence fix * 7b00991bf9 srun - Removed "can't honor" warning on job allocations * 57c1685f16 salloc - Disable calculation of ntasks if set by the user in the cli * 9cfff7276e srun - Disable recalculation of ntasks if set by the user in the cli * ef04c57c9d Add ntasks_opt_set parameter for explicitly set ntasks Thanks for notifying it! Best regards, Ricard. |
Created attachment 33350 [details] slurm.conf file According to the srun man page: --ntasks-per-node=<ntasks> Request that ntasks be invoked on each node. If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. However, it seems to me like --ntasks-per-node takes precedence, since the following srun runs 8 tasks instead of 5. [dgloe@pea2k ~]$ srun -l --ntasks=5 --ntasks-per-node=4 hostname 1: n022 0: n022 2: n022 3: n022 4: n023 6: n023 5: n023 7: n023