Hello: We've upgraded to 20.11.4 (from 19.05.x) and when we try to run our ARM applications: $ ml forge/19.0 $ ddt --offline srun ls srun: error: Unable to create step for job 20054593: Requested nodes are busy They used to work with our 19.05 versions. Could this be support issue with Slurm 20.11.4 on Forge 19 or just c slurm configuration issue? Cheers, Tony
Tony, Before we'll jump into more detailed debugging, could you please check if addition of >export SLURM_OVERLAP=1 before the execution of the other commands will change the behavior? cheers, Marcin
Marcin: That worked. Thanks, Tony
Tonny, I'm guessing that the issue was that `ddt` (probably) calls srun behind the scene and one of the major changes in Slurm 20.11 was that we don't overlap step resources by default. If you can verify how ddt works (one of the options is to use strace and check if srun is executed) this will help us to understand the case. Another thing to check here is to make sure if you need srun in `ddt` arguments, maybe just `ddt ls` will work - if srun is already called by ddt. You may want to check Bug 11341 for more details - from the 20.11 change perspective it may be called a duplicate, but I'd like to make sure that we fully understood what's happening in your case to have a best long term solution for you. Exporting SLURM_OVERLAP is something I'd call a workaround for now. Let me know your thoughts. cheers, Marcin
Will check this out. Thanks, Tony
Tony, Were you able to check the details? cheers, Marcin
Hi Marcin: Apologies. Was distracted on other stuff lately. Will find that out and update. Cheers, Tony
Tony, Were you able to get back to the case? cheers, Marcin
Tony, Let me know if you want to continue working on this. In case of no reply I'll close the bug as info given. cheers, Marcin
Hi Marcin: Apologies. A bit busy at the site at the moment due to hardware expansions. Please close the ticket. Much appreciated. Cheers, Tony