We have a spank plugin which takes a bit of time to clean up after itself in slurm_spank_exit() with context=remote. In 20.11.7, 'srun -N1 true ; srun -N2 true' fails. Minimal spank plugin: > #include <slurm/spank.h> > #include <unistd.h> > #define PLUGIN_NAME test > SPANK_PLUGIN(PLUGIN_NAME, 1); > int slurm_spank_exit(spank_t sp, int ac, char **av) { > if (spank_context() == S_CTX_REMOTE) sleep(1); > return (0); > } Command to exhibit the buggy behavior: > $ salloc -N2 bash -ex <<< 'srun -N1 hostname ; srun -N2 hostname' > salloc: Granted job allocation 18 > salloc: Waiting for resource configuration > salloc: Nodes node-[1-2] are ready for job > + srun -N1 hostname > lyeager-dt > + srun -N2 hostname > srun: error: Unable to create step for job 18: Requested node configuration is not available > salloc: Relinquishing job allocation 18 Felix, my colleague, did a git bisect and found the culprit: https://github.com/SchedMD/slurm/commit/6a2c99edbf This is a pretty serious issue for us.
Thanks for reporting this, and nice job with git bisect finding the commit that started it! That certainly helped me find out what was happening faster. I can reproduce this and I believe I have a fix. I'm submitting my patch to our review queue. Would you like me to make the patch available for you to test? I discovered this bug can happen without a SPANK plugin with a simple job like this: #!/bin/bash #SBATCH -N2 srun -l -N1 bash -c 'hostname; sleep 10' & sleep 1 srun -l -N2 bash -c 'hostname; sleep 10' & wait Before commit 6a2c99edbf96e, the second step wasn't rejected, the second step incorrectly ran in parallel with the first step, thus overallocating CPUs. So commit 6a2c99edbf actually fixed that bug (in addition to the bug I meant to fix with commit 6a2c99edbf). But unfortunately that then caused this new bug which you reported.
Thanks for the quick confirmation. And it's good to know that this will affect sites and jobs which don't use spank plugins, too. I think we will probably just sit at 20.11.6 until 20.11.8 is released rather than patching 20.11.7. But it would be nice to have the patch public for the sake of other sites who might want to patch, I think. As you know, many sites need to be at 20.11.7 because of the CVE fix.
(In reply to Luke Yeager from comment #8) > Thanks for the quick confirmation. And it's good to know that this will > affect sites and jobs which don't use spank plugins, too. I think we will > probably just sit at 20.11.6 until 20.11.8 is released rather than patching > 20.11.7. Sounds good. > But it would be nice to have the patch public for the sake of other > sites who might want to patch, I think. As you know, many sites need to be > at 20.11.7 because of the CVE fix. As soon as we push the fix I will post the commit here. If someone wants the patch before it's pushed, just make a comment here and I will make it public.
Hey Luke, This turned out to be a little more complicated than I initially thought, but we got it figured out. We've fixed this and other issues in the following commits: 48009cbd97 Clear allocated cpus for running steps in a job before handling requested nodes on new step. 05f5abda9f Don't reject a step if not enough nodes are available. 49b1b2681e Don't reject a step if it requests a node that is allocated. These will be in 20.11.8. Thanks for reporting this! Closing this bug as resolved/fixed.
*** Ticket 11783 has been marked as a duplicate of this ticket. ***
I've verified the fix in 20.11.8 - thanks.