NOTE: we have set this to the highest priority due to the short time remaining before our planned upgrade to Slurm 20.11 (scheduled for Wednesday 08:00 CET). If you can provide a quick answer to our primary question “is this a bug or a feature”, you may then lower the priority of the ticket. In Slurm 20.02.6, the following job script works as intended (Slurm makes sure that only one copy of “someapp” runs per core), but in Slurm 20.11.5 only one copy of “someapp” runs concurrently per node. #!/bin/bash #SBATCH -N 2 #SBATCH --exclusive # Run 256 tasks on the 2*32 cores allocated to the job for i in $(seq 1 256); do srun -N1 -n1 /somedir/someapp > ./out-${i}.log & done wait Is this a regression, or an intentional change in behaviour in Slurm 20.11? If it is an intentional change, how should the above script be modified for 20.11? If it is a regression, we will likely delay the planned Slurm upgrade until a fix is available. If it is an intentional change, we will instead have to educate our users on how to run on Slurm 20.11. And that is the reason we are looking for a quick reply from you. The behaviour does not match our interpretation of the srun man page for 20.11. The man page suggests that --exclusive (which is the default) should ensure one “someapp” runs per core, and that --overlap would result in all 256 instances of “someapp” running concurrently. We have tried various combinations of --exclusive, --overlap, --whole, --mem-per-node=0, … without finding anything that restores the previous behaviour.
Hi Jonas, Please add: #SBATCH -c 1 and --exact for srun. In my tests, this combination works. Tell us if this is working for you as well. Thanks.
We've had lots of questions about the change to srun in 20.11. This is a duplicate of bug 10383 comment 63, 11644, 10769, 11448, and probably others. Read bug 10383 comment 63 first and also RELEASE_NOTES and look for changes to srun, --exact, and -overlap.
(And as Carlos mentioned in comment 1, exclusive means steps don't share resources. So definitely follow Carlos's hint.)
Thank you very much for the quick replies. We got things working by doing this (works with or without -c1): #!/bin/bash #SBATCH -N2 #SBATCH --exclusive for i in $(seq 1 256);do srun -N1 -n1 --overlap --exact /somedir/someapp > ./out-${i}.log & done wait
Sound good too. Let's close the bug with info given, if you agree. Regards.
(I'm working with Jonas and have also looked at this issue.) We are still unsure why the following snippet without --overlap works the way it does: #!/bin/bash #SBATCH -N2 #SBATCH -c 1 #SBATCH --exclusive for i in $(seq 1 256);do srun -N1 -n1 --exact /somedir/someapp > ./out-${i}.log & done wait This starts in parallel 32 processes on the first node and only ONE on the second node before delaying until the processes start finishing. However, adding the flag '--overlap' makes it start 32 processes on each node. Why is the --overlap flag needed here? Why are the two nodes treated differently without that flag?
I guess it's because my test was not 2 times the number of cores of 2 nodes, but just the number of cores of 2 nodes. But anyway, I've tested doubling the number of srun/steps. Then it gives my 32 per node (64), then waits. Not 32, for one node, the 1 alone in the other. Regarding overlap, it's needed because you put double the steps than cores. Thus, the steps overlaps in cores. Regards.
> But anyway, I've tested doubling the number of srun/steps. Then it gives my > 32 per node (64), then waits. Not 32, for one node, the 1 alone in the other. Thanks running these tests! Then I don't think the behavior we see is what is intended. We indeed see 32 processes on one node and 1 on the other, and then waiting. > Regarding overlap, it's needed because you put double the steps than cores. > Thus, the steps overlaps in cores. Perhaps I misunderstand, but we are after the behavior you state you see with just --exact (64 processes, 32 on each node, then waiting). That would be no overlap, right? For some reason we get this desired behavior only when we use the combination of both --overlap and --exact, which seems a bit odd.
Ah, ok. I misunderstood your case. I guessed you wish the 256 at the same time. I made a mistake reading logs. My test follows the same as you 256 in a loop, it executes 32+32 with exact+overlap. But it does the same with the exact w/o overlap, in my case. Also, in my case -c1 does not add anything but in both cases just because of my config. Please, send me the slurm.conf. The answer for the difference must be there, I guess. Or maybe you use job_submit.lua or similar, in the sense you rewrite some default values for flags when those are not explicitly used. Thanks.
Created attachment 19657 [details] slurm.conf
(In reply to Carlos Tripiana Montes from comment #9) > I guess. Or maybe you use job_submit.lua or similar, in the sense you > rewrite some default values for flags when those are not explicitly used. Yes, in out job_submit.lua we set job_desc.shared = 0 if the user asked for more than one node. (Which should be the same as adding --exclusive) But apart from that we shouldn't change anything relating to this.
Created attachment 19658 [details] whereami.c Hey Jonas, Matching your config, I'm unable to reproduce the 32 steps in "node 1" plus 1 step in "node 2", at the same time. And the others waiting. I'm getting 32 steps in "node 1" plus 32 step in "node 2", the rest waiting. Please, use this script: #!/bin/bash #SBATCH -N 2 #SBATCH --exclusive # Run 256 tasks on the 2*32 cores allocated to the job for i in $(seq 1 256); do srun -vvv -N1 -n1 --exact bash -c 'whereami;sleep 60' &> ./out-${i}.log & done wait and send it to queue as follows: sbatch -vvv [script] Post sbatch output, and all the slurm-[jobid].out plus out-[step].log files. Additionally post the slurmctld.log and the slurmd.log (for the 2 nodes of the job), and the job_submit.lua. I'm not sure, but I think it's job_submit.lua which is doing something? Please, see attached whereami.c. Compile it 1st for the test. Regards.
Created attachment 19659 [details] Logs from slurmd on nodes and slurmctld
Created attachment 19660 [details] Logs from sbatch and steps
Created attachment 19661 [details] job_submit.lua This script is a bit hairy. But the function that might be interesting is _set_exclusive_if_needed(). The rest of the code _should_ only affect jobs that request a special reservations/qos or nodes with memory that differ from DefMemPerCPU. (We should probably have used partitions for this, but the solution we have is for historical reasons.)
Okey, okey... found it! 20.11 branch, head commit, is not failing. 20.11.5 does it fail. Good news is you just need --exact w/o --overlap, if you update. That way makes more sense, and is what you where expecting from the stated in the documentation. Bad news is that if you don't update, you need it. I'm going to seek and send back to you the commit ID and related bug ID (if any) belonging to the fix, as soon as I've spot at it. Give me a while digging into the git history. Regards,
https://github.com/SchedMD/slurm/commit/6a2c99edbf96e50463cc2f16d8e5eb955c82a8ab Bug https://bugs.schedmd.com/show_bug.cgi?id=11357
Thank you very very much for the quick help. I've tested a 20.11.7 version and verified that it works for us as well. You can close the ticket.
Hi Jonas, That's awesome news. Thank you very much for the feedback. Let's close the bug. Regards.