Hello, One of our users reported a resource allocation issue which I cannot explain, while mixing exclusive steps and overlapping steps. Here is a minimal example to reproduce this issue: #! /bin/bash -l #SBATCH -N 1 #SBATCH --ntasks-per-node=7 #SBATCH --cpus-per-task 1 #SBATCH --time=0-00:30:00 #SBATCH --partition interactive SRUN=(srun --exclusive --ntasks=1 --cpus-per-task=1) PARALLEL=( \ parallel \ --delay .2 \ --jobs "${SLURM_NTASKS}" \ --joblog parallel.log \ --line-buffer \ ) NTASKS=1000 "${PARALLEL[@]}" "${SRUN[@]}" "./exec.sh" {} :::: <(seq "${NTASKS}") The script "exec.sh" called above displays the cpu affinity of all exec.sh processes started via "srun --exclusive" at a given time. #! /bin/bash echo "PID $$: Starting, CPU affinities:" for pid in $(pgrep exec.sh) ; do taskset -cp $pid done DELAY=$(( (RANDOM % 5) + 2 )) echo "PID $$: Sleeping for ${DELAY}s" sleep "$DELAY" In the beginning, everything is fine, each process runs on a different cpu core: PID 232353: Starting, CPU affinities: pid 232217's current affinity list: 102 pid 232238's current affinity list: 59 pid 232259's current affinity list: 63 pid 232281's current affinity list: 67 pid 232304's current affinity list: 71 pid 232337's current affinity list: 75 pid 232353's current affinity list: 79 PID 232353: Sleeping for 2s PID 232444: Starting, CPU affinities: pid 232238's current affinity list: 59 pid 232259's current affinity list: 63 pid 232281's current affinity list: 67 pid 232304's current affinity list: 71 pid 232337's current affinity list: 75 pid 232353's current affinity list: 79 pid 232444's current affinity list: 102 PID 232444: Sleeping for 2s As soon as we start one overlapping step, in example with this command used for debugging purposes (srun --jobid <JOB_ID> --overlap --gres=gpu:0 --pty bash -i), the resource allocated previously to this overlapping step start being allocated for multiple exclusive steps at the same time (in the following extract, the cpu 102). PID 234869: Starting, CPU affinities: pid 234619's current affinity list: 79 pid 234661's current affinity list: 59 pid 234709's current affinity list: 67 pid 234784's current affinity list: 102 pid 234821's current affinity list: 75 pid 234844's current affinity list: 71 pid 234869's current affinity list: 102 PID 234869: Sleeping for 6s Is this an expected behavior ? Thank you for your support, Best regards, Hyacinthe
I'm looking into this, though I haven't been able to reproduce it yet. I've also only tested on Slurm 21.08, so I will test on 20.11 to see if the problem exists there.
(In reply to Marshall Garey from comment #2) > I'm looking into this, though I haven't been able to reproduce it yet. I've > also only tested on Slurm 21.08, so I will test on 20.11 to see if the > problem exists there. Correction: I should have said "to see if I can reproduce the problem there" since you have seen this problem. This is unexpected behavior. And thanks for your reproducer script.
Thanks for having a look at this, the script is provided by one of our user. If it is fixed by a new Slurm version, we will consider the upgrade later.
I'm sorry for the delayed response. We found some buggy and inconsistent behavior of steps sharing CPUs (--overlap in 20.11/21.08, which was just the default behavior prior to 20.11). For example, if an --overlap step is submitted after an exclusive step (with the exclusive step still running), then the overlap step could share CPUs with the exclusive step. But if the exclusive step is submitted after an overlap step (with the overlap step still running), the exclusive step could not share CPUs with the overlap step. We made some changes to --overlap for the 22.05 release. In 20.11/21.08, --overlap only allowed sharing CPUs. In 22.05, --overlap allows the step to share CPUs, memory, and GRES. In addition, resources allocated to overlapping steps will not count towards resources allocated in the job, meaning it will always share its resources with all other steps (overlap or exclusive). This change will make running debugging steps much better and easier, and should fix the bug that you reported. See the following commits: fe9f416ec2 8b00476873 5e446730c8 Unfortunately, we won't be able to backport this to 21.08 since it is a change in behavior. So you'll have to wait for 22.05 for these fixes. Is there anything else I can help with for this ticket? - Marshall
Thank you very much for the explanation. I think this issue can be closed.
Resolving.