If I salloc or sbatch a job using -N, -c, and --exclusive (really an OverSubscribe=EXCLUSIVE partition), where NNODES * (CPUS_PER_NODE % CPUS_PER_TASK) >= CPUS_PER_TASK, SLURM_TASKS_PER_NODE is set incorrectly in the batch part. For example, on 40-core nodes: > salloc --exclusive -N 2 -c 11 -C skylake bash > echo $SLURM_TASKS_PER_NODE 4,3 > srun hostname | sort | uniq -c 3 worker2111 3 worker2112 On 128-core nodes: > salloc --exclusive -N 3 -c 33 -C rome bash > echo $SLURM_TASKS_PER_NODE 4(x2),3 > srun hostname | sort | uniq -c 3 worker5481 3 worker5482 3 worker5483 > srun env | grep SLURM_TASKS_PER_NODE SLURM_TASKS_PER_NODE=3(x3) ... Larger example on 128-core: > salloc --exclusive -N 15 -c 5 -C rome bash > echo $SLURM_TASKS_PER_NODE 26(x9),25(x6) It's doing the right thing with srun, running the number of tasks that fit. It also sets SLURM_TASKS_PER_NODE and other variables correctly inside the srun, but not in the salloc/sbatch. There it seems like it's collecting all the extra cpus across nodes and creating more tasks for them and assigning them to the first nodes, even if they don't fit with CPUS_PER_TASK. This only happens when the total number of tasks is implicit and inferred from the exclusively allocated cpus. Using -n or --ntasks-per-node works fine. This also seems to affect mpirun (at least with openmpi4), which runs ranks for these extra tasks.
Hi, I'll try to reproduce your issue. Please provide your slurm.conf if possible. Thanks, Carlos.
Created attachment 27103 [details] slurm.conf
Hi, I have been able to reproduce the issue in master and we are investigating why this extra task is set *only* in the environmental variable. The steps aren't affected by this and properly run the right amount of tasks per node. I'll let you know once this is fixed. Thanks for reporting, Carlos.
Hi Dylan, This has been fixed in 22.05 and master branches, commits: 848142a418 Fix salloc SLURM_NTASKS_PER_NODE output env variable when -n not given 7c86732028 Fix sbatch SLURM_NTASKS_PER_NODE output env variable when -n not given 355a3df278 Add NEWS for the previous two commits I'm going to close the bug as fixed. Feel free to reopen it if you find any related issue. Regards, Carlos.