Ticket 15081

Summary: SLURM_TASKS_PER_NODE incorrect with --exclusive --cpus-per-task underfit
Product: Slurm Reporter: Dylan Simon <dsimon>
Component: SchedulingAssignee: Carlos Tripiana Montes <tripiana>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek, jdamicis, lgarrison, tripiana
Version: 21.08.8   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=10620
Site: Simons Foundation & Flatiron Institute Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 23.02.0pre1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Dylan Simon 2022-09-30 07:22:14 MDT
If I salloc or sbatch a job using -N, -c, and --exclusive (really an OverSubscribe=EXCLUSIVE partition), where NNODES * (CPUS_PER_NODE % CPUS_PER_TASK) >= CPUS_PER_TASK, SLURM_TASKS_PER_NODE is set incorrectly in the batch part.  For example, on 40-core nodes:

> salloc --exclusive -N 2 -c 11 -C skylake bash
> echo $SLURM_TASKS_PER_NODE
4,3
> srun hostname | sort | uniq -c
    3 worker2111
    3 worker2112

On 128-core nodes:

> salloc --exclusive -N 3 -c 33 -C rome bash
> echo $SLURM_TASKS_PER_NODE
4(x2),3
> srun hostname | sort | uniq -c
   3 worker5481
   3 worker5482
   3 worker5483
> srun env | grep SLURM_TASKS_PER_NODE
SLURM_TASKS_PER_NODE=3(x3)
...

Larger example on 128-core:
> salloc --exclusive -N 15 -c 5 -C rome bash
> echo $SLURM_TASKS_PER_NODE
26(x9),25(x6)

It's doing the right thing with srun, running the number of tasks that fit. It also sets SLURM_TASKS_PER_NODE and other variables correctly inside the srun, but not in the salloc/sbatch.  There it seems like it's collecting all the extra cpus across nodes and creating more tasks for them and assigning them to the first nodes, even if they don't fit with CPUS_PER_TASK.

This only happens when the total number of tasks is implicit and inferred from the exclusively allocated cpus.  Using -n or --ntasks-per-node works fine.

This also seems to affect mpirun (at least with openmpi4), which runs ranks for these extra tasks.
Comment 2 Carlos Tripiana Montes 2022-10-04 06:21:43 MDT
Hi,

I'll try to reproduce your issue. Please provide your slurm.conf if possible.

Thanks,
Carlos.
Comment 3 Dylan Simon 2022-10-04 06:23:39 MDT
Created attachment 27103 [details]
slurm.conf
Comment 5 Carlos Tripiana Montes 2022-10-06 09:10:43 MDT
Hi,

I have been able to reproduce the issue in master and we are investigating why this extra task is set *only* in the environmental variable. The steps aren't affected by this and properly run the right amount of tasks per node.

I'll let you know once this is fixed.

Thanks for reporting,
Carlos.
Comment 8 Carlos Tripiana Montes 2022-10-25 08:32:35 MDT
Hi Dylan,

This has been fixed in 22.05 and master branches, commits:

848142a418 Fix salloc SLURM_NTASKS_PER_NODE output env variable when -n not given
7c86732028 Fix sbatch SLURM_NTASKS_PER_NODE output env variable when -n not given
355a3df278 Add NEWS for the previous two commits

I'm going to close the bug as fixed. Feel free to reopen it if you find any related issue.

Regards,
Carlos.