11275 – srun allocating too many CPU's per task

Ticket 11275 - srun allocating too many CPU's per task

Summary: srun allocating too many CPU's per task

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmstepd (show other tickets)
Version:	20.11.5
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-04-02 05:33 MDT by Patrick
Modified:	2022-02-24 02:31 MST (History)
CC List:	1 user (show)

See Also:	12912 12909 13351
Site:	Goodyear
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	RHEL
Machine Name:
CLE Version:
Version Fixed:	21.08.0pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Patrick 2021-04-02 05:33:58 MDT

Please consider the following simple test script:
#!/bin/bash
#SBATCH --ntasks 30
#SBATCH --partition linlarge
#SBATCH --exclusive
for i in {1..4}; do
echo $i
srun -n 1 -c 1 sleep 150 &
done
wait

the nodes in our partition have each 28 cores therefore 2 nodes are allocated
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            677521  linlarge slurmtes  ngib740  R       0:03      2 gisath[034,036]

however only 2 processes get started; looking at "sacct" I can see that 28 cores (CPUS) were allocated per jobstep.
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
677521       slurmtest+   linlarge    default         56    RUNNING      0:0
677521.batch      batch               default         28    RUNNING      0:0
677521.0          sleep               default         28    RUNNING      0:0
677521.1          sleep               default         28    RUNNING      0:0

The SLURM output shows the following warnings:
cpu-bind=MASK - gisath034, task  0  0 [7913]: mask 0xfffffff set
1
2
3
4
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
cpu-bind=MASK - gisath034, task  0  0 [7975]: mask 0x1 set
cpu-bind=MASK - gisath036, task  0  0 [20700]: mask 0x1 set


If I change the script and add --exclusive to the srun parameters I still get these same "srun: Warning:..."  but the behaviour is back to expected and one core is allocated per task:
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
677522       slurmtest+   linlarge    default         56    RUNNING      0:0
677522.batch      batch               default         28    RUNNING      0:0
677522.0          sleep               default          1    RUNNING      0:0
677522.1          sleep               default          1    RUNNING      0:0
677522.2          sleep               default          1    RUNNING      0:0
677522.3          sleep               default          1    RUNNING      0:0

Without the #SBATCH exclusive (and without srun exclusive) I got another allocation of 15 CPUS/task:

677523       slurmtest+   linlarge    default         30    RUNNING      0:0
677523.batch      batch               default         15    RUNNING      0:0
677523.0          sleep               default         15    RUNNING      0:0
677523.1          sleep               default         15    RUNNING      0:0

I'm trying to understand what is happening here and why more than 1 CPU's get allocated to the tasks with srun; I'm pretty sure we did not see this behavior with SLURM 20.2; can you please advise if this is an expected change or a bug in 20.11 ? Do we need additional configuration settings with SLURM 20.11 ? thank you

our partition is set up as follows:
PartitionName=linlarge    Nodes=gisath[009-352]                   MaxTime=INFINITE State=UP OverSubscribe=FORCE:1 PriorityTier=30 QoS=linlarge

Nodes like this:
NodeName=gisath[017-352] CPUs=28 Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 RealMemory=128000 State=UNKNOWN Weight=1 Feature=athena,broadwell,rhel7,lmem GRES=fv:1

The select type :
SelectType=select/cons_tres # for gpu
SelectTypeParameters=CR_Core,CR_Pack_Nodes

let me know if you need more information
thanks

Comment 1 Patrick 2021-04-02 06:28:33 MDT

possibly related to the same issue, if I use the following script SLRUM will start 28 tasks on first node, but only single task on second node allocated to job:

#!/bin/bash
#SBATCH --ntasks 40
#SBATCH --partition linlarge
#SBATCH --exclusive

for i in {1..40}; do
echo $i
srun -n 1 -c 1 --exclusive sleep 150 &
done

wait

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            677547  linlarge slurmtes  ngib740  R       2:56      2 gisath[018,020]

 cat slurm-677547.out
cpu-bind=MASK - gisath018, task  0  0 [14335]: mask 0xfffffff set
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
cpu-bind=MASK - gisath018, task  0  0 [14584]: mask 0x2 set
cpu-bind=MASK - gisath018, task  0  0 [14591]: mask 0x4 set
cpu-bind=MASK - gisath020, task  0  0 [25583]: mask 0x1 set
cpu-bind=MASK - gisath018, task  0  0 [14600]: mask 0x4000 set
cpu-bind=MASK - gisath018, task  0  0 [14608]: mask 0x8000 set
cpu-bind=MASK - gisath018, task  0  0 [14616]: mask 0x20 set
cpu-bind=MASK - gisath018, task  0  0 [14624]: mask 0x40 set
cpu-bind=MASK - gisath018, task  0  0 [14631]: mask 0x20000 set
cpu-bind=MASK - gisath018, task  0  0 [14640]: mask 0x10000 set
cpu-bind=MASK - gisath018, task  0  0 [14647]: mask 0x40000 set
cpu-bind=MASK - gisath018, task  0  0 [14657]: mask 0x1 set
cpu-bind=MASK - gisath018, task  0  0 [14665]: mask 0x80000 set
cpu-bind=MASK - gisath018, task  0  0 [14673]: mask 0x100 set
cpu-bind=MASK - gisath018, task  0  0 [14681]: mask 0x400 set
cpu-bind=MASK - gisath018, task  0  0 [14689]: mask 0x1000 set
cpu-bind=MASK - gisath018, task  0  0 [14697]: mask 0x10 set
cpu-bind=MASK - gisath018, task  0  0 [14705]: mask 0x8 set
cpu-bind=MASK - gisath018, task  0  0 [14714]: mask 0x800 set
cpu-bind=MASK - gisath018, task  0  0 [14722]: mask 0x100000 set
cpu-bind=MASK - gisath018, task  0  0 [14730]: mask 0x80 set
cpu-bind=MASK - gisath018, task  0  0 [14738]: mask 0x2000 set
cpu-bind=MASK - gisath018, task  0  0 [14746]: mask 0x200000 set
cpu-bind=MASK - gisath018, task  0  0 [14754]: mask 0x400000 set
cpu-bind=MASK - gisath018, task  0  0 [14762]: mask 0x200 set
cpu-bind=MASK - gisath018, task  0  0 [14770]: mask 0x4000000 set
cpu-bind=MASK - gisath018, task  0  0 [14778]: mask 0x8000000 set
cpu-bind=MASK - gisath018, task  0  0 [14786]: mask 0x1000000 set
cpu-bind=MASK - gisath018, task  0  0 [14792]: mask 0x800000 set
cpu-bind=MASK - gisath018, task  0  0 [14795]: mask 0x2000000 set
srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy)

Comment 11 Marshall Garey 2021-04-05 11:45:43 MDT

Hi Patrick,

First, to answer your question about this warning:

> srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1

srun requested 2 nodes implicitly - the job allocation is 2 nodes, and srun didn't specify node count, so it uses the number for the job allocation. But, srun requested 1 CPU and 1 task, and you can't run 1 process on 2 nodes. You can specify the number of nodes to silence this warning (-N1).



Now to answer your main question:

There were some changes to srun in 20.11. From our RELEASE_NOTES document:

https://github.com/SchedMD/slurm/blob/slurm-20-11-5-1/RELEASE_NOTES

 -- By default, a step started with srun will be granted exclusive (or non-
    overlapping) access to the resources assigned to that step. No other
    parallel step will be allowed to run on the same resources at the same
    time. This replaces one facet of the '--exclusive' option's behavior, but
    does not imply the '--exact' option described below. To get the previous
    default behavior - which allowed parallel steps to share all resources -
    use the new srun '--overlap' option.
 -- In conjunction to this non-overlapping step allocation behavior being the
    new default, there is an additional new option for step management
    '--exact', which will allow a step access to only those resources requested
    by the step. This is the second half of the '--exclusive' behavior.
    Otherwise, by default all non-gres resources on each node in the allocation
    will be used by the step, making it so no other parallel step will have
    access to those resources unless both steps have specified '--overlap'.

You can find more background about this change here:

https://bugs.schedmd.com/show_bug.cgi?id=10383#c63


In other words, the default behavior for srun is:

* Exclusive access to resources it requests (srun --exclusive)
* All the resources of the job on the node (srun --whole)

These can be overridden by:

* srun --overlap (srun can overlap each other)
* srun --exact (only use exactly the resources requested)


Here's an example: I have 8 cores and 2 threads per core on my nodes.

#!/bin/bash
#SBATCH -n20
set -x
srun --exact -N1 -c1 -n1 whereami
printf "\n\n\n"
srun -N1 -c1 -n1 whereami
printf "\n\n\n"
srun -N2 -n2 whereami | sort


("whereami" is a simple program we wrote that just displays CPU masks. You can also display the masks like you are already doing. I can share this program with you if you want, though.)

$ sbatch 11275.batch 
Submitted batch job 202
$ cat slurm-202.out 
+ srun --exact -N1 -c1 -n1 whereami
0000 n1-1 - Cpus_allowed:       0101    Cpus_allowed_list:      0,8
+ printf '\n\n\n'



+ srun -N1 -c1 -n1 whereami
0000 n1-1 - Cpus_allowed:       1f1f    Cpus_allowed_list:      0-4,8-12
+ printf '\n\n\n'



+ srun -N2 -n2 whereami
+ sort
0000 n1-1 - Cpus_allowed:       1f1f    Cpus_allowed_list:      0-4,8-12
0001 n1-2 - Cpus_allowed:       1f1f    Cpus_allowed_list:      0-4,8-12


My first step uses --exact to get exactly the CPUs it asked for. My second step doesn't use --exact so it is given all the resources on the node. This is what you're seeing. My third step uses all the resources in the job just to show what the CPU bindings are for the entire job across the 2 nodes.

Basically, unless --exact is specified, --cpus-per-task is ignored in a job step.


I admit this was surprising to me, even though I knew about this change to srun - I thought if I specified --cpus-per-task that slurmctld would give me exactly what I asked for. I will look into changing the behavior so that if --cpus-per-task is explicitly requested by the user, then we can imply --exact. But if it turns out that we don't want to do that, then I will at least write a documentation patch to the srun man page to clarify that --cpus-per-task requires --exact.

Can you run your tests with --exact and let me know if it does what you expect?

Comment 12 Marshall Garey 2021-04-05 12:04:59 MDT

> If I change the script and add --exclusive to the srun parameters I still get these same "srun: Warning:..."  but the behaviour is back to expected and one core is allocated per task:


This happens because explicitly requesting --exclusive implicitly set --exact.

Comment 26 Patrick 2021-04-12 03:43:54 MDT

Hello - thank you for the explanation;
I just did a quick test but I believe I still see some issue when going across nodes. The following script will request two desktops with 3 CPU's configured each:

#!/bin/bash
#SBATCH --ntasks 6
#SBATCH --partition desktop

for i in {1..6}; do
echo $i
srun -N1 -n1 -c1 --exact sleep 150 &
done

wait

SLURM assigns two nodes:
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1380409   desktop slurmtes  ngib740  R       2:49      2 giswlx[100-101]

The output file however shows that only 1 CPU is used on the second node (giswlx101):
cpu-bind=MASK - giswlx100, task  0  0 [11655]: mask 0x7 set
1
2
3
4
5
6
cpu-bind=MASK - giswlx100, task  0  0 [11752]: mask 0x1 set
cpu-bind=MASK - giswlx100, task  0  0 [11764]: mask 0x4 set
cpu-bind=MASK - giswlx100, task  0  0 [11767]: mask 0x2 set
cpu-bind=MASK - giswlx101, task  0  0 [25074]: mask 0x1 set
srun: Job 1380409 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 1380409 step creation temporarily disabled, retrying (Requested nodes are busy)

If I use "srun -N1 -n1 -c1 --exact --overlap sleep 150 &", then it is even worse with SLURM starting more tasks (5) on giswlx100 than there are configured CPU's:
cpu-bind=MASK - giswlx100, task  0  0 [16047]: mask 0x7 set
1
2
3
4
5
6
cpu-bind=MASK - giswlx100, task  0  0 [16125]: mask 0x4 set
cpu-bind=MASK - giswlx100, task  0  0 [16138]: mask 0x1 set
cpu-bind=MASK - giswlx100, task  0  0 [16144]: mask 0x1 set
cpu-bind=MASK - giswlx100, task  0  0 [16154]: mask 0x2 set
cpu-bind=MASK - giswlx100, task  0  0 [16155]: mask 0x2 set
cpu-bind=MASK - giswlx101, task  0  0 [26888]: mask 0x1 set

So am I still missing something in the configuration here ?
thanks to advise.

I agree that if cpus-per-tasks is specified it would make sense to imply "exact" or at least issue a warning in the log file that the option is ignored.

Comment 27 Marshall Garey 2021-04-12 15:41:49 MDT

(In reply to Patrick from comment #26)
> Hello - thank you for the explanation;
> I just did a quick test but I believe I still see some issue when going
> across nodes. The following script will request two desktops with 3 CPU's
> configured each:
> 
> #!/bin/bash
> #SBATCH --ntasks 6
> #SBATCH --partition desktop
> 
> for i in {1..6}; do
> echo $i
> srun -N1 -n1 -c1 --exact sleep 150 &
> done
> 
> wait

Actually, this job doesn't request two nodes with 3 CPUs each. This job only requests 6 tasks on partition "desktop". The tasks don't have to be distributed evenly on two nodes, and they don't even have to be on two nodes. If Slurm can fit the 6 tasks on one node, then it will try to do that since "block" is the default distribution for tasks across nodes with select/cons_res. You can read more about the distribution in the sbatch/srun/salloc man pages (search for the -m, --distribution option).

To request exactly 3 tasks on 2 nodes, use

sbatch --ntasks-per-node=3 -N2

So it looks like what actually happened is 5 tasks were on one node, and 1 task was on another node.



> If I use "srun -N1 -n1 -c1 --exact --overlap sleep 150 &", then it is even
> worse with SLURM starting more tasks (5) on giswlx100 than there are
> configured CPU's:

--overlap allows steps to share CPUs.

Comment 28 Patrick 2021-04-13 00:30:40 MDT

Marshall - the "desktops" are all configured with 3 CPU's:
NodeName=giswlx100 Arch=x86_64 CoresPerSocket=3
   CPUAlloc=0 CPUTot=3 CPULoad=2.01

This means that asking for 6 tasks requests 2 hosts with 3 CPU's each.
The main issue here is that I'm asking for 6 tasks but can only start (srun) 4 at the same time :
(SLURM output file)
1
2
3
4
5
6
cpu-bind=MASK - giswlx100, task  0  0 [11752]: mask 0x1 set
cpu-bind=MASK - giswlx100, task  0  0 [11764]: mask 0x4 set
cpu-bind=MASK - giswlx100, task  0  0 [11767]: mask 0x2 set
cpu-bind=MASK - giswlx101, task  0  0 [25074]: mask 0x1 set
srun: Job 1380409 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 1380409 step creation temporarily disabled, retrying (Requested nodes are busy)

To illustrate this better here's a test on 3 nodes:

#!/bin/bash
#SBATCH --ntasks 9
#SBATCH --ntasks-per-node 3
#SBATCH --partition desktop

for i in {1..9}; do
echo $i
srun -N1 -n1 -c1 --exact sleep 30 &
done

wait

SLURM will use 3 CPU's / tasks on the first assigned host (giswlx100) but only a single task on each of the 2 other hosts:

cpu-bind=MASK - giswlx100, task  0  0 [32493]: mask 0x1 set
cpu-bind=MASK - giswlx100, task  0  0 [32498]: mask 0x4 set
cpu-bind=MASK - giswlx100, task  0  0 [32504]: mask 0x2 set
cpu-bind=MASK - giswlx101, task  0  0 [13746]: mask 0x1 set
cpu-bind=MASK - giswlx102, task  0  0 [27852]: mask 0x1 set
srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 1388346
srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 1388346 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 1388346
srun: Job 1388346 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 1388346
srun: Job 1388346 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 1388346
cpu-bind=MASK - giswlx100, task  0  0 [321]: mask 0x1 set
cpu-bind=MASK - giswlx100, task  0  0 [338]: mask 0x2 set
cpu-bind=MASK - giswlx102, task  0  0 [28906]: mask 0x1 set
cpu-bind=MASK - giswlx101, task  0  0 [14921]: mask 0x1 set

Comment 29 Marshall Garey 2021-04-13 10:19:56 MDT

Patrick,

Thanks for the clarification. I can reproduce what you're seeing in your latest comment, and I agree it seems like a bug. But it's different from the original issue you reported about --cpus-per-task, so I've created bug 11357 to handle this and added you to CC. Let's continue our conversation over there, and we'll leave this bug to handle --cpus-per-task implying --exact.

Comment 35 Marshall Garey 2021-04-14 16:22:06 MDT

Patrick,

We've pushed a fix for --cpus-per-task (and --threads-per-core, which had the same issue) to imply --exact. We pushed this to the master branch since it is a change in behavior, and documented the behavior change in NEWS, RELEASE_NOTES, and the srun man page. So this will be in 21.08 when it is released (this August).

If you want this for 20.11, you should be able to cherry pick the patch (just src/srun/libsrun/launch.c, since the changes to other files are only documentation changes).

Closing this as resolved/fixed in 21.08.

Comment 36 Marshall Garey 2021-06-17 08:50:21 MDT

I realized I never made the commit hash public. Here's the commit with the change:

https://github.com/SchedMD/slurm/commit/e01e884f3c294


(Sorry about the extra email.)

Comment 37 Michael Hinton 2022-02-23 10:10:54 MST

Hi Patrick, I just wanted to give you some updates on this:

It turned out that due to a bug, -c/--cpus-per-task and --threads-per-core did NOT imply --exact in 21.08.0 through 21.08.4. In 21.08.5, we fixed that bug so that they imply --exact properly. However, we discovered that this broke MPI programs pretty badly, since mpirun can't work like it needs to when --exact is specified. So in 21.08.6+, we reverted -c/--cpus-per-task and --threads-per-core implying --exact.

However, in 22.05, we are going to make -c/--cpus-per-task imply --exact again, but we are also going to change it so that srun does NOT inherit any -c specified by salloc/sbatch. This will give us the best of both worlds - it won't break MPI programs, and it also will fix the issue you highlighted in this ticket (where `srun -cX ...` gives you the job's whole allocation when it really doesn't make sense for it to).

What this means for you is that in 21.08.6+, make sure users specify --exact when doing `srun -c ...` to get only the CPUs you expect. In 22.05, you can get rid of the extra --exact.

See bug 13351 comment 76 for more details.

Thanks!
-Michael

Comment 38 Patrick 2022-02-24 02:31:12 MST

thank you for the update Michal, we'll keep that in mind when we will update to the latest SLURM release.