Ticket 10620

Summary: Should SLURM_NTASKS (input env of srun) in batch step environment depend on --threads-per-core (when -N given)?
Product: Slurm Reporter: Marcin Stolarek <cinek>
Component: OtherAssignee: Marcin Stolarek <cinek>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: alex, greg.wickham, kevin.buckley, lyeager, tim
Version: 20.11.2   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=15081
https://bugs.schedmd.com/show_bug.cgi?id=16666
Site: Pawsey Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name: chaos
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Marcin Stolarek 2021-01-13 06:17:11 MST
Splitting this from Bug 10377, since it started with a different issue that was a regression in 20.11 (compared to 20.02) and the one here was discovered during tests there.

Initially message from Kevin: Bug 10377 comment 6

==SUMMARY==
--threads-per-core impacts number *NTASKS* variables on 20.11 (and 20.02) when combined with -N, which is difficult to explain.

==REPRODUCER==
# scontrol show config | grep VERS
SLURM_VERSION           = 20.02.6
# sbatch  -V
slurm 20.02.6
# sbatch  --threads-per-core=2  -o /tmp/tasks --wrap="srun env" && sleep 3 && grep TASKS /tmp/tasks | sort | uniq -c 
Submitted batch job 59502
      1 SLURM_NTASKS=1
      1 SLURM_STEP_NUM_TASKS=1
      1 SLURM_STEP_TASKS_PER_NODE=1
      1 SLURM_TASKS_PER_NODE=1
# sbatch -N1  --threads-per-core=2  -o /tmp/tasks --wrap="srun env" && sleep 3 && grep TASKS /tmp/tasks | sort | uniq -c 
Submitted batch job 59503
      2 SLURM_NTASKS=2
      2 SLURM_STEP_NUM_TASKS=2
      2 SLURM_STEP_TASKS_PER_NODE=2
      2 SLURM_TASKS_PER_NODE=2
-----
# scontrol show config | grep VERS
SLURM_VERSION           = 20.11.2
# sbatch  -V
slurm 20.11.2
# sbatch -N1  --threads-per-core=2  -o /tmp/tasks --wrap="srun env" && sleep 3 && grep TASKS /tmp/tasks | sort | uniq -c 
Submitted batch job 59504
      2 SLURM_NTASKS=2
      2 SLURM_STEP_NUM_TASKS=2
      2 SLURM_STEP_TASKS_PER_NODE=2
      2 SLURM_TASKS_PER_NODE=2
# sbatch   --threads-per-core=2  -o /tmp/tasks --wrap="srun env" && sleep 3 && grep TASKS /tmp/tasks | sort | uniq -c 
Submitted batch job 59505
      1 SLURM_NTASKS=1
      1 SLURM_STEP_NUM_TASKS=1
      1 SLURM_STEP_TASKS_PER_NODE=1
      1 SLURM_TASKS_PER_NODE=1


cheers,
Marcin
Comment 1 Kevin Buckley 2021-01-14 02:04:20 MST
Thought I'd update the table of differing values from 10377,
now that I have had a chance to apply a default of
 --threads-per-core=1
using the cli_filter that Marcin suggested there.

Columns 1-3 we saw before.

Column 1 No explict #SBATCH --threads-per-core=1
Column 2    Explict #SBATCH --threads-per-core=1
Column 3 No explict #SBATCH --threads-per-core=1
          --threads-per-core=1  added by Lua job_submit script check
Column 4 No explict #SBATCH --threads-per-core=1
          --threads-per-core=1 set by Lua cli_filter
          and checked for by Lua job_submit script

                                                                    Col4
SLURM_CPUS_ON_NODE=        48                             24    24   24
SLURM_JOB_CPUS_PER_NODE=   48                             24    24   24
                                SLURM_NPROCS=1                        1
                                SLURM_NTASKS=1                        1
SLURM_TASKS_PER_NODE=      48                              1    24    1
                                SLURM_THREADS_PER_CORE=1              1

Not sure if that tells us anything more, other than if there
is an explict --threads-per-core=1 before the job_submit check, 
we get the three extra variables set, and SLURM_TASKS_PER_NODE
gets set to 1, not 24 (cores) or 48 (hyperthreads).

Hoping that's useful,
Kevin
Comment 6 Marcin Stolarek 2021-05-03 23:46:13 MDT
*** Ticket 11507 has been marked as a duplicate of this ticket. ***
Comment 7 Luke Yeager 2021-05-04 15:54:26 MDT
You should consider changing the title of this bug. The number of tasks launched actually changes - not just the value of an environment variable. The environment variable is correct - Slurm is launching more tasks for you. Perhaps you already understand that, but it's not clear from the title of this bug.

> $ salloc     --threads-per-core=2 srun -l hostname
> 0: node1
> $ salloc -N1 --threads-per-core=2 srun -l hostname
> 0: node1
> 1: node1
It's definitely surprising to me that adding the "-N1" flag changes this behavior. I filed a similar complaint in bug#11303 (where a seemingly harmless flag changes the behavior).
Comment 8 Marcin Stolarek 2021-05-05 02:26:03 MDT
Luke,

> not just the value of an environment variable.

Yes it's "not just" since SLURM_NTASKS is an input environment variable for srun, with that in mind I feel that the title/short summary was OK, but I added that info too now.

cheers,
Marcin
Comment 9 Luke Yeager 2021-05-05 08:10:23 MDT
Oh! I didn't realize srun picked the number of tasks to run based on an environment variable - I thought there was some communication between srun and either slurmd or slurmctld. Now I understand the title better, thanks.

> $ salloc -n3 env SLURM_NTASKS=2 srun -l hostname
> 0: node1
> 1: node1
Comment 15 Marcin Stolarek 2021-05-24 02:45:24 MDT
Since we have a number of issues here, I'll keep you posted on all code/documentation fixes. We started with documentation improvement of SLURM_CPUS_ON_NODE in bbe3cd794d308[1]

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/bbe3cd794d308081f6ff6ac09d642980bc63b0ed
Comment 24 Marcin Stolarek 2021-07-16 05:30:23 MDT
Just to let you know the status of this bug. We pushed another documentation fix for SLURM_JOB_CPUS_PER_NODE - b4cced8d62140f9c7e1[1].

The issue with --threads-per-core=2 -N1 specification resulting in two tasks is something we're discussing internally. The code responsible for it was in since the initial implementation of --threads-per-core --cores-per-socket --sockets-per-node.

Its behavior is driven by the assumption that a user wants to run 1 task per CPU (understood here as specified total number of threads) so a number of requested tasks is set implicitly.

We'll be adding more detailed documentation of it and additional verbose level messages to make it easier for a user to understand, but we don't want to actually change this behavior since it's very old and today may be an assumption taken by some sites (as it worked like that all the time, someone may even not realize that he relied on -N1 --threads-per-core=2 starting two tasks). 

If you don't want this, direct specification of a number of tasks will prevent this from happening, so you can simply add -n1 as a command-line argument.

I'll keep you posted on the progress.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/b4cced8d62140f9c7e10d1cfcea620f57880f3f1
Comment 30 Marcin Stolarek 2021-07-26 01:36:00 MDT
Additional documentation improvement was merged to our main repository[1].

The one is about the most confusing behavior of SLURM_NTASKS (and effectively number of tasks started by srun under sbatch/salloc) being affected by --threads-per-core. In fact the behavior is:
 If the user gave an exact number of nodes (not a min-max range), but just -N X where X is a single number and specified --threads-per-core, --sockets-per-node, --cores-per-socket tools will set the requested number of tasks to the product of those. It's aligned with a general Slurm default of starting one task per CPU.
 There is a slight inconsistency in the way the logic is applied to sbatch/salloc and srun. For srun number of requested tasks is only affected if all of the options were set, while for sbatch/salloc if one of the "multi-core" options is not provided we assume it's one for the task bumping logic. 

The behavior is in code for more than 10 years, so we are very cautious about changing it since Slurm acts as a backed for many applications.

I'll keep the bug open since we'll at least add logging messages at verbose level (client-side) to make it easier to figure out what's happening if anyone will accidentally fall into the discussed case in the future.

Let me know if you have any questions.

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/fa3ac862549408429f44c6c3e380984a2168ff67
Comment 35 Marcin Stolarek 2021-09-07 03:10:20 MDT
We pushed one more change. When the client is run with verbose logging level (at least one -v used) and the number of tasks is bumped implicitly, we'll print a message like:

># salloc -N1 -v --threads-per-core=2 true
>salloc: Number of tasks implicitly set to 2
>[...]

to emphasize and make it easier to understand debug in the future.

Currently, we don't see an easy option to dropping this (maybe a non-obvious behavior) since it has been in the Slurm code for more than 10 years. It's probably a code that it's not so active in terms of the fraction of use cases, but we can't estimate if it's not assumed by software using Slurm as a backend today.

Please let me know if you have more questions about this. In case of no reply, I'll close the ticket as "information given". Although we landed a few commits from the case, its meaning is mainly informational.

cheers,
Marcin
Comment 36 Marcin Stolarek 2021-09-07 03:26:42 MDT
PS. I just noticed I didn't reference a specific commit: 3e596f13c8[1]
[1]https://github.com/SchedMD/slurm/commit/3e596f13c884df828ef4d94d1971e1b69f016fee
Comment 37 Kevin Buckley 2021-09-08 01:42:44 MDT
Will close for you Marcin (rather than not reply): can't see 
that we can ask for anything else here.