Ticket 8147

Summary: Tasks on SMT Cores
Product: Slurm Reporter: Ulf Markwardt <Ulf.markwardt>
Component: slurmctldAssignee: Marcin Stolarek <cinek>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek, jacob, jbooth
Version: 19.05.2   
Hardware: Linux   
OS: Linux   
Site: -Other- Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 20.02.2 20.11pre0
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: fix assignemnt of INFINITE16 to ntasks_per_node(v1)
fix assignemnt of INFINITE16 to ntasks_per_node(v2)

Description Ulf Markwardt 2019-11-26 05:49:57 MST
Dear Slurm developers,

we are about to integrate a large number of AMD Rome nodes in our HPC system. For this, we have tested the SMD awareness of Slurm on another system based on AMD Naples. There we have run into an inconsistency between srun and sbatch when it comes to placing tasks on SMT cores...

Thank you,
Ulf



# Inconsitent Handling of SMT-Capabilities

**Problem:** Slurm does not allow to submit batch files allocating Simultaneous MultiThreading (SMT) capabilities on an AMD EPYC 7601 system. On the other hand, interactive allocation using `srun` works.
The target system is an HPC system with two AMD EPPYC 7601 processor per node, each processor provides 32 cores and SMT will give 128 tasks in total per node.
SMT is enabled/disabled by `--hint=multithreading|nomultihreading` and disabled by default.

## SMT with srun

This works:
```
$ srun --nodes=1 --tasks-per-node=128 --hint=multithread echo hi | grep hi | wc -l
128
```

## SMT with SBATCH

Slurm does not accept the following job
```
#SBATCH -A hpcsupport
#SBATCH -J multi128
#SBATCH --hint=multithread
#SBATCH -N 1
#SBATCH --tasks-per-node=128

srun echo hi | grep hi | wc -l
```

complaining about
```
$ sbatch multithread.batch
sbatch: defined options
sbatch: -------------------- --------------------
sbatch: account             : hpcsupport
sbatch: hint                : multithread
sbatch: nodes               : 1
sbatch: ntasks              : 128
sbatch: verbose             : 1
sbatch: -------------------- --------------------
sbatch: end of defined options
sbatch: Consumable Resources (CR) Node Selection plugin loaded with argument 4372
sbatch: select/cons_tres loaded with argument 4372
sbatch: Linear node selection plugin loaded with argument 4372
sbatch: Cray/Aries node selection plugin loaded
sbatch: error: Batch job submission failed: Requested node configuration is not available
```

### Oberservations

We played with various settings, but had no success:

**Overcommit**

The job is only accepted by adding `#SBATCH --overcommit`. But the resulting scheduling will not make use of the SMT-cores. This is expected behaviour.

**Exclusive**

Things don't change by providing `--hint=multithread` when submitting the batch file:

```
$ sbatch --hint=multithread multithread.batch
```

**SLURM_HINT**

Setting the value of SLURM_HINT to multithread via export command does not help. Moreover, the option `--hint=multithread` does not change the env. variable SLUM_HINT in any case.

# slurm.conf

Here the snippet ouf our slurm.conf:

SelectType=select/cons_res
SelectTypeParameters=CR_ONE_TASK_PER_CORE,CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK
ProctrackType=proctrack/linuxproc
TaskPlugin=task/affinity
TaskPluginParam=Cpusets,Autobind=Threads
NodeName=n1 Procs=128  Sockets=2 CoresPerSocket=32 ThreadsPerCore=2 FEATURE=mem256gb,noboost,booston,ghz-2.2,mhz-2200,mhz-1700,mhz-1200 RealMemory=250000 Weight=256
Comment 2 Jacob Jenson 2019-11-26 08:18:58 MST
Ulf,

Is this for a system that is being supported by Atos? If so, could you have Atos submit this issue? Most of this might be fixable through configuration options. Due to contract limitations this is the route we have to take. 

Jacob
Comment 4 Marcin Stolarek 2019-11-28 05:06:02 MST
Created attachment 12433 [details]
fix assignemnt of INFINITE16 to ntasks_per_node(v1)

Ulf,

I can reproduce it. The issue comes from inconsistent size used by sbatch/srun to internally handle ntasks-per-core, which in case of --hint=multithread defaults to "infinite".

Could you please apply the attached patch and verify if this eliminates the issue for you?

Alternatively, you can explicitly specify --ntasks-per-core (--ntasks-per-core=2 will be enough in this case) or overwrite it in job_submit plugin.

cheers,
Marcin
Comment 8 Ulf Markwardt 2019-11-29 03:04:30 MST
I am out of office until December 1, 2019.
/* For support questions please contact hpcsupport@zih.tu-dresden.de . */

Kind regards,
Ulf Markwardt
Comment 10 Marcin Stolarek 2019-12-10 05:03:07 MST
Comment on attachment 12439 [details]
fix assignemnt of INFINITE16 to ntasks_per_node(v2)

Ulf,

Were you able to apply and verify the patch from comment 4 ?

cheers,
Marcin
Comment 11 Ulf Markwardt 2019-12-13 02:06:51 MST
We have tested it today. The behavior is still the same :-(
Best,
Ulf
Comment 12 Ulf Markwardt 2019-12-13 03:00:40 MST
We have tested the patch: situation unchanged.
With --ntasks-per-core=2 jobs are accepted.
Comment 13 Marcin Stolarek 2019-12-13 03:46:38 MST
Could you please double check if the slurm was fully rebuilt and installed from new build with the patch applied and you're using new sbatch?

If yes please execute the following commands:
ls -l $(which sbatch)
#gdb $(which sbatch)
(gdb) break  proc_args.c:872
(gdb) run --hint='multithread' --wrap='sleep 100'
(gdb) n
(gdb) print *ntasks_per_core

and share the full output with us.

cheers,
Marcin
Comment 14 Ulf Markwardt 2020-03-23 04:32:35 MDT
Dear Slurm developers,

sorry for the long delay. We checked that slurm is rebuilt with the
patch. The patch fixes the issue partly only. We see the following behavior:

1. #SBATCH Directive

If "#SBATCH --hint=multithread" is specified within a jobfile, the job
is rejected with "sbatch: error: Batch job submission failed: Requested
node configuration is not available"


2. Commandline Argument

Submitting the job via "sbatch --hint=multithread jobfile.sh" works and
gives all core (incl. SMT).


3. Env. Variable SLURM_HINT

Last, we expirimented with the env. variable SLURM_HINT. While the
submission using the combination "unset SLURM_HINT" and "SBATCH
--hint=multithread" is rejected, it works with explicitly setting the
value of SLURM_HINT via "export SLRUM_HINT=multithread".


Best
Ulf
Comment 15 Marcin Stolarek 2020-03-25 05:59:20 MDT
Hi Ulf,

>1. #SBATCH Directive [...]
I'm tryint to reproduce it with a script like the one from comment0:
># cat /tmp/testHT 
>#!/bin/bash
>#SBATCH --hint=multithread
>#SBATCH -N 1
>#SBATCH --tasks-per-node=128
>srun echo hi

using unpatched sbatch:
# /mnt/slurm/bin/sbatch /tmp/testHT 
sbatch: error: Batch job submission failed: Requested node configuration is not available

using patched sbatch:
# sbatch /tmp/testHT 
Submitted batch job 122
# grep hi slurm-122.out | wc -l 
128

Important slurm.conf parameters on my side:
# grep SelectTypePa /mnt/slurm/etc/slurm.conf | grep -v ^#
SelectTypeParameters=CR_ONE_TASK_PER_CORE,CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK
# grep SelectTy /mnt/slurm/etc/slurm.conf | grep -v ^#
SelectType=select/cons_res
SelectTypeParameters=CR_ONE_TASK_PER_CORE,CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK
# grep NodeName= /mnt/slurm/etc/slurm.conf | grep -v ^#
NodeName=test02 NodeHostName=slurmctl CPUs=128   CoreSpecCount=0   Sockets=2 CoresPerSocket=32  ThreadsPerCore=2                        State=UNKNOWN 

Is our configuration and jobscript aligned(parameters and theris order)? Do you have any job_submit or cli_filter plugins potentially affecting job description?


>2. Commandline Argument [...]
Just to be sure - this looks fine for you?


>3. Env. Variable SLURM_HINT
It's probably something I didn't fully explain after your initial message. SLURM_* variables are output variables in terms of sbatch and salloc, so when you have a job submitted with sbatch --hint=X SLURM_HINT will be set in the job environment. Those are input variables for srun so srun inside a batch script will "inherit" --hint by default (if it's not unset explicitelly before execution of srun).

At the same time all SLURM_* variables are exported to job environemnt, so if you export SLURM_HINT srun will get it even if sbatch was called without --hint - this is happening in this case.

When srun is called inside a job allocation it will create a step in this allocation, so any option here can't have impact on selection of cores (i.e. slurmctld select plugin activity), but can change TaskPlugin behavior - task affinity.

In those terms --hint is a little bit special since depending on the context it affects both or only task affinity, what it does for sbatch/salloc is:
if both --ntasks-per-core and --threads-per-core are not specified but --hint=multithreaded is used set --ntasks-per-core=infinite and set SLURM_HINT output variable. This variable will be interpreted by srun which will result in --cpu-bind=threads and removal of CR_ONE_TASK_PER_CORE.
Are you mixing --hint is that possible that you're mixing --threads-per-core/--ntasks-per-core in your job script from point 1?

cheers,
Marcin
Comment 18 Marcin Stolarek 2020-04-23 03:04:43 MDT
Ulf,

The patch for the issue was merged[1] into slurm-20.02 branch and will be part of 20.02.2 release.

I'm closing this now. Should you have any question please reopen.

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/e5d9b71bebbeea956997cebd01bf693a1b294b62