8147 – Tasks on SMT Cores

Ticket 8147 - Tasks on SMT Cores

Summary: Tasks on SMT Cores

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	19.05.2
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-11-26 05:49 MST by Ulf Markwardt
Modified:	2020-04-23 03:04 MDT (History)
CC List:	3 users (show)

See Also:
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.02.2 20.11pre0
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
fix assignemnt of INFINITE16 to ntasks_per_node(v1) (993 bytes, patch) 2019-11-28 05:06 MST, Marcin Stolarek	Details \| Diff
fix assignemnt of INFINITE16 to ntasks_per_node(v2) (1.06 KB, patch) 2019-11-29 03:04 MST, Marcin Stolarek	Details \| Diff
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Ulf Markwardt 2019-11-26 05:49:57 MST

Dear Slurm developers,

we are about to integrate a large number of AMD Rome nodes in our HPC system. For this, we have tested the SMD awareness of Slurm on another system based on AMD Naples. There we have run into an inconsistency between srun and sbatch when it comes to placing tasks on SMT cores...

Thank you,
Ulf



# Inconsitent Handling of SMT-Capabilities

**Problem:** Slurm does not allow to submit batch files allocating Simultaneous MultiThreading (SMT) capabilities on an AMD EPYC 7601 system. On the other hand, interactive allocation using `srun` works.
The target system is an HPC system with two AMD EPPYC 7601 processor per node, each processor provides 32 cores and SMT will give 128 tasks in total per node.
SMT is enabled/disabled by `--hint=multithreading|nomultihreading` and disabled by default.

## SMT with srun

This works:
```
$ srun --nodes=1 --tasks-per-node=128 --hint=multithread echo hi | grep hi | wc -l
128
```

## SMT with SBATCH

Slurm does not accept the following job
```
#SBATCH -A hpcsupport
#SBATCH -J multi128
#SBATCH --hint=multithread
#SBATCH -N 1
#SBATCH --tasks-per-node=128

srun echo hi | grep hi | wc -l
```

complaining about
```
$ sbatch multithread.batch
sbatch: defined options
sbatch: -------------------- --------------------
sbatch: account             : hpcsupport
sbatch: hint                : multithread
sbatch: nodes               : 1
sbatch: ntasks              : 128
sbatch: verbose             : 1
sbatch: -------------------- --------------------
sbatch: end of defined options
sbatch: Consumable Resources (CR) Node Selection plugin loaded with argument 4372
sbatch: select/cons_tres loaded with argument 4372
sbatch: Linear node selection plugin loaded with argument 4372
sbatch: Cray/Aries node selection plugin loaded
sbatch: error: Batch job submission failed: Requested node configuration is not available
```

### Oberservations

We played with various settings, but had no success:

**Overcommit**

The job is only accepted by adding `#SBATCH --overcommit`. But the resulting scheduling will not make use of the SMT-cores. This is expected behaviour.

**Exclusive**

Things don't change by providing `--hint=multithread` when submitting the batch file:

```
$ sbatch --hint=multithread multithread.batch
```

**SLURM_HINT**

Setting the value of SLURM_HINT to multithread via export command does not help. Moreover, the option `--hint=multithread` does not change the env. variable SLUM_HINT in any case.

# slurm.conf

Here the snippet ouf our slurm.conf:

SelectType=select/cons_res
SelectTypeParameters=CR_ONE_TASK_PER_CORE,CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK
ProctrackType=proctrack/linuxproc
TaskPlugin=task/affinity
TaskPluginParam=Cpusets,Autobind=Threads
NodeName=n1 Procs=128  Sockets=2 CoresPerSocket=32 ThreadsPerCore=2 FEATURE=mem256gb,noboost,booston,ghz-2.2,mhz-2200,mhz-1700,mhz-1200 RealMemory=250000 Weight=256

Comment 2 Jacob Jenson 2019-11-26 08:18:58 MST

Ulf,

Is this for a system that is being supported by Atos? If so, could you have Atos submit this issue? Most of this might be fixable through configuration options. Due to contract limitations this is the route we have to take. 

Jacob

Comment 4 Marcin Stolarek 2019-11-28 05:06:02 MST

Created attachment 12433 [details]
fix assignemnt of INFINITE16 to ntasks_per_node(v1)

Ulf,

I can reproduce it. The issue comes from inconsistent size used by sbatch/srun to internally handle ntasks-per-core, which in case of --hint=multithread defaults to "infinite".

Could you please apply the attached patch and verify if this eliminates the issue for you?

Alternatively, you can explicitly specify --ntasks-per-core (--ntasks-per-core=2 will be enough in this case) or overwrite it in job_submit plugin.

cheers,
Marcin

Comment 8 Ulf Markwardt 2019-11-29 03:04:30 MST

I am out of office until December 1, 2019.
/* For support questions please contact hpcsupport@zih.tu-dresden.de . */

Kind regards,
Ulf Markwardt

Comment 10 Marcin Stolarek 2019-12-10 05:03:07 MST

Comment on attachment 12439 [details]
fix assignemnt of INFINITE16 to ntasks_per_node(v2)

Ulf,

Were you able to apply and verify the patch from comment 4 ?

cheers,
Marcin

Comment 11 Ulf Markwardt 2019-12-13 02:06:51 MST

We have tested it today. The behavior is still the same :-(
Best,
Ulf

Comment 12 Ulf Markwardt 2019-12-13 03:00:40 MST

We have tested the patch: situation unchanged.
With --ntasks-per-core=2 jobs are accepted.

Comment 13 Marcin Stolarek 2019-12-13 03:46:38 MST

Could you please double check if the slurm was fully rebuilt and installed from new build with the patch applied and you're using new sbatch?

If yes please execute the following commands:
ls -l $(which sbatch)
#gdb $(which sbatch)
(gdb) break  proc_args.c:872
(gdb) run --hint='multithread' --wrap='sleep 100'
(gdb) n
(gdb) print *ntasks_per_core

and share the full output with us.

cheers,
Marcin

Comment 14 Ulf Markwardt 2020-03-23 04:32:35 MDT

Dear Slurm developers,

sorry for the long delay. We checked that slurm is rebuilt with the
patch. The patch fixes the issue partly only. We see the following behavior:

1. #SBATCH Directive

If "#SBATCH --hint=multithread" is specified within a jobfile, the job
is rejected with "sbatch: error: Batch job submission failed: Requested
node configuration is not available"


2. Commandline Argument

Submitting the job via "sbatch --hint=multithread jobfile.sh" works and
gives all core (incl. SMT).


3. Env. Variable SLURM_HINT

Last, we expirimented with the env. variable SLURM_HINT. While the
submission using the combination "unset SLURM_HINT" and "SBATCH
--hint=multithread" is rejected, it works with explicitly setting the
value of SLURM_HINT via "export SLRUM_HINT=multithread".


Best
Ulf

Comment 15 Marcin Stolarek 2020-03-25 05:59:20 MDT

Hi Ulf,

>1. #SBATCH Directive [...]
I'm tryint to reproduce it with a script like the one from comment0:
># cat /tmp/testHT 
>#!/bin/bash
>#SBATCH --hint=multithread
>#SBATCH -N 1
>#SBATCH --tasks-per-node=128
>srun echo hi

using unpatched sbatch:
# /mnt/slurm/bin/sbatch /tmp/testHT 
sbatch: error: Batch job submission failed: Requested node configuration is not available

using patched sbatch:
# sbatch /tmp/testHT 
Submitted batch job 122
# grep hi slurm-122.out | wc -l 
128

Important slurm.conf parameters on my side:
# grep SelectTypePa /mnt/slurm/etc/slurm.conf | grep -v ^#
SelectTypeParameters=CR_ONE_TASK_PER_CORE,CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK
# grep SelectTy /mnt/slurm/etc/slurm.conf | grep -v ^#
SelectType=select/cons_res
SelectTypeParameters=CR_ONE_TASK_PER_CORE,CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK
# grep NodeName= /mnt/slurm/etc/slurm.conf | grep -v ^#
NodeName=test02 NodeHostName=slurmctl CPUs=128   CoreSpecCount=0   Sockets=2 CoresPerSocket=32  ThreadsPerCore=2                        State=UNKNOWN 

Is our configuration and jobscript aligned(parameters and theris order)? Do you have any job_submit or cli_filter plugins potentially affecting job description?


>2. Commandline Argument [...]
Just to be sure - this looks fine for you?


>3. Env. Variable SLURM_HINT
It's probably something I didn't fully explain after your initial message. SLURM_* variables are output variables in terms of sbatch and salloc, so when you have a job submitted with sbatch --hint=X SLURM_HINT will be set in the job environment. Those are input variables for srun so srun inside a batch script will "inherit" --hint by default (if it's not unset explicitelly before execution of srun).

At the same time all SLURM_* variables are exported to job environemnt, so if you export SLURM_HINT srun will get it even if sbatch was called without --hint - this is happening in this case.

When srun is called inside a job allocation it will create a step in this allocation, so any option here can't have impact on selection of cores (i.e. slurmctld select plugin activity), but can change TaskPlugin behavior - task affinity.

In those terms --hint is a little bit special since depending on the context it affects both or only task affinity, what it does for sbatch/salloc is:
if both --ntasks-per-core and --threads-per-core are not specified but --hint=multithreaded is used set --ntasks-per-core=infinite and set SLURM_HINT output variable. This variable will be interpreted by srun which will result in --cpu-bind=threads and removal of CR_ONE_TASK_PER_CORE.
Are you mixing --hint is that possible that you're mixing --threads-per-core/--ntasks-per-core in your job script from point 1?

cheers,
Marcin

Comment 18 Marcin Stolarek 2020-04-23 03:04:43 MDT

Ulf,

The patch for the issue was merged[1] into slurm-20.02 branch and will be part of 20.02.2 release.

I'm closing this now. Should you have any question please reopen.

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/e5d9b71bebbeea956997cebd01bf693a1b294b62