8119 – AMD Rome configuration breaks --hint=nomultithread

Ticket 8119 - AMD Rome configuration breaks --hint=nomultithread

Summary: AMD Rome configuration breaks --hint=nomultithread

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	19.05.3
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-11-18 09:33 MST by David Gloe
Modified:	2020-10-02 00:58 MDT (History)
CC List:	1 user (show)

See Also:
Site:	CRAY
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	Cray Internal
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.11pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
lstopo output (17.27 KB, text/plain) 2019-11-18 09:33 MST, David Gloe	Details
error message + documentation about --hint ignore(v1) (5.29 KB, patch) 2019-11-20 08:44 MST, Marcin Stolarek	Details \| Diff
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description David Gloe 2019-11-18 09:33:47 MST

Created attachment 12344 [details]
lstopo output

We have an AMD Rome compute node with 2 sockets, 8 NUMA nodes, 128 cores, and 256 threads.

When we configure the node with Sockets=2 CoresPerSocket=64 ThreadsPerCore=2, we get these log messages from slurmd:

[2019-11-04T17:51:21.532] Considering each NUMA node as a socket
[2019-11-04T17:51:21.533] error: You are using cons_res or gang scheduling with Fastschedule=0 and node configuration differs from hardware.  The node configuration used will be what is in the slurm.conf because of the bitmaps the slurmctld must create before the slurmd registers.

   CPUs=256:256(hw) Boards=1:1(hw) SocketsPerBoard=2:8(hw) CoresPerSocket=64:16(hw) ThreadsPerCore=2:2(hw)

When we change the configuration to Sockets=8 CoresPerSocket=16 ThreadsPerCore=2, we no longer get this message, but --hint=nomultithread no longer works correctly; tasks are placed on threads on the same core.

nid001002:~ # srun --exclusive -w nid001002 --cpu-bind=v,threads -c 2 --hint=nomultithread -n 2 /lus/dgloe/xthi
cpu-bind=MASK - nid001002, task  0  0 [66084]: mask 0x100000000000000000000000000000001 set
cpu-bind=MASK - nid001002, task  1  1 [66085]: mask 0x200000000000000000000000000000002 set
Hello from rank 000  thread 000  on nid001002  (core affinity = 0,128 )
Hello from rank 000  thread 001  on nid001002  (core affinity = 0,128 )
Hello from rank 001  thread 000  on nid001002  (core affinity = 1,129 )
Hello from rank 001  thread 001  on nid001002  (core affinity = 1,129 )

Which configuration should we use? Is the nomultithread behavior a bug?

Comment 2 Marcin Stolarek 2019-11-19 04:36:36 MST

David,

There are a few things here, first of all, you should not use FastSchedule=0. You should get the below error message as one of the first lines of slurmctld/slurmd output:
>error: FastSchedule will be removed in 20.02, as will the FastSchedule=0 functionality. Please consider removing this from your configuration now.

AMD Rome(all from Epyc) line is another AMD line with multiple NUMA nodes per socket, by default Slurm considers every NUMA node as a separate Socket since from the application perspective it's more important than the number of packages.

I see that you have both --cpu-bind=v,thread and --hint=nomultithread. The way --hint is implemented is modyfication of the same job descrtiption member as --cpu-bind and the only --cpu-bind option allowed together with --hint is verbose. If other than verbose options are specified to --cpu-bind the value of --hint is ignored.

Could you please re-run the test skipping "thread" in --cpu-bind? Could you please elaborate on the goal you're trying to achieve by --hint=nomultithread?

cheers,
Marcin

Comment 3 David Gloe 2019-11-19 14:57:26 MST

I don't see any mention of --cpu-bind overriding --hint in the srun man page.

A user reports that --hint=nomultithread worked with --cpu-bind=threads in Slurm 18.08.7 (although on a different processor)

[saraha@osprey misc]$ srun --mpi=pmi2 -n2 --hint=nomultithread --cpu-bind=threads -c 2 ./xthi | sort
{{ Hello from rank 000  thread 00  on prod-0065  (core affinity = 0 )}}
{{ Hello from rank 000  thread 01  on prod-0065  (core affinity = 1 )}}
{{ Hello from rank 001  thread 00  on prod-0065  (core affinity = 28 )}}
{{ Hello from rank 001  thread 01  on prod-0065  (core affinity = 29 )}}
{{ [saraha@osprey misc]$ srun --mpi=pmi2 -n2 --cpu-bind=threads -c 2 ./xthi | sort}}
{{ Hello from rank 000  thread 00  on prod-0065  (core affinity = 0 )}}
{{ Hello from rank 000  thread 01  on prod-0065  (core affinity = 56 )}}
{{ Hello from rank 001  thread 00  on prod-0065  (core affinity = 28 )}}
{{ Hello from rank 001  thread 01  on prod-0065  (core affinity = 84 )}}

What we're trying to do is disable use of SMT threads.

Comment 4 Marcin Stolarek 2019-11-20 08:44:11 MST

Created attachment 12354 [details]
error message + documentation about --hint ignore(v1)

David,

I can confirm there was a commit changing this behavior[1], previously the final result was dependent on the order of salloc/sbatch/srun arguments. Attached patch adds an end-user error message informing that the option is ignored and documents this in appropriate man pages. I'm addressing additional end-user messages to 20.02, since it's a user end change. However, if you like it should cleanly apply on top of 19.05.

Did you change slurm.conf between the tests in bug report and comment 3? Since in comment 3 both attempts resulted in the allocation you wanted to achieve? 

This behavior depends on a number of "CPUs" you specify in the node configuration line of slurm.conf - if it's equal to number of cores (not the total number of threads) or you have SelectTypeParameters=CR_ONE_TASK_PER_CORE set in your slurm.conf cores are threaded as one PU(processing unit).

If you prefer to keep CPUs=TotalNumberSMTs you can achieve one "PU per core" adding --threads-per-core=1 option to your srun. Like below (I have ThreadsPerCore=2. Core 0 has PU0 and PU1; Core1 has PU2 and PU3..):
1. Default, every thread treated as PU.
# srun  --mem=10 -n2  -c2  /bin/bash -c 'taskset -cp $$'
srun: job 899 queued and waiting for resources
srun: job 899 has been allocated resources
pid 29265's current affinity list: 2,3
pid 29264's current affinity list: 0,1


2. Added --threads-per-core=1 - core treated as one PU. Each task gets 4 threads=2cores. Binding done to all threads in core.
# srun  --mem=10 -n2 --threads-per-core=1  -c2  /bin/bash -c 'taskset -cp $$'
srun: job 898 queued and waiting for resources
srun: job 898 has been allocated resources
pid 29194's current affinity list: 0-3
pid 29195's current affinity list: 4-7

3. Added --cpu-bind=thread to bind processes to only one SMT per core.
# srun  --mem=10 -n2  -c2  --threads-per-core=1 --cpu-bind=thread  /bin/bash -c 'taskset -cp $$'
srun: job 912 queued and waiting for resources
srun: job 912 has been allocated resources
pid 7512's current affinity list: 4,6
pid 7511's current affinity list: 0,2

I hope you find this helpful.

cheers,
Marcin

[1] https://github.com/SchedMD/slurm/commit/4cf80f20736b0ce81471582c56f2aca20ad8df27

Comment 5 Marcin Stolarek 2019-12-17 04:49:14 MST

David,

Did you try the attached patch?

Is there anything else I can help you with in regards to the issue?

cheers,
Marcin

Comment 6 David Gloe 2019-12-17 06:25:36 MST

Using --cpu-bind=thread --threads-per-core=1 meets our needs. I haven't tried out the patch.

Comment 13 Marcin Stolarek 2020-10-02 00:58:23 MDT

David,

We've added client side verification for mutually exclusive options that were silently ignored before in 20.11 and documented that in sbatchh/srun/salloc manual pages.

Cheers,
Marcin