15648 – Can hyperthreading be configured on all nodes and with "--hint=nomultithread" be made to not impact jobs where hyperthreading is an issue?

Ticket 15648 - Can hyperthreading be configured on all nodes and with "--hint=nomultithread" be made to not impact jobs where hyperthreading is an issue?

Summary: Can hyperthreading be configured on all nodes and with "--hint=nomultithread"...

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	- Unsupported Older Versions
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Carlos Tripiana Montes
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-12-16 12:38 MST by Mike Woodson
Modified:	2022-12-21 12:09 MST (History)
CC List:	0 users

See Also:
Site:	Cornell ITSG
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Mike Woodson 2022-12-16 12:38:17 MST

Currently running 20.11.8. Hope to upgrade to the newest in March of next year. 

There is a lot of documentation on your site regarding hyperthreading and how to configure it. I am still a bit confused. 

Our cluster is a shared cluster where about 50 faculty have each purchased compute and NFS nodes, which we have combined into a SLURM cluster, utilizing preemption. All compute nodes are in a default partition with a PriorityTier of 10. Each research group then has a priority partition (PriorityTier of 20 or higher) which only they can submit to, which, using preemption, gives them almost instant access to their compute nodes. There is no consistent job type or software used, as users come from all kinds of Engineering and Computer Science fields. 

 We have hyperthreading turned off on all compute nodes due to several users requesting that when we built the cluster in 2018. We were under the assumption that it had to be all on or all off to give users a consistent experience. Since on would cause some performance impact for some users (according to them), we chose off. 

We now have an owner who is insisting that hyperthreading and oversubscribing be turned on for his compute node. Options seem to be 1) to turn on hyperthreading on his server and take it out of the default partition, letting them access it only from their priority partition, 2) Take it out of the cluster completely, 3) turn on hyperthreading and possibly impacting certain types of jobs, or 4) coming up with a way of configuring the cluster in such a way that users who want hyperthreading can have it and users that don't, can avoid it (without splitting the cluster into a group with hyperthreading turned on, and another group with it being off. 

His comment on this is:
*****************************************************************
> By over-provisioning, do you mean that multiple jobs can be submitted which will request more resources than is available on the server?
Yes, this is commonly done in hypervisors and in lower cost cloud provider pools. If we can't turn on HT we should be able to overprovision based on some function of system load and manual OP limit setting. I believe SLURM has built in functionality for this.

Regarding HT and using other queues, we are only requesting that HT be turned on for our machines. When we submit to other queues, we run the risk of being preempted, which means that *users who care about repeatability and reliability should not submit to other queues* since they will most likely be preempted and their job will run on a different machine when it gets resubmitted. While HT had performance risks when it first came out almost 20 years ago, modern CPUs have very few use cases where HT is a performance detriment. In the majority of computational settings, HT provides a nontrivial (25%) performance increase.

Since Kevin's group owns the two machines in G2 under his name, *there is no reason that we should have to give up 25% of the performance we paid for under the name of allowing other users to run jobs, especially if those are preemptable jobs*. We already have a heterogenous G2 setup with different CPU counts, GPU counts, and CPU and GPU types. There is no reason not to add another flag that allows users to choose whether they want a machine with HT or not, if that is something they care about.   
*****************************************************************

I have read up on the --hint switch and am trying to test it now, but I don't seem to get the same kind of results as the examples in "https://slurm.schedmd.com/mc_support.html#srun_hints". 

If I understand it correctly, if we were to have the default be "ntasks-per-core=1" and have hyperthreading turned on in the BIOS of each server, each core would have two virtual cores. In scheduling a job, it looks like the second virtual core on a core would also be allocated to that job, but does it mean that the single thread of the task would have full access to the complete resources of the core, or still be limited to the resources of 1 virtual core? And how would we test that? 

Mike

Comment 1 Carlos Tripiana Montes 2022-12-19 11:01:52 MST

Hi Mike,

I think the core (key) question is this one, at the end of the day:

> If I understand it correctly, if we were to have the default be
> "ntasks-per-core=1" and have hyperthreading turned on in the BIOS of each
> server, each core would have two virtual cores. In scheduling a job, it
> looks like the second virtual core on a core would also be allocated to that
> job, but does it mean that the single thread of the task would have full
> access to the complete resources of the core, or still be limited to the
> resources of 1 virtual core? And how would we test that? 

Based on my testbed, you should be doing like:

1. Enable HT in bios, restart system.
2. Edit slurm.conf, and move from:
CPUS=X*Y CoresPerSocket=X Sockets=Y ThreadsPerCore=1
to:
CPUS=X*Y*2 CoresPerSocket=X Sockets=Y ThreadsPerCore=2
3. Restart slurm daemons

I anybody asks for cores, and uses pinning to cores, everything will work as before, no matter that only 1 virtual cpus per core is in use if you look at something like top command.

If anybody wants to use logical cpus in their jobs, then they must configure the job accordingly.

The question... the question is that users aren't going to set things up well at first, and will complain about the change. And sometimes mpi flavours does things differently from srun and pinning is not always as you might think... Well, this is the main reason most of the times admins just disable HT. Because most of the HPC codes just don't run as performant with as HT as w/o. But I am not saying it's because HT, I am saying the bad thing is to put 2 MPI tasks in 1 physical core (1 per logical cpu). To properly scale most of the HPC codes you need real cores, as logical CPUs share some HW resources inside each physical core and those codes heavily use all a core can offer.

I am a bit reluctant to say go for a global change for only one user, even if they own some HW, if this can potentially impact on the jobscripts of all the other users, and would potentially cause noise and trouble for them once they get used to the new configuration. Because, as you see, there's no magic, transparent, way of making such big change.

I'd suggest you to put this node(s) with HT enabled for them, and allow the rest to use it with lower prio, but in separate partition to the ones with ThreadsPerCore=1.

Regards,
Carlos.

Comment 2 Mike Woodson 2022-12-19 14:32:24 MST

Hi,

So, if I understand what you are suggesting:

It can be done, but all users will have to change the way that they submit jobs, since no one currently uses pinning to cores. 

With HT, mpi is an issue since having two mpi processes on the same physical core (2 logical cores), they will both want to use all of the resources of the physical core and since the 2 logical cores share some resources on the physical core, it slows things down (unless they pin the cores). 

If this owner wants HT turned on for their nodes, it sounds like the best solution is to remove the nodes from the default partition and possibly create a another low priority partition (maybe called hyperthreading) if someone wants to use that node. The owner can use their higher priority partition to submit jobs to their nodes, but since no other nodes will have HT turned on, they will not be able to utilize it on other nodes. 

Is this correct? 

Mike

Comment 3 Carlos Tripiana Montes 2022-12-20 05:57:38 MST

> It can be done, but all users will have to change the way that they submit
> jobs, since no one currently uses pinning to cores. 

Slurm does it own default pinning as well.
MPI flavours do default pinning as well, and sometimes different from Slurm.

But, if you fallback to default pinning, it's sometimes different when ThreadsPerCore changes. Thus, it depends on what the user is using/doing, but chances are that pinning could have changed.
 
> With HT, mpi is an issue since having two mpi processes on the same physical
> core (2 logical cores), they will both want to use all of the resources of
> the physical core and since the 2 logical cores share some resources on the
> physical core, it slows things down (unless they pin the cores). 

If the user confuses the concept of a core with a logical cpu, and tries to use each logical cpu as if it was physical, then happens what you are explaining.

If, for example, you set 1 task per core and ask for the amount of cores you need, slurm by default (if launched with srun) puts 1 mpi task per physical core, but again, chances are that a user has a combination of flags/mpi type that can be bad after changing ThreadsPerCore.

> If this owner wants HT turned on for their nodes, it sounds like the best
> solution is to remove the nodes from the default partition and possibly
> create a another low priority partition (maybe called hyperthreading) if
> someone wants to use that node. The owner can use their higher priority
> partition to submit jobs to their nodes, but since no other nodes will have
> HT turned on, they will not be able to utilize it on other nodes. 

Maybe it's a conservative option, but it's an alternative that *you really know it's not going to potentially affect every single other job in the system*. So... probably you want to explore it.

Regards,
Carlos.

Comment 4 Carlos Tripiana Montes 2022-12-21 08:46:38 MST

Hi,

Please, let us know if you need further assistance or it can be marked as resolved/infogiven.

Thanks,
Carlos.

Comment 5 Mike Woodson 2022-12-21 10:38:06 MST

I believe that you have given me what I need.

Thanks!

Mike

From: bugs@schedmd.com <bugs@schedmd.com>
Date: Wednesday, December 21, 2022 at 10:46 AM
To: Michael Anthony Woodson <maw349@cornell.edu>
Subject: [Bug 15648] Can hyperthreading be configured on all nodes and with "--hint=nomultithread" be made to not impact jobs where hyperthreading is an issue?
Comment # 4<https://bugs.schedmd.com/show_bug.cgi?id=15648#c4> on bug 15648<https://bugs.schedmd.com/show_bug.cgi?id=15648> from Carlos Tripiana Montes<mailto:tripiana@schedmd.com>

Hi,

Please, let us know if you need further assistance or it can be marked as

resolved/infogiven.

Thanks,

Carlos.

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 6 Carlos Tripiana Montes 2022-12-21 12:09:22 MST

Closing now.

Thanks,
Carlos