| Summary: | hyper threading | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | surendra <surendra.sunkari> |
| Component: | Configuration | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | tim |
| Version: | 17.11.1 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NREL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
I think you're interested in the --hint=[no]multithread option for srun and sbatch. Leave the nodes with hyperthreading enabled, and users can control how their tasks are bound themselves (or you as an admin can use a job submit plugin to do it for them): A quick example with this node config: NodeName=DEFAULT RealMemory=3000 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 $ srun -n8 whereami 0001 v1 - Cpus_allowed: 11 Cpus_allowed_list: 0,4 0006 v1 - Cpus_allowed: 88 Cpus_allowed_list: 3,7 0000 v1 - Cpus_allowed: 11 Cpus_allowed_list: 0,4 0005 v1 - Cpus_allowed: 44 Cpus_allowed_list: 2,6 0007 v1 - Cpus_allowed: 88 Cpus_allowed_list: 3,7 0003 v1 - Cpus_allowed: 22 Cpus_allowed_list: 1,5 0004 v1 - Cpus_allowed: 44 Cpus_allowed_list: 2,6 0002 v1 - Cpus_allowed: 22 Cpus_allowed_list: 1,5 $ srun -n8 --hint=multithread whereami 0007 v1 - Cpus_allowed: 80 Cpus_allowed_list: 7 0001 v1 - Cpus_allowed: 10 Cpus_allowed_list: 4 0005 v1 - Cpus_allowed: 40 Cpus_allowed_list: 6 0003 v1 - Cpus_allowed: 20 Cpus_allowed_list: 5 0004 v1 - Cpus_allowed: 04 Cpus_allowed_list: 2 0006 v1 - Cpus_allowed: 08 Cpus_allowed_list: 3 0000 v1 - Cpus_allowed: 01 Cpus_allowed_list: 0 0002 v1 - Cpus_allowed: 02 Cpus_allowed_list: 1 $ srun -n8 --hint=nomultithread whereami 0005 v2 - Cpus_allowed: 10 Cpus_allowed_list: 4 0006 v2 - Cpus_allowed: 02 Cpus_allowed_list: 1 0007 v2 - Cpus_allowed: 20 Cpus_allowed_list: 5 0000 v1 - Cpus_allowed: 01 Cpus_allowed_list: 0 0004 v2 - Cpus_allowed: 01 Cpus_allowed_list: 0 0002 v1 - Cpus_allowed: 02 Cpus_allowed_list: 1 0001 v1 - Cpus_allowed: 10 Cpus_allowed_list: 4 0003 v1 - Cpus_allowed: 20 Cpus_allowed_list: 5 Is this an acceptable solution? A note on the output from comment 1: notice that without --hint=[no]multithread, the tasks are bound to the whole core, but there could be as many tasks on a core as there are hyperthreads on a core. With --hint=multithread, each task is bound to a single hyperthread. With --hint=nomultithread, each task is bound to a single core, and only a single task per core is permitted. Also, I'm dropping the severity to sev-4. I was told that your site joined support a little early (before training in October) but that all your bugs should be sev-4 for now. This is just a friendly reminder to keep things at sev-4 for now. I haven't read the contract or email, so I don't know exactly when you'll be able to submit severity 1-3 bugs - just keep in contact with whoever is managing the contract with you (I assume Jacob or Jess). I'm closing this as resolved/infogiven. I found out that your "official" support contract period starts on Oct 1 (when you can submit any severity ticket), and we're supporting sev-4 tickets until then. (Forgot to close the ticket - actually closing) Feel free to reopen if you have further questions regarding hyperthreading. |
We have some users that want hyperthreads on and some users that want hyperthreads off. We added a prologue that allows the user to request that their job run with or without hyperthreads by setting a comment in the batch file. It appears, though, that slurm checks the number of cores at some point and downs nodes with the wrong number of cores. Is there a way to configure slurm to handle this situation? We basically need to have slurm allow us to all of the cores or half of the cores.