Ticket 14654 - --threads-per-core=1 gives cannot request more threads per core than job allocation error
Summary: --threads-per-core=1 gives cannot request more threads per core than job allo...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 21.08.6
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Skyler Malinowski
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-08-01 10:09 MDT by David Gloe
Modified: 2022-08-17 07:39 MDT (History)
1 user (show)

See Also:
Site: CRAY
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: Cray Internal
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08.9; 22.05.4; 23.02.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf file (29.88 KB, text/plain)
2022-08-01 10:09 MDT, David Gloe
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description David Gloe 2022-08-01 10:09:44 MDT
Created attachment 26092 [details]
slurm.conf file

On an internal system, we're seeing an issue where srun by itself works fine, but specifying --threads-per-core=1 or --hint=nomultithread is failing with "Cannot request more threads per core than the job allocation". This happens both in an salloc and when running srun by itself.

dgloe@hotlum-login:~> srun hostname
x1000c0s7b1n1
dgloe@hotlum-login:~> srun --threads-per-core=1 hostname
srun: error: Unable to create step for job 6276: Cannot request more threads per core than the job allocation
dgloe@hotlum-login:~> srun --hint=nomultithread hostname
srun: error: Unable to create step for job 6277: Cannot request more threads per core than the job allocation
dgloe@hotlum-login:~> salloc --threads-per-core=1
salloc: Granted job allocation 6278
salloc: Waiting for resource configuration
salloc: Nodes x1000c0s7b1n1 are ready for job
dgloe@hotlum-login:~> srun --threads-per-core=1 hostname
srun: error: Unable to create step for job 6278: Cannot request more threads per core than the job allocation
Comment 1 Skyler Malinowski 2022-08-02 13:34:37 MDT
I can reproduce this behavior. It looks to be an issue with select/linear specifically. I will keep you posted as I know more.

Thanks,
Skyler
Comment 3 Skyler Malinowski 2022-08-10 16:45:42 MDT
This is a regression caused in 21.08.6. I have created a patch that is out for review.
Comment 7 Skyler Malinowski 2022-08-16 14:06:28 MDT
Out of curiosity, why are you using `select/linear` instead of `select/cons_tres`? `select/cons_tres` can be used for whole node allocations too.


Also I notice in your slurm.conf that you could simply the node section with the meta node NodeName=DEFAULT.

```
# slurm.conf

NodeName=DEFAULT RealMemory=512000 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 State=idle
NodeName=x1000c0s0b0n0
NodeName=x1000c0s0b0n1
NodeName=x1000c0s0b1n0
... (omitted) ...
NodeName=x1000c7s7b1n1
```
Comment 8 David Gloe 2022-08-16 14:17:16 MDT
This is how the system was set up by the admins, I'm not sure why they used select/linear. I've recommended them to use select/cons_res, which is what we typically use.

Is there an advantage to use select/cons_tres instead of select/cons_res?
Comment 9 Skyler Malinowski 2022-08-16 14:46:40 MDT
`select/cons_tres` is a super set of `select/cons_res` and has more features than `select/linear`.


https://slurm.schedmd.com/cons_res.html#using_cons_tres

> Slurm's default select/linear plugin is using a best fit algorithm based on
> number of consecutive nodes. The same node allocation approach is used with
> select/cons_res and select/cons_tres for consistency.

> Consumable Trackable Resources (cons_tres) plugin provides all the same
> functionality provided by the Consumable Resources (cons_res) plugin. It also
> includes additional functionality specifically related to GPUs.

> The --exclusive srun option allows users to request nodes in exclusive mode
> even when consumable resources is enabled. See the srun man page for details.
Comment 12 Skyler Malinowski 2022-08-17 07:39:04 MDT
Commit c728da23f8 merged in for 21.08.9, 22.05.4, and 23.02.0pre1.

Please not that 21.08.9 does not have a planned release date and may not be released at all. Fixes are always upward propagated, so please consider 22.05.4 for the future should 21.08.9 not be released.

Cheers,
Skyler