| Summary: | --threads-per-core=1 gives cannot request more threads per core than job allocation error | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | David Gloe <david.gloe> |
| Component: | Configuration | Assignee: | Skyler Malinowski <skyler> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | alex |
| Version: | 21.08.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CRAY | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | Cray Internal |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 21.08.9; 22.05.4; 23.02.0pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf file | ||
I can reproduce this behavior. It looks to be an issue with select/linear specifically. I will keep you posted as I know more. Thanks, Skyler This is a regression caused in 21.08.6. I have created a patch that is out for review. Out of curiosity, why are you using `select/linear` instead of `select/cons_tres`? `select/cons_tres` can be used for whole node allocations too. Also I notice in your slurm.conf that you could simply the node section with the meta node NodeName=DEFAULT. ``` # slurm.conf NodeName=DEFAULT RealMemory=512000 Sockets=2 CoresPerSocket=64 ThreadsPerCore=2 State=idle NodeName=x1000c0s0b0n0 NodeName=x1000c0s0b0n1 NodeName=x1000c0s0b1n0 ... (omitted) ... NodeName=x1000c7s7b1n1 ``` This is how the system was set up by the admins, I'm not sure why they used select/linear. I've recommended them to use select/cons_res, which is what we typically use. Is there an advantage to use select/cons_tres instead of select/cons_res? `select/cons_tres` is a super set of `select/cons_res` and has more features than `select/linear`. https://slurm.schedmd.com/cons_res.html#using_cons_tres > Slurm's default select/linear plugin is using a best fit algorithm based on > number of consecutive nodes. The same node allocation approach is used with > select/cons_res and select/cons_tres for consistency. > Consumable Trackable Resources (cons_tres) plugin provides all the same > functionality provided by the Consumable Resources (cons_res) plugin. It also > includes additional functionality specifically related to GPUs. > The --exclusive srun option allows users to request nodes in exclusive mode > even when consumable resources is enabled. See the srun man page for details. Commit c728da23f8 merged in for 21.08.9, 22.05.4, and 23.02.0pre1. Please not that 21.08.9 does not have a planned release date and may not be released at all. Fixes are always upward propagated, so please consider 22.05.4 for the future should 21.08.9 not be released. Cheers, Skyler |
Created attachment 26092 [details] slurm.conf file On an internal system, we're seeing an issue where srun by itself works fine, but specifying --threads-per-core=1 or --hint=nomultithread is failing with "Cannot request more threads per core than the job allocation". This happens both in an salloc and when running srun by itself. dgloe@hotlum-login:~> srun hostname x1000c0s7b1n1 dgloe@hotlum-login:~> srun --threads-per-core=1 hostname srun: error: Unable to create step for job 6276: Cannot request more threads per core than the job allocation dgloe@hotlum-login:~> srun --hint=nomultithread hostname srun: error: Unable to create step for job 6277: Cannot request more threads per core than the job allocation dgloe@hotlum-login:~> salloc --threads-per-core=1 salloc: Granted job allocation 6278 salloc: Waiting for resource configuration salloc: Nodes x1000c0s7b1n1 are ready for job dgloe@hotlum-login:~> srun --threads-per-core=1 hostname srun: error: Unable to create step for job 6278: Cannot request more threads per core than the job allocation