Ticket 11238

Summary: Selecting from a pool of available nodes
Product: Slurm Reporter: Carl Ponder <CPonder>
Component: User CommandsAssignee: Scott Hilton <scott>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: fabecassis
Version: 20.02.6   
Hardware: Linux   
OS: Linux   
Site: NVIDIA (PSLA) Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Carl Ponder 2021-03-29 02:57:42 MDT
I'm running on a cluster of heterogeneous nodes, and want to run a single-node job of which there are two nodes that will work. I'd like to be able to use the form

      srun -N 1 -w node1,node2 ...

which would give me whichever of node1 or node2 comes available first.
This doesn't work, however

     srun: error: Required nodelist includes more nodes than permitted by max-node count (2 > 1). Eliminating nodes from the nodelist.

since it's expecting the node-count to match the length of the node list.
Comment 1 Carl Ponder 2021-03-29 03:00:01 MDT
This behavior deviates from the -x convention used to eliminate nodes from consideration. If I wrote a long list that eliminates all the other nodes

     srun -N 1 -x node2,node3,... ....

then there's no complaint, in spite of the remaining pool of nodes having two elements instead of one.
Comment 2 Carl Ponder 2021-03-29 03:01:05 MDT
Using this convention

    srun -N 1 -w node1 -w node2 ...

doesn't work either, the second setting overrides the first so I have to wait for node2 even if node1 is available.
Comment 3 Carl Ponder 2021-03-29 03:07:24 MDT
I wouldn't consider this a change of behavior since, currently, the mis-match of the count and list-length

     srun -N 1 -w node1,node2 ....

causes an error rather than having the job run with a different behavior than I'm asking for. (Unless, maybe, they use the consistency-check as some sort of fail-safe when they put commands together programmatically).

If you changed the behavior of

     srun -N 1 -w node1 -w node2 ....

it could potentially affect users that are building lists of settings and are expecting the later settings to override the earlier ones.
Comment 5 Scott Hilton 2021-03-29 15:11:38 MDT
Carl,

There are two ways I can see to accomplish this type of request with slurm as is. 

First is partitions. Specifying the partition accomplishes this but you may want partitions to be used in a different way and the list of nodes up must set up in the slurm.conf instead of by the user when they want it.

The second way is using features and constraints. You can assign features to different nodes in slurm.conf. Then when you want the nodes specify the constraint.

     srun -N 1 --constraint="intel|amd"

In fact, you could assign each node a feature equal to its name and then specify it with     
     
     slurm.conf: 
     NodeName=node1 Feature=node1

     srun -N 1 -C "node1|node2"

If you use NodeName=node[1-10] You couldn't do this unless you split it up.

-Scott
Comment 6 Carl Ponder 2021-03-29 15:18:05 MDT
Both of these require intervention from the Admins, right?
That't shat I'm trying to avoid...
Comment 7 Scott Hilton 2021-03-29 15:45:39 MDT
Carl,

Yes. An admin would need to edit the slurm.conf for either of these workarounds. Though once it is setup it could be used by any users.

I will ask the development team if a change like this is something we could add.

-Scott
Comment 9 Scott Hilton 2021-03-30 10:45:04 MDT
Carl,

Sorry, we are not planning on modifying the -w flag. 

Using features + constraints is the way to go. If there is a specific resource in node 1 and node2 not present in the rest of the nodes the admin should specify it with either features or gres. While this does require admin implementation, setting the the nodes in the first place does too.

If you have any further questions let me know.

-Scott
Comment 10 Scott Hilton 2021-04-07 13:13:23 MDT
Carl,

Closing ticket. If you have a follow up questions feel free to reopen it.

-Scott