Ticket 12742 - Can slurm allocate compute ID dyncamically based on shape?
Summary: Can slurm allocate compute ID dyncamically based on shape?
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Cloud (show other tickets)
Version: 22.05.x
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Nick Ihli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-10-25 22:03 MDT by PDT Partners
Modified: 2021-11-29 22:32 MST (History)
1 user (show)

See Also:
Site: PDT
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description PDT Partners 2021-10-25 22:03:04 MDT
Hi,

Is there a way to register a compute with Slurm without specifying its ID?

As an example, let's say we had two configurations listed in Slurm config:
- A: 1-socket machine
- B: 2-socket machine

When setting up B we would like to say something like:
scontrol update <shape B> <IP> <hostname>

Which would result in Slurm dynamically assigning an ID to this compute e.g., compute123.

Once the compute is DOWN the ID would become vacant and could be re-used.

The motivation for this is it would make the process of setting up computes in the cloud environment much simpler for us.

Thanks
Mateusz
Comment 1 Nick Ihli 2021-10-26 19:48:22 MDT
Mateusz,

Yes, you can do that with FUTURE nodes. 

https://slurm.schedmd.com/slurm.conf.html#OPT_FUTURE

The node definition in slurm.conf will have a pseudo-node configuration that is a template for what the node shape/type (sockets, cores, memory, etc) will look like. They do not equal a specific node or need to exist when the Slurm controller is started. The state of the node is FUTURE. A NodeName is defined, but these are arbitrary and not specific to an exact host.

When the Slurmd is started with the future option, then the Slurm controller will pair that slurmd with a matching node from the node configuration in slurm.conf, based on the resource size defined (sockets, cores, threads) as displayed by the output to “slurmd -C”.

Those nodes defined in the slurm.conf file don't have a real Node addr or hostname, instead those get dynamically registered when the slurmd reports to the Slurm controller. Until these nodes are made available, they will not be seen by any Slurm commands nor will there be any attempt to contact them. You will need to determine how many FUTURE nodes need to be configured for each type of node.

#slurm.conf
##A node that registers with the following shape will register as either node00,node01...node09

NodeName=node0[0-9] CPUs=2 CoresPerSocket=2 Sockets=1 ThreadsPerCore=1 State=FUTURE

##A node that registers with the following shape will register as either g00,g01..g09

NodeName=g0[0-9] CPUs=16 CoresPerSocket=2 Sockets=1 ThreadsPerCore=1 State=FUTURE

When a node is started, the slurmd must be started with the -F option.

slurmd -F

The node will need to have Munge configured and have access to the Slurm configuration as
with a static node, either through a shared file system or using the Configless option.
https://slurm.schedmd.com/configless_slurm.html

slurmd -F --conf-server [slurmctl-primary-server:port]

Once the node is registered, it will remain that way through restarts. If you need to remove the node and relinquish it, then you need to set the state of the node back to FUTURE:

scontrol update node=node00 state=FUTURE

DNS and Host Resolution

If the mapping of the NodeName to the slurmd HostName is not updated in DNS as part of the
process for adding a node, Dynamic Future nodes won't know how to communicate with each other. This is due to NodeAddr and NodeHostName not being defined in the slurm.conf. The
fanout communications between the nodes need to be disabled by setting the parameter
“TreeWidth” to a high number (e.g. 65533).

#slurm.conf
TreeWidth=65533

If the DNS mapping is made, then the cloud_dns SlurmctldParameter can be used.

#slurm.conf
SlurmctldParameters=cloud_dns


Let me know if this accomplishes what you are after.

Thanks,
Nick
Comment 2 PDT Partners 2021-10-29 14:13:59 MDT
Nick,

Thanks so much, this looks very promising!

Other than CPU/memory - are there some additional tags or settings one can use to distinguish between nodes? E.g., could if we had nodes with two different OS versions - could they be recognized somehow? Or GPU vs non-GPU?

Thanks!
Mateusz
Comment 3 Nick Ihli 2021-10-29 15:15:21 MDT
Yes, feature constraints is one method. The main resource that is mapped is CPU, Memory and GRES are not mapped. But you can use features to differentiate between types. Multiple Features can be listed for a node.

For instance the config would look like:

NodeName=node0[0-10] CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=772211 State=FUTURE Gres=gpu:4 FEATURE=36lowmem

You then start the slurmd with the -F<Feature> option.   NOTE: No space between the -F and feature.
             
slurmd -F36lowmem


The Slurm controller will then match that node with one of the psuedo FUTURE nodes that have the feature.

Regards,
Nick
Comment 4 PDT Partners 2021-11-03 14:51:59 MDT
I'm experimenting with using this feature for our setup. Now that Slurm can pick up a slot itself I'm trying to figure out how to query (from the compute itself) what name it chose - is that possible?

Thanks!
Mateusz
Comment 5 Nick Ihli 2021-11-03 15:50:14 MDT
Your best place to look for that is the NodeHostName field from:

"scontrol show node [nodename]"
Comment 6 PDT Partners 2021-11-12 11:21:03 MST
I'm trying out this change but I'm running into issues. I sometimes see slurmctl reporting this error:

error: _slurm_rpc_node_registration node=compute13901: Invalid argument

Side note - looking at the compute, the slurmd service there appears running and content, maybe not ideal?.

Q0:
Based on the compute ID it's trying to match and the instance spec I think it's trying to incorrectly match the instance shape to our config. Please let me know if this error means something else?

What is the algorithm this feature uses to match on the memory requirements? If I have:

NodeName=compute[1-4] CoresPerSocket=24 FEATURE=spot RealMemory=250000 Sockets=2 ThreadsPerCore=1
NodeName=compute[5-8] CoresPerSocket=24 FEATURE=spot RealMemory=370000 Sockets=2 ThreadsPerCore=1

And I have computes come up with memory:
a/ 249999 (low -1)
b/ 250001 (low +1)
c/ 370001 (high +1)

Q1:
How will a/b/c get matched to the config above?

Q2:
Considering our ability to tell exactly the amount of free memory is imprecise due to how AWS virtualization works, can I drop the memory from the config altogether and e.g. use FEATURE to disambiguate? I.e. should Slurm be fine w/ discovering memory dynamically?

Q3:
Is there a limit on the string length for the FEATURE specification? And are there limitation to characters we're allowed to use in that string?

Thank you!
Mateusz
Comment 7 PDT Partners 2021-11-15 10:15:49 MST
Hi, we would really appreciate some help with the questions above.

Thanks!
Comment 8 PDT Partners 2021-11-15 11:17:12 MST
Hmm... I think I misread one of Nick's replies, ie it seems "memory" is _not_ mapped so I assume Slurm will not disambiguate based on that, correct?
Comment 9 Nick Ihli 2021-11-15 13:36:36 MST
Memory is not matched. Instead, that is where the features method comes in
handy. If the CPUs are all the same, then it will find the first node on
the list with the CPU shape. TO have memory come into play, then set
features on the node for different memory sizes. Look back at the example I
gave. In that, the node will register with CPU shape but then be matched
with a node with the 36lowmem feature.

Features are alphanumeric. The number of characters is for the feature tag
can be very large.





Nick Ihli
Director, Cloud and Sales Engineering
nick@schedmd.com


On Fri, Nov 12, 2021 at 11:21 AM <bugs@schedmd.com> wrote:

> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=12742#c6> on bug
> 12742 <https://bugs.schedmd.com/show_bug.cgi?id=12742> from PDT Partners
> <customer-pdt@bugs.schedmd.com> *
>
> I'm trying out this change but I'm running into issues. I sometimes see
> slurmctl reporting this error:
>
> error: _slurm_rpc_node_registration node=compute13901: Invalid argument
>
> Side note - looking at the compute, the slurmd service there appears running
> and content, maybe not ideal?.
>
> Q0:
> Based on the compute ID it's trying to match and the instance spec I think it's
> trying to incorrectly match the instance shape to our config. Please let me
> know if this error means something else?
>
> What is the algorithm this feature uses to match on the memory requirements? If
> I have:
>
> NodeName=compute[1-4] CoresPerSocket=24 FEATURE=spot RealMemory=250000
> Sockets=2 ThreadsPerCore=1
> NodeName=compute[5-8] CoresPerSocket=24 FEATURE=spot RealMemory=370000
> Sockets=2 ThreadsPerCore=1
>
> And I have computes come up with memory:
> a/ 249999 (low -1)
> b/ 250001 (low +1)
> c/ 370001 (high +1)
>
> Q1:
> How will a/b/c get matched to the config above?
>
> Q2:
> Considering our ability to tell exactly the amount of free memory is imprecise
> due to how AWS virtualization works, can I drop the memory from the config
> altogether and e.g. use FEATURE to disambiguate? I.e. should Slurm be fine w/
> discovering memory dynamically?
>
> Q3:
> Is there a limit on the string length for the FEATURE specification? And are
> there limitation to characters we're allowed to use in that string?
>
> Thank you!
> Mateusz
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are the assignee for the bug.
>    - You are on the CC list for the bug.
>
> --



Nick Ihli
Director, Cloud and Sales Engineering
nick@schedmd.com
Comment 10 PDT Partners 2021-11-15 22:37:32 MST
Great, thanks!

Considering that at this point the features map 1:1 to an instance type - is there a way of dropping all CoresPerSocket/ThreadsPerCore/RealMemory/etc settings from the config? I tried simply removing them but that caused srun to fail with: 

srun: error: Memory specification can not be satisfied
srun: error: Unable to allocate resources: Requested node configuration is not available

Thanks,
Mateusz
Comment 11 Nick Ihli 2021-11-16 08:49:23 MST
You will still need information about cpus and memory. You could just use CPUs= instead of corespersocket/threads, etc.
Comment 12 Nick Ihli 2021-11-22 17:17:57 MST
Are there any other questions on this or can we close the ticket?

Thanks,
Nick
Comment 13 PDT Partners 2021-11-24 10:44:23 MST
This is working well but I have one more related question. If you prefer, I'd be happy to open a new ticket for this?

The question I have is - can we tell Slurm to start a compute in a specific state? Specifically, we would like it to not schedule any jobs on the node so e.g. DRAIN would work. The reason is that we want the jobs to start only AFTER we know the nodename - we're hoping to rename the compute using that, and we wouldn't want the jobs to experience a hostname change.

Thanks!
Comment 14 Nick Ihli 2021-11-24 14:09:07 MST
Normally you could, but if you are using FUTURE, you aren't able to set the nodes as a separate state like DRAIN in the slurm.conf. One method you could look at is have all the Future nodes in their own partition and set the State of the Partition to be INACTIVE. Then, once the node is up and you make the changes you need, you can modify other partitions and add the node to them.

Another option would be to create a reservation on the Future nodes, then remove them as needed after they are ready.
Comment 15 PDT Partners 2021-11-24 22:45:13 MST
I have tried adding an INACTIVE partition as you described, but doing so I had to remove all the nodes from our usual partitions. Unfortunately this caused srun to fail immediately when trying to submit a job with: 

srun: error: Unable to allocate resources: More processors requested than permitted

Is there a way around that?

I'm now looking into trying the reservation as you suggested.
Comment 16 PDT Partners 2021-11-24 23:08:29 MST
Looking at this closer - I don't see a way to modify a reservation to add/remove a single node from it. Is that correct? If so, this would be difficult for us to implement, that part of the bootstrap happens on the computes and it would be difficult to make the reservation updates atomic.

I'm considering adding "dummy" nodes to our usual partitions to trick srun into thinking there are enough CPUs. Please let me know if you think there is a better way?

I'm also wondering if moving computes between partitions is erased once we set State=FUTURE? Or would a compute coming back up again and grabbing the same nodename end up on the partition it was moved to using scontrol?

Thank you,
Mateusz
Comment 17 PDT Partners 2021-11-25 11:23:35 MST
I was able to work around the srun issue, but it looks like it's not possible to update a node with a different partition name: 

Update of this parameter is not supported: partitionname=spot
Request aborted

Can you please advise of the best way to do that?
Comment 18 Nick Ihli 2021-11-29 15:35:22 MST
To update the partition a node is in, you need to update the node list in the partition.

test_node is my Future node. It came up in the 

root@mgmtnode ~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
open*        up   infinite      9   idle node[02-09]
test         up   infinite      5   idle test_node



scontrol update partition=open nodes=node[02-09],test_node

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
open*        up   infinite      9   idle node[02-09],test_node
test         up   infinite      1   idle test_node

You could remove it from the "test" partition:

scontrol update partition=test nodes=

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
open*        up   infinite      9   idle node[02-09],test_node
test         up   infinite      0    n/a 


One method you can look at doing is not even have those FUTURE nodes added to a partition, then update the partitions where you want those nodes to go when they are ready for scheduling. Jobs won't be scheduled to those nodes as they have no Partition associated with them.

It is important to note, after restarting the controller, those changes are not preserved unless the slurm.conf is changed or you start slurmctld with -R option.

For the reservation method you are correct, you have to update the whole node list, but remove whichever node  is no longer to be reserved.
Comment 19 PDT Partners 2021-11-29 22:32:07 MST
OK, thanks!

To give you a bit more color - we would ideally want to modify the partition/reservation from the computes themselves. We may have hundreds of those coming up at about the same time, and it's important that the updates are atomic and don't erase changes made by others. 

We could of course have the updates done centrally and use another service to arbitrate changes to the partitions, but that's less convenient.

For now we were able work around the issue of nodes accepting the jobs immediately ie, it is no longer a problem. But if there were a way to say:

scontrol update node nodename=foo partitions=bar

Or start a FUTURE node in a DRAIN state, or something similar - that would be useful for us :)

Anyway, I appreciate your help!