Hi, Is there a way to register a compute with Slurm without specifying its ID? As an example, let's say we had two configurations listed in Slurm config: - A: 1-socket machine - B: 2-socket machine When setting up B we would like to say something like: scontrol update <shape B> <IP> <hostname> Which would result in Slurm dynamically assigning an ID to this compute e.g., compute123. Once the compute is DOWN the ID would become vacant and could be re-used. The motivation for this is it would make the process of setting up computes in the cloud environment much simpler for us. Thanks Mateusz
Mateusz, Yes, you can do that with FUTURE nodes. https://slurm.schedmd.com/slurm.conf.html#OPT_FUTURE The node definition in slurm.conf will have a pseudo-node configuration that is a template for what the node shape/type (sockets, cores, memory, etc) will look like. They do not equal a specific node or need to exist when the Slurm controller is started. The state of the node is FUTURE. A NodeName is defined, but these are arbitrary and not specific to an exact host. When the Slurmd is started with the future option, then the Slurm controller will pair that slurmd with a matching node from the node configuration in slurm.conf, based on the resource size defined (sockets, cores, threads) as displayed by the output to “slurmd -C”. Those nodes defined in the slurm.conf file don't have a real Node addr or hostname, instead those get dynamically registered when the slurmd reports to the Slurm controller. Until these nodes are made available, they will not be seen by any Slurm commands nor will there be any attempt to contact them. You will need to determine how many FUTURE nodes need to be configured for each type of node. #slurm.conf ##A node that registers with the following shape will register as either node00,node01...node09 NodeName=node0[0-9] CPUs=2 CoresPerSocket=2 Sockets=1 ThreadsPerCore=1 State=FUTURE ##A node that registers with the following shape will register as either g00,g01..g09 NodeName=g0[0-9] CPUs=16 CoresPerSocket=2 Sockets=1 ThreadsPerCore=1 State=FUTURE When a node is started, the slurmd must be started with the -F option. slurmd -F The node will need to have Munge configured and have access to the Slurm configuration as with a static node, either through a shared file system or using the Configless option. https://slurm.schedmd.com/configless_slurm.html slurmd -F --conf-server [slurmctl-primary-server:port] Once the node is registered, it will remain that way through restarts. If you need to remove the node and relinquish it, then you need to set the state of the node back to FUTURE: scontrol update node=node00 state=FUTURE DNS and Host Resolution If the mapping of the NodeName to the slurmd HostName is not updated in DNS as part of the process for adding a node, Dynamic Future nodes won't know how to communicate with each other. This is due to NodeAddr and NodeHostName not being defined in the slurm.conf. The fanout communications between the nodes need to be disabled by setting the parameter “TreeWidth” to a high number (e.g. 65533). #slurm.conf TreeWidth=65533 If the DNS mapping is made, then the cloud_dns SlurmctldParameter can be used. #slurm.conf SlurmctldParameters=cloud_dns Let me know if this accomplishes what you are after. Thanks, Nick
Nick, Thanks so much, this looks very promising! Other than CPU/memory - are there some additional tags or settings one can use to distinguish between nodes? E.g., could if we had nodes with two different OS versions - could they be recognized somehow? Or GPU vs non-GPU? Thanks! Mateusz
Yes, feature constraints is one method. The main resource that is mapped is CPU, Memory and GRES are not mapped. But you can use features to differentiate between types. Multiple Features can be listed for a node. For instance the config would look like: NodeName=node0[0-10] CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=772211 State=FUTURE Gres=gpu:4 FEATURE=36lowmem You then start the slurmd with the -F<Feature> option. NOTE: No space between the -F and feature. slurmd -F36lowmem The Slurm controller will then match that node with one of the psuedo FUTURE nodes that have the feature. Regards, Nick
I'm experimenting with using this feature for our setup. Now that Slurm can pick up a slot itself I'm trying to figure out how to query (from the compute itself) what name it chose - is that possible? Thanks! Mateusz
Your best place to look for that is the NodeHostName field from: "scontrol show node [nodename]"
I'm trying out this change but I'm running into issues. I sometimes see slurmctl reporting this error: error: _slurm_rpc_node_registration node=compute13901: Invalid argument Side note - looking at the compute, the slurmd service there appears running and content, maybe not ideal?. Q0: Based on the compute ID it's trying to match and the instance spec I think it's trying to incorrectly match the instance shape to our config. Please let me know if this error means something else? What is the algorithm this feature uses to match on the memory requirements? If I have: NodeName=compute[1-4] CoresPerSocket=24 FEATURE=spot RealMemory=250000 Sockets=2 ThreadsPerCore=1 NodeName=compute[5-8] CoresPerSocket=24 FEATURE=spot RealMemory=370000 Sockets=2 ThreadsPerCore=1 And I have computes come up with memory: a/ 249999 (low -1) b/ 250001 (low +1) c/ 370001 (high +1) Q1: How will a/b/c get matched to the config above? Q2: Considering our ability to tell exactly the amount of free memory is imprecise due to how AWS virtualization works, can I drop the memory from the config altogether and e.g. use FEATURE to disambiguate? I.e. should Slurm be fine w/ discovering memory dynamically? Q3: Is there a limit on the string length for the FEATURE specification? And are there limitation to characters we're allowed to use in that string? Thank you! Mateusz
Hi, we would really appreciate some help with the questions above. Thanks!
Hmm... I think I misread one of Nick's replies, ie it seems "memory" is _not_ mapped so I assume Slurm will not disambiguate based on that, correct?
Memory is not matched. Instead, that is where the features method comes in handy. If the CPUs are all the same, then it will find the first node on the list with the CPU shape. TO have memory come into play, then set features on the node for different memory sizes. Look back at the example I gave. In that, the node will register with CPU shape but then be matched with a node with the 36lowmem feature. Features are alphanumeric. The number of characters is for the feature tag can be very large. Nick Ihli Director, Cloud and Sales Engineering nick@schedmd.com On Fri, Nov 12, 2021 at 11:21 AM <bugs@schedmd.com> wrote: > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=12742#c6> on bug > 12742 <https://bugs.schedmd.com/show_bug.cgi?id=12742> from PDT Partners > <customer-pdt@bugs.schedmd.com> * > > I'm trying out this change but I'm running into issues. I sometimes see > slurmctl reporting this error: > > error: _slurm_rpc_node_registration node=compute13901: Invalid argument > > Side note - looking at the compute, the slurmd service there appears running > and content, maybe not ideal?. > > Q0: > Based on the compute ID it's trying to match and the instance spec I think it's > trying to incorrectly match the instance shape to our config. Please let me > know if this error means something else? > > What is the algorithm this feature uses to match on the memory requirements? If > I have: > > NodeName=compute[1-4] CoresPerSocket=24 FEATURE=spot RealMemory=250000 > Sockets=2 ThreadsPerCore=1 > NodeName=compute[5-8] CoresPerSocket=24 FEATURE=spot RealMemory=370000 > Sockets=2 ThreadsPerCore=1 > > And I have computes come up with memory: > a/ 249999 (low -1) > b/ 250001 (low +1) > c/ 370001 (high +1) > > Q1: > How will a/b/c get matched to the config above? > > Q2: > Considering our ability to tell exactly the amount of free memory is imprecise > due to how AWS virtualization works, can I drop the memory from the config > altogether and e.g. use FEATURE to disambiguate? I.e. should Slurm be fine w/ > discovering memory dynamically? > > Q3: > Is there a limit on the string length for the FEATURE specification? And are > there limitation to characters we're allowed to use in that string? > > Thank you! > Mateusz > > ------------------------------ > You are receiving this mail because: > > - You are the assignee for the bug. > - You are on the CC list for the bug. > > -- Nick Ihli Director, Cloud and Sales Engineering nick@schedmd.com
Great, thanks! Considering that at this point the features map 1:1 to an instance type - is there a way of dropping all CoresPerSocket/ThreadsPerCore/RealMemory/etc settings from the config? I tried simply removing them but that caused srun to fail with: srun: error: Memory specification can not be satisfied srun: error: Unable to allocate resources: Requested node configuration is not available Thanks, Mateusz
You will still need information about cpus and memory. You could just use CPUs= instead of corespersocket/threads, etc.
Are there any other questions on this or can we close the ticket? Thanks, Nick
This is working well but I have one more related question. If you prefer, I'd be happy to open a new ticket for this? The question I have is - can we tell Slurm to start a compute in a specific state? Specifically, we would like it to not schedule any jobs on the node so e.g. DRAIN would work. The reason is that we want the jobs to start only AFTER we know the nodename - we're hoping to rename the compute using that, and we wouldn't want the jobs to experience a hostname change. Thanks!
Normally you could, but if you are using FUTURE, you aren't able to set the nodes as a separate state like DRAIN in the slurm.conf. One method you could look at is have all the Future nodes in their own partition and set the State of the Partition to be INACTIVE. Then, once the node is up and you make the changes you need, you can modify other partitions and add the node to them. Another option would be to create a reservation on the Future nodes, then remove them as needed after they are ready.
I have tried adding an INACTIVE partition as you described, but doing so I had to remove all the nodes from our usual partitions. Unfortunately this caused srun to fail immediately when trying to submit a job with: srun: error: Unable to allocate resources: More processors requested than permitted Is there a way around that? I'm now looking into trying the reservation as you suggested.
Looking at this closer - I don't see a way to modify a reservation to add/remove a single node from it. Is that correct? If so, this would be difficult for us to implement, that part of the bootstrap happens on the computes and it would be difficult to make the reservation updates atomic. I'm considering adding "dummy" nodes to our usual partitions to trick srun into thinking there are enough CPUs. Please let me know if you think there is a better way? I'm also wondering if moving computes between partitions is erased once we set State=FUTURE? Or would a compute coming back up again and grabbing the same nodename end up on the partition it was moved to using scontrol? Thank you, Mateusz
I was able to work around the srun issue, but it looks like it's not possible to update a node with a different partition name: Update of this parameter is not supported: partitionname=spot Request aborted Can you please advise of the best way to do that?
To update the partition a node is in, you need to update the node list in the partition. test_node is my Future node. It came up in the root@mgmtnode ~]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST open* up infinite 9 idle node[02-09] test up infinite 5 idle test_node scontrol update partition=open nodes=node[02-09],test_node PARTITION AVAIL TIMELIMIT NODES STATE NODELIST open* up infinite 9 idle node[02-09],test_node test up infinite 1 idle test_node You could remove it from the "test" partition: scontrol update partition=test nodes= PARTITION AVAIL TIMELIMIT NODES STATE NODELIST open* up infinite 9 idle node[02-09],test_node test up infinite 0 n/a One method you can look at doing is not even have those FUTURE nodes added to a partition, then update the partitions where you want those nodes to go when they are ready for scheduling. Jobs won't be scheduled to those nodes as they have no Partition associated with them. It is important to note, after restarting the controller, those changes are not preserved unless the slurm.conf is changed or you start slurmctld with -R option. For the reservation method you are correct, you have to update the whole node list, but remove whichever node is no longer to be reserved.
OK, thanks! To give you a bit more color - we would ideally want to modify the partition/reservation from the computes themselves. We may have hundreds of those coming up at about the same time, and it's important that the updates are atomic and don't erase changes made by others. We could of course have the updates done centrally and use another service to arbitrate changes to the partitions, but that's less convenient. For now we were able work around the issue of nodes accepting the jobs immediately ie, it is no longer a problem. But if there were a way to say: scontrol update node nodename=foo partitions=bar Or start a FUTURE node in a DRAIN state, or something similar - that would be useful for us :) Anyway, I appreciate your help!