| Summary: | RPC REQUEST_PING Can't find an address, check slurm.conf | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | GSK-ONYX-SLURM <slurm-support> |
| Component: | Configuration | Assignee: | Oriol Vilarrubi <jvilarru> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | jvilarru |
| Version: | 23.02.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | GSK | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | RHEL |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
nodes.conf partitions.com gres.conf us1ghpcgpu001 - /etc/hosts us1ghpcgpu001 - hostnamectl |
||
|
Description
GSK-ONYX-SLURM
2023-09-19 04:53:11 MDT
Created attachment 32314 [details]
slurm.conf
Created attachment 32315 [details]
nodes.conf
Created attachment 32316 [details]
partitions.com
Created attachment 32317 [details]
gres.conf
Created attachment 32318 [details]
us1ghpcgpu001 - /etc/hosts
Created attachment 32319 [details]
us1ghpcgpu001 - hostnamectl
Hi Radek, We have an internal bug open that sounds like it could be related to what you're seeing. It has to do with nodes being added dynamically and causes there to be errors that show: lookup failure for node "<node name>" This bug is being worked on, but doesn't have a fix yet. Are you adding these nodes dynamically (https://slurm.schedmd.com/dynamic_nodes.html)? Hi Ben, thanks a lot for your prompt response. I'm not adding nodes dynamically. All the nodes have been added into the slurm.conf file. Thanks, Radek Hi Ben, I think the main problem here is related to this: [2023-09-21T12:11:50.264] agent/is_node_resp: node:uptuw522 RPC:REQUEST_PING : Can't find an address, check slurm.conf When it happens, the affected node - uptuw522 and also uptuw497 is seen as idle*: [I am root!@ushpc:~]# sinfo -Nl | grep login login1 1 admin drained 96 2:24:2 262144 1906796 1 login,in NHC: check_hw_swap_f login2 1 admin drained 96 2:24:2 262144 1906796 1 login,in NHC: check_hw_swap_f login3 1 admin drained 96 2:24:2 262144 1906796 1 login,in NHC: check_hw_swap_f login4 1 admin drained 96 2:24:2 262144 1906796 1 login,in NHC: check_hw_swap_f login5 1 admin idle 96 2:24:2 262144 1906796 1 login,in none uptuw497 1 admin idle* 20 2:10:1 128000 1900000 1 login,in none uptuw498 1 admin idle 20 2:10:1 257000 1900000 1 login,in none uptuw522 1 admin idle* 20 2:10:1 128000 1900000 1 login,in none They both have been added outside Bright, if that matters... Nothing in the slurmd logs, everything looks good. Checking the slurmctld logs again after a while: [2023-09-21T12:08:16.334] error: _find_node_record: lookup failure for node "uptuw522.corpnet2.com" [2023-09-21T12:11:50.264] agent/is_node_resp: node:uptuw522 RPC:REQUEST_PING : Can't find an address, check slurm.conf [2023-09-21T12:26:41.007] Node uptuw522 now responding and the node is responding. Thanks, Radek Hi Oriol, I think I know where the problem might be. Some of new nodes are connected to the external network only, while other compute nodes are connected to the internal network. The head node is connected to both. I think it might be something to do with the missing routing table or dns or both. I will be checking this now. Rest of new nodes are connected to both networks and they can communicate with other nodes without any problems. However the line: [2023-09-22T04:28:28.285] agent/is_node_resp: node:us1ghpcgpu003 RPC:REQUEST_PING : Can't find an address, check slurm.conf appears regarding these nodes too. I noticed that the hostname is FQDN for that nodes while other nodes don't have a domain. Does it matter? Thanks, Radek Update - I just restarted all compute nodes across the cluster and I haven't seen the issue since today, 8:23AM EDT. The error log was seen before every hour, so it looks promising. Tell me what you think anyway, if there's anything else I could check..? Happy weekend! Radek Hi Radek, Sorry for not answering before, for the nodes to be usable they need to have a routable connection to the controller, as well as they need to be resolvable by name, they also need to be in time sync with the controller. In case that you cannot modify the global dns, or you don't wnat to you can also add the node IP address directly in slurm by using the NodeAddr parameter of every node. So for you to check for example if uptuw522 is a usable node, you can directly ping it from the controller. The reason why in sometimes you see the domain and others no in slurm logs, is because when you do not see the domain the log is informing about the node as an "slurm entity" but when you see it with the domain is because we try to resolve it as a dns name, and that is the default domain to search with as instructed by the controller, if my memory does not fail me that is called dns domain search. When you see a node with an state that ends with * it means that the node is not responding, if you want to see a full list of states for a node, you can use scontrol show node <nodename> Are you still seeing the nhc time issue? Regards. Hi Oriol, all the names are resolvable across the nodes and I think I found what the problem is. Once new nodes were added, I just restarted slurmctld on the head node and slurmd on the new nodes forgetting to do that on the rest of nodes. Every time I wanted to execute an srun job, I was getting an error saying that there's a problem with an address. Also, I saw that new nodes were not responding from time to time. Last Friday I restarted slurmd on all nodes being part of the cluster and since that time I don't see anything like "RPC:REQUEST_PING : Can't find an address, check slurm.conf". What's your take on this? I'm still having the problem with nhc and time synchronization, but it's out of Slurm so as long as you confirm that the slurmd restart was required (we don't use configless on that cluster) I think we can close the ticket. Thanks, Radek Hello Radek, Yes, that is 100% the issue. Whenever you add new nodes you need to reload the configuration file to the daemon so that they are aware of those new nodes. Otherwise they will receives messages that contains those new nodes and those will not be recognized as valid slurm nodes. Sorry for not pointing this out earlier I was assuming that all nodes were freshly restarted. The same idea goes also with configuration paramaters, but in the majority of those it is enough that you issue an scontrol reconfigure command. Then I will proceed into closing this ticket, but do not hesitate to reopen if needed. Regards. |