Ticket 17715

Summary:	RPC REQUEST_PING Can't find an address, check slurm.conf
Product:	Slurm	Reporter:	GSK-ONYX-SLURM <slurm-support>
Component:	Configuration	Assignee:	Oriol Vilarrubi <jvilarru>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	jvilarru
Version:	23.02.4
Hardware:	Linux
OS:	Linux
Site:	GSK	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	RHEL	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf nodes.conf partitions.com gres.conf us1ghpcgpu001 - /etc/hosts us1ghpcgpu001 - hostnamectl

Description GSK-ONYX-SLURM 2023-09-19 04:53:11 MDT

Hi Team,

I've been adding new compute nodes to our one of the clusters, which is managed by Bright. These new nodes are RHEL7 and RHEL8 machines and are being provisioned outside the Bright manager.

I noticed that the newly added nodes are go down and up from time to time. I filtered the slurmctld logs by selecting one of these nodes, which I'm having the problem with - us1ghpcgpu001:

[...]
[2023-09-19T04:54:42.556] error: _find_node_record: lookup failure for node "us1ghpcgpu001.corpnet2.com"
[2023-09-19T04:56:22.336] agent/is_node_resp: node:us1ghpcgpu001 RPC:REQUEST_PING : Can't find an address, check slurm.conf
[2023-09-19T05:09:43.618] error: Nodes uptuw[497,522],us1ghpcgpu001 not responding
[2023-09-19T05:16:22.389] debug:  Spawning ping agent for apps[1-4],archive[1-2],backup2,cpu-[001-004,101-107,109-139],gpu-[512-518,520-521,607-612,702,801-818,901-902,904-914],uptun[072-074],uptuw[497-498,522],uptuw001gdc,us1ghpcgpu001
[2023-09-19T05:16:22.404] Node us1ghpcgpu001 now responding
[2023-09-19T05:19:43.431] error: _find_node_record: lookup failure for node "us1ghpcgpu001.corpnet2.com"
[2023-09-19T05:24:43.568] error: _find_node_record: lookup failure for node "us1ghpcgpu001.corpnet2.com"
[2023-09-19T05:29:44.003] error: _find_node_record: lookup failure for node "us1ghpcgpu001.corpnet2.com"
[2023-09-19T05:34:44.049] error: _find_node_record: lookup failure for node "us1ghpcgpu001.corpnet2.com"
[2023-09-19T05:39:43.888] error: _find_node_record: lookup failure for node "us1ghpcgpu001.corpnet2.com"
[2023-09-19T05:43:02.748] debug:  Spawning ping agent for apps[1-4],archive[1-2],backup2,cpu-[001-004,101-107,109-139,413-417,419-451,453-460,501-502],uptun[072-074],uptuw498,uptuw001gdc,us1ghpcgpu001
[2023-09-19T05:44:44.332] error: _find_node_record: lookup failure for node "us1ghpcgpu001.corpnet2.com"
[...]

One of the slurmd related requirements is nhc, which is also installed along with rest of nodes. Without nhc being installed slurmd is reporting issues, not sure if that matters. So, I decided to install and configure nhc on these servers too.

Once nhc was configured and slurmd restarted I was still having issues with time synchronization:

[2023-09-11T01:18:07.776] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Failed to compare time to server master 1246 times.

I tried to resolve this issue by comparing the configuration on the nodes managed by Bright, but no luck. It seems that ntpdate along with chronyd is used, but even though the ntpdate has been properly configured and it stopped complaining, the above issue was still persisting. Finally I decided to comment the line out in the nhc.conf file, which is responsible for comparing time. I'm just mentioning this because it may be important.

The slurm.conf file is exactly the same across all the clusters, the /etc/slurm location is shared along with all nodes from the head node (this env doesn't use configless). The /etc/hosts file looks good too, I can ssh into nodes from the head node. When I execute: scontrol ping, I get the response that the node is UP. Not sure what else can be checked here.

I'm attaching config files. Let me know if you need anything else.

Thanks,
Radek

Comment 1 GSK-ONYX-SLURM 2023-09-19 04:58:02 MDT

Created attachment 32314 [details]
slurm.conf

Comment 2 GSK-ONYX-SLURM 2023-09-19 04:58:21 MDT

Created attachment 32315 [details]
nodes.conf

Comment 3 GSK-ONYX-SLURM 2023-09-19 04:59:09 MDT

Created attachment 32316 [details]
partitions.com

Comment 4 GSK-ONYX-SLURM 2023-09-19 04:59:28 MDT

Created attachment 32317 [details]
gres.conf

Comment 5 GSK-ONYX-SLURM 2023-09-19 05:00:16 MDT

Created attachment 32318 [details]
us1ghpcgpu001 - /etc/hosts

Comment 6 GSK-ONYX-SLURM 2023-09-19 05:00:58 MDT

Created attachment 32319 [details]
us1ghpcgpu001 - hostnamectl

Comment 7 Ben Roberts 2023-09-20 08:51:16 MDT

Hi Radek,

We have an internal bug open that sounds like it could be related to what you're seeing.  It has to do with nodes being added dynamically and causes there to be errors that show:
  lookup failure for node "<node name>"

This bug is being worked on, but doesn't have a fix yet.  Are you adding these nodes dynamically (https://slurm.schedmd.com/dynamic_nodes.html)?

Comment 9 GSK-ONYX-SLURM 2023-09-20 10:11:48 MDT

Hi Ben,

thanks a lot for your prompt response. I'm not adding nodes dynamically. All the nodes have been added into the slurm.conf file. 

Thanks,
Radek

Comment 11 GSK-ONYX-SLURM 2023-09-21 10:34:49 MDT

Hi Ben,

I think the main problem here is related to this:

[2023-09-21T12:11:50.264] agent/is_node_resp: node:uptuw522 RPC:REQUEST_PING : Can't find an address, check slurm.conf

When it happens, the affected node - uptuw522 and also uptuw497 is seen as idle*:

[I am root!@ushpc:~]# sinfo -Nl | grep login
login1             1         admin     drained 96     2:24:2 262144  1906796      1 login,in NHC: check_hw_swap_f
login2             1         admin     drained 96     2:24:2 262144  1906796      1 login,in NHC: check_hw_swap_f
login3             1         admin     drained 96     2:24:2 262144  1906796      1 login,in NHC: check_hw_swap_f
login4             1         admin     drained 96     2:24:2 262144  1906796      1 login,in NHC: check_hw_swap_f
login5             1         admin        idle 96     2:24:2 262144  1906796      1 login,in none
uptuw497           1         admin       idle* 20     2:10:1 128000  1900000      1 login,in none
uptuw498           1         admin        idle 20     2:10:1 257000  1900000      1 login,in none
uptuw522           1         admin       idle* 20     2:10:1 128000  1900000      1 login,in none

They both have been added outside Bright, if that matters...

Nothing in the slurmd logs, everything looks good.

Checking the slurmctld logs again after a while:

[2023-09-21T12:08:16.334] error: _find_node_record: lookup failure for node "uptuw522.corpnet2.com"
[2023-09-21T12:11:50.264] agent/is_node_resp: node:uptuw522 RPC:REQUEST_PING : Can't find an address, check slurm.conf
[2023-09-21T12:26:41.007] Node uptuw522 now responding

and the node is responding.

Thanks,
Radek

Comment 12 GSK-ONYX-SLURM 2023-09-22 02:54:27 MDT

Hi Oriol,

I think I know where the problem might be. Some of new nodes are connected to the external network only, while other compute nodes are connected to the internal network. The head node is connected to both. I think it might be something to do with the missing routing table or dns or both. I will be checking this now.

Rest of new nodes are connected to both networks and they can communicate with other nodes without any problems. However the line:

[2023-09-22T04:28:28.285] agent/is_node_resp: node:us1ghpcgpu003 RPC:REQUEST_PING : Can't find an address, check slurm.conf

appears regarding these nodes too. I noticed that the hostname is FQDN for that nodes while other nodes don't have a domain. Does it matter? 

Thanks,
Radek

Comment 13 GSK-ONYX-SLURM 2023-09-22 09:43:36 MDT

Update - I just restarted all compute nodes across the cluster and I haven't seen the issue since today, 8:23AM EDT. The error log was seen before every hour, so it looks promising. Tell me what you think anyway, if there's anything else I could check..?

Happy weekend!

Radek

Comment 14 Oriol Vilarrubi 2023-09-22 10:01:34 MDT

Hi Radek,

Sorry for not answering before, for the nodes to be usable they need to have a routable connection to the controller, as well as they need to be resolvable by name, they also need to be in time sync with the controller. In case that you cannot modify the global dns, or you don't wnat to you can also add the node IP address directly in slurm by using the NodeAddr parameter of every node.

So for you to check for example if uptuw522 is a usable node, you can directly ping it from the controller.
The reason why in sometimes you see the domain and others no in slurm logs, is because when you do not see the domain the log is informing about the node as an "slurm entity" but when you see it with the domain is because we try to resolve it as a dns name, and that is the default domain to search with as instructed by the controller, if my memory does not fail me that is called dns domain search.

When you see a node with an state that ends with * it means that the node is not responding, if you want to see a full list of states for a node, you can use scontrol show node <nodename>

Are you still seeing the nhc time issue?

Regards.

Comment 15 GSK-ONYX-SLURM 2023-09-25 00:10:30 MDT

Hi Oriol,

all the names are resolvable across the nodes and I think I found what the problem is. Once new nodes were added, I just restarted slurmctld on the head node and slurmd on the new nodes forgetting to do that on the rest of nodes. Every time I wanted to execute an srun job, I was getting an error saying that there's a problem with an address. Also, I saw that new nodes were not responding from time to time. 

Last Friday I restarted slurmd on all nodes being part of the cluster and since that time I don't see anything like "RPC:REQUEST_PING : Can't find an address, check slurm.conf". What's your take on this?

I'm still having the problem with nhc and time synchronization, but it's out of Slurm so as long as you confirm that the slurmd restart was required (we don't use configless on that cluster) I think we can close the ticket.

Thanks,
Radek

Comment 16 Oriol Vilarrubi 2023-09-25 03:47:09 MDT

Hello Radek,

Yes, that is 100% the issue. Whenever you add new nodes you need to reload the configuration file to the daemon so that they are aware of those new nodes. Otherwise they will receives messages that contains those new nodes and those will not be recognized as valid slurm nodes.

Sorry for not pointing this out earlier I was assuming that all nodes were freshly restarted.

The same idea goes also with configuration paramaters, but in the majority of those it is enough that you issue an scontrol reconfigure command.

Then I will proceed into closing this ticket, but do not hesitate to reopen if needed.

Regards.