Ticket 2718

Summary:	error: find_node_record: lookup failure for
Product:	Slurm	Reporter:	Simon Michnowicz <simon.michnowicz>
Component:	slurmctld	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED INVALID	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	14.11.6
Hardware:	Linux
OS:	Linux
Site:	Monash University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Our slurm.conf for your consideration

Description Simon Michnowicz 2016-05-10 15:29:12 MDT

Created attachment 3074 [details]
Our slurm.conf for your consideration

Dear Sir/Madam
we have a cluster called 'monarch' as well as a login node called 'monarch'. Monarch (the node) is not part of the cluster, but has munge keys so people can submit jobs etc from it.

Our slurmctld.log file is filling up with constant warnings (every 5 minutes)
error: find_node_record: lookup failure for monarch

-monarch node does not run any slurm processes (i.e. slurmd)
-monarch node is listed in /etc/hosts

The only solution I could  find is to list monarch as a node in slurm.conf and mark it as down. Some people in our team consider this unsatisfactory as it may lead to other problems.

Would you be able to offer any suggestions?

I enclose our slurm.conf. Every node there has an entry in /etc/hosts on our slurmctl.

regards
Simon

Comment 1 Tim Wickberg 2016-05-11 01:50:43 MDT

(In reply to Simon Michnowicz from comment #0)
> Created attachment 3074 [details]
> Our slurm.conf for your consideration
> 
> Dear Sir/Madam
> we have a cluster called 'monarch' as well as a login node called 'monarch'.
> Monarch (the node) is not part of the cluster, but has munge keys so people
> can submit jobs etc from it.
> 
> Our slurmctld.log file is filling up with constant warnings (every 5 minutes)
> error: find_node_record: lookup failure for monarch
> 
> -monarch node does not run any slurm processes (i.e. slurmd)
> -monarch node is listed in /etc/hosts
> 
> The only solution I could  find is to list monarch as a node in slurm.conf
> and mark it as down. Some people in our team consider this unsatisfactory as
> it may lead to other problems.

You shouldn't need to add login nodes or other submit hosts to slurm.conf; running without them is common.

Is there a slurmd process running on monarch? I think that's the most likely source of this message - slurmd's attempts to register itself with slurmctld would fail and would produce this message. The five-minute interval matches up with this as well.

slurmd should only run on the compute nodes, it does not need to run on login nodes - only munge is needed there.

Comment 2 Simon Michnowicz 2016-05-11 17:43:39 MDT

Tim,
thanks for that. I can confirm that there are no slurmd daemons on monarch..

 hostname
monarch

 ps aux | grep slurm
smichnow 17063  0.0  0.0 112652   964 pts/0    S+   16:38   0:00 grep --color=auto slurm



Is there an easy way to determine the IP of the machine trying to register with the slurm controller? Maybe another machine out there is called 'monarch' and is trying to connect..I may have to resort to Wireshark to trace this..

regards
Simon

Comment 3 Simon Michnowicz 2016-05-12 17:13:30 MDT

Tim,
I found we had nhc on our login node and it was trying to mark the state of the login node as up/down. You can close the ticket now.
thanks for your assistance.

Simon

Comment 4 Tim Wickberg 2016-05-13 00:58:59 MDT

Ahh, that makes sense. 'scontrol update nodename=monarch state=whatever' would cause that message. Marking closed now.

- Tim