| Summary: | error: find_node_record: lookup failure for | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Simon Michnowicz <simon.michnowicz> |
| Component: | slurmctld | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 14.11.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Monash University | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | Our slurm.conf for your consideration | ||
(In reply to Simon Michnowicz from comment #0) > Created attachment 3074 [details] > Our slurm.conf for your consideration > > Dear Sir/Madam > we have a cluster called 'monarch' as well as a login node called 'monarch'. > Monarch (the node) is not part of the cluster, but has munge keys so people > can submit jobs etc from it. > > Our slurmctld.log file is filling up with constant warnings (every 5 minutes) > error: find_node_record: lookup failure for monarch > > -monarch node does not run any slurm processes (i.e. slurmd) > -monarch node is listed in /etc/hosts > > The only solution I could find is to list monarch as a node in slurm.conf > and mark it as down. Some people in our team consider this unsatisfactory as > it may lead to other problems. You shouldn't need to add login nodes or other submit hosts to slurm.conf; running without them is common. Is there a slurmd process running on monarch? I think that's the most likely source of this message - slurmd's attempts to register itself with slurmctld would fail and would produce this message. The five-minute interval matches up with this as well. slurmd should only run on the compute nodes, it does not need to run on login nodes - only munge is needed there. Tim, thanks for that. I can confirm that there are no slurmd daemons on monarch.. hostname monarch ps aux | grep slurm smichnow 17063 0.0 0.0 112652 964 pts/0 S+ 16:38 0:00 grep --color=auto slurm Is there an easy way to determine the IP of the machine trying to register with the slurm controller? Maybe another machine out there is called 'monarch' and is trying to connect..I may have to resort to Wireshark to trace this.. regards Simon Tim, I found we had nhc on our login node and it was trying to mark the state of the login node as up/down. You can close the ticket now. thanks for your assistance. Simon Ahh, that makes sense. 'scontrol update nodename=monarch state=whatever' would cause that message. Marking closed now. - Tim |
Created attachment 3074 [details] Our slurm.conf for your consideration Dear Sir/Madam we have a cluster called 'monarch' as well as a login node called 'monarch'. Monarch (the node) is not part of the cluster, but has munge keys so people can submit jobs etc from it. Our slurmctld.log file is filling up with constant warnings (every 5 minutes) error: find_node_record: lookup failure for monarch -monarch node does not run any slurm processes (i.e. slurmd) -monarch node is listed in /etc/hosts The only solution I could find is to list monarch as a node in slurm.conf and mark it as down. Some people in our team consider this unsatisfactory as it may lead to other problems. Would you be able to offer any suggestions? I enclose our slurm.conf. Every node there has an entry in /etc/hosts on our slurmctl. regards Simon