Ticket 17045

Summary:	Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while in standby mode
Product:	Slurm	Reporter:	Naresh M <naresh.midatha>
Component:	slurmctld	Assignee:	Benjamin Witham <benjamin.witham>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	benjamin.witham
Version:	22.05.6
Hardware:	Linux
OS:	Linux
Site:	SiFive	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf and slave logs primary_controller.log

Description Naresh M 2023-06-25 23:46:49 MDT

Created attachment 30931 [details]
slurm.conf and slave logs

we are getting below error continuously on backup server after the change in IP for both slave and master.

error: Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while in standby mode

we are moving our datacenter from one place to another place, so I have moved hpcslave first to new datacenter and promoted as a master and we were testing slave for more than 2 days in production, then we took down the  hpcmaster and moved to new datacenter.

automatic failover is not working because of the RPC errors but if i run scontrol takeover then it is working.

Attached herewith the slurm.conf and slurmctld.logs from slave

Let me know if you need anything else

Comment 1 Benjamin Witham 2023-06-26 09:51:52 MDT

Hello Naresh, 

I have a few questions for you. 

1. Are you getting these errors on both of your slurm controllers or just on your backup controller?
2. Is the backup controller not taking over if the primary controller shuts down? If so, could I get the logs to your primary slurm controller?
3. Have you had any outages of your primary controller since the move?

Comment 2 Naresh M 2023-06-26 10:04:51 MDT

1. Are you getting these errors on both of your slurm controllers or just on your backup controller?

getting errors only on backup controller

2. Is the backup controller not taking over if the primary controller shuts down? If so, could I get the logs to your primary slurm controller?

yes, it is not taking over and let me attach the primary controller logs

3. Have you had any outages of your primary controller since the move?

No Outages so far

Comment 3 Naresh M 2023-06-26 10:07:16 MDT

Created attachment 30934 [details]
primary_controller.log

Comment 4 Naresh M 2023-06-28 01:55:30 MDT

Hi,

Any update ?

Comment 5 Benjamin Witham 2023-06-28 14:50:01 MDT

Hello Naresh, 

I've been looking at your primary controller log, and I see many errors to this effect:

> error: _find_node_record: lookup failure for node "omega01.internal.sifive.com"
> error: _find_node_record: lookup failure for node "omega00.internal.sifive.com"

What are the state of these nodes? Are these nodes responding to pings from the controller?

What are you doing to simulate the failure of the primary controller? I'm not sure that the invalid RPC calls are the reason behind the backup controller failing to take over.

Could you send me the output of sdiag as well?

Comment 6 Benjamin Witham 2023-07-20 13:53:34 MDT

Hello Naresh, 

Are you still having trouble with your backup controller failing to take over?

Comment 7 Benjamin Witham 2023-08-07 11:57:02 MDT

Hello Naresh, 

I'm assuming that this isn't a problem anymore, so I'll close this ticket. If you are still experiencing this problem, feel free to reopen this ticket.