Ticket 17045

Summary: Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while in standby mode
Product: Slurm Reporter: Naresh M <naresh.midatha>
Component: slurmctldAssignee: Benjamin Witham <benjamin.witham>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: benjamin.witham
Version: 22.05.6   
Hardware: Linux   
OS: Linux   
Site: SiFive Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf and slave logs
primary_controller.log

Description Naresh M 2023-06-25 23:46:49 MDT
Created attachment 30931 [details]
slurm.conf and slave logs

we are getting below error continuously on backup server after the change in IP for both slave and master.

error: Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while in standby mode

we are moving our datacenter from one place to another place, so I have moved hpcslave first to new datacenter and promoted as a master and we were testing slave for more than 2 days in production, then we took down the  hpcmaster and moved to new datacenter.

automatic failover is not working because of the RPC errors but if i run scontrol takeover then it is working.

Attached herewith the slurm.conf and slurmctld.logs from slave

Let me know if you need anything else
Comment 1 Benjamin Witham 2023-06-26 09:51:52 MDT
Hello Naresh, 

I have a few questions for you. 

1. Are you getting these errors on both of your slurm controllers or just on your backup controller?
2. Is the backup controller not taking over if the primary controller shuts down? If so, could I get the logs to your primary slurm controller?
3. Have you had any outages of your primary controller since the move?
Comment 2 Naresh M 2023-06-26 10:04:51 MDT
1. Are you getting these errors on both of your slurm controllers or just on your backup controller?

getting errors only on backup controller

2. Is the backup controller not taking over if the primary controller shuts down? If so, could I get the logs to your primary slurm controller?

yes, it is not taking over and let me attach the primary controller logs

3. Have you had any outages of your primary controller since the move?

No Outages so far
Comment 3 Naresh M 2023-06-26 10:07:16 MDT
Created attachment 30934 [details]
primary_controller.log
Comment 4 Naresh M 2023-06-28 01:55:30 MDT
Hi,

Any update ?
Comment 5 Benjamin Witham 2023-06-28 14:50:01 MDT
Hello Naresh, 

I've been looking at your primary controller log, and I see many errors to this effect:

> error: _find_node_record: lookup failure for node "omega01.internal.sifive.com"
> error: _find_node_record: lookup failure for node "omega00.internal.sifive.com"

What are the state of these nodes? Are these nodes responding to pings from the controller?

What are you doing to simulate the failure of the primary controller? I'm not sure that the invalid RPC calls are the reason behind the backup controller failing to take over.

Could you send me the output of sdiag as well?
Comment 6 Benjamin Witham 2023-07-20 13:53:34 MDT
Hello Naresh, 

Are you still having trouble with your backup controller failing to take over?
Comment 7 Benjamin Witham 2023-08-07 11:57:02 MDT
Hello Naresh, 

I'm assuming that this isn't a problem anymore, so I'll close this ticket. If you are still experiencing this problem, feel free to reopen this ticket.