| Summary: | Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while in standby mode | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Naresh M <naresh.midatha> |
| Component: | slurmctld | Assignee: | Benjamin Witham <benjamin.witham> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | benjamin.witham |
| Version: | 22.05.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | SiFive | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf and slave logs
primary_controller.log |
||
Hello Naresh, I have a few questions for you. 1. Are you getting these errors on both of your slurm controllers or just on your backup controller? 2. Is the backup controller not taking over if the primary controller shuts down? If so, could I get the logs to your primary slurm controller? 3. Have you had any outages of your primary controller since the move? 1. Are you getting these errors on both of your slurm controllers or just on your backup controller? getting errors only on backup controller 2. Is the backup controller not taking over if the primary controller shuts down? If so, could I get the logs to your primary slurm controller? yes, it is not taking over and let me attach the primary controller logs 3. Have you had any outages of your primary controller since the move? No Outages so far Created attachment 30934 [details]
primary_controller.log
Hi, Any update ? Hello Naresh,
I've been looking at your primary controller log, and I see many errors to this effect:
> error: _find_node_record: lookup failure for node "omega01.internal.sifive.com"
> error: _find_node_record: lookup failure for node "omega00.internal.sifive.com"
What are the state of these nodes? Are these nodes responding to pings from the controller?
What are you doing to simulate the failure of the primary controller? I'm not sure that the invalid RPC calls are the reason behind the backup controller failing to take over.
Could you send me the output of sdiag as well?
Hello Naresh, Are you still having trouble with your backup controller failing to take over? Hello Naresh, I'm assuming that this isn't a problem anymore, so I'll close this ticket. If you are still experiencing this problem, feel free to reopen this ticket. |
Created attachment 30931 [details] slurm.conf and slave logs we are getting below error continuously on backup server after the change in IP for both slave and master. error: Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while in standby mode we are moving our datacenter from one place to another place, so I have moved hpcslave first to new datacenter and promoted as a master and we were testing slave for more than 2 days in production, then we took down the hpcmaster and moved to new datacenter. automatic failover is not working because of the RPC errors but if i run scontrol takeover then it is working. Attached herewith the slurm.conf and slurmctld.logs from slave Let me know if you need anything else