Ticket 12845

Summary:	Help needed with taming a misbehaving node
Product:	Slurm	Reporter:	Jurij Pečar <jurij.pecar>
Component:	slurmctld	Assignee:	Jason Booth <jbooth>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	20.11.7
Hardware:	Linux
OS:	Linux
Site:	EMBL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Jurij Pečar 2021-11-12 01:33:06 MST

We're running an el7 based cluster that I now want to start migrating to el8. My first step is adding an el8 based node that I'll use to rebuild our software stack on it. My issue is that I can't get this new el8 node to behave.

Here's what happens: on el8 slurmd startup, el7 slurmctld registers node and marks it as idle. Within a few seconds, node goes into idle* state, "not responding". Looking at slurmctld debug5 logs, it seems the culprit is 

agent/is_node_resp: node:bn05 RPC:REQUEST_ACCT_GATHER_UPDATE : Can't find an address, check slurm.conf

where bn05 is the el8 based node in question.

I'm unable to drill this down to the root cause to eliminate it. Things I've already checked:
* network communication is ok (same vlan, proper MTUs on all involved switches)
* config files are ok (puppet managed, so the same everywhere)
* hosts file is ok (puppet managed, so the same everywhere)
* DNS resolution is ok (both forward and reverse)

I did some grepping through the source of slurm to gain some insights, but it's not really clear where this comes from initially. Didn't try gdb yet, tho.

Do you have something like a checklist for this kind of cases? 

Thanks for help.

Comment 1 Jason Booth 2021-11-12 10:33:43 MST

Would you run the following commands on the scheduler and on bn05 then send back the output?

> nslookup bn05
> ping bn05

Please also attach your slurm.conf

> agent/is_node_resp: node:bn05 RPC:REQUEST_ACCT_GATHER_UPDATE : Can't find an address, check slurm.conf

I am assuming this is from the slurmctld.log? Would you please also attach your slurmctd and slurmd logs to this issue as well?

Comment 2 Jurij Pečar 2021-11-13 01:44:07 MST

Funny enough, after sleeping things over, next day this node works as expected.
Must have been some negative dns cache somewhere ...

Btw, while browsing the code, I saw lots of error('...') lines that looked like they intend to print out information that I would be interested in. But I didn't see any of them in the logs, even at debug5 level. How do I get these error lines to reach some stdout/stderr/log file?

Comment 3 Jason Booth 2021-11-15 11:42:46 MST

> Funny enough, after sleeping things over, next day this node works as expected.
> Must have been some negative dns cache somewhere ...

Great.

> Btw, while browsing the code, I saw lots of error('...') lines that looked like they intend to print out information that I would be interested in. But I didn't see any of them in the logs, even at debug5 level. How do I get these error lines to reach some stdout/stderr/log file?

Would you highlight those logs you are interested? You can also copy past the git line number too. 

For example: 
https://github.com/SchedMD/slurm/blob/slurm-20-11-7-1/src/slurmctld/node_scheduler.c#L320

Comment 4 Jurij Pečar 2021-11-15 12:00:10 MST

For example, line like
https://github.com/SchedMD/slurm/blob/slurm-20-11-7-1/src/common/forward.c#L360
From what I could see, slurm_conf_get_addr was returning SLURM_ERROR here but I couldn't see what exactly it was passing for "name" to verify that is's something reasonable and expected.

Anyway, you can now close this issue.

Comment 5 Jason Booth 2021-11-15 14:17:22 MST

Resolving