8615 – NodeName vs NodeHostname

Ticket 8615 - NodeName vs NodeHostname

Summary: NodeName vs NodeHostname

Status:	RESOLVED CANNOTREPRODUCE

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	19.05.5
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Nate Rini
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-03-03 11:41 MST by Matt Ezell
Modified:	2020-03-18 13:27 MDT (History)
CC List:	0 users

See Also:	7499 8584 8634
Site:	NOAA
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	ORNL
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf from c4-sys0 (3.78 KB, text/plain) 2020-03-04 15:15 MST, Matt Ezell	Details
slurm.conf from c3-sys0 (3.80 KB, text/plain) 2020-03-04 15:18 MST, Matt Ezell	Details
slurm.conf from es-slurm (3.59 KB, text/plain) 2020-03-04 15:19 MST, Matt Ezell	Details
slurmctld log from t4-slurm-backup (402.80 KB, text/plain) 2020-03-16 18:04 MDT, Matt Ezell	Details
slurmctld log from t4-slurm-backup (527.45 KB, text/plain) 2020-03-16 18:18 MDT, Matt Ezell	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Matt Ezell 2020-03-03 11:41:58 MST

A little background on our setup:
We have 3 Cray systems as well some "external" nodes that don't live within any of the Cray systems. Each Cray system has a slurmctld that lives inside the Cray as well as a backup slurmctld on a VM that lives outside the Cray. Each Cray node has a hostname in the form "nid#####". These "nid" names are duplicated among the systems. Since we need interactive jobs to work from the shared login nodes, we created hostname aliases in the form "cluster-nid#####". Inside the Cray, the hosts file has entries for both forms. Outside the Cray, nodes only have the "cluster-nid#####" entries.

On c4, for example, we use:
NodeName=nid0[0000-0015,0020-0075,0080-0143,0148-0459,0464-0527,0532-0843,0848-0911,0916-1295,1300-2687] NodeHostname=c4-nid0[0000-0015,0020-0075,0080-0143,0148-0459,0464-0527,0532-0843,0848-0911,0916-1295,1300-2687] <rest of the parameters>

This seems to work fine for both batch and interactive jobs.

Problem:
Last week our primary controller on c4 crashed. The backup on the VM took over, but quickly marked all the nodes as unresponsive and requeued all the running jobs with messages like:

[2020-02-26T12:29:18.041] error: Unable to resolve "nid00035": Unknown host
[2020-02-26T12:29:18.041] requeue job JobId=268605023 due to failure of node nid01982

# scontrol --local show node=nid00035 | grep Node
NodeName=nid00035 Arch=x86_64 CoresPerSocket=18
NodeAddr=c4-nid00035 NodeHostName=c4-nid00035 Version=19.05.5

Based on my reading of the docs for NodeName, NodeHostname, and NodeAddr, I would have expected this to use the c4-nid##### hostname instead of the nid##### hostname. That didn't seem to be the case... can you recommend how we should set this up?

Comment 1 Nate Rini 2020-03-03 17:45:32 MST

Matt,

Can you please provide the following from each cluster:
> getent hosts nid00035
> getent hosts c4-nid00035

Can you please also attach your slurm.conf for each cluster too.

Thanks,
--Nate

Comment 2 Matt Ezell 2020-03-03 19:38:11 MST

Sorry, we are having some connectivity issues after the tornado in Nashville severed one of our fiber connections for the lab.  Here's what I'm able to grab now, I can hopefully get a more full response tomorrow.

# Primary controller for c4, inside the c4 Cray
c4-sys0:~ # hostname
c4-sys0
c4-sys0:~ # getent hosts nid00035
172.25.32.36    nid00035 c4-nid00035 c0-0c0s8n3
c4-sys0:~ # getent hosts c4-nid00035
172.25.32.36    nid00035 c4-nid00035 c0-0c0s8n3

# Backup controller for c4, outside the Cray
[root@c4-slurm-backup c4]# hostname
c4-slurm-backup.ncrc.gov
[root@c4-slurm-backup c4]# getent hosts nid00035
[root@c4-slurm-backup c4]# getent hosts c4-nid00035
172.25.32.36    c4-nid00035

As you can see, the backup controller doesn't know the 'NodeName' hostname, but can resolve the 'NodeHostname' hostname.  I suppose it's possible I could add the 'NodeName' aliases on the backup controllers (since they only talk to nodes in their own cluster), but I can't do that on the login nodes because the 'NodeName' hostnames are ambiguous/duplicated.

Comment 4 Nate Rini 2020-03-04 14:36:30 MST

Matt

Can you please provide the a few of the lines above:
> [2020-02-26T12:29:18.041] error: Unable to resolve "nid00035": Unknown host

Thanks,
--Nate

Comment 6 Nate Rini 2020-03-04 15:04:16 MST

(In reply to Nate Rini from comment #4)
> Matt
> 
> Can you please provide the a few of the lines above:
> > [2020-02-26T12:29:18.041] error: Unable to resolve "nid00035": Unknown host

Please also call this:
> scontrol show fed

Are you using the overriden host names in the database to make "external" clusters in 19.05?

Can you please also attach your slurm.conf for each cluster too. I assume this is one of the open systems.

Thanks,
--Nate

Comment 7 Matt Ezell 2020-03-04 15:15:07 MST

(In reply to Nate Rini from comment #4)
> Can you please provide the a few of the lines above:
> > [2020-02-26T12:29:18.041] error: Unable to resolve "nid00035": Unknown host

[2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604657
[2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604659
[2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604661
[2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604663
[2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604665
[2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604667
[2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604669
[2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604672
[2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604674
[2020-02-26T12:26:00.038] _reconcile_fed_job: origin c3 still has JobId=202604676
[2020-02-26T12:26:00.039] _reconcile_fed_job: origin c3 still has JobId=202604678
[2020-02-26T12:26:00.039] _reconcile_fed_job: origin c3 still has JobId=202604680
[2020-02-26T12:26:00.039] _reconcile_fed_job: origin c3 still has JobId=202604682
[2020-02-26T12:26:00.039] _reconcile_fed_job: origin c3 still has JobId=202604684
[2020-02-26T12:26:00.039] _reconcile_fed_job: origin c3 still has JobId=202604686
[2020-02-26T12:26:00.039] _reconcile_fed_job: origin c3 still has JobId=202604688
[2020-02-26T12:27:39.266] _job_complete: JobId=268605057 WEXITSTATUS 1
[2020-02-26T12:27:39.267] email msg to wesley.ebisuzaki@noaa.gov: Slurm Job_id=268605057 Name=C128_convonly Failed, Run time 00:43:10, FAILED, ExitCode 1
[2020-02-26T12:27:39.267] _job_complete: JobId=268605057 done
[2020-02-26T12:28:50.859] error: _pick_step_nodes: JobId=268605038 StepId=Batch has no step_node_bitmap
[2020-02-26T12:28:51.967] _job_complete: JobId=268605038 WEXITSTATUS 0
[2020-02-26T12:28:51.968] _job_complete: JobId=268605038 done
[2020-02-26T12:29:18.041] error: Unable to resolve "nid00035": Unknown host
[2020-02-26T12:29:18.041] requeue job JobId=268605023 due to failure of node nid01982
[2020-02-26T12:29:18.042] error: Error connecting, bad data: family = 0, port = 0
[2020-02-26T12:29:18.044] error: Unable to resolve "nid00035": Unknown host
[2020-02-26T12:29:18.044] email msg to Bill.Hurlin: Slurm Job_id=268605023 Name=CM4_c192L33_am4p0_piControl_new Failed, Run time 09:05:36, NODE_FAIL, ExitCode 0
[2020-02-26T12:29:18.044] error: Error connecting, bad data: family = 0, port = 0
[2020-02-26T12:29:18.126] error: Unable to resolve "nid00047": Unknown host
[2020-02-26T12:29:18.126] requeue job JobId=268604868 due to failure of node nid01983
[2020-02-26T12:29:18.127] error: Error connecting, bad data: family = 0, port = 0
[2020-02-26T12:29:18.129] error: Unable to resolve "nid00047": Unknown host



[root@c4-slurm-backup c4]# scontrol show fed
Federation: gaea
Self:       c4:192.188.179.96:6817 ID:4 FedState:ACTIVE Features:
Sibling:    c3:192.188.179.71:6817 ID:3 FedState:ACTIVE Features: PersistConnSend/Recv:Yes/Yes Synced:Yes
Sibling:    es:192.188.179.88:6817 ID:1 FedState:ACTIVE Features: PersistConnSend/Recv:Yes/Yes Synced:Yes


The "external" cluster is "gfdl".  It is not in the federation.  We don't do any hostname trickery for that, just manually set the IP address in the database.

Comment 8 Matt Ezell 2020-03-04 15:15:40 MST

Created attachment 13268 [details]
slurm.conf from c4-sys0

Comment 9 Matt Ezell 2020-03-04 15:18:29 MST

Created attachment 13270 [details]
slurm.conf from c3-sys0

Comment 10 Matt Ezell 2020-03-04 15:19:07 MST

Created attachment 13271 [details]
slurm.conf from es-slurm

Comment 11 Nate Rini 2020-03-05 10:10:19 MST

(In reply to Matt Ezell from comment #7)
> The "external" cluster is "gfdl".  It is not in the federation.  We don't do
> any hostname trickery for that, just manually set the IP address in the
> database.

Any plans (in near future) to upgrade to 20.02 for the supported external clusters?

Comment 12 Nate Rini 2020-03-05 10:18:20 MST

(In reply to Matt Ezell from comment #7)
> (In reply to Nate Rini from comment #4)
> [2020-02-26T12:28:50.859] error: _pick_step_nodes: JobId=268605038
> StepId=Batch has no step_node_bitmap
This look like bug#7499 comment#73 and should be fully fixed in 20.02.

Comment 14 Matt Ezell 2020-03-05 10:32:47 MST

(In reply to Nate Rini from comment #11)
> Any plans (in near future) to upgrade to 20.02 for the supported external
> clusters?

It is on our to-do list, but it's behind upgrading Cray CLE.  So we are probably several months out, at earliest.

Comment 15 Matt Ezell 2020-03-05 10:35:09 MST

(In reply to Nate Rini from comment #12)
> (In reply to Matt Ezell from comment #7)
> > (In reply to Nate Rini from comment #4)
> > [2020-02-26T12:28:50.859] error: _pick_step_nodes: JobId=268605038
> > StepId=Batch has no step_node_bitmap
> This look like bug#7499 comment#73 and should be fully fixed in 20.02.

Is this what caused all the running jobs to abort, or is this just a related problem?  That bug is private and I cannot see the details.  Is the fix backportable to 19.05, or is it based in more fundamental changes?

Comment 16 Nate Rini 2020-03-05 10:50:38 MST

(In reply to Matt Ezell from comment #15)
> (In reply to Nate Rini from comment #12)
> > (In reply to Matt Ezell from comment #7)
> > > (In reply to Nate Rini from comment #4)
> > > [2020-02-26T12:28:50.859] error: _pick_step_nodes: JobId=268605038
> > > StepId=Batch has no step_node_bitmap
> > This look like bug#7499 comment#73 and should be fully fixed in 20.02.
> 
> Is this what caused all the running jobs to abort, or is this just a related
> problem?
This appears to a symptom but not the cause of this issue. Is slurmctld SEGFAULTing?

> That bug is private and I cannot see the details.  
NERSC agreed to open most of the ticket to the public. It should be open now.

> Is the fix backportable to 19.05, or is it based in more fundamental changes?
The patches change the RPC layer, so backporting would require all binaries be updated at once to avoid RPC corruption. I don't advise it. Please see bug #7499 comment#82 for patch list.

Comment 17 Matt Ezell 2020-03-05 10:56:30 MST

(In reply to Nate Rini from comment #16)
> This appears to a symptom but not the cause of this issue. Is slurmctld
> SEGFAULTing?

slurmctld is SIGABRT'ing as described in bug #8584.  When the secondary takes over, we are seeing the behavior described in this bug.  I've been treating them as separate issues, but they could be related by more than chronology.

Comment 18 Nate Rini 2020-03-05 11:09:18 MST

(In reply to Matt Ezell from comment #17)
> (In reply to Nate Rini from comment #16)
> > This appears to a symptom but not the cause of this issue. Is slurmctld
> > SEGFAULTing?
> 
> slurmctld is SIGABRT'ing as described in bug #8584.  When the secondary
> takes over, we are seeing the behavior described in this bug.  I've been
> treating them as separate issues, but they could be related by more than
> chronology.

Your backtrace looks different than the one in bug#7499 (kill_step_on_node() vs _attempt_backfill()), but I'll defer to Dominic since he already owns bug#8584. I will continue to pursue these independently.

Comment 19 Nate Rini 2020-03-07 11:56:53 MST

Matt,

Still working on recreating your issue. Looks like a shutdown (or SIGKILL) slurmctld failure is not sufficient to trigger the issue.

--Nate

Comment 20 Nate Rini 2020-03-09 15:21:13 MDT

Matt,

Is it possible to get this output?

> sacct -j 268605038 -p -a

Thanks
--Nate

Comment 21 Nate Rini 2020-03-09 15:41:59 MDT

(In reply to Matt Ezell from comment #0)
> # scontrol --local show node=nid00035 | grep Node
> NodeName=nid00035 Arch=x86_64 CoresPerSocket=18 
>    NodeAddr=c4-nid00035 NodeHostName=c4-nid00035 Version=19.05.5
> 
> Based on my reading of the docs for NodeName, NodeHostname, and NodeAddr, I
> would have expected this to use the c4-nid##### hostname instead of the
> nid##### hostname.  That didn't seem to be the case... can you recommend how
> we should set this up?
It is providing the nodename as configured for the given cluster.

The scontrol command is talking directly the the primary slurmctld on each cluster:
> [root@mgmtc2s2 ~]#  scontrol -M cluster1 show node=node01|grep Node
> NodeName=node01 Arch=x86_64 CoresPerSocket=6 
>    NodeAddr=c1node01 NodeHostName=c1node01 Version=19.05.5
> [root@mgmtc2s2 ~]#  scontrol -M cluster2 show node=node01|grep Node
> NodeName=node01 Arch=x86_64 CoresPerSocket=6 
>    NodeAddr=c2node01 NodeHostName=c2node01 Version=19.05.5
> [root@mgmtc2s2 ~]#  scontrol -M cluster3 show node=node01|grep Node
> NodeName=node01 Arch=x86_64 CoresPerSocket=6 
>    NodeAddr=c3node01 NodeHostName=c3node01 Version=19.05.5

I admit this is quite confusing but it crosses into RFE territory to have Slurm return a more universal node name.

Comment 22 Nate Rini 2020-03-09 15:48:46 MDT

(In reply to Matt Ezell from comment #7)
> [2020-02-26T12:29:18.127] error: Error connecting, bad data: family = 0,
> port = 0
> [2020-02-26T12:29:18.129] error: Unable to resolve "nid00047": Unknown host

Is it possible to get a few pages worth of logs following these errors?

Comment 23 Matt Ezell 2020-03-10 20:02:44 MDT

(In reply to Nate Rini from comment #21)
> (In reply to Matt Ezell from comment #0)
> > # scontrol --local show node=nid00035 | grep Node
> > NodeName=nid00035 Arch=x86_64 CoresPerSocket=18 
> >    NodeAddr=c4-nid00035 NodeHostName=c4-nid00035 Version=19.05.5
> > 
> > Based on my reading of the docs for NodeName, NodeHostname, and NodeAddr, I
> > would have expected this to use the c4-nid##### hostname instead of the
> > nid##### hostname.  That didn't seem to be the case... can you recommend how
> > we should set this up?
> It is providing the nodename as configured for the given cluster.
> 
> The scontrol command is talking directly the the primary slurmctld on each
> cluster:
> > [root@mgmtc2s2 ~]#  scontrol -M cluster1 show node=node01|grep Node
> > NodeName=node01 Arch=x86_64 CoresPerSocket=6 
> >    NodeAddr=c1node01 NodeHostName=c1node01 Version=19.05.5
> > [root@mgmtc2s2 ~]#  scontrol -M cluster2 show node=node01|grep Node
> > NodeName=node01 Arch=x86_64 CoresPerSocket=6 
> >    NodeAddr=c2node01 NodeHostName=c2node01 Version=19.05.5
> > [root@mgmtc2s2 ~]#  scontrol -M cluster3 show node=node01|grep Node
> > NodeName=node01 Arch=x86_64 CoresPerSocket=6 
> >    NodeAddr=c3node01 NodeHostName=c3node01 Version=19.05.5
> 
> I admit this is quite confusing but it crosses into RFE territory to have
> Slurm return a more universal node name.

Sorry if I was unclear here.  The 'scontrol' output is as I would expect.  It's that the controller tried to reach out to the NodeName hostname (as indicated in the logs) instead of the NodeHostname or NodeAddr hostname that seems wrong.

Comment 24 Nate Rini 2020-03-10 20:05:57 MDT

(In reply to Matt Ezell from comment #23)
> It's that the controller tried to reach out to the NodeName hostname (as
> indicated in the logs) instead of the NodeHostname or NodeAddr hostname that
> seems wrong.

Agreed.

Getting more logs per comment #22 would be helpful. So far I have not been able to recreate this issue locally. I suspect there is a fun race condition here that your site managed to hit.

Comment 25 Matt Ezell 2020-03-10 20:07:12 MDT

Created attachment 13335 [details]
slurmctld log from c4-slurm-backup

Comment 26 Matt Ezell 2020-03-10 20:08:28 MDT

(In reply to Nate Rini from comment #24)
> (In reply to Matt Ezell from comment #23)
> > It's that the controller tried to reach out to the NodeName hostname (as
> > indicated in the logs) instead of the NodeHostname or NodeAddr hostname that
> > seems wrong.
> 
> Agreed.
> 
> Getting more logs per comment #22 would be helpful. So far I have not been
> able to recreate this issue locally. I suspect there is a fun race condition
> here that your site managed to hit.

I think attachment 13335 [details] should have 2 full occurrences from the backup controller.

Comment 28 Nate Rini 2020-03-10 20:16:42 MDT

Can you please confirm your not running in front end mode?

Comment 29 Matt Ezell 2020-03-10 20:17:45 MDT

(In reply to Nate Rini from comment #28)
> Can you please confirm your not running in front end mode?

We are not in frontend mode.

Comment 30 Nate Rini 2020-03-10 20:31:38 MDT

(In reply to Matt Ezell from comment #26)
> I think attachment 13335 [details] should have 2 full occurrences from the
> backup controller.

One of the main confusing parts of the reading the logs is that all the threads are getting sparsed together. I suggest setting this in your slurm.conf for future issues:
> LogTimeFormat=thread_id

Comment 31 Nate Rini 2020-03-11 12:20:55 MDT

Matt

Can you please provide the topology.conf for each cluster?

Thanks,
--Nate

Comment 32 Matt Ezell 2020-03-11 12:24:35 MDT

(In reply to Nate Rini from comment #31)
> Matt
> 
> Can you please provide the topology.conf for each cluster?
> 
> Thanks,
> --Nate

Interesting angle.  Those use the NodeName:

# c3
SwitchName=s0 Nodes=nid0[0004-0067,0072-0127,0132-0383]
SwitchName=s1 Nodes=nid0[0388-0451,0456-0511,0516-0767]
SwitchName=s2 Nodes=nid0[0772-0835,0840-1151]
SwitchName=s3 Nodes=nid0[1156-1219,1224-1535]
SwitchName=root Switches=s[0-3]

# c4
SwitchName=s0 Nodes=nid0[0000-0015,0020-0075,0080-0143,0148-0383]
SwitchName=s1 Nodes=nid0[0384-0459,0464-0527,0532-0767]
SwitchName=s2 Nodes=nid0[0768-0843,0848-0911,0916-1151]
SwitchName=s3 Nodes=nid0[1152-1295,1300-1535]
SwitchName=s4 Nodes=nid0[1536-1919]
SwitchName=s5 Nodes=nid0[1920-2303]
SwitchName=s6 Nodes=nid0[2304-2687]
SwitchName=root Switches=s[0-6]

# es
# no topology.conf file exists

Comment 34 Nate Rini 2020-03-11 15:08:18 MDT

Matt

How is DNS handled on the controllers?

I was not able to trigger the issue with setting up topology.

Thanks,
--Nate

Comment 35 Matt Ezell 2020-03-12 07:12:13 MDT

(In reply to Nate Rini from comment #34)
> Matt
> 
> How is DNS handled on the controllers?
> 
> I was not able to trigger the issue with setting up topology.
> 
> Thanks,
> --Nate

The compute nodes are in /etc/hosts and not in DNS.  On the primary controller, both hostnames are in /etc/hosts, like:

# grep nid00008 /etc/hosts
172.25.32.9             nid00008        c4-nid00008 c0-0c0s2n0

Whereas on the backup controllers, only the c4-nid##### hostnames resolve:

# grep nid00008 /etc/hosts
172.25.32.9 c4-nid00008

If a host is not found in /etc/hosts, it goes out to our center-wide DNS servers per the nsswitch:

# grep hosts /etc/nsswitch.conf
# hosts defined in: Class[Dns::Client]
hosts:  files dns myhostname


As a workaround, I suppose I could add the "bare" nid hostnames to the backup controllers.

Comment 36 Nate Rini 2020-03-12 14:02:24 MDT

(In reply to Matt Ezell from comment #35)
> Whereas on the backup controllers, only the c4-nid##### hostnames resolve:
>
> As a workaround, I suppose I could add the "bare" nid hostnames to the
> backup controllers.
Okay, now this makes sense. I assumed /etc/hosts (and/or DNS) would be same within any given full cluster. Yes, you need to have the host names resolvable (via getent) on the backup controllers same as the primary controller. The backup controllers are just normal primary controllers when they are active and have the same requirements.

Please tell me if you have any more questions.

Thanks,
--Nate

Comment 37 Matt Ezell 2020-03-12 14:45:53 MDT

(In reply to Nate Rini from comment #36)
> (In reply to Matt Ezell from comment #35)
> > Whereas on the backup controllers, only the c4-nid##### hostnames resolve:
> >
> > As a workaround, I suppose I could add the "bare" nid hostnames to the
> > backup controllers.
> Okay, now this makes sense. I assumed /etc/hosts (and/or DNS) would be same
> within any given full cluster. Yes, you need to have the host names
> resolvable (via getent) on the backup controllers same as the primary
> controller. The backup controllers are just normal primary controllers when
> they are active and have the same requirements.
> 
> Please tell me if you have any more questions.
> 
> Thanks,
> --Nate

Why is it using NodeName?  Isn't the point of NodeHostname that everything would us that instead?

Comment 38 Nate Rini 2020-03-16 15:28:46 MDT

(In reply to Matt Ezell from comment #35)
> Whereas on the backup controllers, only the c4-nid##### hostnames resolve:
Are all of the long host names from every cluster included here?

> If a host is not found in /etc/hosts, it goes out to our center-wide DNS
> servers per the nsswitch:
Is there any chance that the center-wide DNS was failing at the time? 

(In reply to Matt Ezell from comment #37)
> (In reply to Nate Rini from comment #36)
> > (In reply to Matt Ezell from comment #35)
> > Okay, now this makes sense. I assumed /etc/hosts (and/or DNS) would be same
> > within any given full cluster. Yes, you need to have the host names
Should have said NodeHostname.
> > resolvable (via getent) on the backup controllers same as the primary
> > controller.
 
> Why is it using NodeName?
It should not be using NodeName for any of the networking, but the cli commands will for user/admin interaction (which can get really confusing). Slurm internally uses NodeName everywhere until the networking code, where it is converted to NodeHostname.

> Isn't the point of NodeHostname that everything would us that instead?
For networking only. On my test fed cluster, I don't have any dns provider for the NodeName but only NodeHostname and everything is working as expected (including fail over with down nodes).

Did issue only happened with bug#8584 causing a failover? Is it possible to try to trigger it again with some extra logging to find where it is failing since the logs don't provide enough details currently.

Comment 39 Matt Ezell 2020-03-16 15:48:36 MDT

(In reply to Nate Rini from comment #38)
> (In reply to Matt Ezell from comment #35)
> > Whereas on the backup controllers, only the c4-nid##### hostnames resolve:
> Are all of the long host names from every cluster included here?

No, only the local cluster
 
> > If a host is not found in /etc/hosts, it goes out to our center-wide DNS
> > servers per the nsswitch:
> Is there any chance that the center-wide DNS was failing at the time? 

None of the compute nodes are in center-wide DNS.

> (In reply to Matt Ezell from comment #37)
> > (In reply to Nate Rini from comment #36)
> > > (In reply to Matt Ezell from comment #35)
> > > Okay, now this makes sense. I assumed /etc/hosts (and/or DNS) would be same
> > > within any given full cluster. Yes, you need to have the host names
> Should have said NodeHostname.
> > > resolvable (via getent) on the backup controllers same as the primary
> > > controller.
>  
> > Why is it using NodeName?
> It should not be using NodeName for any of the networking, but the cli
> commands will for user/admin interaction (which can get really confusing).
> Slurm internally uses NodeName everywhere until the networking code, where
> it is converted to NodeHostname.
> 
> > Isn't the point of NodeHostname that everything would us that instead?
> For networking only. On my test fed cluster, I don't have any dns provider
> for the NodeName but only NodeHostname and everything is working as expected
> (including fail over with down nodes).
> 
> Did issue only happened with bug#8584 causing a failover? Is it possible to
> try to trigger it again with some extra logging to find where it is failing
> since the logs don't provide enough details currently.

I can try to reproduce this on t4 (the test Cray machine) although the environment is slightly different (SLE15 instead of SLES12).  I'll try SIGKILL'ing the primary controller and get back to you after I've had a chance to try it.

Comment 40 Nate Rini 2020-03-16 17:02:00 MDT

(In reply to Matt Ezell from comment #39)
> (In reply to Nate Rini from comment #38)
> > Did issue only happened with bug#8584 causing a failover? Is it possible to
> > try to trigger it again with some extra logging to find where it is failing
> > since the logs don't provide enough details currently.
> 
> I can try to reproduce this on t4 (the test Cray machine) although the
> environment is slightly different (SLE15 instead of SLES12).  I'll try
> SIGKILL'ing the primary controller and get back to you after I've had a
> chance to try it.

Please set SlurmctldDebug=debug4 and debugflags=agent,network.

Comment 41 Matt Ezell 2020-03-16 17:58:13 MDT

(In reply to Nate Rini from comment #40)
> Please set SlurmctldDebug=debug4 and debugflags=agent,network.

# slurmctld 
slurmctld: error: Invalid DebugFlag: network
slurmctld: error: DebugFlags invalid: agent,network
slurmctld: fatal: Unable to process configuration file

Comment 42 Matt Ezell 2020-03-16 18:04:08 MDT

Created attachment 13390 [details]
slurmctld log from t4-slurm-backup

Comment 43 Nate Rini 2020-03-16 18:07:26 MDT

(In reply to Matt Ezell from comment #41)
> (In reply to Nate Rini from comment #40)
> > Please set SlurmctldDebug=debug4 and debugflags=agent,network.
> 
> # slurmctld 
> slurmctld: error: Invalid DebugFlag: network
> slurmctld: error: DebugFlags invalid: agent,network
> slurmctld: fatal: Unable to process configuration file

Network was added in 20.02, please just do agent for now.
> debugflags=agent

Please also set:
> LogTimeFormat=thread_id

Thanks,
--Nate

Comment 44 Nate Rini 2020-03-16 18:09:51 MDT

(In reply to Nate Rini from comment #40)
> > I can try to reproduce this on t4 (the test Cray machine) although the
> > environment is slightly different (SLE15 instead of SLES12).  I'll try

(In reply to Matt Ezell from comment #42)
> Created attachment 13390 [details]
> slurmctld log from t4-slurm-backup
Were any nodes set to down while logging?

Comment 45 Matt Ezell 2020-03-16 18:15:32 MDT

(In reply to Nate Rini from comment #44)
> Were any nodes set to down while logging?

Sorry, looks like the log I uploaded did not have all the relevant messages in it.  I'll re-upload momentarily.

[2020-03-16T23:52:49.588] requeue job JobId=134262404 due to failure of node nid00008
[2020-03-16T23:52:49.588] debug3: select/cons_res: _rm_job_from_res: JobId=134262404 action 0
[2020-03-16T23:52:49.588] debug3: select/cons_res: removed JobId=134262404 from part batch row 0
[2020-03-16T23:52:49.588] debug3: make_node_comp: Node nid00008 being left DOWN
[2020-03-16T23:52:49.588] agent_trigger: pending_wait_time=65534->999 mail_too=F->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=1
[2020-03-16T23:52:49.588] debug:  Spawning registration agent for nid[00024-00027] 4 hosts
[2020-03-16T23:52:49.588] agent_trigger: pending_wait_time=999->999 mail_too=F->F Agent_cnt=0 agent_thread_cnt=0 retry_list_size=2
[2020-03-16T23:52:49.588] error: Nodes nid[00008-00023] not responding, setting DOWN

Comment 46 Matt Ezell 2020-03-16 18:18:00 MDT

Created attachment 13391 [details]
slurmctld log from t4-slurm-backup

Comment 47 Nate Rini 2020-03-17 16:27:00 MDT

(In reply to Matt Ezell from comment #46)
> Created attachment 13391 [details]
> slurmctld log from t4-slurm-backup

There were no resolution errors:
> $ grep -c 'Unable to resolve' slurmctld.log.t4agentdebug
> 0

Were all the hosts added per comment #35 before the test?

Comment 48 Matt Ezell 2020-03-17 16:46:41 MDT

(In reply to Nate Rini from comment #47)
> There were no resolution errors:
> > $ grep -c 'Unable to resolve' slurmctld.log.t4agentdebug
> > 0
> 
> Were all the hosts added per comment #35 before the test?

No, not on t4:

# grep nid00008 /etc/hosts
172.25.56.9  t4-nid00008
# ping nid00008
ping: nid00008: Name or service not known

So maybe this didn't reproduce in the same way?

Comment 49 Nate Rini 2020-03-17 16:50:08 MDT

(In reply to Matt Ezell from comment #48)
> So maybe this didn't reproduce in the same way?

I would have expected errors like this one for DNS issues:
> error: Unable to resolve "nid00035": Unknown host

This issue looks like slurmds not responding in a timely manner:
> debug2: Error connecting slurm stream socket at 172.25.56.9:6818: Connection timed out

Was slurmd restarted at the same time? Kernel refused the first connection attempts which suggests slurmd wasn't even listening.
> debug2: Error connecting slurm stream socket at 192.188.179.91:6817: Connection refused

Comment 50 Matt Ezell 2020-03-17 16:55:09 MDT

(In reply to Nate Rini from comment #49)
> Was slurmd restarted at the same time? Kernel refused the first connection

No, but we do have some weirdness due to the way we are doing our routing:

# ping -c1 t4-nid00008
PING t4-nid00008 (172.25.56.9) 56(84) bytes of data.
From ncrc-rtr1-v405.ncrc.gov (192.188.179.66) icmp_seq=1 Redirect Host(New nexthop: t4-batch1.ncrc.gov (192.188.179.93))

--- t4-nid00008 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

# ping -c2 t4-nid00008
PING t4-nid00008 (172.25.56.9) 56(84) bytes of data.
From ncrc-rtr1-v405.ncrc.gov (192.188.179.66) icmp_seq=1 Redirect Host(New nexthop: t4-batch1.ncrc.gov (192.188.179.93))
From ncrc-rtr1-v405.ncrc.gov (192.188.179.66): icmp_seq=1 Redirect Host(New nexthop: t4-batch1.ncrc.gov (192.188.179.93))
64 bytes from t4-nid00008 (172.25.56.9): icmp_seq=1 ttl=63 time=0.885 ms

--- t4-nid00008 ping statistics ---
1 packets transmitted, 1 received, +1 errors, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.885/0.885/0.885/0.000 ms


Maybe I should setup manual routes in the routing table instead of expecting the default gateway to take care of it.

Comment 51 Nate Rini 2020-03-17 17:01:04 MDT

(In reply to Matt Ezell from comment #50)
> Maybe I should setup manual routes in the routing table instead of expecting
> the default gateway to take care of it.

I was hoping to see replication of the issue in comment #0. So far, all of my internal tests have not shown the issue, so I must be missing something. Since your hosts are in /etc/hosts, the routing *shouldn't* matter but I've been surprised before by libnss.

Comment 52 Nate Rini 2020-03-18 13:15:59 MDT

Matt

Have you seen any errors like these?
> slurmctld: error: pack_msg: Invalid message version=0, type:4502
> slurmctld: error: _queue_rpc: failed to pack msg_type:4502

Your slurm.conf didn't have any lines with MsgAggregationParams but I wanted to make sure you didn't somehow have it setup:
> scontrol show config | grep -i -e window -e agg
In bug#8697, we found that some cloud nodes might try to resolve early on startup when message aggregation is active.

I also setup my test cluster to only have the local long form nodes names in DNS (at all) and I still don't see the resolution errors.

--Nate

Comment 53 Matt Ezell 2020-03-18 13:22:22 MDT

(In reply to Nate Rini from comment #52)
> Matt
> 
> Have you seen any errors like these?
> > slurmctld: error: pack_msg: Invalid message version=0, type:4502
> > slurmctld: error: _queue_rpc: failed to pack msg_type:4502

No, not on the primary or backup.

> Your slurm.conf didn't have any lines with MsgAggregationParams but I wanted
> to make sure you didn't somehow have it setup:
> > scontrol show config | grep -i -e window -e agg
> In bug#8697, we found that some cloud nodes might try to resolve early on
> startup when message aggregation is active.

Nope:
# scontrol show config | grep -i -e window -e agg
MsgAggregationParams    = (null)

> I also setup my test cluster to only have the local long form nodes names in
> DNS (at all) and I still don't see the resolution errors.

Thanks for all your hard work trying to reproduce this - it seems that we don't have a reliable reproducer other than the referenced bug (which I can't reliably reproduce).

Comment 54 Nate Rini 2020-03-18 13:27:42 MDT

(In reply to Matt Ezell from comment #53)
> Thanks for all your hard work trying to reproduce this - it seems that we
> don't have a reliable reproducer other than the referenced bug (which I
> can't reliably reproduce).

At this point, only way I can reproduce the issue is to remove the DNS entries for given nodes (NodeHostname) in /etc/hosts. That doesn't look like the issue observed in comment #0.

With that, I'll close this ticket. Please reply to have it reopened and we can continue from there. If you see this happening again, please call gcore to grab a core from slurmctld process and that should give us a starting point.