Ticket 11878

Summary:	Configless Slurm fails due to failing SRV record lookup on EL8 (CentOS 8)
Product:	Slurm	Reporter:	Ole.H.Nielsen <Ole.H.Nielsen>
Component:	slurmd	Assignee:	Tim McMullan <mcmullan>
Status:	RESOLVED FIXED	QA Contact:
Severity:	C - Contributions
Priority:	---	CC:	ext-jds, tru
Version:	24.11.5
Hardware:	Linux
OS:	Linux
Site:	DTU Physics	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	25.05.1 25.11.0rc1
CLE Version:		Version Fixed:	21.08.0
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Ole.H.Nielsen@fysik.dtu.dk 2021-06-22 07:00:29 MDT

We're testing some compute nodes running EL8 (CentOS 8.4 and AlmaLinux 8.4).
Our cluster setup uses Configless Slurm with a DNS SRV record. This is working correctly, for example, as shown by a DNS lookup on the slurmctld server (running CentOS 7.9):

$ host -t SRV _slurmctld._tcp
_slurmctld._tcp.nifl.fysik.dtu.dk has SRV record 0 0 6817 que.nifl.fysik.dtu.dk.

On the EL8 compute nodes, however, slurmd fails to start, and we see these lines in the syslog:

Jun 22 14:03:26 h002 slurmd[1274]: error: resolve_ctls_from_dns_srv: res_nsearch error: Host name lookup failure
Jun 22 14:03:26 h002 slurmd[1274]: error: fetch_config: DNS SRV lookup failed
Jun 22 14:03:26 h002 slurmd[1274]: error: _establish_configuration: failed to load configs
Jun 22 14:03:26 h002 slurmd[1274]: error: slurmd initialization failed
Jun 22 14:03:26 h002 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Jun 22 14:03:26 h002 systemd[1]: slurmd.service: Failed with result 'exit-code'.

It is indeed the DNS lookup of SRV records that seems to be causing problems on the EL8 compute nodes:

[root@h002 ~]# host -t SRV _slurmctld._tcp
Host _slurmctld._tcp not found: 3(NXDOMAIN)

Normal DNS A records are OK using just the short hostname:

[root@h002 ~]# host h001
h001.nifl.fysik.dtu.dk has address 10.2.133.113

But when I append the entire FQDN, the SRV record can be looked up correctly:

[root@h002 ~]# host -t SRV _slurmctld._tcp.nifl.fysik.dtu.dk.
_slurmctld._tcp.nifl.fysik.dtu.dk has SRV record 0 0 6817 que.nifl.fysik.dtu.dk.

AFAIK, we don't seem to have any issues with our DNS servers, which are running on CentOS 7.9 servers.

The node's resolv.conf file is generated automatically:

[root@h002 ~]# cat /etc/resolv.conf
# Generated by NetworkManager
search nifl.fysik.dtu.dk
nameserver 10.2.128.110
nameserver 10.2.128.2

I've scanned DNS lookup of SRV records on all Linux servers in our network, and a consistent picture emerges:

* On all EL8 servers the command "host -t SRV _slurmctld._tcp" fails to give any results! The OS versions tested include CentOS 8.3, CentOS 8.4, AlmaLinux 8.4, CentOS Stream 8, Fedora FC34.

* All EL7 servers are OK.

It would seem that DNS SRV records are treated differently on EL8 hosts than on EL7 hosts. The slurmd daemon seems to fails to start in Configless mode on EL8 hosts owing to this problem.

Question: Are you aware of what has changed wrt. SRV records on EL8?

A fix for DNS lookup in slurmd on EL8 systems may possibly be required.

Thanks a lot,
Ole

Comment 1 Tim McMullan 2021-06-22 10:21:51 MDT

Hi Ole,

Thanks for the report!  I've not yet heard of this issue, but in my own test environment I wasn't able to reproduce it quickly.  I'm currently testing with CentOS Stream 8, but host/dig can both get the appropriate response without the FQDN attached.

My resolv.conf is essentially the same as yours and is also being generated by NetworkManager (I'm using DHCP reservations for all my hosts), I updated/rebooted the test node before the test so its very current.

Just as a sanity check and to help me in trying to replicate the issue:
Are the /etc/resolv.conf files the same on the el7 and el8 nodes?
Does /etc/hostname have the FQDN or just the hostname on el7 and the el8 nodes?
Are the srv records defined on both DNS servers?
Can you share how you defined the SRV records on the DNS servers?

Hopefully with this I'll be able to match my setup to yours, reproduce, and find what changed!

Thanks!
--Tim

Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2021-06-23 05:40:30 MDT

Hi Tim,

Thanks for a quick response.  Answers are below:

(In reply to Tim McMullan from comment #1)
> Thanks for the report!  I've not yet heard of this issue, but in my own test
> environment I wasn't able to reproduce it quickly.  I'm currently testing
> with CentOS Stream 8, but host/dig can both get the appropriate response
> without the FQDN attached.

In my cluster network, the SRV record doesn't lookup on the EL8 node except when I add the FQDN:

[root@h002 ~]# dig +short -t SRV -n _slurmctld._tcp
[root@h002 ~]# dig +short -t SRV -n _slurmctld._tcp.nifl.fysik.dtu.dk.
0 0 6817 que.nifl.fysik.dtu.dk.

[root@h002 ~]# host -t SRV _slurmctld._tcp
Host _slurmctld._tcp not found: 3(NXDOMAIN)
[root@h002 ~]# host -t SRV _slurmctld._tcp.nifl.fysik.dtu.dk.
_slurmctld._tcp.nifl.fysik.dtu.dk has SRV record 0 0 6817 que.nifl.fysik.dtu.dk.

On an EL7 node the "host" command works without FQDN, but "dig" doesn't:

[root@s004 ~]# dig +short -t SRV -n _slurmctld._tcp
[root@s004 ~]# host -t SRV _slurmctld._tcp
_slurmctld._tcp.nifl.fysik.dtu.dk has SRV record 0 0 6817 que.nifl.fysik.dtu.dk.

Maybe I'm barking up the wrong tree here:  I may just be that the "host" command from the bind-utils RPM has a changed behavior from EL7 (bind-utils-9.11.4) to EL8 (bind-utils-9.11.26), but I couldn't find any changelog.

I wonder why your CentOS Stream 8 system behaves different from mine?  We have a PC running CentOS Stream 8 where the FQDN is also required:

[root@tesla ~]# dig +short -t SRV -n _slurmctld._tcp
[root@tesla ~]# dig +short -t SRV -n _slurmctld._tcp.fysik.dtu.dk.
0 0 6817 que.fysik.dtu.dk.
[root@tesla ~]# dig +short -t SRV -n _slurmctld._tcp.nifl.fysik.dtu.dk.
0 0 6817 que.nifl.fysik.dtu.dk.

The network for this PC has a campus-wide Infoblox DNS server box which is completely different from my cluster's CentOS 7.9 BIND DNS server.

> My resolv.conf is essentially the same as yours and is also being generated
> by NetworkManager (I'm using DHCP reservations for all my hosts), I
> updated/rebooted the test node before the test so its very current.
> 
> Just as a sanity check and to help me in trying to replicate the issue:
> Are the /etc/resolv.conf files the same on the el7 and el8 nodes?

Yes, verified on EL7 and EL8.  The same DHCP server services all cluster nodes.

> Does /etc/hostname have the FQDN or just the hostname on el7 and the el8
> nodes?

Full FQDN:

[root@h002 ~]# cat /etc/hostname
h002.nifl.fysik.dtu.dk

> Are the srv records defined on both DNS servers?

Yes, they are both DNS slave servers using the same authoritative DNS server.

> Can you share how you defined the SRV records on the DNS servers?

In the zone file for nifl.fysik.dtu.dk I have:

_slurmctld._tcp 3600 IN SRV 0 0 6817 que

Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2021-06-23 06:20:54 MDT

I have made some new observations today.  Having found in Comment 2 that the FQDN is required for lookup of the SRV record (at least in our network for reasons that I don't understand yet), maybe this is not a DNS issue after all?

When my EL8 node "h002" boots, I get slurmd error messages in the syslog related to DNS lookup failures:

Jun 23 13:50:36 h002 slurmd[1281]: error: resolve_ctls_from_dns_srv: res_nsearch error: Host name lookup failure
Jun 23 13:50:36 h002 slurmd[1281]: error: fetch_config: DNS SRV lookup failed
Jun 23 13:50:36 h002 slurmd[1281]: error: _establish_configuration: failed to load configs
Jun 23 13:50:36 h002 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Jun 23 13:50:36 h002 slurmd[1281]: error: slurmd initialization failed
Jun 23 13:50:36 h002 systemd[1]: slurmd.service: Failed with result 'exit-code'.

Consequently, the slurmd service has failed:

[root@h002 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmd.service.d
           └─core_limit.conf
   Active: failed (Result: exit-code) since Wed 2021-06-23 13:50:36 CEST; 4min 24s ago
  Process: 1281 ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 1281 (code=exited, status=1/FAILURE)

Jun 23 13:50:36 h002.nifl.fysik.dtu.dk systemd[1]: Started Slurm node daemon.
Jun 23 13:50:36 h002.nifl.fysik.dtu.dk systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Jun 23 13:50:36 h002.nifl.fysik.dtu.dk systemd[1]: slurmd.service: Failed with result 'exit-code'.

Strangely, when I now restart slurmd it works just fine:

[root@h002 ~]# systemctl restart slurmd
[root@h002 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmd.service.d
           └─core_limit.conf
   Active: active (running) since Wed 2021-06-23 13:55:05 CEST; 1s ago
 Main PID: 2151 (slurmd)
    Tasks: 2
   Memory: 18.4M
   CGroup: /system.slice/slurmd.service
           └─2151 /usr/sbin/slurmd -D

Jun 23 13:55:05 h002.nifl.fysik.dtu.dk systemd[1]: Started Slurm node daemon.

I've confirmed that this behavior is repeated every time I reboot the node h002.  The slurmd fails when started by Systemd during booting, but a few minutes later slurmd starts correctly from Systemd.  I think this precludes any temporary issue with our DNS servers, also because all other nodes in the cluster work just fine in Configless mode.

Perhaps the base issue is related to the slurmd error: fetch_config: DNS SRV lookup failed

Could it be that slurmd is being started too early in the boot process where some  network services have not yet been fully completed?  In the syslog I see that NetworkManager is started with the exact same timestamp as the slurmd service:

Jun 23 13:50:36 h002 systemd[1]: Starting Network Manager...
Jun 23 13:50:36 h002 NetworkManager[1224]: <info>  [1624449036.3693] NetworkManager (version 1.30.0-7.el8) is starting... (for the first time)
Jun 23 13:50:36 h002 NetworkManager[1224]: <info>  [1624449036.3697] Read config: /etc/NetworkManager/NetworkManager.conf
Jun 23 13:50:36 h002 systemd[1]: Started Network Manager.
Jun 23 13:50:36 h002 NetworkManager[1224]: <info>  [1624449036.3722] bus-manager: acquired D-Bus service "org.freedesktop.NetworkManager"
Jun 23 13:50:36 h002 systemd[1]: Starting Network Manager Wait Online...
Jun 23 13:50:36 h002 systemd[1]: Reached target Network.

I tried delaying the startup of slurmd for some seconds by adding a ExecStartPre line to /usr/lib/systemd/system/slurmd.service:

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmd
# Testing a delay:
ExecStartPre=/bin/sleep 30

After rebooting again the slurmd service has now started correctly during the boot process:

[root@h002 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmd.service.d
           └─core_limit.conf
   Active: active (running) since Wed 2021-06-23 14:12:12 CEST; 3min 11s ago
  Process: 1273 ExecStartPre=/bin/sleep 30 (code=exited, status=0/SUCCESS)
 Main PID: 2055 (slurmd)
    Tasks: 2
   Memory: 19.5M
   CGroup: /system.slice/slurmd.service
           └─2055 /usr/sbin/slurmd -D

Jun 23 14:11:42 h002.nifl.fysik.dtu.dk systemd[1]: Starting Slurm node daemon...
Jun 23 14:12:12 h002.nifl.fysik.dtu.dk systemd[1]: Started Slurm node daemon.

With this experiment it seems to me that the issue may be a race condition between the start of the slurmd and NetworkManager services.

What are your thoughts on this?

Thanks,
Ole

Comment 4 Tim McMullan 2021-06-23 06:37:02 MDT

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #3)
> With this experiment it seems to me that the issue may be a race condition
> between the start of the slurmd and NetworkManager services.
> 
> What are your thoughts on this?

Thanks for the additional observations!  I think that could explain a lot of what we are seeing here.

Would you be able to add "Wants=network-online.target" to the slurmd unit file on an EL8 node and see if that makes any difference?  This may delay the slurmd a little bit more and let the network come up all the way without an added sleep.

Comment 5 Ole.H.Nielsen@fysik.dtu.dk 2021-06-23 06:40:42 MDT

Following up on my idea in Comment 3 that slurmd starts before the network is fully started, I've found that the slurmd.service file may be depending incorrectly on the network.target

[Unit]
Description=Slurm node daemon
After=munge.service network.target remote-fs.target

As discussed in the RHEL8 documentation:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_networking/systemd-network-targets-and-services_configuring-and-managing-networking#differences-between-the-network-and-network-online-systemd-target_systemd-network-targets-and-services

we may in fact require the network-online.target in the slurmd.service file so that the case of Configless Slurm will work correctly:

[Unit]
Description=Slurm node daemon
After=munge.service network-online.target remote-fs.target

I've removed the ExecStartPre delay in Comment 2 and rebooted the system.  Now slurmd starts correctly at boot time:

[root@h002 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmd.service.d
           └─core_limit.conf
   Active: active (running) since Wed 2021-06-23 14:28:25 CEST; 18s ago
 Main PID: 1657 (slurmd)
    Tasks: 2
   Memory: 19.4M
   CGroup: /system.slice/slurmd.service
           └─1657 /usr/sbin/slurmd -D

Jun 23 14:28:25 h002.nifl.fysik.dtu.dk systemd[1]: Started Slurm node daemon.

Therefore I would like to propose the following patch introducing the Systemd network-online.target:

[root@h002 ~]# diff -c /usr/lib/systemd/system/slurmd.service /usr/lib/systemd/system/slurmd.service.orig 
*** /usr/lib/systemd/system/slurmd.service	2021-06-23 14:37:15.890078007 +0200
--- /usr/lib/systemd/system/slurmd.service.orig	2021-05-28 10:42:26.000000000 +0200
***************
*** 1,6 ****
  [Unit]
  Description=Slurm node daemon
! After=munge.service network-online.target remote-fs.target
  #ConditionPathExists=/etc/slurm/slurm.conf
  
  [Service]
--- 1,6 ----
  [Unit]
  Description=Slurm node daemon
! After=munge.service network.target remote-fs.target
  #ConditionPathExists=/etc/slurm/slurm.conf
  
  [Service]


Does this seem to be a correct conclusion?

Thanks,
Ole

Comment 6 Tim McMullan 2021-06-23 06:47:40 MDT

Looks like we came to basically the same conclusion here :)

I'll work on getting this included!

Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2021-06-23 06:51:46 MDT

(In reply to Tim McMullan from comment #6)
> Looks like we came to basically the same conclusion here :)
> 
> I'll work on getting this included!

Thanks!  I'm confused whether we want to use wants= or after= or possibly both in the service file.  This is defined in https://www.freedesktop.org/software/systemd/man/systemd.unit.html but I don't fully understand the distinction.  There is a comment "It is a common pattern to include a unit name in both the After= and Wants= options, in which case the unit listed will be started before the unit that is configured with these options."

One issue remains, though:  Why do my EL8 systems require the FQDN in DNS lookups, whereas yours don't?

Thanks,
Ole

Comment 9 Tim McMullan 2021-06-23 07:12:06 MDT

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #7)
> Thanks!  I'm confused whether we want to use wants= or after= or possibly
> both in the service file.  This is defined in
> https://www.freedesktop.org/software/systemd/man/systemd.unit.html but I
> don't fully understand the distinction.  There is a comment "It is a common
> pattern to include a unit name in both the After= and Wants= options, in
> which case the unit listed will be started before the unit that is
> configured with these options."

I read through https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ and https://www.freedesktop.org/software/systemd/man/systemd.special.html and between them I think the right thing to do is have network-online.target in both After= and Wants=.

This quote in particular makes me think that: "systemd automatically adds dependencies of type Wants= and After= for this target unit to all SysV init script service units with an LSB header referring to the "$network" facility."

> One issue remains, though:  Why do my EL8 systems require the FQDN in DNS
> lookups, whereas yours don't?

I do find that strange.  I'm going to tweak my environment some to match yours as best I can and see if it behaves the same way.  My DNS is run on FreeBSD with unbound so it won't be perfect... but its worth a try at least.

Comment 11 Tim McMullan 2021-07-01 09:49:14 MDT

Hi Ole,

We've push changes to the unit files for slurm for 21.08+!

I've tried (mostly) replicating your setup with the details you provided and haven't seen the same behavior yet... nor have I found a good explanation for the difference you mention between EL7 and EL8.  Have you found anything?

Thanks!
--Tim

Comment 12 Ole.H.Nielsen@fysik.dtu.dk 2021-07-01 12:24:31 MDT

(In reply to Tim McMullan from comment #11)
> Hi Ole,
> 
> We've push changes to the unit files for slurm for 21.08+!
> 
> I've tried (mostly) replicating your setup with the details you provided and
> haven't seen the same behavior yet... nor have I found a good explanation
> for the difference you mention between EL7 and EL8.  Have you found anything?

As you can see in Comment 3, the issue is due to a race condition between the network being up before or after slurmd is started.  The winner of the race condition may depend on many things.

IMHO, the correct and safe solution is to start slurmd only after the network-online target!  This is crucial in the case of Configless Slurm.

Thanks,
Ole

Comment 13 Tim McMullan 2021-07-01 12:30:46 MDT

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #12)
> (In reply to Tim McMullan from comment #11)
> > Hi Ole,
> > 
> > We've push changes to the unit files for slurm for 21.08+!
> > 
> > I've tried (mostly) replicating your setup with the details you provided and
> > haven't seen the same behavior yet... nor have I found a good explanation
> > for the difference you mention between EL7 and EL8.  Have you found anything?
> 
> As you can see in Comment 3, the issue is due to a race condition between
> the network being up before or after slurmd is started.  The winner of the
> race condition may depend on many things.
> 
> IMHO, the correct and safe solution is to start slurmd only after the
> network-online target!  This is crucial in the case of Configless Slurm.
> 
> Thanks,
> Ole

Sorry Ole, the situation I was trying to replicate was the difference in behavior of dig and host!

https://github.com/SchedMD/slurm/commit/e1e7926
and
https://github.com/SchedMD/slurm/commit/e88f7ff

Fix the issue in 21.08 :)

Thanks!
-Tim

Comment 14 Ole.H.Nielsen@fysik.dtu.dk 2021-07-02 04:21:03 MDT

(In reply to Tim McMullan from comment #13)
> Sorry Ole, the situation I was trying to replicate was the difference in
> behavior of dig and host!

I've verified again that on all our EL7 servers dig/host both work correctly, and on all EL8 servers (CentOS 8.3, 8.4, Stream, AmlaLinux 8.4) it doesn't (FQDN is required).

Additionally, I have access to the Slurm cluster at another university, and on their EL7 nodes dig/host both work correctly.  They have installed an AlmaLinux 8.4 node which shows the same behavior:

$ cat /etc/redhat-release 
AlmaLinux release 8.4 (Electric Cheetah)
$ host -t SRV _slurmctld._tcp 
Host _slurmctld._tcp not found: 3(NXDOMAIN)
$ host -t SRV _slurmctld._tcp.grendel.cscaa.dk
_slurmctld._tcp.grendel.cscaa.dk has SRV record 0 0 6817 in4.grendel.cscaa.dk.
$ dig +short -t SRV -n _slurmctld._tcp 
$ dig +short -t SRV -n _slurmctld._tcp.grendel.cscaa.dk
0 0 6817 in4.grendel.cscaa.dk.

So I believe the DNS SRV record problem is not due to our particular network or DNS setup.

Could you possibly ask some other Slurm sites and SchedMD colleagues to check the dig/host behavior on any available EL8 nodes?

Thanks,
Ole

Comment 15 Tim McMullan 2021-07-09 11:30:13 MDT

Hey Ole,

I did some asking around internally and the rhel8 systems we have aren't exhibiting the issue.  Since this doesn't seem to be impacting Slurm functionality and I'm not finding the issue thus far, it might be better to see if Red Hat might know whats going on with the host/dig behavior differences?

Thanks,
--Tim

Comment 16 Ole.H.Nielsen@fysik.dtu.dk 2021-07-23 05:41:38 MDT

(In reply to Tim McMullan from comment #13)
> Sorry Ole, the situation I was trying to replicate was the difference in
> behavior of dig and host!
> 
> https://github.com/SchedMD/slurm/commit/e1e7926
> and
> https://github.com/SchedMD/slurm/commit/e88f7ff
> 
> Fix the issue in 21.08 :)

Today there is a thread "slumctld don't start at boot" on the slurm-users list where a site is running RockyLinux 8.4 and slurmctld fails to start.

Do you think you can push the change also to 20.11.9, because this issue may be hitting more broadly on EL 8.4 systems?

Thanks,
Ole

Comment 17 Tim McMullan 2021-08-31 10:08:09 MDT

Hi Ole,

I did some chatting internally and right now the feeling is to leave it the way it is for 20.11 for now.  The issue doesn't seem to be impacting many people right now, and we are reluctant to change the unit files just in case we introduce something unexpected... and its very easy to handle if there is a problem since it doesn't require any code changes.

I'm going to resolve this for now since the issue is resolved in 21.08.

Thanks!
--Tim

Comment 18 John 2025-06-19 01:02:45 MDT

The standard behavior of the resolver is to treat a query with at least one "." as a FQDN.
So since the query for the SRV record contains exactly  one "." it is regarded as a FQDN query which will fail because the BIND name server is configured for subdomain.

An easy way to overcome this behavior is the add "options nodes:2" to the /etc/resolver.conf. But be aware of the systemd that also has high jacked this file.

Comment 19 Ole.H.Nielsen@fysik.dtu.dk 2025-06-19 01:03:04 MDT

I'm out of the office, back on June 24.
Jeg er ikke på kontoret, tilbage igen 24. juni.

Best regards / Venlig hilsen,
Ole Holm Nielsen

Comment 20 John 2025-06-23 05:21:48 MDT

(In reply to John from comment #18)
> The standard behavior of the resolver is to treat a query with at least one
> "." as a FQDN.
> So since the query for the SRV record contains exactly  one "." it is
> regarded as a FQDN query which will fail because the BIND name server is
> configured for subdomain.
> 
> An easy way to overcome this behavior is the add "options nodes:2" to the
> /etc/resolver.conf. But be aware of the systemd that also has high jacked
> this file.

Another way is to add res.ndots=1; to slurm_resolv.c after the call to res_ninit() and before the call to res_nsearch().

Comment 21 John 2025-06-24 00:48:20 MDT

(In reply to John from comment #20)
> (In reply to John from comment #18)
> > The standard behavior of the resolver is to treat a query with at least one
> > "." as a FQDN.
> > So since the query for the SRV record contains exactly  one "." it is
> > regarded as a FQDN query which will fail because the BIND name server is
> > configured for subdomain.
> > 
> > An easy way to overcome this behavior is the add "options nodes:2" to the
> > /etc/resolver.conf. But be aware of the systemd that also has high jacked
> > this file.
> 
> Another way is to add res.ndots=1; to slurm_resolv.c after the call to
> res_ninit() and before the call to res_nsearch().

It should of course be res.ndots=2;

Comment 22 Ole.H.Nielsen@fysik.dtu.dk 2025-06-24 07:06:39 MDT

Hi John,

(In reply to John from comment #21)
> (In reply to John from comment #20)
> > (In reply to John from comment #18)
> > > The standard behavior of the resolver is to treat a query with at least one
> > > "." as a FQDN.
> > > So since the query for the SRV record contains exactly  one "." it is
> > > regarded as a FQDN query which will fail because the BIND name server is
> > > configured for subdomain.
> > > 
> > > An easy way to overcome this behavior is the add "options nodes:2" to the
> > > /etc/resolver.conf. But be aware of the systemd that also has high jacked
> > > this file.
> > 
> > Another way is to add res.ndots=1; to slurm_resolv.c after the call to
> > res_ninit() and before the call to res_nsearch().
> 
> It should of course be res.ndots=2;

Thanks a lot, I added this point to the Wiki page now:

https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#testing-configless-setup

Best regards,
Ole

Comment 23 Ole.H.Nielsen@fysik.dtu.dk 2025-06-24 07:16:59 MDT

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #22)
> Hi John,
> 
> (In reply to John from comment #21)
> > (In reply to John from comment #20)
> > > (In reply to John from comment #18)
> > > > The standard behavior of the resolver is to treat a query with at least one
> > > > "." as a FQDN.
> > > > So since the query for the SRV record contains exactly  one "." it is
> > > > regarded as a FQDN query which will fail because the BIND name server is
> > > > configured for subdomain.
> > > > 
> > > > An easy way to overcome this behavior is the add "options nodes:2" to the
> > > > /etc/resolver.conf. But be aware of the systemd that also has high jacked
> > > > this file.
> > > 
> > > Another way is to add res.ndots=1; to slurm_resolv.c after the call to
> > > res_ninit() and before the call to res_nsearch().
> > 
> > It should of course be res.ndots=2;

Normally the /etc/resolv.conf file is controlled by NetworkManager.  I couldn't find any way to configure such options in NetworkManager for automatic usage.  So a manual config is probably needed.

/Ole

Comment 24 John 2025-06-25 02:45:24 MDT

As systemd also present a very amputated version of /etc/resolv.conf, I believed the best solution would be to configure clients like slurmd to explicitly use ndots=2 when making queries for SRV records.

Comment 25 John 2025-06-26 00:47:13 MDT

In order to manipulate /etc/resolv.conf you may NOT have this in any of your /etc/NetworkManager/NetworkManager.conf or /etc/NetworkManager/conf.d files:
[main]
dns=none

On my Fedora 42 system I found above statements in  /etc/NetworkManager/conf.d/90-dns-none.conf

Since this was the only entry I removed the file and used nmcli commands configure /etc/resolv.conf

nmcli can configure these ipv4.dns variables:
ipv4.dns:                               192.168.1.1
ipv4.dns-search:                        --
ipv4.dns-options:                       --
ipv4.dns-priority:                      0

On my Fedora system I can list/show these interfaces:
 sudo nmcli c s
NAME           UUID                                  TYPE      DEVICE
System enp2s0  2fdf26ee-75bd-43b0-bb56-f5a0d7cef9c8  ethernet  enp2s0
lo             86a6929b-f720-42e3-ab10-09437e3a5d88  loopback  lo

Now using the below nmcli commands I can configure resolv.conf

 nmcli c m "System enp2s0" ipv4.dns "193.162.153.164,194.239.134.83,1.1.1.1,8.8.8.8" 
Note that libc has a limit of 3 nameservers and configuring 4 servers as above will generate a warning in resolv.conf
 nmcli c m "System enp2s0" ipv4.dns-search example.com,showme.com
 nmcli c m "System enp2s0" ipv4.dns-options ndots:2

Now I have this from nmcli c s | grep ipv4.dns
ipv4.dns:                               193.162.153.164,194.239.134.83,1.1.1.1,8.8.8.8
ipv4.dns-search:                        example.com,showme.com
ipv4.dns-options:                       ndots:2
ipv4.dns-priority:                      0
:
In order to manipulate /etc/resolv.conf you may NOT have this in any of your /etc/NetworkManager/NetworkManager.conf or /etc/NetworkManager/conf.d files:
[main]
dns=none

On my Fedora 42 system I found above statements in  /etc/NetworkManager/conf.d/90-dns-none.conf

Since this was the only entry I removed the file and used nmcli commands configure /etc/resolv.conf

nmcli can configure these ipv4.dns variables:
ipv4.dns:                               192.168.1.1
ipv4.dns-search:                        --
ipv4.dns-options:                       --
ipv4.dns-priority:                      0

On my Fedora system I can list/show these interfaces:
 sudo nmcli c s
NAME           UUID                                  TYPE      DEVICE
System enp2s0  2fdf26ee-75bd-43b0-bb56-f5a0d7cef9c8  ethernet  enp2s0
lo             86a6929b-f720-42e3-ab10-09437e3a5d88  loopback  lo

Now using the below nmcli commands I can configure resolv.conf

 nmcli c m "System enp2s0" ipv4.dns "193.162.153.164,194.239.134.83,1.1.1.1,8.8.8.8" 
Note that libc has a limit of 3 nameservers and configuring 4 servers as above will generate a warning in resolv.conf
 nmcli c m "System enp2s0" ipv4.dns-search example.com,showme.com
 nmcli c m "System enp2s0" ipv4.dns-options ndots:2

Now I have this from nmcli c s | grep ipv4.dns
ipv4.dns:                               193.162.153.164,194.239.134.83,1.1.1.1,8.8.8.8
ipv4.dns-search:                        example.com,showme.com
ipv4.dns-options:                       ndots:2
ipv4.dns-priority:                      0

I did systemctl restart NetworkManager.service to update resolv.conf

# Generated by NetworkManager
search example.com showme.com
nameserver 193.162.153.164
nameserver 194.239.134.83
nameserver 1.1.1.1
# NOTE: the libc resolver may not support more than 3 nameservers.
# The nameservers listed below may not be recognized.
nameserver 8.8.8.8
options ndots:2

Comment 26 Ole.H.Nielsen@fysik.dtu.dk 2025-06-30 07:27:42 MDT

In comment 21 it is shown that the issue in comment 0 is real indeed because the DNS resolver thinks that the name _slurmctld._tcp with a "." character is a FQDN in stead of a relative name.  Therefore the dig and host lookups will fail.  The workaround is to use an ndots=2 option:

$ dig +short +search +ndots=2 -t SRV -n _slurmctld._tcp
$ host -N 2 -t SRV _slurmctld._tcp

In comment 21 it is suggested to add before the call to res_nsearch() in https://github.com/SchedMD/slurm/blob/3d4781302e52fe9975b9b9520b7123806ab74989/src/common/slurm_resolv.c#L73 an assignment:

res.ndots=2;

This ought to ensure that lookups of _slurmctld._tcp will use relative names. Would you kindly consider this patch?

Best regards,
Ole

Comment 27 John 2025-06-30 08:12:57 MDT

I can add that it worked very well for us.

Comment 28 Tim McMullan 2025-07-01 06:35:41 MDT

Hi Ole,

I'm taking look at this and let you know where we land!

Thanks,
--Tim

Comment 29 John 2025-08-19 01:37:22 MDT

I noticed that the change has been applied to the 25.05.1 version.

Thanks
John

Comment 30 Tim McMullan 2025-08-19 05:31:57 MDT

Hi John,

Yes it did!  I am not sure what happened that prevented me from closing the loop here, but it did land after I finally got the problem to reproduce.

Thank you for the contribution, and I'm so sorry that I missed getting this ticket updated.
--Tim