Summary: | Configless Slurm fails due to failing SRV record lookup on EL8 (CentOS 8) | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
Component: | slurmd | Assignee: | Tim McMullan <mcmullan> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | tru |
Version: | 20.11.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | DTU Physics | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | Google sites: | --- |
HPCnow Sites: | --- | HPE Sites: | --- |
IBM Sites: | --- | NOAA SIte: | --- |
NoveTech Sites: | --- | Nvidia HWinf-CS Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Tzag Elita Sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 21.08.0 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Ole.H.Nielsen@fysik.dtu.dk
2021-06-22 07:00:29 MDT
Hi Ole, Thanks for the report! I've not yet heard of this issue, but in my own test environment I wasn't able to reproduce it quickly. I'm currently testing with CentOS Stream 8, but host/dig can both get the appropriate response without the FQDN attached. My resolv.conf is essentially the same as yours and is also being generated by NetworkManager (I'm using DHCP reservations for all my hosts), I updated/rebooted the test node before the test so its very current. Just as a sanity check and to help me in trying to replicate the issue: Are the /etc/resolv.conf files the same on the el7 and el8 nodes? Does /etc/hostname have the FQDN or just the hostname on el7 and the el8 nodes? Are the srv records defined on both DNS servers? Can you share how you defined the SRV records on the DNS servers? Hopefully with this I'll be able to match my setup to yours, reproduce, and find what changed! Thanks! --Tim Hi Tim, Thanks for a quick response. Answers are below: (In reply to Tim McMullan from comment #1) > Thanks for the report! I've not yet heard of this issue, but in my own test > environment I wasn't able to reproduce it quickly. I'm currently testing > with CentOS Stream 8, but host/dig can both get the appropriate response > without the FQDN attached. In my cluster network, the SRV record doesn't lookup on the EL8 node except when I add the FQDN: [root@h002 ~]# dig +short -t SRV -n _slurmctld._tcp [root@h002 ~]# dig +short -t SRV -n _slurmctld._tcp.nifl.fysik.dtu.dk. 0 0 6817 que.nifl.fysik.dtu.dk. [root@h002 ~]# host -t SRV _slurmctld._tcp Host _slurmctld._tcp not found: 3(NXDOMAIN) [root@h002 ~]# host -t SRV _slurmctld._tcp.nifl.fysik.dtu.dk. _slurmctld._tcp.nifl.fysik.dtu.dk has SRV record 0 0 6817 que.nifl.fysik.dtu.dk. On an EL7 node the "host" command works without FQDN, but "dig" doesn't: [root@s004 ~]# dig +short -t SRV -n _slurmctld._tcp [root@s004 ~]# host -t SRV _slurmctld._tcp _slurmctld._tcp.nifl.fysik.dtu.dk has SRV record 0 0 6817 que.nifl.fysik.dtu.dk. Maybe I'm barking up the wrong tree here: I may just be that the "host" command from the bind-utils RPM has a changed behavior from EL7 (bind-utils-9.11.4) to EL8 (bind-utils-9.11.26), but I couldn't find any changelog. I wonder why your CentOS Stream 8 system behaves different from mine? We have a PC running CentOS Stream 8 where the FQDN is also required: [root@tesla ~]# dig +short -t SRV -n _slurmctld._tcp [root@tesla ~]# dig +short -t SRV -n _slurmctld._tcp.fysik.dtu.dk. 0 0 6817 que.fysik.dtu.dk. [root@tesla ~]# dig +short -t SRV -n _slurmctld._tcp.nifl.fysik.dtu.dk. 0 0 6817 que.nifl.fysik.dtu.dk. The network for this PC has a campus-wide Infoblox DNS server box which is completely different from my cluster's CentOS 7.9 BIND DNS server. > My resolv.conf is essentially the same as yours and is also being generated > by NetworkManager (I'm using DHCP reservations for all my hosts), I > updated/rebooted the test node before the test so its very current. > > Just as a sanity check and to help me in trying to replicate the issue: > Are the /etc/resolv.conf files the same on the el7 and el8 nodes? Yes, verified on EL7 and EL8. The same DHCP server services all cluster nodes. > Does /etc/hostname have the FQDN or just the hostname on el7 and the el8 > nodes? Full FQDN: [root@h002 ~]# cat /etc/hostname h002.nifl.fysik.dtu.dk > Are the srv records defined on both DNS servers? Yes, they are both DNS slave servers using the same authoritative DNS server. > Can you share how you defined the SRV records on the DNS servers? In the zone file for nifl.fysik.dtu.dk I have: _slurmctld._tcp 3600 IN SRV 0 0 6817 que I have made some new observations today. Having found in Comment 2 that the FQDN is required for lookup of the SRV record (at least in our network for reasons that I don't understand yet), maybe this is not a DNS issue after all? When my EL8 node "h002" boots, I get slurmd error messages in the syslog related to DNS lookup failures: Jun 23 13:50:36 h002 slurmd[1281]: error: resolve_ctls_from_dns_srv: res_nsearch error: Host name lookup failure Jun 23 13:50:36 h002 slurmd[1281]: error: fetch_config: DNS SRV lookup failed Jun 23 13:50:36 h002 slurmd[1281]: error: _establish_configuration: failed to load configs Jun 23 13:50:36 h002 systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE Jun 23 13:50:36 h002 slurmd[1281]: error: slurmd initialization failed Jun 23 13:50:36 h002 systemd[1]: slurmd.service: Failed with result 'exit-code'. Consequently, the slurmd service has failed: [root@h002 ~]# systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/slurmd.service.d └─core_limit.conf Active: failed (Result: exit-code) since Wed 2021-06-23 13:50:36 CEST; 4min 24s ago Process: 1281 ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS (code=exited, status=1/FAILURE) Main PID: 1281 (code=exited, status=1/FAILURE) Jun 23 13:50:36 h002.nifl.fysik.dtu.dk systemd[1]: Started Slurm node daemon. Jun 23 13:50:36 h002.nifl.fysik.dtu.dk systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE Jun 23 13:50:36 h002.nifl.fysik.dtu.dk systemd[1]: slurmd.service: Failed with result 'exit-code'. Strangely, when I now restart slurmd it works just fine: [root@h002 ~]# systemctl restart slurmd [root@h002 ~]# systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/slurmd.service.d └─core_limit.conf Active: active (running) since Wed 2021-06-23 13:55:05 CEST; 1s ago Main PID: 2151 (slurmd) Tasks: 2 Memory: 18.4M CGroup: /system.slice/slurmd.service └─2151 /usr/sbin/slurmd -D Jun 23 13:55:05 h002.nifl.fysik.dtu.dk systemd[1]: Started Slurm node daemon. I've confirmed that this behavior is repeated every time I reboot the node h002. The slurmd fails when started by Systemd during booting, but a few minutes later slurmd starts correctly from Systemd. I think this precludes any temporary issue with our DNS servers, also because all other nodes in the cluster work just fine in Configless mode. Perhaps the base issue is related to the slurmd error: fetch_config: DNS SRV lookup failed Could it be that slurmd is being started too early in the boot process where some network services have not yet been fully completed? In the syslog I see that NetworkManager is started with the exact same timestamp as the slurmd service: Jun 23 13:50:36 h002 systemd[1]: Starting Network Manager... Jun 23 13:50:36 h002 NetworkManager[1224]: <info> [1624449036.3693] NetworkManager (version 1.30.0-7.el8) is starting... (for the first time) Jun 23 13:50:36 h002 NetworkManager[1224]: <info> [1624449036.3697] Read config: /etc/NetworkManager/NetworkManager.conf Jun 23 13:50:36 h002 systemd[1]: Started Network Manager. Jun 23 13:50:36 h002 NetworkManager[1224]: <info> [1624449036.3722] bus-manager: acquired D-Bus service "org.freedesktop.NetworkManager" Jun 23 13:50:36 h002 systemd[1]: Starting Network Manager Wait Online... Jun 23 13:50:36 h002 systemd[1]: Reached target Network. I tried delaying the startup of slurmd for some seconds by adding a ExecStartPre line to /usr/lib/systemd/system/slurmd.service: [Service] Type=simple EnvironmentFile=-/etc/sysconfig/slurmd # Testing a delay: ExecStartPre=/bin/sleep 30 After rebooting again the slurmd service has now started correctly during the boot process: [root@h002 ~]# systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/slurmd.service.d └─core_limit.conf Active: active (running) since Wed 2021-06-23 14:12:12 CEST; 3min 11s ago Process: 1273 ExecStartPre=/bin/sleep 30 (code=exited, status=0/SUCCESS) Main PID: 2055 (slurmd) Tasks: 2 Memory: 19.5M CGroup: /system.slice/slurmd.service └─2055 /usr/sbin/slurmd -D Jun 23 14:11:42 h002.nifl.fysik.dtu.dk systemd[1]: Starting Slurm node daemon... Jun 23 14:12:12 h002.nifl.fysik.dtu.dk systemd[1]: Started Slurm node daemon. With this experiment it seems to me that the issue may be a race condition between the start of the slurmd and NetworkManager services. What are your thoughts on this? Thanks, Ole (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #3) > With this experiment it seems to me that the issue may be a race condition > between the start of the slurmd and NetworkManager services. > > What are your thoughts on this? Thanks for the additional observations! I think that could explain a lot of what we are seeing here. Would you be able to add "Wants=network-online.target" to the slurmd unit file on an EL8 node and see if that makes any difference? This may delay the slurmd a little bit more and let the network come up all the way without an added sleep. Following up on my idea in Comment 3 that slurmd starts before the network is fully started, I've found that the slurmd.service file may be depending incorrectly on the network.target [Unit] Description=Slurm node daemon After=munge.service network.target remote-fs.target As discussed in the RHEL8 documentation: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_networking/systemd-network-targets-and-services_configuring-and-managing-networking#differences-between-the-network-and-network-online-systemd-target_systemd-network-targets-and-services we may in fact require the network-online.target in the slurmd.service file so that the case of Configless Slurm will work correctly: [Unit] Description=Slurm node daemon After=munge.service network-online.target remote-fs.target I've removed the ExecStartPre delay in Comment 2 and rebooted the system. Now slurmd starts correctly at boot time: [root@h002 ~]# systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/slurmd.service.d └─core_limit.conf Active: active (running) since Wed 2021-06-23 14:28:25 CEST; 18s ago Main PID: 1657 (slurmd) Tasks: 2 Memory: 19.4M CGroup: /system.slice/slurmd.service └─1657 /usr/sbin/slurmd -D Jun 23 14:28:25 h002.nifl.fysik.dtu.dk systemd[1]: Started Slurm node daemon. Therefore I would like to propose the following patch introducing the Systemd network-online.target: [root@h002 ~]# diff -c /usr/lib/systemd/system/slurmd.service /usr/lib/systemd/system/slurmd.service.orig *** /usr/lib/systemd/system/slurmd.service 2021-06-23 14:37:15.890078007 +0200 --- /usr/lib/systemd/system/slurmd.service.orig 2021-05-28 10:42:26.000000000 +0200 *************** *** 1,6 **** [Unit] Description=Slurm node daemon ! After=munge.service network-online.target remote-fs.target #ConditionPathExists=/etc/slurm/slurm.conf [Service] --- 1,6 ---- [Unit] Description=Slurm node daemon ! After=munge.service network.target remote-fs.target #ConditionPathExists=/etc/slurm/slurm.conf [Service] Does this seem to be a correct conclusion? Thanks, Ole Looks like we came to basically the same conclusion here :) I'll work on getting this included! (In reply to Tim McMullan from comment #6) > Looks like we came to basically the same conclusion here :) > > I'll work on getting this included! Thanks! I'm confused whether we want to use wants= or after= or possibly both in the service file. This is defined in https://www.freedesktop.org/software/systemd/man/systemd.unit.html but I don't fully understand the distinction. There is a comment "It is a common pattern to include a unit name in both the After= and Wants= options, in which case the unit listed will be started before the unit that is configured with these options." One issue remains, though: Why do my EL8 systems require the FQDN in DNS lookups, whereas yours don't? Thanks, Ole (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #7) > Thanks! I'm confused whether we want to use wants= or after= or possibly > both in the service file. This is defined in > https://www.freedesktop.org/software/systemd/man/systemd.unit.html but I > don't fully understand the distinction. There is a comment "It is a common > pattern to include a unit name in both the After= and Wants= options, in > which case the unit listed will be started before the unit that is > configured with these options." I read through https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ and https://www.freedesktop.org/software/systemd/man/systemd.special.html and between them I think the right thing to do is have network-online.target in both After= and Wants=. This quote in particular makes me think that: "systemd automatically adds dependencies of type Wants= and After= for this target unit to all SysV init script service units with an LSB header referring to the "$network" facility." > One issue remains, though: Why do my EL8 systems require the FQDN in DNS > lookups, whereas yours don't? I do find that strange. I'm going to tweak my environment some to match yours as best I can and see if it behaves the same way. My DNS is run on FreeBSD with unbound so it won't be perfect... but its worth a try at least. Hi Ole, We've push changes to the unit files for slurm for 21.08+! I've tried (mostly) replicating your setup with the details you provided and haven't seen the same behavior yet... nor have I found a good explanation for the difference you mention between EL7 and EL8. Have you found anything? Thanks! --Tim (In reply to Tim McMullan from comment #11) > Hi Ole, > > We've push changes to the unit files for slurm for 21.08+! > > I've tried (mostly) replicating your setup with the details you provided and > haven't seen the same behavior yet... nor have I found a good explanation > for the difference you mention between EL7 and EL8. Have you found anything? As you can see in Comment 3, the issue is due to a race condition between the network being up before or after slurmd is started. The winner of the race condition may depend on many things. IMHO, the correct and safe solution is to start slurmd only after the network-online target! This is crucial in the case of Configless Slurm. Thanks, Ole (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #12) > (In reply to Tim McMullan from comment #11) > > Hi Ole, > > > > We've push changes to the unit files for slurm for 21.08+! > > > > I've tried (mostly) replicating your setup with the details you provided and > > haven't seen the same behavior yet... nor have I found a good explanation > > for the difference you mention between EL7 and EL8. Have you found anything? > > As you can see in Comment 3, the issue is due to a race condition between > the network being up before or after slurmd is started. The winner of the > race condition may depend on many things. > > IMHO, the correct and safe solution is to start slurmd only after the > network-online target! This is crucial in the case of Configless Slurm. > > Thanks, > Ole Sorry Ole, the situation I was trying to replicate was the difference in behavior of dig and host! https://github.com/SchedMD/slurm/commit/e1e7926 and https://github.com/SchedMD/slurm/commit/e88f7ff Fix the issue in 21.08 :) Thanks! -Tim (In reply to Tim McMullan from comment #13) > Sorry Ole, the situation I was trying to replicate was the difference in > behavior of dig and host! I've verified again that on all our EL7 servers dig/host both work correctly, and on all EL8 servers (CentOS 8.3, 8.4, Stream, AmlaLinux 8.4) it doesn't (FQDN is required). Additionally, I have access to the Slurm cluster at another university, and on their EL7 nodes dig/host both work correctly. They have installed an AlmaLinux 8.4 node which shows the same behavior: $ cat /etc/redhat-release AlmaLinux release 8.4 (Electric Cheetah) $ host -t SRV _slurmctld._tcp Host _slurmctld._tcp not found: 3(NXDOMAIN) $ host -t SRV _slurmctld._tcp.grendel.cscaa.dk _slurmctld._tcp.grendel.cscaa.dk has SRV record 0 0 6817 in4.grendel.cscaa.dk. $ dig +short -t SRV -n _slurmctld._tcp $ dig +short -t SRV -n _slurmctld._tcp.grendel.cscaa.dk 0 0 6817 in4.grendel.cscaa.dk. So I believe the DNS SRV record problem is not due to our particular network or DNS setup. Could you possibly ask some other Slurm sites and SchedMD colleagues to check the dig/host behavior on any available EL8 nodes? Thanks, Ole Hey Ole, I did some asking around internally and the rhel8 systems we have aren't exhibiting the issue. Since this doesn't seem to be impacting Slurm functionality and I'm not finding the issue thus far, it might be better to see if Red Hat might know whats going on with the host/dig behavior differences? Thanks, --Tim (In reply to Tim McMullan from comment #13) > Sorry Ole, the situation I was trying to replicate was the difference in > behavior of dig and host! > > https://github.com/SchedMD/slurm/commit/e1e7926 > and > https://github.com/SchedMD/slurm/commit/e88f7ff > > Fix the issue in 21.08 :) Today there is a thread "slumctld don't start at boot" on the slurm-users list where a site is running RockyLinux 8.4 and slurmctld fails to start. Do you think you can push the change also to 20.11.9, because this issue may be hitting more broadly on EL 8.4 systems? Thanks, Ole Hi Ole, I did some chatting internally and right now the feeling is to leave it the way it is for 20.11 for now. The issue doesn't seem to be impacting many people right now, and we are reluctant to change the unit files just in case we introduce something unexpected... and its very easy to handle if there is a problem since it doesn't require any code changes. I'm going to resolve this for now since the issue is resolved in 21.08. Thanks! --Tim |