We recently migrated our slurmctld service from an old server to a new host. Since we use Configless Slurm, we had to update the DNS SRV record in our cluster's local DNS server zone file to point to the new slurmctld server: _slurmctld._tcp 3600 IN SRV 0 0 6817 que2 However, we experienced the issue that all slurmd daemons on the compute nodes did NOT recognize changes in the DNS SRV record, even long after the DNS record's TTL had expired. We reported this in bug 20070 comment 10, and the workaround was to restart all slurmd's quickly. Now I've made a careful test to isolate and confirm the issue: I've drained a compute node (running AlmaLinux 8.10) and installed a BIND (bind-9.11.36-14.el8_10.x86_64) DNS server locally on the node with a DNS server at the localhost 127.0.0.1 IP address. This DNS server only serves the cluster DNS domain's zone "nifl.fysik.dtu.dk" where the top of the zone file is configured as (notice the TTLs of 600 seconds): $TTL 86400 @ IN SOA d023.nifl.fysik.dtu.dk. postmaster.fysik.dtu.dk. ( 2024072203 ; serial 600 ; refresh every 10 minutes 600 ; retry every 10 minutes 6048000 ; expire after 10 weeks 86400 ) ; default of 1 day IN NS d023.nifl.fysik.dtu.dk. _slurmctld._tcp 600 IN SRV 0 0 6817 que2 (lines deleted) I modified the SystemManager service network file /etc/sysconfig/network-scripts/ifcfg-eno8303 to contain: DNS1=127.0.0.1 DOMAIN="nifl.fysik.dtu.dk fysik.dtu.dk" (lines deleted) and now the /etc/resolv.conf correctly reflects this: $ cat /etc/resolv.conf # Generated by NetworkManager search nifl.fysik.dtu.dk fysik.dtu.dk nameserver 127.0.0.1 The local DNS server is working correctly as shown by DNS lookup of the slurmctld server "que2": # host -t SRV _slurmctld._tcp.`dnsdomainname` _slurmctld._tcp.nifl.fysik.dtu.dk has SRV record 0 0 6817 que2.nifl.fysik.dtu.dk. I then restarted slurmd and it's working correctly (slurmd.log is attached). The contents of the Configless cache is correct: $ ls -la /var/spool/slurmd/conf-cache total 40 drwxr-xr-x. 2 root root 131 Jul 22 09:56 . drwxr-xr-x. 3 slurm slurm 274 Jul 22 09:56 .. -rw-r--r-- 1 root root 640 Jul 22 09:56 acct_gather.conf -rw-r--r-- 1 root root 609 Jul 22 09:56 cgroup.conf -rw-r--r-- 1 root root 271 Jul 22 09:56 gres.conf -rw-r--r-- 1 root root 132 Jul 22 09:56 job_container.conf -rw-r--r-- 1 root root 16812 Jul 22 09:56 slurm.conf -rw-r--r-- 1 root root 2003 Jul 22 09:56 topology.conf At this point I edited the DNS zone file (and updated the serial number) to point to a different host "que": _slurmctld._tcp 600 IN SRV 0 0 6817 que This new DNS SRV record is looked up correctly: $ host -t SRV _slurmctld._tcp.`dnsdomainname` _slurmctld._tcp.nifl.fysik.dtu.dk has SRV record 0 0 6817 que.nifl.fysik.dtu.dk. Here is the ISSUE: Even after more than one hour had passed (compare this to the TTL of 600 seconds), there is NO INDICATION in the slurmd.log file that the DNS SRV record has actually been changed. User commands such as sinfo and squeue continue to work with the previously cached information in /var/spool/slurmd/conf-cache (the cached files have unchanged timestamps). Expected behavior: After the TTL of the DNS SRV record has expired, slurmd is expected to query again the DNS server (here 127.0.0.1) for the DNS SRV record. That doesn't seem to happen!! The order of precedence for determining what configuration source to use as explained in https://slurm.schedmd.com/configless_slurm.html doesn't seem to explain the behavior we're experiencing here. I believe there is an issue that bears fixing. As a final test, I restarted the slurmd service, and this fails because the server pointed to by the DNS SRV record isn't responding: $ systemctl restart slurmd Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details. $ systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/slurmd.service.d └─core_limit.conf Active: failed (Result: exit-code) since Mon 2024-07-22 11:15:12 CEST; 13min ago Process: 64803 ExecStart=/usr/sbin/slurmd --systemd $SLURMD_OPTIONS (code=exited, status=1/FAILURE) Main PID: 64803 (code=exited, status=1/FAILURE) Jul 22 11:15:03 d023.nifl.fysik.dtu.dk systemd[1]: Starting Slurm node daemon... Jul 22 11:15:12 d023.nifl.fysik.dtu.dk slurmd[64804]: slurmd: error: _fetch_child: failed to fetch remote configs: Unable to contact slurm controller (connect failure) Jul 22 11:15:12 d023.nifl.fysik.dtu.dk slurmd[64804]: error: _fetch_child: failed to fetch remote configs: Unable to contact slurm controller (connect failure) Jul 22 11:15:12 d023.nifl.fysik.dtu.dk slurmd[64803]: slurmd: error: _establish_configuration: failed to load configs Jul 22 11:15:12 d023.nifl.fysik.dtu.dk slurmd[64803]: slurmd: error: slurmd initialization failed Jul 22 11:15:12 d023.nifl.fysik.dtu.dk slurmd[64803]: error: _establish_configuration: failed to load configs Jul 22 11:15:12 d023.nifl.fysik.dtu.dk slurmd[64803]: error: slurmd initialization failed Jul 22 11:15:12 d023.nifl.fysik.dtu.dk systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE Jul 22 11:15:12 d023.nifl.fysik.dtu.dk systemd[1]: slurmd.service: Failed with result 'exit-code'. Jul 22 11:15:12 d023.nifl.fysik.dtu.dk systemd[1]: Failed to start Slurm node daemon. Could you kindly look into the issue and suggest any errors on my side, or any additional tests that I should make? Thanks, Ole
Created attachment 37873 [details] slurmd log file
Hello Ole, Thanks for the detailed writeup. I'm going to replicate this in my environment after carefully reviewing it, but it does look like slurmd does not honor the TTL field in the SRV record. I'll work on verifying the problem and proposing a patch if confirmed. Just to make sure, this is not causing any pressing issue on your end? Kind regards, Daniel
Hi Daniel, Thanks a lot for your quick response! (In reply to Daniel Armengod from comment #2) > Thanks for the detailed writeup. I'm going to replicate this in my > environment after carefully reviewing it, but it does look like slurmd does > not honor the TTL field in the SRV record. Thanks for sharing offhand my view of the issue, let's see what comes up after your analysis. > I'll work on verifying the problem and proposing a patch if confirmed. Just > to make sure, this is not causing any pressing issue on your end? That's correct, we made the workaround of quickly restarting all slurmd's after changing the DNS SRV record a few weeks ago, and this resolved the pressing issue. Other Slurm users may be surprised in the future if they encounter the same issue after updating the DNS SRV record, so it would be good to get a fix or a documentation update. Best regards, Ole
I have thought that the desired response of slurmd when it discovers that the TTL of the DNS SRV record has expired, is that it should create a new slurmd process, just like what happens with an "scontrol reconfigure" (from 23.11): Instruct all slurmctld and slurmd daemons to re-read the config‐ uration file. This mechanism can be used to modify configura‐ tion parameters set in slurm.conf(5) without interrupting run‐ ning jobs. Starting in 23.11, this command operates by creating new processes for the daemons, then passing control to the new processes when or if they start up successfully. This allows it to gracefully catch configuration problems and keep running with the previous configuration if there is a problem. Probably you'll agree on this? BTW, have you reproduced the issue in your own environment? Thanks, Ole
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #6) > I have thought that the desired response of slurmd when it discovers that > the TTL of the DNS SRV record has expired, is that it should create a new > slurmd process, just like what happens with an "scontrol reconfigure" (from > 23.11): Actually, I meant to say that if the DNS SRV record has *changed*, then slurmd should be restarted. It's only after the TTL has expired that slurmd is going to discover if it has a changed value. /Ole