Ticket 20462

Summary: Configless slurmd does not honor changes in the DNS SRV record
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: slurmdAssignee: Daniel Armengod <armengod>
Status: RESOLVED FIXED QA Contact: Documentation <docs>
Severity: 4 - Minor Issue    
Priority: --- CC: armengod, stephen
Version: 23.11.8   
Hardware: Linux   
OS: Linux   
Site: DTU Physics Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 24.05.4 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmd log file

Description Ole.H.Nielsen@fysik.dtu.dk 2024-07-22 03:36:13 MDT
We recently migrated our slurmctld service from an old server to a new host.  Since we use Configless Slurm, we had to update the DNS SRV record in our cluster's local DNS server zone file to point to the new slurmctld server:

_slurmctld._tcp 3600 IN SRV 0 0 6817 que2

However, we experienced the issue that all slurmd daemons on the compute nodes did NOT recognize changes in the DNS SRV record, even long after the DNS record's TTL had expired.  We reported this in bug 20070 comment 10, and the workaround was to restart all slurmd's quickly.  Now I've made a careful test to isolate and confirm the issue:

I've drained a compute node (running AlmaLinux 8.10) and installed a BIND (bind-9.11.36-14.el8_10.x86_64) DNS server locally on the node with a DNS server at the localhost 127.0.0.1 IP address.  This DNS server only serves the cluster DNS domain's zone "nifl.fysik.dtu.dk" where the top of the zone file is configured as (notice the TTLs of 600 seconds):

$TTL 86400
@ IN SOA        d023.nifl.fysik.dtu.dk. postmaster.fysik.dtu.dk. (
                2024072203              ; serial
                600                     ; refresh every 10 minutes
                600                     ; retry every 10 minutes
                6048000                 ; expire after 10 weeks
                86400 )                 ; default of 1 day
  IN NS d023.nifl.fysik.dtu.dk.
_slurmctld._tcp 600 IN SRV 0 0 6817 que2
(lines deleted)

I modified the SystemManager service network file /etc/sysconfig/network-scripts/ifcfg-eno8303 to contain:

DNS1=127.0.0.1
DOMAIN="nifl.fysik.dtu.dk fysik.dtu.dk"
(lines deleted) 

and now the /etc/resolv.conf correctly reflects this:

$ cat /etc/resolv.conf 
# Generated by NetworkManager
search nifl.fysik.dtu.dk fysik.dtu.dk
nameserver 127.0.0.1

The local DNS server is working correctly as shown by DNS lookup of the slurmctld server "que2":

# host -t SRV _slurmctld._tcp.`dnsdomainname`
_slurmctld._tcp.nifl.fysik.dtu.dk has SRV record 0 0 6817 que2.nifl.fysik.dtu.dk.

I then restarted slurmd and it's working correctly (slurmd.log is attached).  The contents of the Configless cache is correct:

$ ls -la /var/spool/slurmd/conf-cache
total 40
drwxr-xr-x. 2 root  root    131 Jul 22 09:56 .
drwxr-xr-x. 3 slurm slurm   274 Jul 22 09:56 ..
-rw-r--r--  1 root  root    640 Jul 22 09:56 acct_gather.conf
-rw-r--r--  1 root  root    609 Jul 22 09:56 cgroup.conf
-rw-r--r--  1 root  root    271 Jul 22 09:56 gres.conf
-rw-r--r--  1 root  root    132 Jul 22 09:56 job_container.conf
-rw-r--r--  1 root  root  16812 Jul 22 09:56 slurm.conf
-rw-r--r--  1 root  root   2003 Jul 22 09:56 topology.conf

At this point I edited the DNS zone file (and updated the serial number) to point to a different host "que":

_slurmctld._tcp 600 IN SRV 0 0 6817 que

This new DNS SRV record is looked up correctly:

$ host -t SRV _slurmctld._tcp.`dnsdomainname`
_slurmctld._tcp.nifl.fysik.dtu.dk has SRV record 0 0 6817 que.nifl.fysik.dtu.dk.

Here is the ISSUE:  Even after more than one hour had passed (compare this to the TTL of 600 seconds), there is NO INDICATION in the slurmd.log file that the DNS SRV record has actually been changed.  User commands such as sinfo and squeue continue to work with the previously cached information in /var/spool/slurmd/conf-cache (the cached files have unchanged timestamps).

Expected behavior: After the TTL of the DNS SRV record has expired, slurmd is expected to query again the DNS server (here 127.0.0.1) for the DNS SRV record.  That doesn't seem to happen!!  The order of precedence for determining what configuration source to use as explained in https://slurm.schedmd.com/configless_slurm.html doesn't seem to explain the behavior we're experiencing here.  I believe there is an issue that bears fixing.

As a final test, I restarted the slurmd service, and this fails because the server pointed to by the DNS SRV record isn't responding:

$ systemctl restart slurmd
Job for slurmd.service failed because the control process exited with error code.
See "systemctl status slurmd.service" and "journalctl -xe" for details.

$ systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmd.service.d
           └─core_limit.conf
   Active: failed (Result: exit-code) since Mon 2024-07-22 11:15:12 CEST; 13min ago
  Process: 64803 ExecStart=/usr/sbin/slurmd --systemd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 64803 (code=exited, status=1/FAILURE)

Jul 22 11:15:03 d023.nifl.fysik.dtu.dk systemd[1]: Starting Slurm node daemon...
Jul 22 11:15:12 d023.nifl.fysik.dtu.dk slurmd[64804]: slurmd: error: _fetch_child: failed to fetch remote configs: Unable to contact slurm controller (connect failure)
Jul 22 11:15:12 d023.nifl.fysik.dtu.dk slurmd[64804]: error: _fetch_child: failed to fetch remote configs: Unable to contact slurm controller (connect failure)
Jul 22 11:15:12 d023.nifl.fysik.dtu.dk slurmd[64803]: slurmd: error: _establish_configuration: failed to load configs
Jul 22 11:15:12 d023.nifl.fysik.dtu.dk slurmd[64803]: slurmd: error: slurmd initialization failed
Jul 22 11:15:12 d023.nifl.fysik.dtu.dk slurmd[64803]: error: _establish_configuration: failed to load configs
Jul 22 11:15:12 d023.nifl.fysik.dtu.dk slurmd[64803]: error: slurmd initialization failed
Jul 22 11:15:12 d023.nifl.fysik.dtu.dk systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Jul 22 11:15:12 d023.nifl.fysik.dtu.dk systemd[1]: slurmd.service: Failed with result 'exit-code'.
Jul 22 11:15:12 d023.nifl.fysik.dtu.dk systemd[1]: Failed to start Slurm node daemon.

Could you kindly look into the issue and suggest any errors on my side, or any additional tests that I should make?

Thanks,
Ole
Comment 1 Ole.H.Nielsen@fysik.dtu.dk 2024-07-22 03:36:30 MDT
Created attachment 37873 [details]
slurmd log file
Comment 2 Daniel Armengod 2024-07-22 04:47:08 MDT
Hello Ole,

Thanks for the detailed writeup. I'm going to replicate this in my environment after carefully reviewing it, but it does look like slurmd does not honor the TTL field in the SRV record.

I'll work on verifying the problem and proposing a patch if confirmed. Just to make sure, this is not causing any pressing issue on your end?

Kind regards,
Daniel
Comment 5 Ole.H.Nielsen@fysik.dtu.dk 2024-07-22 05:16:57 MDT
Hi Daniel,

Thanks a lot for your quick response!

(In reply to Daniel Armengod from comment #2)
> Thanks for the detailed writeup. I'm going to replicate this in my
> environment after carefully reviewing it, but it does look like slurmd does
> not honor the TTL field in the SRV record.

Thanks for sharing offhand my view of the issue, let's see what comes up after your analysis.

> I'll work on verifying the problem and proposing a patch if confirmed. Just
> to make sure, this is not causing any pressing issue on your end?

That's correct, we made the workaround of quickly restarting all slurmd's after changing the DNS SRV record a few weeks ago, and this resolved the pressing issue. 

Other Slurm users may be surprised in the future if they encounter the same issue after updating the DNS SRV record, so it would be good to get a fix or a documentation update.

Best regards,
Ole
Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2024-07-24 03:49:30 MDT
I have thought that the desired response of slurmd when it discovers that the TTL of the DNS SRV record has expired, is that it should create a new slurmd process, just like what happens with an "scontrol reconfigure" (from 23.11):

              Instruct all slurmctld and slurmd daemons to re-read the config‐
              uration file.  This mechanism can be used to  modify  configura‐
              tion  parameters  set in slurm.conf(5) without interrupting run‐
              ning jobs.  Starting in 23.11, this command operates by creating
              new  processes  for the daemons, then passing control to the new
              processes when or if they start up successfully. This allows  it
              to gracefully catch configuration problems and keep running with
              the previous configuration if there is a problem. 

Probably you'll agree on this?  BTW, have you reproduced the issue in your own environment?

Thanks,
Ole
Comment 8 Ole.H.Nielsen@fysik.dtu.dk 2024-07-25 12:52:11 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #6)
> I have thought that the desired response of slurmd when it discovers that
> the TTL of the DNS SRV record has expired, is that it should create a new
> slurmd process, just like what happens with an "scontrol reconfigure" (from
> 23.11):

Actually, I meant to say that if the DNS SRV record has *changed*, then slurmd should be restarted.  It's only after the TTL has expired that slurmd is going to discover if it has a changed value.

/Ole
Comment 9 Daniel Armengod 2024-09-14 02:05:14 MDT
Hi Ole,

Just getting a chance to put down in writing what we had the opportunity to discuss in-person at SLUG'24.

Regardles of whether configless slurmd discovers its ctld through --conf-server or DNS SRV, fetching the config is a one-time operation. The config's "freshness" is not associated in any way with the DNS's record TTL field; this is akin to fetching a generic resource over the net.

I'm going to propose a patch to slightly alter the wording in the configless documentation to emphasize the fact that discovery is a one-time operation performed at startup and that the daemon runs normally afterwards.

Kind regards,
Daniel
Comment 10 Ole.H.Nielsen@fysik.dtu.dk 2024-09-16 06:16:40 MDT
Hi Daniel,

It was nice to meet you at the SLUG'24 meeting last week.  Thanks for taking the time to discuss this case with me.

I appreciate the clarification that the DNS SRV record's TTL value isn't used for anything at all by slurmd after starting up - this was unexpected to me, given the normal semantics of DNS TTL values!

(In reply to Daniel Armengod from comment #9)
> Just getting a chance to put down in writing what we had the opportunity to
> discuss in-person at SLUG'24.
> 
> Regardles of whether configless slurmd discovers its ctld through
> --conf-server or DNS SRV, fetching the config is a one-time operation. The
> config's "freshness" is not associated in any way with the DNS's record TTL
> field; this is akin to fetching a generic resource over the net.
> 
> I'm going to propose a patch to slightly alter the wording in the configless
> documentation to emphasize the fact that discovery is a one-time operation
> performed at startup and that the daemon runs normally afterwards.

It will be good to get a clarification in the configless documentation.

Conclusion: All slurmd's have to be restarted if the DNS SRV record is changed in a configless setup, is this correct?

I have a followup question:  If a site doesn't use configless, the slurmd reads slurm.conf from some file at startup and caches the value of SlurmctldHost.  What will happen when SlurmctldHost is changed in the slurm.conf file, similar to the present migration scenario?

My expectation is that slurmd will ignore the updated slurm.conf file until slurmd is explicitly restarted with "systemctl restart slurmd".  I don't know if a "scontrol reconfig" command executed on the new SlurmctldHost will be rejected or ignored by slurmd's?

It would be good to have a documentation of migrating SlurmctldHost which works both with the configless scenario and with a local slurm.conf file.

Thanks,
Ole

It would be good to
Comment 11 Daniel Armengod 2024-09-20 02:17:23 MDT
Hi Ole,

>It was nice to meet you at the SLUG'24 meeting last week.

Same! :)

>I have a followup question:  If a site doesn't use configless, the slurmd reads slurm.conf from some file at startup and caches the value of SlurmctldHost.  What will happen when SlurmctldHost is changed in the slurm.conf file, similar to the present migration scenario?

"Hot"-moving the slurmctld is not _really_ a common operation. As you surmise the extant slurmds will fail to contact the slurmctl and become stuck until timeouts happen. They *will* acknowledge and execute a reconfig RPC from the new ctld, but not until all running operations finish, which due to the fact that they cannot contact the slurmctld, will take a long time.

The recommended plan of action is to shut down the ctld, update slurm.conf, bring it up at a new location, and then restart all slurmds. It's cleaner that way.

>It would be good to have a documentation of migrating SlurmctldHost which works both with the configless scenario and with a local slurm.conf file.

A documentation patch is in the works. I originally thought to clarify only the SRV TTL issue, but I'll keep this in mind.
Comment 12 Ole.H.Nielsen@fysik.dtu.dk 2024-09-20 03:02:58 MDT
Hi Daniel,

(In reply to Daniel Armengod from comment #11)
> >I have a followup question:  If a site doesn't use configless, the slurmd reads slurm.conf from some file at startup and caches the value of SlurmctldHost.  What will happen when SlurmctldHost is changed in the slurm.conf file, similar to the present migration scenario?
> 
> "Hot"-moving the slurmctld is not _really_ a common operation. As you
> surmise the extant slurmds will fail to contact the slurmctl and become
> stuck until timeouts happen. They *will* acknowledge and execute a reconfig
> RPC from the new ctld, but not until all running operations finish, which
> due to the fact that they cannot contact the slurmctld, will take a long
> time.

I agree that moving the slurmctl to a new server is a rare operation, but we need to be able to do so reliably and without losing running jobs (which became possible with 23.11) when we have to move the slurmctld to a new OS or a new hardware.

Are you saying that slurmd would be blocking trying to contact the old SlurmctldHost until timeout, and only then perform the reconfig operation from the new SlurmctldHost?

> The recommended plan of action is to shut down the ctld, update slurm.conf,
> bring it up at a new location, and then restart all slurmds. It's cleaner
> that way.

Yes, I agree that this is the clean and reliable way!

I have now updated my Slurm Wiki page on this topic: https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#migrate-the-slurmctld-service-to-another-server  Could you perhaps look at the page quickly to see if I make any incorrect statements?

> >It would be good to have a documentation of migrating SlurmctldHost which works both with the configless scenario and with a local slurm.conf file.
> 
> A documentation patch is in the works. I originally thought to clarify only
> the SRV TTL issue, but I'll keep this in mind.

Thanks a lot for improving the documentation!  I'm sure it will help other users as well.

Please close this case now.

Best regards,
Ole
Comment 13 Jason Booth 2024-09-24 15:11:48 MDT
Ole,

Daniel brought this to my attention. I just want to clarify the role of our support 
folks when dealing with external documentation requests like this. We are going to 
have to decline your request to review the wiki link.

I would rather see changes and improvements in the official documentation. Request 
like this can give a false impression that SchedMD endorses what is found on your 
wiki and this is something we want to avoid. I do understand that we have in the 
past done this on occasion, but it is something we are trying to refrain from doing. 

I would ask that you keep your questions in the ticket and directed at the issue you 
are trying to solve.

> I have now updated my Slurm Wiki page on this topic: 
> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#migrate-the-slurmctld-service-to-another-server  
> Could you perhaps look at the page quickly to see if I make any incorrect 
> statements?


Regarding you other questions: I will have Daniel answers those in the next update 
to this issue.
Comment 14 Ole.H.Nielsen@fysik.dtu.dk 2024-09-24 15:12:03 MDT
I'm out of the office, back on September 26.
Jeg er ikke på kontoret, tilbage igen 26. september.

Best regards / Venlig hilsen,
Ole Holm Nielsen
Comment 16 Daniel Armengod 2024-09-25 05:54:58 MDT
Hello Ole,

>Are you saying that slurmd would be blocking trying to contact the old SlurmctldHost until timeout, and only then perform the reconfig operation from the new SlurmctldHost?

Yes. As part of the reconfigure process slurmd must join all extant RPC handler threads. Threads that need to communicate with the controller will hang until timeout, at which point reconfigure will proceed.

Kind regards,
Daniel
Comment 19 Ole.H.Nielsen@fysik.dtu.dk 2024-09-26 04:13:01 MDT
Hi Jason,

Thanks for the note, and I fully agree with you that my Wiki is exclusively my own responsibility and that SchedMD should not comment upon it.

Best regards,
Ole

(In reply to Jason Booth from comment #13)
> Daniel brought this to my attention. I just want to clarify the role of our
> support 
> folks when dealing with external documentation requests like this. We are
> going to 
> have to decline your request to review the wiki link.
> 
> I would rather see changes and improvements in the official documentation.
> Request 
> like this can give a false impression that SchedMD endorses what is found on
> your 
> wiki and this is something we want to avoid. I do understand that we have in
> the 
> past done this on occasion, but it is something we are trying to refrain
> from doing.
Comment 20 Ole.H.Nielsen@fysik.dtu.dk 2024-09-26 04:15:50 MDT
Hi Daniel,

(In reply to Daniel Armengod from comment #16)
> >Are you saying that slurmd would be blocking trying to contact the old SlurmctldHost until timeout, and only then perform the reconfig operation from the new SlurmctldHost?
> 
> Yes. As part of the reconfigure process slurmd must join all extant RPC
> handler threads. Threads that need to communicate with the controller will
> hang until timeout, at which point reconfigure will proceed.

Thanks a lot for the clarification.  It seems that restart of all slurmd's is the correct approach when migrating the slurmctld.

Best regards,
Ole
Comment 21 Daniel Armengod 2024-09-30 01:33:57 MDT
Hello Ole,

Thank you very much for reporting this issue. The doc patch will land in 24.05.4.

If there's nothing else, I'm closing this ticket.

-Daniel