Summary: | Procedure for migrating slurmctld to another server | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
Component: | slurmctld | Assignee: | Stephen Kendall <stephen> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 23.11.8 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | DTU Physics | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Ole.H.Nielsen@fysik.dtu.dk
2024-06-04 03:42:20 MDT
> Is it required that no running jobs are allowed in the cluster? It is on versions prior to 23.11. One of the major changes in 23.11 was to allow 'slurmstepd' to receive an updated 'SlurmctldHost' setting so that running jobs will report back to the new controller when they finish. > Are there any other steps which we must consider? The procedure for 23.11 would be: 1. Stop slurmctld 2. Update SRV record 3. Migrate to new machine 4. Start slurmctld 5. If nodes are not communicating, run 'scontrol reconfigure' or restart slurmd You should also pay attention to the TTL on the existing SRV record and any places in your environment where it might be cached. The procedure would be even simpler if you can use the same IP address on the new server but this might not be practical. Let me know if you have any further questions. Best regards, Stephen Hi Stephen, Thanks for your great reply: (In reply to Stephen Kendall from comment #1) > > Is it required that no running jobs are allowed in the cluster? > > It is on versions prior to 23.11. One of the major changes in 23.11 was to > allow 'slurmstepd' to receive an updated 'SlurmctldHost' setting so that > running jobs will report back to the new controller when they finish. I had missed this point previously, but there is a page "Change SlurmctldHost settings without breaking running jobs" in the SC23 and SLUG23 presentations which I hadn't noticed previously. > > Are there any other steps which we must consider? > > The procedure for 23.11 would be: > > 1. Stop slurmctld > 2. Update SRV record Or, if not using Configless, copy slurm.conf to all nodes. > 3. Migrate to new machine This would include copy the StateSaveLocation directory to the new host and make sure the permissions allow the SlurmUser to read and write it. > 4. Start slurmctld > 5. If nodes are not communicating, run 'scontrol reconfigure' or restart > slurmd > > You should also pay attention to the TTL on the existing SRV record and any > places in your environment where it might be cached. Thanks, good point! Could you elaborate on the caching? I don't see where this caching might be. > The procedure would be even simpler if you can use the same IP address on > the new server but this might not be practical. > > Let me know if you have any further questions. Yes, could you kindly update the FAQ https://slurm.schedmd.com/faq.html#controller or provide the new instructions in some other documentation page? Thanks a lot, Ole > > 2. Update SRV record > Or, if not using Configless, copy slurm.conf to all nodes. > > 3. Migrate to new machine > This would include copy the StateSaveLocation directory to the new host and > make sure the permissions allow the SlurmUser to read and write it. Correct > Could you elaborate on the caching? Caching can vary based on the system configuration. By checking '/etc/resolv.conf', I can tell that my Ubuntu workstation uses a local loopback address to query the 'systemd-resolved' service for DNS, which then looks at the local DNS server. My AlmaLinux VM, on the other hand, points directly to the default gateway on its virtual network. A local DNS service or intermediate server between the client and authoritative DNS server may cache DNS records but are expected to check back up the chain once the TTL is reached. They should also have a mechanism for manually expiring cached DNS records. With systemd-resolved on my system, that is with the command 'resolvectl flush-caches' (or 'systemd-resolve --flush-caches' on older versions). If your compute nodes query a local DNS service or a separate DNS server that is not the authoritative server, you should check the relevant documentation to clear DNS cache as part of the migration to ensure they receive the new SRV record. However, it wouldn't surprise me if they are directly querying the authoritative server, in which case I don't expect caching to be an issue. Of course, you could simply lower the TTL ahead of the migration to ensure that downstream clients will promptly receive the new record regardless of DNS configuration. > could you kindly update the FAQ or provide the new instructions in some other documentation page? We are constantly making improvements to the documentation and will include the topics discussed here in those improvements. Keep in mind that it will take some time to update as we plan the exact changes to make, put them through review, and deploy to the site. Best regards, Stephen Hi Stephen, I just noted a detail in the instructions step 5: (In reply to Stephen Kendall from comment #1) > The procedure for 23.11 would be: > > 1. Stop slurmctld > 2. Update SRV record > 3. Migrate to new machine > 4. Start slurmctld > 5. If nodes are not communicating, run 'scontrol reconfigure' or restart > slurmd With 23.11 a 'scontrol reconfigure' will restart slurmctld, unlike the older versions. This means that 'scontrol reconfigure' in step 5 is redundant. Would your step 5 therefore be to restart slurmd on any failing nodes? Thanks, Ole Good catch, I think that is a better step 5. My intent was to reconfigure the 'slurmd' daemons, but if they are not communicating with the controller it would be simplest to directly restart the daemons. > With 23.11 a 'scontrol reconfigure' will restart slurmctld To be exact, it creates new daemon processes and passes control to them if they start successfully. In most cases this is functionally equivalent to a normal restart, but it incurs essentially no downtime and recovers seamlessly in case of failure (e.g., config errors). There are still a few things it can't change - TCP ports and authentication mechanisms. https://slurm.schedmd.com/scontrol.html#OPT_reconfigure Best regards, Stephen (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #4) > (In reply to Stephen Kendall from comment #1) > > The procedure for 23.11 would be: > > > > 1. Stop slurmctld > > 2. Update SRV record > > 3. Migrate to new machine > > 4. Start slurmctld > > 5. If nodes are not communicating, run 'scontrol reconfigure' or restart > > slurmd > > With 23.11 a 'scontrol reconfigure' will restart slurmctld, unlike the older > versions. This means that 'scontrol reconfigure' in step 5 is redundant. > Would your step 5 therefore be to restart slurmd on any failing nodes? The new 'scontrol reconfigure' functionality in 23.11 is a great improvement! I would like to add to the above procedure a new step following step 3: 3½. Update 'SlurmctldHost' in the slurmdbd server's slurm.conf and restart the slurmdbd service. We have to make slurmdbd aware of the new 'SlurmctldHost' name :-) Is there a better way? Ole The 'slurmdbd' daemon doesn't actually read from 'slurm.conf', although if you run any commands from the dbd host, those will read from 'slurm.conf' so you should keep it consistent with the 'slurmctld' host. Best regards, Stephen Hi Stephen, (In reply to Stephen Kendall from comment #7) > The 'slurmdbd' daemon doesn't actually read from 'slurm.conf', although if > you run any commands from the dbd host, those will read from 'slurm.conf' so > you should keep it consistent with the 'slurmctld' host. Thanks for reminding me on slurmdbd not needing slurm.conf! I think this case is now fully documented, so please close it now. Thanks for your support, Ole Glad I could help. Let us know if any further questions come up. Best regards, Stephen We have performed the procedure for migrating slurmctld to another server as discussed in the present case. Even though its' not needed, we decided to make a system reservation of all nodes during the migration. Prior to the migration we had increased the timeouts in slurm.conf to avoid slurmd crashes: SlurmctldTimeout=3600 SlurmdTimeout=3600 Also, in the DNS server we decreased the SRV record TTL to 600 seconds so that slurmds ought to read a new _slurmctld._tcp server name after at most 600 seconds. There is an issue which I'd like to report: After slurmctld had been started successfully on the new server, we discovered that slurmds in fact were still caching the configuration files from the old slurmctld server in the /var/spool/slurmd/conf-cache folder, even after more than 600 seconds (the DNS TTL value) had passed. Apparently, slurmd doesn't honor the new DNS SRV record, or it doesn't update it's cache of the IP address after the TTL has expired. Therefore slurmds weren't contacting the new slurmctld server as expected, so the nodes were "not responding" from the viewpoint of slurmctld. We quickly had to fix the problem, so we restarted all slurmds (by "clush -ba systemctl restart slurmd"). This fixed the problem, and the cluster became healthy again. I think slurmd's handling of the DNS SRV record should be investigated to ensure that a changed record gets honored correctly. For the record, I'm collecting my slurmctld migration experiences in the Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#migrate-the-slurmctld-service-to-another-server Thanks, Ole Thanks for that update. I'm not immediately sure what might cause that issue, but I think it deviates enough from the original topic of this ticket that it would be better handled in a new ticket. Please include in that new ticket any relevant logs, configs, commands run, and references to this ticket and your wiki page. Best regards, Stephen (In reply to Stephen Kendall from comment #11) > Thanks for that update. I'm not immediately sure what might cause that > issue, but I think it deviates enough from the original topic of this ticket > that it would be better handled in a new ticket. Please include in that new > ticket any relevant logs, configs, commands run, and references to this > ticket and your wiki page. I have confirmed the DNS SRV issue reported in comment 10 and reported it in a new bug 20462. /Ole |