Ticket 20070

Summary:	Procedure for migrating slurmctld to another server
Product:	Slurm	Reporter:	Ole.H.Nielsen <Ole.H.Nielsen>
Component:	slurmctld	Assignee:	Stephen Kendall <stephen>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	23.11.8
Hardware:	Linux
OS:	Linux
Site:	DTU Physics	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Ole.H.Nielsen@fysik.dtu.dk 2024-06-04 03:42:20 MDT

Our old trusty slurmctld server running CentOS 7 is going be replaced by a new server hardware running RockyLinux 8 and with a different IP-address. First we plan to migrate the slurmctld service only, keeping a few non-Slurm services on the old server for a little while, in order to divide the entire migration process into more manageable steps.

We run Configless Slurm, so the DNS _slurmctld._tcp SRV record must be updated to point the slurmds to the new slurmctld server.

Our slurmdbd service is running on a separate server, so SlurmctldHost must be changed in its local slurm.conf file and slurmdbd restarted.

As for documentation of the migration process, I was only able to find the FAQ "How should I relocate the primary or backup controller?" [1] giving 5 specific steps:

1. Stop all Slurm daemons
2. Modify the SlurmctldHost values in the slurm.conf file
3. Distribute the updated slurm.conf file to all nodes
4. Copy the StateSaveLocation directory to the new host and make sure the permissions allow the SlurmUser to read and write it.
5. Restart all Slurm daemons

Can you add appropriate advice in item 3 regarding the Configless case?

Questions:

1. Is it required that no running jobs are allowed in the cluster? That isn't necessarily implied by item 1 above. Any running job may have an associated slurmstepd which relies on the IP address of the old SlurmctldHost, so wouldn't running jobs crash after a new SlurmctldHost IP address is configured?

2. Are there any other steps which we must consider? I suppose the slurmctld and slurmds must be started in item 5 within the timeout parameters in slurm.conf, but otherwise in no specific order?

Thanks a lot,
Ole

[1] https://slurm.schedmd.com/faq.html#controller

Comment 1 Stephen Kendall 2024-06-04 16:14:54 MDT

> Is it required that no running jobs are allowed in the cluster?

It is on versions prior to 23.11. One of the major changes in 23.11 was to allow 'slurmstepd' to receive an updated 'SlurmctldHost' setting so that running jobs will report back to the new controller when they finish.

> Are there any other steps which we must consider?

The procedure for 23.11 would be:

1. Stop slurmctld
2. Update SRV record
3. Migrate to new machine
4. Start slurmctld
5. If nodes are not communicating, run 'scontrol reconfigure' or restart slurmd

You should also pay attention to the TTL on the existing SRV record and any places in your environment where it might be cached.

The procedure would be even simpler if you can use the same IP address on the new server but this might not be practical.

Let me know if you have any further questions.

Best regards,
Stephen

Comment 2 Ole.H.Nielsen@fysik.dtu.dk 2024-06-06 02:49:08 MDT

Hi Stephen,

Thanks for your great reply:

(In reply to Stephen Kendall from comment #1)
> > Is it required that no running jobs are allowed in the cluster?
> 
> It is on versions prior to 23.11. One of the major changes in 23.11 was to
> allow 'slurmstepd' to receive an updated 'SlurmctldHost' setting so that
> running jobs will report back to the new controller when they finish.

I had missed this point previously, but there is a page "Change SlurmctldHost settings without breaking running jobs" in the SC23 and SLUG23 presentations which I hadn't noticed previously.

> > Are there any other steps which we must consider?
> 
> The procedure for 23.11 would be:
> 
> 1. Stop slurmctld
> 2. Update SRV record

Or, if not using Configless, copy slurm.conf to all nodes.

> 3. Migrate to new machine

This would include copy the StateSaveLocation directory to the new host and make sure the permissions allow the SlurmUser to read and write it.

> 4. Start slurmctld
> 5. If nodes are not communicating, run 'scontrol reconfigure' or restart
> slurmd
> 
> You should also pay attention to the TTL on the existing SRV record and any
> places in your environment where it might be cached.

Thanks, good point!  Could you elaborate on the caching?  I don't see where this caching might be.

> The procedure would be even simpler if you can use the same IP address on
> the new server but this might not be practical.
> 
> Let me know if you have any further questions.

Yes, could you kindly update the FAQ https://slurm.schedmd.com/faq.html#controller or provide the new instructions in some other documentation page?

Thanks a lot,
Ole

Comment 3 Stephen Kendall 2024-06-06 10:44:32 MDT

> > 2. Update SRV record
> Or, if not using Configless, copy slurm.conf to all nodes.
> > 3. Migrate to new machine
> This would include copy the StateSaveLocation directory to the new host and
> make sure the permissions allow the SlurmUser to read and write it.

Correct

> Could you elaborate on the caching?

Caching can vary based on the system configuration. By checking '/etc/resolv.conf', I can tell that my Ubuntu workstation uses a local loopback address to query the 'systemd-resolved' service for DNS, which then looks at the local DNS server. My AlmaLinux VM, on the other hand, points directly to the default gateway on its virtual network.

A local DNS service or intermediate server between the client and authoritative DNS server may cache DNS records but are expected to check back up the chain once the TTL is reached. They should also have a mechanism for manually expiring cached DNS records. With systemd-resolved on my system, that is with the command 'resolvectl flush-caches' (or 'systemd-resolve --flush-caches' on older versions).

If your compute nodes query a local DNS service or a separate DNS server that is not the authoritative server, you should check the relevant documentation to clear DNS cache as part of the migration to ensure they receive the new SRV record. However, it wouldn't surprise me if they are directly querying the authoritative server, in which case I don't expect caching to be an issue.

Of course, you could simply lower the TTL ahead of the migration to ensure that downstream clients will promptly receive the new record regardless of DNS configuration.

> could you kindly update the FAQ or provide the new instructions in some other documentation page?

We are constantly making improvements to the documentation and will include the topics discussed here in those improvements. Keep in mind that it will take some time to update as we plan the exact changes to make, put them through review, and deploy to the site.

Best regards,
Stephen

Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2024-06-07 01:16:11 MDT

Hi Stephen,

I just noted a detail in the instructions step 5:

(In reply to Stephen Kendall from comment #1)
> The procedure for 23.11 would be:
> 
> 1. Stop slurmctld
> 2. Update SRV record
> 3. Migrate to new machine
> 4. Start slurmctld
> 5. If nodes are not communicating, run 'scontrol reconfigure' or restart
> slurmd

With 23.11 a 'scontrol reconfigure' will restart slurmctld, unlike the older versions.  This means that 'scontrol reconfigure' in step 5 is redundant.  Would your step 5 therefore be to restart slurmd on any failing nodes?

Thanks,
Ole

Comment 5 Stephen Kendall 2024-06-07 09:50:12 MDT

Good catch, I think that is a better step 5. My intent was to reconfigure the 'slurmd' daemons, but if they are not communicating with the controller it would be simplest to directly restart the daemons.

> With 23.11 a 'scontrol reconfigure' will restart slurmctld

To be exact, it creates new daemon processes and passes control to them if they start successfully. In most cases this is functionally equivalent to a normal restart, but it incurs essentially no downtime and recovers seamlessly in case of failure (e.g., config errors). There are still a few things it can't change - TCP ports and authentication mechanisms.
https://slurm.schedmd.com/scontrol.html#OPT_reconfigure

Best regards,
Stephen

Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2024-06-08 01:10:17 MDT

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #4)
> (In reply to Stephen Kendall from comment #1)
> > The procedure for 23.11 would be:
> > 
> > 1. Stop slurmctld
> > 2. Update SRV record
> > 3. Migrate to new machine
> > 4. Start slurmctld
> > 5. If nodes are not communicating, run 'scontrol reconfigure' or restart
> > slurmd
> 
> With 23.11 a 'scontrol reconfigure' will restart slurmctld, unlike the older
> versions.  This means that 'scontrol reconfigure' in step 5 is redundant. 
> Would your step 5 therefore be to restart slurmd on any failing nodes?

The new 'scontrol reconfigure' functionality in 23.11 is a great improvement!

I would like to add to the above procedure a new step following step 3:

3½. Update 'SlurmctldHost' in the slurmdbd server's slurm.conf and restart the slurmdbd service.

We have to make slurmdbd aware of the new 'SlurmctldHost' name :-)  Is there a better way?

Ole

Comment 7 Stephen Kendall 2024-06-10 12:11:11 MDT

The 'slurmdbd' daemon doesn't actually read from 'slurm.conf', although if you run any commands from the dbd host, those will read from 'slurm.conf' so you should keep it consistent with the 'slurmctld' host.

Best regards,
Stephen

Comment 8 Ole.H.Nielsen@fysik.dtu.dk 2024-06-11 00:27:27 MDT

Hi Stephen,

(In reply to Stephen Kendall from comment #7)
> The 'slurmdbd' daemon doesn't actually read from 'slurm.conf', although if
> you run any commands from the dbd host, those will read from 'slurm.conf' so
> you should keep it consistent with the 'slurmctld' host.

Thanks for reminding me on slurmdbd not needing slurm.conf!  I think this case is now fully documented, so please close it now.

Thanks for your support,
Ole

Comment 9 Stephen Kendall 2024-06-11 09:01:26 MDT

Glad I could help. Let us know if any further questions come up.

Best regards,
Stephen

Comment 10 Ole.H.Nielsen@fysik.dtu.dk 2024-07-04 03:57:18 MDT

We have performed the procedure for migrating slurmctld to another server as discussed in the present case. Even though its' not needed, we decided to make a system reservation of all nodes during the migration.

Prior to the migration we had increased the timeouts in slurm.conf to avoid slurmd crashes:

SlurmctldTimeout=3600
SlurmdTimeout=3600

Also, in the DNS server we decreased the SRV record TTL to 600 seconds so that slurmds ought to read a new _slurmctld._tcp server name after at most 600 seconds.

There is an issue which I'd like to report:

After slurmctld had been started successfully on the new server, we discovered that slurmds in fact were still caching the configuration files from the old slurmctld server in the /var/spool/slurmd/conf-cache folder, even after more than 600 seconds (the DNS TTL value) had passed. Apparently, slurmd doesn't honor the new DNS SRV record, or it doesn't update it's cache of the IP address after the TTL has expired.

Therefore slurmds weren't contacting the new slurmctld server as expected, so the nodes were "not responding" from the viewpoint of slurmctld.

We quickly had to fix the problem, so we restarted all slurmds (by "clush -ba systemctl restart slurmd"). This fixed the problem, and the cluster became healthy again.

I think slurmd's handling of the DNS SRV record should be investigated to ensure that a changed record gets honored correctly.

For the record, I'm collecting my slurmctld migration experiences in the Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#migrate-the-slurmctld-service-to-another-server

Thanks,
Ole

Comment 11 Stephen Kendall 2024-07-08 10:02:36 MDT

Thanks for that update. I'm not immediately sure what might cause that issue, but I think it deviates enough from the original topic of this ticket that it would be better handled in a new ticket. Please include in that new ticket any relevant logs, configs, commands run, and references to this ticket and your wiki page.

Best regards,
Stephen

Comment 12 Ole.H.Nielsen@fysik.dtu.dk 2024-07-23 01:08:44 MDT

(In reply to Stephen Kendall from comment #11)
> Thanks for that update. I'm not immediately sure what might cause that
> issue, but I think it deviates enough from the original topic of this ticket
> that it would be better handled in a new ticket. Please include in that new
> ticket any relevant logs, configs, commands run, and references to this
> ticket and your wiki page.

I have confirmed the DNS SRV issue reported in comment 10 and reported it in a new bug 20462.

/Ole