20931 – Multi-cluster functionality broken after slurmdbd upgrade to 24.05 from 23.02

Ticket 20931 - Multi-cluster functionality broken after slurmdbd upgrade to 24.05 from 23.02

Summary: Multi-cluster functionality broken after slurmdbd upgrade to 24.05 from 23.02

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Federation (show other tickets)
Version:	24.05.3
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Joel Criado
QA Contact:	Tim Wickberg

URL:

Depends on:
Blocks:

Reported:	2024-09-16 03:22 MDT by VUB HPC
Modified:	2024-09-30 09:56 MDT (History)
CC List:	2 users (show)

See Also:
Site:	VUB
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurm config file for 24.05 (19.64 KB, text/plain) 2024-09-20 03:06 MDT, VUB HPC	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description VUB HPC 2024-09-16 03:22:54 MDT

We have a multi-cluster setup that we recently upgraded from Slurm 23.02 to 24.05. We carried out this upgrade gradually by first starting with slurmdbd, as described in https://slurm.schedmd.com/upgrades.html.

The upgrade of slurmdbd was fine, it started running on 24.05.3 without issue and the 23.02 controllers successfully reconnected to it. However, the multi-cluster functionality broke and adding the "-M" option to any slurm command resulted in the following error:

---------------
$ sinfo -M hydra
sinfo: error: Cluster 'hydra' has an unknown select plugin_id 4294967294
sinfo: error: 'hydra' can't be reached now, or it is an invalid entry for --cluster.  Use 'sacctmgr list clusters' to see available clusters.
$ sinfo -M chimera
sinfo: error: Cluster 'chimera' has an unknown select plugin_id 4294967294
sinfo: error: 'chimera' can't be reached now, or it is an invalid entry for --cluster.  Use 'sacctmgr list clusters' to see available clusters.
---------------

This issue was solved as soon as the controllers of each cluster were upgraded to 24.05.3 as well. However, we expected to be able to carry out the upgrade without impacting the cluster functionality. Is this a known limitation of multi-cluster setups? or is there something we can do to avoid it in future upgrades?

Comment 1 Jason Booth 2024-09-16 09:39:22 MDT

This seems unexpected and possibly a regression. Would you attach the slurm.conf from both clusters and from the submit host where you tried to submit jobs from?

Comment 6 VUB HPC 2024-09-20 03:06:45 MDT

Created attachment 38844 [details]
Slurm config file for 24.05

Config file used for 24.05 in attachment.

What was changed between 23.02 and this one:
- SwitchType was dropped
- ENFORCE_BINDING_GRES was added to SelectTypeParameters
- CgroupAutomount was dropped
- in the topology file we switched to using the full hostname instead of short.

Comment 7 Joel Criado 2024-09-20 03:42:49 MDT

Hi,

We have identified the root problem and we are working on a suitable solution for it. I will update you when we have a valid solution for it.

As you detected, there is an incompatibility regarding cluster info when slurmdbd is in 24.05 and the rest is in 23.02. Sorry for the trouble caused by this issue.

Also, as you pointed out, once you bring the rest to 24.05 it should work just fine. 

Kind regards,
Joel

Comment 8 Joel Criado 2024-09-24 07:28:34 MDT

Hi,

After carefully analyzing the issue, we have decided to not fix it. This means that the incompatibility will remain between versions 24.05 and 23.02 in the scenario described in my previous comment. We apologize for any inconvenience that this might have caused on your side.

The issue arises from a series of bad internal notes that led us to modify a part of the code one version sooner than it should. Those changes caused the incompatibility. After that, we applied more change to that part of the code, so reverting everything would put the code in a bad place. That is why we are taking this decision regarding this issue. 
We will be more careful going forward so this kind of problem is not repeated again. Sorry again for the inconvenience.

Kind regards,
Joel

Comment 9 VUB HPC 2024-09-24 07:45:10 MDT

OK, thanks for the info.

It does not concern us any more as we already upgraded but I would highly recommend to make a note of this in the upgrade documentation?

Comment 10 Ryan Novosielski 2024-09-26 13:24:28 MDT

Minor apologies for speaking out of turn here and modifying a ticket that is not mine, but it has been two days since this has been discovered, and the upgrade guide /does not/ mention this incompatibility. That means you could have customers right now, with service contracts, in the process of going through updates that are going to cause a production outage. I have reopened this ticket and raised the seveirty until that problem is fixed.

I would be surpemely unhappy with SchedMD if I scheduled a no-downtime upgrade that is specifically listed as supported in the upgrade guide and then ran into this show-stopper.

Comment 12 Jason Booth 2024-09-26 17:03:52 MDT

Hi Ryan. I understand your concern here with wanting possible critical upgrade 
issues fixed and or documented. The likelihood of encountering these two issues are 
remote and are either fixed or a minor issue with a client command that can be easily 
remedied. So, I am a little confused as to why you believe this is a severity 1 
issues?

1. The first issue only occurs if you upgrade from 23.02 with slurmstepd's that are 
suspended or if you suspend these 23.02 steps after upgrading. This was fixed in the 
following commit.

https://github.com/SchedMD/slurm/commit/78e674911830dfbd8a3f50ff775ea784f012a88f

This workflow is not common so we expect very little impact from this issue.


2. The second issue deals with the -M option. This only affects 23.02 when upgrading 
to 23.11 or from 23.02 -> 24.05. We refactored the way plugins work to fix several 
issues. While unfortunate fixing these issues meant that we could not preserve "-M" 
switch option compatibility from 23.02. This was unfortunately not called out in the 
23.11 release nodes as clear as it could have been and I have already opened a 
conversation internally about this issue. With that in mind it should not be an 
issues for most sites.

Should a site encounter this then updating the client command would be needed. This
does not affect Slurm's ability to schedule, start or complete workload.


If your site has encountered this issue then please open a new ticket and we can 
continue the discussion there.

I am a bit hesitant documenting this in the upgrade guide or starting that pattern 
since we tend to handle these types of updates in the RELEASE_NOTES. As stated above 
making this more clear in the RELEASE_NOTES document would be my preference

I do have a task to go and chat with folks internally about this, but I do not see 
why we would need to keep this issue opened and at a severity 1 cluster down status
for a site that is not related to your cluster/site.

Comment 13 Ryan Novosielski 2024-09-27 00:51:05 MDT

RELEASE_NOTES, upgrade guide... the point is that right now it's in neither place, and you basically have to know about this bug report in order to avoid stepping in this, and there doesn't seem to be any urgency from SchedMD to remedy that. Are you aware of another place where a site that is planning to upgrade right now could find this information, except VUB's warning to us on the slurm_users mailing list?

Sites have workloads that rely on -M, which I know because mine is one of them. It's common to see this for applications like Open OnDemand in a multi-cluster environment. Some user scripts use this for one reason or anohter.

You can't just discover that something might break someone's production environment and not notify your user community. Again, I would be really aggravated if I discovered that a documented upgrade process was broken and the vendor was now fully aware of it and just chose not to note it anywhere, requiring me to figure out on demand whether I was going to do a client upgrade I had not planned or budgeted time for -- because explictly stated that it's permissible to do it in stages in the right order -- or roll back the upgrade entirely.

Comment 15 Bas van der Vlies 2024-09-30 09:56:57 MDT

Hello,

Thanks for your message. I am out of the office from 27/Sep - 7/Oct 2204

Regards