Summary: | Multi-cluster functionality broken after slurmdbd upgrade to 24.05 from 23.02 | ||
---|---|---|---|
Product: | Slurm | Reporter: | VUB HPC <hpcadmin> |
Component: | Federation | Assignee: | Joel Criado <joel> |
Status: | RESOLVED INFOGIVEN | QA Contact: | Tim Wickberg <tim> |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | bas.vandervlies, novosirj |
Version: | 24.05.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | VUB | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | Slurm config file for 24.05 |
Description
VUB HPC
2024-09-16 03:22:54 MDT
This seems unexpected and possibly a regression. Would you attach the slurm.conf from both clusters and from the submit host where you tried to submit jobs from? Created attachment 38844 [details]
Slurm config file for 24.05
Config file used for 24.05 in attachment.
What was changed between 23.02 and this one:
- SwitchType was dropped
- ENFORCE_BINDING_GRES was added to SelectTypeParameters
- CgroupAutomount was dropped
- in the topology file we switched to using the full hostname instead of short.
Hi, We have identified the root problem and we are working on a suitable solution for it. I will update you when we have a valid solution for it. As you detected, there is an incompatibility regarding cluster info when slurmdbd is in 24.05 and the rest is in 23.02. Sorry for the trouble caused by this issue. Also, as you pointed out, once you bring the rest to 24.05 it should work just fine. Kind regards, Joel Hi, After carefully analyzing the issue, we have decided to not fix it. This means that the incompatibility will remain between versions 24.05 and 23.02 in the scenario described in my previous comment. We apologize for any inconvenience that this might have caused on your side. The issue arises from a series of bad internal notes that led us to modify a part of the code one version sooner than it should. Those changes caused the incompatibility. After that, we applied more change to that part of the code, so reverting everything would put the code in a bad place. That is why we are taking this decision regarding this issue. We will be more careful going forward so this kind of problem is not repeated again. Sorry again for the inconvenience. Kind regards, Joel OK, thanks for the info. It does not concern us any more as we already upgraded but I would highly recommend to make a note of this in the upgrade documentation? Minor apologies for speaking out of turn here and modifying a ticket that is not mine, but it has been two days since this has been discovered, and the upgrade guide /does not/ mention this incompatibility. That means you could have customers right now, with service contracts, in the process of going through updates that are going to cause a production outage. I have reopened this ticket and raised the seveirty until that problem is fixed. I would be surpemely unhappy with SchedMD if I scheduled a no-downtime upgrade that is specifically listed as supported in the upgrade guide and then ran into this show-stopper. Hi Ryan. I understand your concern here with wanting possible critical upgrade issues fixed and or documented. The likelihood of encountering these two issues are remote and are either fixed or a minor issue with a client command that can be easily remedied. So, I am a little confused as to why you believe this is a severity 1 issues? 1. The first issue only occurs if you upgrade from 23.02 with slurmstepd's that are suspended or if you suspend these 23.02 steps after upgrading. This was fixed in the following commit. https://github.com/SchedMD/slurm/commit/78e674911830dfbd8a3f50ff775ea784f012a88f This workflow is not common so we expect very little impact from this issue. 2. The second issue deals with the -M option. This only affects 23.02 when upgrading to 23.11 or from 23.02 -> 24.05. We refactored the way plugins work to fix several issues. While unfortunate fixing these issues meant that we could not preserve "-M" switch option compatibility from 23.02. This was unfortunately not called out in the 23.11 release nodes as clear as it could have been and I have already opened a conversation internally about this issue. With that in mind it should not be an issues for most sites. Should a site encounter this then updating the client command would be needed. This does not affect Slurm's ability to schedule, start or complete workload. If your site has encountered this issue then please open a new ticket and we can continue the discussion there. I am a bit hesitant documenting this in the upgrade guide or starting that pattern since we tend to handle these types of updates in the RELEASE_NOTES. As stated above making this more clear in the RELEASE_NOTES document would be my preference I do have a task to go and chat with folks internally about this, but I do not see why we would need to keep this issue opened and at a severity 1 cluster down status for a site that is not related to your cluster/site. RELEASE_NOTES, upgrade guide... the point is that right now it's in neither place, and you basically have to know about this bug report in order to avoid stepping in this, and there doesn't seem to be any urgency from SchedMD to remedy that. Are you aware of another place where a site that is planning to upgrade right now could find this information, except VUB's warning to us on the slurm_users mailing list? Sites have workloads that rely on -M, which I know because mine is one of them. It's common to see this for applications like Open OnDemand in a multi-cluster environment. Some user scripts use this for one reason or anohter. You can't just discover that something might break someone's production environment and not notify your user community. Again, I would be really aggravated if I discovered that a documented upgrade process was broken and the vendor was now fully aware of it and just chose not to note it anywhere, requiring me to figure out on demand whether I was going to do a client upgrade I had not planned or budgeted time for -- because explictly stated that it's permissible to do it in stages in the right order -- or roll back the upgrade entirely. Hello, Thanks for your message. I am out of the office from 27/Sep - 7/Oct 2204 Regards |