| Summary: | Issues After Slurm Upgrade | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jonathan Avalishvili <jonathan.avalishvili> |
| Component: | Other | Assignee: | Ben Glines <ben.glines> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 23.02.0 | ||
| Hardware: | Cray CS | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=18973 | ||
| Site: | CRAY | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | VIRGINIA TECH | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
Messages and SLURM logs
messages log slurbdbd log |
||
|
Description
Jonathan Avalishvili
2024-01-19 12:20:42 MST
Created attachment 34193 [details]
Messages and SLURM logs
Asked customer to provide logs. Uploading to this case.
Created attachment 34194 [details]
messages log
messages log
Created attachment 34195 [details]
slurbdbd log
slurbdbd log
Note that we are experiencing this issue on two clusters. The first one requeued all the existing jobs. The second clsuter is this one, and I only let two existing jobs re-queue so far. If we run out of time, we may let them all re-queue and ask users to resubmit; we seem to be able to get new jobs to run without these errors, so suspect it is only old jobs. I updated the severity of this case since the customer opened our case as a critical down. Were there any running jobs at the time of the upgrade? If there are old slurmstepd's from 20.11 trying to talk to the newer 23.02 slurmctld, they won't be able to communicate. Disregard my last reply. I am able to reproduce your issue by following the same upgrade sequence that you did, and I see that any pending jobs from 20.11 are terminated. Unfortunately, I'm not sure if there's any way to prevent this issue and save all of your pending jobs other than fixing the bug with a patch. I may be able to formulate a patch to fix this issue, but it will take time to test it and review it before I can send it over to you for you to apply the patch and rebuild Slurm from scratch. I can't promise that I can get you a patch that will save your pending jobs, but I can certainly try if you'd like me to do that. It would require you applying my patch to Slurm's source code, and then building Slurm from that code with the patch applied. There were no jobs running. We had a reservation in place to hold jobs in pending, and I verified there were no jobs in R or CG states. Nodes were rebooted before the first Slurm upgrade, then drained, and daemons shut down, then old packages removed, new ones installed, and all daemons brought up. The same sequence was done for the second Slurm upgrade. The slurmdbd was not up after the first upgrade as it was getting upgraded as well. We decided to let the old jobs run and fail and requeue and ask users to cancel and resubmit. New jobs seem to be fine so far. Since we decided to release these systems (we’re already a day past maintenance window), we can drop the severity of this case down a notch. We do want to keep this open, however, in case we do run into new issues with new jobs in the short term, and also to get to a root cause. (In reply to Jonathan Avalishvili from comment #9) > We decided to let the old jobs run and fail and requeue and ask users to > cancel and resubmit. New jobs seem to be fine so far. That's unfortunate that you had to do that, sorry about that. I'll be tracking this bug to resolve this issue in the future for other sites that run into this, so thank you for reporting this behavior to us. > Since we decided to release these systems (we’re already a day past > maintenance window), we can drop the severity of this case down a notch. Setting the severity to 4. > We do want to keep this open, however, in case we do run into new issues > with new jobs in the short term, and also to get to a root cause. Okay sounds good. I haven't noticed any other issues in my testing as far as new running jobs, but we can leave this open. The root cause of this issue seems to be a Slurm bug, not anything you did wrong. In case you're interested in the details, the "protocol version" for the RPC that is sent from the slurmctld to the slurmd for a job to start prolog for the job ends up being whatever the job was submitted with. In your case, the REQUEST_PROLOG_LAUNCH message was sent as a message with a protocol version of 20.11, since that is the version that the job was originally submitted as. The message is packed on the slurmctld before being sent out to the slurmd, but your slurmctld is 23.02, and since 20.11 is more than 2 major versions older than 23.02 (20.11 -> 21.08 -> 22.05 -> 23.02), the RPC fails and the job is terminated. The fix for this is simple, but I need to do more testing to ensure there are no unintended consequences. We can close out this ticket. Thank you for your assistance. Sounds good, closing now. |