18732 – Issues After Slurm Upgrade

Ticket 18732 - Issues After Slurm Upgrade

Summary: Issues After Slurm Upgrade

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Other (show other tickets)
Version:	23.02.0
Hardware:	Cray CS Linux

Severity:	4 - Minor Issue
Assignee:	Ben Glines
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2024-01-19 12:20 MST by Jonathan Avalishvili
Modified:	2024-02-13 12:47 MST (History)
CC List:	0 users

See Also:	18973
Site:	CRAY
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	VIRGINIA TECH
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Messages and SLURM logs (145.99 KB, application/x-gzip) 2024-01-19 12:42 MST, Jonathan Avalishvili	Details
messages log (57.70 MB, application/x-gzip) 2024-01-19 12:50 MST, Jonathan Avalishvili	Details
slurbdbd log (2.98 MB, application/x-gzip) 2024-01-19 12:50 MST, Jonathan Avalishvili	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Jonathan Avalishvili 2024-01-19 12:20:42 MST

Customer: Virginia Tech
Asset: Cray CS 500 System
Serial Number: 10011763

Customer Contact: Bill Marmagas
zorba@vt.edu


I followed the procedures in https://kb.brightcomputing.com/knowledge-base/upgrading-slurm/ to perform an intermediate Slurm upgrade from 20.11.9 to 22.05.11, and then a Slurm upgrade from 22.05.11 to 23.02.x. I did this in these two steps since one cannot go directly from 20.11 to 23.02. The upgrade process seemed to go fine.

However, when I tried releasing the pending jobs by deleting the maintenance reservation, the nodes got drained and the jobs got re-queued in less than a minute with a reason of "Prolog error ." Here are related error messages from the slurmctld log for two jobs:

[2024-01-19T13:07:53.112] _slurm_rpc_resv_delete complete for HPCMaintenance usec=2385

[2024-01-19T13:07:53.383] sched: Allocate JobId=1940124 NodeList=tc-hm001 #CPUs=4 Partition=largemem_q

[2024-01-19T13:07:53.615] prolog_running_decr: Configuration for JobId=1940124 is complete

[2024-01-19T13:07:53.628] error: pack_msg: Invalid message version=9216, type:6017

[2024-01-19T13:07:53.628] error: auth_g_pack: protocol_version 9216 not supported

[2024-01-19T13:07:53.628] error: slurm_buffers_pack_msg: auth_g_pack: REQUEST_LAUNCH_PROLOG has authentication error: No error

[2024-01-19T13:08:10.216] sched/backfill: _start_job: Started JobId=1940585 in largemem_q on tc-hm001

[2024-01-19T13:08:10.424] prolog_running_decr: Configuration for JobId=1940585 is complete

[2024-01-19T13:08:10.457] error: pack_msg: Invalid message version=9216, type:6017

[2024-01-19T13:08:10.457] error: auth_g_pack: protocol_version 9216 not supported

[2024-01-19T13:08:10.457] error: slurm_buffers_pack_msg: auth_g_pack: REQUEST_LAUNCH_PROLOG has authentication error: No error

[2024-01-19T13:08:10.464] Node tc-hm001 now responding

[2024-01-19T13:08:16.020] sched: Created reservation=HPCMaintenance accounts=sysadmin,arcadm nodes=tc[001-308],tc-dgx[001-010],tc-gpu[001-004],tc-hm[001-008],tc-intel[001-016] cores=43776 licenses=(null) tres=cpu=43776 watts=4294967294 start=2024-01-19T13:08:16 end=2025-01-18T13:08:16 MaxStartDelay= Comment=

[2024-01-19T13:08:19.403] Batch JobId=1940124 missing from batch node tc-hm001 (not found BatchStartTime after startup), Requeuing job

[2024-01-19T13:08:19.403] _job_complete: JobId=1940124 WTERMSIG 126

[2024-01-19T13:08:19.403] _job_complete: JobId=1940124 cancelled by node failure

[2024-01-19T13:08:19.403] _job_complete: requeue JobId=1940124 due to node failure

[2024-01-19T13:08:19.403] _job_complete: JobId=1940124 done

[2024-01-19T13:08:20.431] Requeuing JobId=1940124

[2024-01-19T13:08:40.272] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1935569

[2024-01-19T13:08:40.272] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1935571

[2024-01-19T13:08:40.272] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1935572

[2024-01-19T13:08:40.272] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1935574

[2024-01-19T13:08:40.273] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1935575

[2024-01-19T13:08:40.273] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1939847

[2024-01-19T13:08:40.273] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1939848

[2024-01-19T13:08:40.277] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1935386

[2024-01-19T13:08:40.277] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1935384

[2024-01-19T13:08:40.277] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1937160

[2024-01-19T13:09:10.345] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1935569

[2024-01-19T13:09:10.346] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1935571

[2024-01-19T13:09:10.346] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1935572

[2024-01-19T13:09:10.346] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1935574

[2024-01-19T13:09:10.346] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1935575

[2024-01-19T13:09:10.346] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1939847

[2024-01-19T13:09:10.347] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1939848

[2024-01-19T13:09:10.350] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1935386

[2024-01-19T13:09:10.350] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1935384

[2024-01-19T13:09:10.351] error: cons_res: dist_tasks_compute_c_b oversubscribe for JobId=1937160

[2024-01-19T13:09:10.464] error: validate_node_specs: Prolog or job env setup failure on node tc-hm001, draining the node

[2024-01-19T13:09:10.464] drain_nodes: node tc-hm001 state set to DRAIN

[2024-01-19T13:09:11.434] sched: Updated reservation=HPCMaintenance accounts=sysadmin,arcadm nodes=tc[001-308],tc-dgx[001-010],tc-gpu[001-004],tc-hm[001-008],tc-intel[001-016] cores=43776 licenses=(null) tres=cpu=43776 watts=4294967294 start=2024-01-19T13:08:16 end=2025-01-18T13:08:16 MaxStartDelay= Comment=

[2024-01-19T13:09:11.494] Requeuing JobId=1940585

Comment 1 Jonathan Avalishvili 2024-01-19 12:42:56 MST

Created attachment 34193 [details]
Messages and SLURM logs

Asked customer to provide logs. Uploading to this case.

Comment 2 Jonathan Avalishvili 2024-01-19 12:50:15 MST

Created attachment 34194 [details]
messages log

messages log

Comment 3 Jonathan Avalishvili 2024-01-19 12:50:57 MST

Created attachment 34195 [details]
slurbdbd log

slurbdbd log

Comment 4 Jonathan Avalishvili 2024-01-19 12:51:56 MST

Note that we are experiencing this issue on two clusters. The first one requeued all the existing jobs. The second clsuter is this one, and I only let two existing jobs re-queue so far. If we run out of time, we may let them all re-queue and ask users to resubmit; we seem to be able to get new jobs to run without these errors, so suspect it is only old jobs.

Comment 5 Jonathan Avalishvili 2024-01-19 13:35:09 MST

I updated the severity of this case since the customer opened our case as a critical down.

Comment 6 Ben Glines 2024-01-19 13:57:02 MST

Were there any running jobs at the time of the upgrade? If there are old slurmstepd's from 20.11 trying to talk to the newer 23.02 slurmctld, they won't be able to communicate.

Comment 7 Ben Glines 2024-01-19 17:04:30 MST

Disregard my last reply. I am able to reproduce your issue by following the same upgrade sequence that you did, and I see that any pending jobs from 20.11 are terminated.

Unfortunately, I'm not sure if there's any way to prevent this issue and save all of your pending jobs other than fixing the bug with a patch. I may be able to formulate a patch to fix this issue, but it will take time to test it and review it before I can send it over to you for you to apply the patch and rebuild Slurm from scratch.

I can't promise that I can get you a patch that will save your pending jobs, but I can certainly try if you'd like me to do that. It would require you applying my patch to Slurm's source code, and then building Slurm from that code with the patch applied.

Comment 9 Jonathan Avalishvili 2024-01-19 17:44:14 MST

There were no jobs running. We had a reservation in place to hold jobs in pending, and I verified there were no jobs in R or CG states.

Nodes were rebooted before the first Slurm upgrade, then drained, and daemons shut down, then old packages removed, new ones installed, and all daemons brought up.
The same sequence was done for the second Slurm upgrade.

The slurmdbd was not up after the first upgrade as it was getting upgraded as well.

We decided to let the old jobs run and fail and requeue and ask users to cancel and resubmit. New jobs seem to be fine so far.

Since we decided to release these systems (we’re already a day past maintenance window), we can drop the severity of this case down a notch.

We do want to keep this open, however, in case we do run into new issues with new jobs in the short term, and also to get to a root cause.

Comment 10 Ben Glines 2024-01-20 19:56:19 MST

(In reply to Jonathan Avalishvili from comment #9)
> We decided to let the old jobs run and fail and requeue and ask users to
> cancel and resubmit. New jobs seem to be fine so far.
That's unfortunate that you had to do that, sorry about that. I'll be tracking this bug to resolve this issue in the future for other sites that run into this, so thank you for reporting this behavior to us.

> Since we decided to release these systems (we’re already a day past
> maintenance window), we can drop the severity of this case down a notch.
Setting the severity to 4.

> We do want to keep this open, however, in case we do run into new issues
> with new jobs in the short term, and also to get to a root cause.
Okay sounds good. I haven't noticed any other issues in my testing as far as new running jobs, but we can leave this open. The root cause of this issue seems to be a Slurm bug, not anything you did wrong.

In case you're interested in the details, the "protocol version" for the RPC that is sent from the slurmctld to the slurmd for a job to start prolog for the job ends up being whatever the job was submitted with. In your case, the REQUEST_PROLOG_LAUNCH message was sent as a message with a protocol version of 20.11, since that is the version that the job was originally submitted as. The message is packed on the slurmctld before being sent out to the slurmd, but your slurmctld is 23.02, and since 20.11 is more than 2 major versions older than 23.02 (20.11 -> 21.08 -> 22.05 -> 23.02), the RPC fails and the job is terminated. The fix for this is simple, but I need to do more testing to ensure there are no unintended consequences.

Comment 11 Jonathan Avalishvili 2024-01-22 08:08:11 MST

We can close out this ticket. Thank you for your assistance.

Comment 12 Ben Glines 2024-01-22 09:47:10 MST

Sounds good, closing now.