Ticket 16187 - Jobs stuck at CG state & also Unable to run interactive jobs even after slurm update
Summary: Jobs stuck at CG state & also Unable to run interactive jobs even after slurm...
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 23.02.0
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Nate Rini
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-03-03 12:23 MST by RCC SysAdmin
Modified: 2023-03-07 13:43 MST (History)
1 user (show)

See Also:
Site: University of Chicago
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: CentOS
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm controller logs (43.27 MB, application/gzip)
2023-03-03 12:30 MST, RCC SysAdmin
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description RCC SysAdmin 2023-03-03 12:23:02 MST
Hi Team,

We are facing multiple issues with slurm here at RCC, University Of Chicago. 
(1) Jobs were stucking on CG state. reference Bug ticket#16004.
As recommended, we upgraded slurm version to latest 23.02. But still facing the same issue.

(2) After upgrade another issue encountered that interactive jobs throwing authentication errors while batch jobs working fine. Both methods were working fine in previous version 18.08.

Example:
$ sinteractive -p broadwl --account=pi-mstephens
Submitted batch job 26162436
Access denied: user pcarbo (uid=69822) has no active jobs on this node.
Authentication failed.

(3) Slurm controller reporting "slurm_unpack_received_msg" in slurmctld logs. Not sure to what these error messages are related.

Erros in logs:
[2023-03-03T13:14:54.940] error: slurm_unpack_received_msg: [[midway2-0591.rcc.local]:34116] Message receive failure
[2023-03-03T13:14:54.950] error: slurm_receive_msg [10.50.223.79:34116]: Message receive failure
[2023-03-03T13:14:55.252] error: unpack_header: protocol_version 8448 not supported
[2023-03-03T13:14:55.252] error: unpacking header
[2023-03-03T13:14:55.252] error: destroy_forward: no init


Since these issues impact our research activities severely, we will highly appreciate your attention and prompt support. Attached logs for your reference.


Thanks
HPC System Admin
RCC, University of Chicago
Comment 1 RCC SysAdmin 2023-03-03 12:30:52 MST
Created attachment 29136 [details]
slurm controller logs
Comment 2 Jason Booth 2023-03-03 12:55:44 MST
Did you upgrade directly from 18.08, and were there jobs in the queue or jobs left running?
Comment 3 RCC SysAdmin 2023-03-03 13:00:26 MST
No direct upgrade. Intermediate versions have been upgraded to achieve the target version
18.08 > 19.05>20.02>20.11>21.08>22.05>23.02

All previous running jobs could not recovered, hence resubmitted.
Comment 5 Nate Rini 2023-03-03 14:49:49 MST
Please attach slurm.conf and related files as I assume they have changed since the last ticket.

(In reply to RCC SysAdmin from comment #0)
> (2) After upgrade another issue encountered that interactive jobs throwing
> authentication errors while batch jobs working fine. Both methods were
> working fine in previous version 18.08.
> 
> Example:
> $ sinteractive -p broadwl --account=pi-mstephens
> Submitted batch job 26162436
> Access denied: user pcarbo (uid=69822) has no active jobs on this node.
> Authentication failed.

Is `LaunchParameters=use_interactive_step` set in slurm.conf?

Is `sinteractive` from this project?
> https://github.com/sdsc/sinteractive

Do normal interactive jobs work?
> salloc -X11 -p broadwl --account=pi-mstephens uptime
 
.

> (3) Slurm controller reporting "slurm_unpack_received_msg" in slurmctld
> logs. Not sure to what these error messages are related.
> [2023-03-03T13:14:55.252] error: unpack_header: protocol_version 8448 not
> supported

These errors are created when an 18.08 binary attempts to communicate with the controllers. During the upgrade, were the old binaries removed?
Comment 6 Nate Rini 2023-03-07 09:00:23 MST
Reducing ticket severity while waiting for responses to questions in comment#5.
Comment 7 RCC SysAdmin 2023-03-07 13:16:07 MST
Hi Team,

We are able to resolve these issues:

(1) Cleared old jobs and asked user to resubmit their jobs, no more CG errors.

(2) sinteractive fixed by copying pam_slurm.so and pam_slurm_adop.so to /usr/lib64/security.

(3) "protocol_version 8448 not supported" resolved by unlinking all previous versions of slurm from existing slurm executables.


Thankyou for your support and Please proceeding closing this case.


Regards
RCC HPC Admins
Comment 8 Nate Rini 2023-03-07 13:43:38 MST
Closing per the last response.