Ticket 16187

Summary: Jobs stuck at CG state & also Unable to run interactive jobs even after slurm update
Product: Slurm Reporter: RCC SysAdmin <operator>
Component: SchedulingAssignee: Nate Rini <nate>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: operator
Version: 23.02.0   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=16004
Site: University of Chicago Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: CentOS
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm controller logs

Description RCC SysAdmin 2023-03-03 12:23:02 MST
Hi Team,

We are facing multiple issues with slurm here at RCC, University Of Chicago. 
(1) Jobs were stucking on CG state. reference Bug ticket#16004.
As recommended, we upgraded slurm version to latest 23.02. But still facing the same issue.

(2) After upgrade another issue encountered that interactive jobs throwing authentication errors while batch jobs working fine. Both methods were working fine in previous version 18.08.

Example:
$ sinteractive -p broadwl --account=pi-mstephens
Submitted batch job 26162436
Access denied: user pcarbo (uid=69822) has no active jobs on this node.
Authentication failed.

(3) Slurm controller reporting "slurm_unpack_received_msg" in slurmctld logs. Not sure to what these error messages are related.

Erros in logs:
[2023-03-03T13:14:54.940] error: slurm_unpack_received_msg: [[midway2-0591.rcc.local]:34116] Message receive failure
[2023-03-03T13:14:54.950] error: slurm_receive_msg [10.50.223.79:34116]: Message receive failure
[2023-03-03T13:14:55.252] error: unpack_header: protocol_version 8448 not supported
[2023-03-03T13:14:55.252] error: unpacking header
[2023-03-03T13:14:55.252] error: destroy_forward: no init


Since these issues impact our research activities severely, we will highly appreciate your attention and prompt support. Attached logs for your reference.


Thanks
HPC System Admin
RCC, University of Chicago
Comment 1 RCC SysAdmin 2023-03-03 12:30:52 MST
Created attachment 29136 [details]
slurm controller logs
Comment 2 Jason Booth 2023-03-03 12:55:44 MST
Did you upgrade directly from 18.08, and were there jobs in the queue or jobs left running?
Comment 3 RCC SysAdmin 2023-03-03 13:00:26 MST
No direct upgrade. Intermediate versions have been upgraded to achieve the target version
18.08 > 19.05>20.02>20.11>21.08>22.05>23.02

All previous running jobs could not recovered, hence resubmitted.
Comment 5 Nate Rini 2023-03-03 14:49:49 MST
Please attach slurm.conf and related files as I assume they have changed since the last ticket.

(In reply to RCC SysAdmin from comment #0)
> (2) After upgrade another issue encountered that interactive jobs throwing
> authentication errors while batch jobs working fine. Both methods were
> working fine in previous version 18.08.
> 
> Example:
> $ sinteractive -p broadwl --account=pi-mstephens
> Submitted batch job 26162436
> Access denied: user pcarbo (uid=69822) has no active jobs on this node.
> Authentication failed.

Is `LaunchParameters=use_interactive_step` set in slurm.conf?

Is `sinteractive` from this project?
> https://github.com/sdsc/sinteractive

Do normal interactive jobs work?
> salloc -X11 -p broadwl --account=pi-mstephens uptime
 
.

> (3) Slurm controller reporting "slurm_unpack_received_msg" in slurmctld
> logs. Not sure to what these error messages are related.
> [2023-03-03T13:14:55.252] error: unpack_header: protocol_version 8448 not
> supported

These errors are created when an 18.08 binary attempts to communicate with the controllers. During the upgrade, were the old binaries removed?
Comment 6 Nate Rini 2023-03-07 09:00:23 MST
Reducing ticket severity while waiting for responses to questions in comment#5.
Comment 7 RCC SysAdmin 2023-03-07 13:16:07 MST
Hi Team,

We are able to resolve these issues:

(1) Cleared old jobs and asked user to resubmit their jobs, no more CG errors.

(2) sinteractive fixed by copying pam_slurm.so and pam_slurm_adop.so to /usr/lib64/security.

(3) "protocol_version 8448 not supported" resolved by unlinking all previous versions of slurm from existing slurm executables.


Thankyou for your support and Please proceeding closing this case.


Regards
RCC HPC Admins
Comment 8 Nate Rini 2023-03-07 13:43:38 MST
Closing per the last response.