Hi Team, We are facing multiple issues with slurm here at RCC, University Of Chicago. (1) Jobs were stucking on CG state. reference Bug ticket#16004. As recommended, we upgraded slurm version to latest 23.02. But still facing the same issue. (2) After upgrade another issue encountered that interactive jobs throwing authentication errors while batch jobs working fine. Both methods were working fine in previous version 18.08. Example: $ sinteractive -p broadwl --account=pi-mstephens Submitted batch job 26162436 Access denied: user pcarbo (uid=69822) has no active jobs on this node. Authentication failed. (3) Slurm controller reporting "slurm_unpack_received_msg" in slurmctld logs. Not sure to what these error messages are related. Erros in logs: [2023-03-03T13:14:54.940] error: slurm_unpack_received_msg: [[midway2-0591.rcc.local]:34116] Message receive failure [2023-03-03T13:14:54.950] error: slurm_receive_msg [10.50.223.79:34116]: Message receive failure [2023-03-03T13:14:55.252] error: unpack_header: protocol_version 8448 not supported [2023-03-03T13:14:55.252] error: unpacking header [2023-03-03T13:14:55.252] error: destroy_forward: no init Since these issues impact our research activities severely, we will highly appreciate your attention and prompt support. Attached logs for your reference. Thanks HPC System Admin RCC, University of Chicago
Created attachment 29136 [details] slurm controller logs
Did you upgrade directly from 18.08, and were there jobs in the queue or jobs left running?
No direct upgrade. Intermediate versions have been upgraded to achieve the target version 18.08 > 19.05>20.02>20.11>21.08>22.05>23.02 All previous running jobs could not recovered, hence resubmitted.
Please attach slurm.conf and related files as I assume they have changed since the last ticket. (In reply to RCC SysAdmin from comment #0) > (2) After upgrade another issue encountered that interactive jobs throwing > authentication errors while batch jobs working fine. Both methods were > working fine in previous version 18.08. > > Example: > $ sinteractive -p broadwl --account=pi-mstephens > Submitted batch job 26162436 > Access denied: user pcarbo (uid=69822) has no active jobs on this node. > Authentication failed. Is `LaunchParameters=use_interactive_step` set in slurm.conf? Is `sinteractive` from this project? > https://github.com/sdsc/sinteractive Do normal interactive jobs work? > salloc -X11 -p broadwl --account=pi-mstephens uptime . > (3) Slurm controller reporting "slurm_unpack_received_msg" in slurmctld > logs. Not sure to what these error messages are related. > [2023-03-03T13:14:55.252] error: unpack_header: protocol_version 8448 not > supported These errors are created when an 18.08 binary attempts to communicate with the controllers. During the upgrade, were the old binaries removed?
Reducing ticket severity while waiting for responses to questions in comment#5.
Hi Team, We are able to resolve these issues: (1) Cleared old jobs and asked user to resubmit their jobs, no more CG errors. (2) sinteractive fixed by copying pam_slurm.so and pam_slurm_adop.so to /usr/lib64/security. (3) "protocol_version 8448 not supported" resolved by unlinking all previous versions of slurm from existing slurm executables. Thankyou for your support and Please proceeding closing this case. Regards RCC HPC Admins
Closing per the last response.