| Summary: | Jobs stuck at CG state & also Unable to run interactive jobs even after slurm update | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | RCC SysAdmin <operator> |
| Component: | Scheduling | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | operator |
| Version: | 23.02.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=16004 | ||
| Site: | University of Chicago | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | CentOS |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm controller logs | ||
|
Description
RCC SysAdmin
2023-03-03 12:23:02 MST
Created attachment 29136 [details]
slurm controller logs
Did you upgrade directly from 18.08, and were there jobs in the queue or jobs left running? No direct upgrade. Intermediate versions have been upgraded to achieve the target version 18.08 > 19.05>20.02>20.11>21.08>22.05>23.02 All previous running jobs could not recovered, hence resubmitted. Please attach slurm.conf and related files as I assume they have changed since the last ticket. (In reply to RCC SysAdmin from comment #0) > (2) After upgrade another issue encountered that interactive jobs throwing > authentication errors while batch jobs working fine. Both methods were > working fine in previous version 18.08. > > Example: > $ sinteractive -p broadwl --account=pi-mstephens > Submitted batch job 26162436 > Access denied: user pcarbo (uid=69822) has no active jobs on this node. > Authentication failed. Is `LaunchParameters=use_interactive_step` set in slurm.conf? Is `sinteractive` from this project? > https://github.com/sdsc/sinteractive Do normal interactive jobs work? > salloc -X11 -p broadwl --account=pi-mstephens uptime . > (3) Slurm controller reporting "slurm_unpack_received_msg" in slurmctld > logs. Not sure to what these error messages are related. > [2023-03-03T13:14:55.252] error: unpack_header: protocol_version 8448 not > supported These errors are created when an 18.08 binary attempts to communicate with the controllers. During the upgrade, were the old binaries removed? Reducing ticket severity while waiting for responses to questions in comment#5. Hi Team, We are able to resolve these issues: (1) Cleared old jobs and asked user to resubmit their jobs, no more CG errors. (2) sinteractive fixed by copying pam_slurm.so and pam_slurm_adop.so to /usr/lib64/security. (3) "protocol_version 8448 not supported" resolved by unlinking all previous versions of slurm from existing slurm executables. Thankyou for your support and Please proceeding closing this case. Regards RCC HPC Admins Closing per the last response. |