Ticket 10274

Summary: Invalid job credential after upgrading to 20.11
Product: Slurm Reporter: lhuang
Component: slurmdAssignee: Tim McMullan <mcmullan>
Status: RESOLVED CANNOTREPRODUCE QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cblack, nate
Version: 20.11.0   
Hardware: Linux   
OS: Linux   
Site: NY Genome Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description lhuang 2020-11-23 16:48:08 MST
All our slurmd nodes are showing this when a job starts.

[2020-11-23T18:46:02.253] error: Credential signature check: Credential data size mismatch
[2020-11-23T18:46:02.254] error: _convert_job_mem: slurm_cred_verify failed: Invalid job credential


I've restarted slurmd, slurmctld and munge services on the compute node + slurmctld. We only upgraded slurmctld to 20.11. The rest of the compute nodes are still on 19.05.3.

The jobs still seems to run and complete.
Comment 2 Tim McMullan 2020-11-24 10:18:20 MST
Is this happening to just some specific job/jobs?  And if so, would you be able to attach something like the "scontrol show job" output for it?

This looks like the credential is expiring before the job finishes launching. Would you please try adding "AuthInfo=cred_expire=600" to your slurm.conf file to see if it improves?  Daemons will need to be restarted for this to be applied.

We've made some changes in 20.11 to better catch this kind of error, and this suggestion should appear in the slurmd logs (and the job should fail).

Thanks!
--Tim
Comment 3 lhuang 2020-11-30 12:55:56 MST
This was occurring on all jobs. After we upgraded all the compute nodes to match the same version as 20.11. We no longer see the errors.