| Summary: | Invalid job credential after upgrading to 20.11 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | lhuang |
| Component: | slurmd | Assignee: | Tim McMullan <mcmullan> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | cblack, nate |
| Version: | 20.11.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NY Genome | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
lhuang
2020-11-23 16:48:08 MST
Is this happening to just some specific job/jobs? And if so, would you be able to attach something like the "scontrol show job" output for it? This looks like the credential is expiring before the job finishes launching. Would you please try adding "AuthInfo=cred_expire=600" to your slurm.conf file to see if it improves? Daemons will need to be restarted for this to be applied. We've made some changes in 20.11 to better catch this kind of error, and this suggestion should appear in the slurmd logs (and the job should fail). Thanks! --Tim This was occurring on all jobs. After we upgraded all the compute nodes to match the same version as 20.11. We no longer see the errors. |