Ticket 14643

Summary:	Job credential expired
Product:	Slurm	Reporter:	John Thompson <jthompson>
Component:	slurmctld	Assignee:	Ben Glines <ben.glines>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	2 - High Impact
Priority:	---
Version:	21.08.8
Hardware:	Linux
OS:	Linux
Site:	Albert Einstein	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description John Thompson 2022-07-28 09:38:04 MDT

We have intermittent failures when submitting jobs.

Error message looks like this:

[johthompso@jthompson ~]$ srun -p unlimited --pty bash
srun: error: Task launch for StepId=355431.0 failed on node cpu-731: Job credential expired
srun: error: Application launch failed: Job credential expired
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
[johthompso@jthompson ~]$ sinfo --version
slurm 21.08.8-2


We've seen this happen with many users on many machines, and then the error has disappeared.

Comment 2 John Thompson 2022-07-28 11:17:27 MDT

So I searched your site, and see that clocks are not exact among all nodes in the cluster.  What's the maximum skew allowed between submit host and execution host?

Comment 3 Ben Glines 2022-07-28 11:21:47 MDT

Please verify the following.

1. Time is in sync across the cluster
2. That the munge is running on those nodes and is responding.
3. Consider increasing the munge threads.
> https://slurm.schedmd.com/high_throughput.html#munge_config

4. You may want to consider increasing the tcptimeout to 30 seconds.
> https://slurm.schedmd.com/slurm.conf.html#OPT_TCPTimeout

5. Have you provided the node's with these failures to see if they are under heavy load or their disk is being heavily used?
> vmstat 1 30

6. Please send us the logs from a few of the nodes during that time and the controller logs.

Comment 4 Ben Glines 2022-07-28 14:09:41 MDT

(In reply to John Thompson from comment #2)
> So I searched your site, and see that clocks are not exact among all nodes
> in the cluster.  What's the maximum skew allowed between submit host and
> execution host?

The default time-to-live for munge credentials is typically 300 seconds (unless it has been modified to be different). Any clock skew more than this (as well as other possible problems) could result in the issues you are seeing.

Comment 5 John Thompson 2022-07-28 15:13:47 MDT

I've got a bad clock. I've done a hack to fix times on the machines, and I can submit jobs now. I still have work to do, but we can submit jobs to slurm now.
Please close the case.

Comment 6 Ben Glines 2022-07-28 16:42:00 MDT

Okay, glad you were able to at least figure out the source. Let us know if you have any further questions. Closing this now