We have intermittent failures when submitting jobs. Error message looks like this: [johthompso@jthompson ~]$ srun -p unlimited --pty bash srun: error: Task launch for StepId=355431.0 failed on node cpu-731: Job credential expired srun: error: Application launch failed: Job credential expired srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: Timed out waiting for job step to complete [johthompso@jthompson ~]$ sinfo --version slurm 21.08.8-2 We've seen this happen with many users on many machines, and then the error has disappeared.
So I searched your site, and see that clocks are not exact among all nodes in the cluster. What's the maximum skew allowed between submit host and execution host?
Please verify the following. 1. Time is in sync across the cluster 2. That the munge is running on those nodes and is responding. 3. Consider increasing the munge threads. > https://slurm.schedmd.com/high_throughput.html#munge_config 4. You may want to consider increasing the tcptimeout to 30 seconds. > https://slurm.schedmd.com/slurm.conf.html#OPT_TCPTimeout 5. Have you provided the node's with these failures to see if they are under heavy load or their disk is being heavily used? > vmstat 1 30 6. Please send us the logs from a few of the nodes during that time and the controller logs.
(In reply to John Thompson from comment #2) > So I searched your site, and see that clocks are not exact among all nodes > in the cluster. What's the maximum skew allowed between submit host and > execution host? The default time-to-live for munge credentials is typically 300 seconds (unless it has been modified to be different). Any clock skew more than this (as well as other possible problems) could result in the issues you are seeing.
I've got a bad clock. I've done a hack to fix times on the machines, and I can submit jobs now. I still have work to do, but we can submit jobs to slurm now. Please close the case.
Okay, glad you were able to at least figure out the source. Let us know if you have any further questions. Closing this now