| Summary: | Job credential expired | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | John Thompson <jthompson> |
| Component: | slurmctld | Assignee: | Ben Glines <ben.glines> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | ||
| Version: | 21.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Albert Einstein | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
John Thompson
2022-07-28 09:38:04 MDT
So I searched your site, and see that clocks are not exact among all nodes in the cluster. What's the maximum skew allowed between submit host and execution host? Please verify the following. 1. Time is in sync across the cluster 2. That the munge is running on those nodes and is responding. 3. Consider increasing the munge threads. > https://slurm.schedmd.com/high_throughput.html#munge_config 4. You may want to consider increasing the tcptimeout to 30 seconds. > https://slurm.schedmd.com/slurm.conf.html#OPT_TCPTimeout 5. Have you provided the node's with these failures to see if they are under heavy load or their disk is being heavily used? > vmstat 1 30 6. Please send us the logs from a few of the nodes during that time and the controller logs. (In reply to John Thompson from comment #2) > So I searched your site, and see that clocks are not exact among all nodes > in the cluster. What's the maximum skew allowed between submit host and > execution host? The default time-to-live for munge credentials is typically 300 seconds (unless it has been modified to be different). Any clock skew more than this (as well as other possible problems) could result in the issues you are seeing. I've got a bad clock. I've done a hack to fix times on the machines, and I can submit jobs now. I still have work to do, but we can submit jobs to slurm now. Please close the case. Okay, glad you were able to at least figure out the source. Let us know if you have any further questions. Closing this now |