| Summary: | nodes down with expired credentials | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Nicholas Labello <nicholas.labello> |
| Component: | slurmd | Assignee: | Director of Support <support> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 21.08.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Pfizer | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
sdiag.log
slurm.conf slurmctld-20220601.gz |
||
|
Description
Nicholas Labello
2022-06-01 18:37:43 MDT
> Munge decode failed: Expired credential These errors can occur for a few different reasons. 1. Notes are out of sync with their time. Munge requires in sync notes in-order for their signature to be valid. 2. You could be experiencing timeouts. This could be due to how busy the cluster is. We can make adjustments if needed once we gather some additional information. 3. You may need to increase the munge thread count. Please send us the slurm.conf, the full slurmctld.log cover this time period, and the full slurmd.log from one of the nodes. What I will be looking for is any errors that center around TCP timeouts and munge errors in the slurmctld.log. Please also send us the output of sdiag ran 5 times 1 minute apart. Regarding the munge thread, you can make this change now by editing the service file and adding these additional threads and reloading/restarting the service. https://slurm.schedmd.com/high_throughput.html#munge_config Created attachment 25342 [details] sdiag.log Thanks Jason. Attached. Unfortunately the nodes have been rebooted wiping slurmd logs. I may have to wait for the next occurrence to collect them. I noticed while collecting these logs that we do not have memory enforcement enabled. I think this must have been accidentally lost during a recent Bright upgrade which included jumping 2 major versions of Slurm. Given that the node failures occurred when a very resource-hungry user job was running on them I am wondering if it ran the nodes out of memory. Based on our slurm.conf is ConstrainRAMSpace=yes all we need to do to enable memory enforcement? From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, June 2, 2022 1:16 PM To: Labello, Nicholas <Nicholas.Labello@pfizer.com> Subject: [EXTERNAL] [Bug 14221] nodes down with expired credentials Comment # 1<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=14221*c1__;Iw!!H9nueQsQ!98uPaWgyUugXWR0lf69BuV5ZNImstm3IBCzuwsu5N8koKkexH_tKq7L59ah8qZdj3Uei1uR_z_MSZnSZGGU$> on bug 14221<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=14221__;!!H9nueQsQ!98uPaWgyUugXWR0lf69BuV5ZNImstm3IBCzuwsu5N8koKkexH_tKq7L59ah8qZdj3Uei1uR_z_MSBEQ5SDQ$> from Jason Booth<mailto:jbooth@schedmd.com> > Munge decode failed: Expired credential These errors can occur for a few different reasons. 1. Notes are out of sync with their time. Munge requires in sync notes in-order for their signature to be valid. 2. You could be experiencing timeouts. This could be due to how busy the cluster is. We can make adjustments if needed once we gather some additional information. 3. You may need to increase the munge thread count. Please send us the slurm.conf, the full slurmctld.log cover this time period, and the full slurmd.log from one of the nodes. What I will be looking for is any errors that center around TCP timeouts and munge errors in the slurmctld.log. Please also send us the output of sdiag ran 5 times 1 minute apart. Regarding the munge thread, you can make this change now by editing the service file and adding these additional threads and reloading/restarting the service. https://slurm.schedmd.com/high_throughput.html#munge_config<https://urldefense.com/v3/__https:/slurm.schedmd.com/high_throughput.html*munge_config__;Iw!!H9nueQsQ!98uPaWgyUugXWR0lf69BuV5ZNImstm3IBCzuwsu5N8koKkexH_tKq7L59ah8qZdj3Uei1uR_z_MSY23ICnc$> ________________________________ You are receiving this mail because: * You reported the bug. Created attachment 25343 [details]
slurm.conf
Created attachment 25344 [details]
slurmctld-20220601.gz
Hi Nicholas, (In reply to Nicholas Labello from comment #2) > Thanks Jason. Attached. Unfortunately the nodes have been rebooted wiping > slurmd logs. I may have to wait for the next occurrence to collect them. Were those nodes out of sync with real time? That can cause these type of expired credential errors. > I noticed while collecting these logs that we do not have memory enforcement > enabled. I think this must have been accidentally lost during a recent > Bright upgrade which included jumping 2 major versions of Slurm. Given that > the node failures occurred when a very resource-hungry user job was running > on them I am wondering if it ran the nodes out of memory. That's possible. Is there anything in the system logs that would indicate that? > Based on our slurm.conf is ConstrainRAMSpace=yes all we need to do to enable > memory enforcement? I think so. You already have the task/cgroup plugin enabled. Thanks, -Michael Hi Nicholas, any updates? Thanks, -Michael Feel free to reopen if you need more assistance. Thanks! |