| Summary: | Compute nodes flapping between responding and not responding | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Hjalti Sveinsson <hjalti.sveinsson> |
| Component: | slurmd | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | alex |
| Version: | 18.08.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | deCODE | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | RHEL |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 18.08.7 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Hjalti Sveinsson
2019-07-04 09:13:26 MDT
I have upped the importance of this issue since this is causing all of the jobs to fail on these nodes and these nodes are really important since they are running jobs on a partition that has nodes with a special setup (os, packages etc). Hi Hjalti, can you attach your config files and slurmctld.log? do all the nodes in the cluster have the same configuration? thanks Hi, I just changed the Ethernet Adapter policy in Cisco UCS manager on these nodes to Linux policy and now they have stopped showing up as not responding. I just changed this at 10:10 this morning and no problem so far. I will update this issue after and hour or so, if I see no issue I will consider the issue fixed an then it has nothing to do with Slurm. Ok, thanks for your feedback. Any further issues? Can we close this out? thanks. Hjalti, I'm lowering the severity of this. Please, let us know if the policy change solved the problem and we can close this out. Thanks. This issue can be closed as it was fixed with the adpater policy in Cisco UCS manager. Thank you. |