| Summary: | node fail with not responding | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Xing Huang <x.huang> |
| Component: | Other | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 22.05.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | WA St. Louis | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Xing Huang
2022-12-16 12:20:26 MST
Please attach your slurm.conf and the slurmd.log from that node. While we wait on the slurm.conf, it is worth mentioning that the controller periodically connects to the slurmd's over a TCP connection. If it can not make a connection in the SlurmdTimeout period, then the slurmd will be considered down not responding. You may want to consider increasing SlurmdTimeout. If your site uses a value of 300 then you may want to consider a value of 500-600. Normally when we see these type of messages, nodes are busy doing other work like copying large amounts of data to the site storage solution or just have CPU intensive workflow. (In reply to Jason Booth from comment #2) > While we wait on the slurm.conf, it is worth mentioning that the controller > periodically connects to the slurmd's over a TCP connection. If it can not > make a connection in the SlurmdTimeout period, then the slurmd will be > considered down not responding. > > > You may want to consider increasing SlurmdTimeout. If your site uses a value > of 300 then you may want to consider a value of 500-600. > > Normally when we see these type of messages, nodes are busy doing other work > like copying large amounts of data to the site storage solution or just have > CPU intensive workflow. That is a good suggestion. At this point, there is no more information in slurmd.log than what I have shown you. I will first bump up this value in slurm.conf and see if this helps. Can you update you in a few days after I make change? > Can you update you in a few days after I make change?
Yes, for now I will proceed to downgrade the severity.
I am closing this out. Please feel free to re-open, however, I would consider the bump in the slurmdtimeout to be sufficient. |