| Summary: | The node is down after scontrol reconfigure | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | NAOKI <shibata> |
| Component: | slurmctld | Assignee: | David Bigagli <david> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | da, maclach |
| Version: | 2.6.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CRAY | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
NAOKI
2014-01-26 11:21:45 MST
I preseume that you mean the slurmctld daemon is down, not the slurmd daemons. This message: # scontrol: error: slurm_receive_msg: Socket timed out on send/recv operation # slurm_reconfigure error: Socket timed out on send/recv operation indicates that your slurmctld is not responding. That might happon on a reconfiguration if you have a huge number of jobs or something really bad in your configuration or a really slow file system with the state file. Another option is that your configuration is bad and slurmctld died. Please attach your slurmctld log file for this time period. Still waiting for logs and configuration file. If this happens again, please open a ticket with the configuration file and logs. |