| Summary: | Node loses communication during job | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Steve Brasier <steveb> |
| Component: | slurmctld | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | 20.11.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurmctld log | ||
Changes from watching squeue and sinfo:
[2020-01-23T12:01:42.067202]
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up 60-00:00:0 2 idle ohpc-compute-[0-1]
[2020-01-23T12:01:56.306523]
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute runhello centos CF 0:02 4 ohpc-compute-[0-3]
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up 60-00:00:0 2 mix# ohpc-compute-[2-3]
compute* up 60-00:00:0 2 mix ohpc-compute-[0-1]
[2020-01-23T12:08:16.573169]
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute runhello centos CF 6:22 4 ohpc-compute-[0-3]
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up 60-00:00:0 4 mix ohpc-compute-[0-3]
[2020-01-23T12:08:47.069845]
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute runhello centos R 0:02 4 ohpc-compute-[0-3]
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up 60-00:00:0 4 mix ohpc-compute-[0-3]
[2020-01-23T12:08:49.103116]
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute runhello centos CG 0:02 3 ohpc-compute-[0-2]
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up 60-00:00:0 3 comp ohpc-compute-[0-2]
compute* up 60-00:00:0 1 idle ohpc-compute-3
[2020-01-23T12:08:51.131978]
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up 60-00:00:0 4 idle ohpc-compute-[0-3]
[2020-01-23T12:10:51.194768]
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up 60-00:00:0 2 idle ohpc-compute-[0-1]
[2020-01-23T12:15:27.822033]
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 compute runhello centos CF 0:01 4 ohpc-compute-[0-3]
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up 60-00:00:0 2 mix ohpc-compute-[0-1]
[2020-01-23T12:15:58.336430]
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 compute runhello centos CF 0:32 4 ohpc-compute-[0-3]
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up 60-00:00:0 2 mix# ohpc-compute-[2-3]
compute* up 60-00:00:0 2 mix ohpc-compute-[0-1]
[2020-01-23T12:21:39.730209]
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 compute runhello centos CF 6:13 4 ohpc-compute-[0-3]
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up 60-00:00:0 4 mix ohpc-compute-[0-3]
[2020-01-23T12:21:45.834557]
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 compute runhello centos R 0:00 4 ohpc-compute-[0-3]
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up 60-00:00:0 4 mix ohpc-compute-[0-3]
[2020-01-23T12:21:49.903407]
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 compute runhello centos CG 0:03 4 ohpc-compute-[0-3]
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up 60-00:00:0 4 comp ohpc-compute-[0-3]
[2020-01-23T12:21:51.930630]
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 compute runhello centos CG 0:03 2 ohpc-compute-[2-3]
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up 60-00:00:0 2 comp ohpc-compute-[2-3]
compute* up 60-00:00:0 2 idle ohpc-compute-[0-1]
IPs: ohpc-compute-2 10.0.0.82 ohpc-compute-3 10.0.0.98 ohpc-login 10.0.0.188 ohpc-compute-0 10.0.0.238 ohpc-compute-1 10.0.0.231 Sorry - should have added this: I've worked through the https://slurm.schedmd.com/troubleshoot.html page, most of which feels relevant. Interestingly, restarting slurmd on the affected nodes fixed the comms problem instantly, and slurmctld reported those nodes were up. Steve, SchedMD has a professional services support team that can help resolve these issues for you. However, before the support team can engage we need to put a support contract in place. Is Stack HPC willing to consider Slurm support for this project? Thanks, Jacob Ah sorry I didn't realise this was how you handled bugs. I'll discuss. |
Created attachment 12812 [details] slurmctld log I'm prototyping a slurm cluster using the elastic computing functionality and running into a communications problem. What happens, in brief, is: - Create a cluster with 2 normal nodes and 2 State=CLOUD nodes. The normal ones are excluded from powersaving - Run a 4x node job (this is Job 2 in logs below). The nodes get created, the job shows as Running, finishes, and no errors occur. - Wait for nodes to be powered down. - Run the same 4x node job again (this is Job 3 in logs below): - The nodes get created and slurmctld shows "Node ohpc-compute-2 now responding" and the same for node -3. - job goes into Running state - slurmctld shows "_job_complete: JobId=3 WEXITSTATUS 0" - the REQUEST_TERMINATE_JOB message fails and after retrying slurmctld reports "error: Nodes ohpc-compute-[2-3] not responding". I have verified that during the "can't communicate" phase I can ping from the (combined) control/login node to the compute nodes in question, using both the hostnames and IPs. The only odd thing I can see is that the slurmctld log shows: [2020-01-23T12:23:23.047] debug2: Error connecting slurm stream socket at 10.0.0.75:6818: Connection timed out [2020-01-23T12:23:23.048] debug2: Error connecting slurm stream socket at 10.0.0.91:6818: Connection timed out neither of these are IPs associated with the compute or control/login nodes - is this expected, or should these be the IPs of the relevant compute nodes?? Attached are: - Output from sinfo and squeue during the entire sequence, showing only changes - The portion of slurmctld.log covering both Job 2 (starts ~12:01) and Job 3 (starts ~12:15) at highest logging level - I've stripped out the RPC calls due to sinfo/squeue that's all. - Node IPs.