Created attachment 12812 [details] slurmctld log I'm prototyping a slurm cluster using the elastic computing functionality and running into a communications problem. What happens, in brief, is: - Create a cluster with 2 normal nodes and 2 State=CLOUD nodes. The normal ones are excluded from powersaving - Run a 4x node job (this is Job 2 in logs below). The nodes get created, the job shows as Running, finishes, and no errors occur. - Wait for nodes to be powered down. - Run the same 4x node job again (this is Job 3 in logs below): - The nodes get created and slurmctld shows "Node ohpc-compute-2 now responding" and the same for node -3. - job goes into Running state - slurmctld shows "_job_complete: JobId=3 WEXITSTATUS 0" - the REQUEST_TERMINATE_JOB message fails and after retrying slurmctld reports "error: Nodes ohpc-compute-[2-3] not responding". I have verified that during the "can't communicate" phase I can ping from the (combined) control/login node to the compute nodes in question, using both the hostnames and IPs. The only odd thing I can see is that the slurmctld log shows: [2020-01-23T12:23:23.047] debug2: Error connecting slurm stream socket at 10.0.0.75:6818: Connection timed out [2020-01-23T12:23:23.048] debug2: Error connecting slurm stream socket at 10.0.0.91:6818: Connection timed out neither of these are IPs associated with the compute or control/login nodes - is this expected, or should these be the IPs of the relevant compute nodes?? Attached are: - Output from sinfo and squeue during the entire sequence, showing only changes - The portion of slurmctld.log covering both Job 2 (starts ~12:01) and Job 3 (starts ~12:15) at highest logging level - I've stripped out the RPC calls due to sinfo/squeue that's all. - Node IPs.
Changes from watching squeue and sinfo: [2020-01-23T12:01:42.067202] JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 60-00:00:0 2 idle ohpc-compute-[0-1] [2020-01-23T12:01:56.306523] JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute runhello centos CF 0:02 4 ohpc-compute-[0-3] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 60-00:00:0 2 mix# ohpc-compute-[2-3] compute* up 60-00:00:0 2 mix ohpc-compute-[0-1] [2020-01-23T12:08:16.573169] JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute runhello centos CF 6:22 4 ohpc-compute-[0-3] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 60-00:00:0 4 mix ohpc-compute-[0-3] [2020-01-23T12:08:47.069845] JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute runhello centos R 0:02 4 ohpc-compute-[0-3] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 60-00:00:0 4 mix ohpc-compute-[0-3] [2020-01-23T12:08:49.103116] JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute runhello centos CG 0:02 3 ohpc-compute-[0-2] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 60-00:00:0 3 comp ohpc-compute-[0-2] compute* up 60-00:00:0 1 idle ohpc-compute-3 [2020-01-23T12:08:51.131978] JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 60-00:00:0 4 idle ohpc-compute-[0-3] [2020-01-23T12:10:51.194768] JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 60-00:00:0 2 idle ohpc-compute-[0-1] [2020-01-23T12:15:27.822033] JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3 compute runhello centos CF 0:01 4 ohpc-compute-[0-3] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 60-00:00:0 2 mix ohpc-compute-[0-1] [2020-01-23T12:15:58.336430] JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3 compute runhello centos CF 0:32 4 ohpc-compute-[0-3] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 60-00:00:0 2 mix# ohpc-compute-[2-3] compute* up 60-00:00:0 2 mix ohpc-compute-[0-1] [2020-01-23T12:21:39.730209] JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3 compute runhello centos CF 6:13 4 ohpc-compute-[0-3] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 60-00:00:0 4 mix ohpc-compute-[0-3] [2020-01-23T12:21:45.834557] JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3 compute runhello centos R 0:00 4 ohpc-compute-[0-3] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 60-00:00:0 4 mix ohpc-compute-[0-3] [2020-01-23T12:21:49.903407] JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3 compute runhello centos CG 0:03 4 ohpc-compute-[0-3] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 60-00:00:0 4 comp ohpc-compute-[0-3] [2020-01-23T12:21:51.930630] JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3 compute runhello centos CG 0:03 2 ohpc-compute-[2-3] PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up 60-00:00:0 2 comp ohpc-compute-[2-3] compute* up 60-00:00:0 2 idle ohpc-compute-[0-1]
IPs: ohpc-compute-2 10.0.0.82 ohpc-compute-3 10.0.0.98 ohpc-login 10.0.0.188 ohpc-compute-0 10.0.0.238 ohpc-compute-1 10.0.0.231
Sorry - should have added this: I've worked through the https://slurm.schedmd.com/troubleshoot.html page, most of which feels relevant. Interestingly, restarting slurmd on the affected nodes fixed the comms problem instantly, and slurmctld reported those nodes were up.
Steve, SchedMD has a professional services support team that can help resolve these issues for you. However, before the support team can engage we need to put a support contract in place. Is Stack HPC willing to consider Slurm support for this project? Thanks, Jacob
Ah sorry I didn't realise this was how you handled bugs. I'll discuss.