Ticket 8380

Summary:	Node loses communication during job
Product:	Slurm	Reporter:	Steve Brasier <steveb>
Component:	slurmctld	Assignee:	Jacob Jenson <jacob>
Status:	RESOLVED INVALID	QA Contact:
Severity:	6 - No support contract
Priority:	---
Version:	20.11.x
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmctld log

Description Steve Brasier 2020-01-23 07:27:29 MST

Created attachment 12812 [details]
slurmctld log

I'm prototyping a slurm cluster using the elastic computing functionality and running into a communications problem.

What happens, in brief, is:
- Create a cluster with 2 normal nodes and 2 State=CLOUD nodes. The normal ones are excluded from powersaving
- Run a 4x node job (this is Job 2 in logs below). The nodes get created, the job shows as Running, finishes, and no errors occur.
- Wait for nodes to be powered down.
- Run the same 4x node job again (this is Job 3 in logs below):
  - The nodes get created and slurmctld shows "Node ohpc-compute-2 now responding" and the same for node -3.
  - job goes into Running state
  - slurmctld shows "_job_complete: JobId=3 WEXITSTATUS 0"
  - the REQUEST_TERMINATE_JOB message fails and after retrying slurmctld reports "error: Nodes ohpc-compute-[2-3] not responding".

I have verified that during the "can't communicate" phase I can ping from the (combined) control/login node to the compute nodes in question, using both the hostnames and IPs.

The only odd thing I can see is that the slurmctld log shows:

[2020-01-23T12:23:23.047] debug2: Error connecting slurm stream socket at 10.0.0.75:6818: Connection timed out
[2020-01-23T12:23:23.048] debug2: Error connecting slurm stream socket at 10.0.0.91:6818: Connection timed out

neither of these are IPs associated with the compute or control/login nodes - is this expected, or should these be the IPs of the relevant compute nodes??

Attached are:
- Output from sinfo and squeue during the entire sequence, showing only changes
- The portion of slurmctld.log covering both Job 2 (starts ~12:01) and Job 3 (starts ~12:15) at highest logging level - I've stripped out the RPC calls due to sinfo/squeue that's all.
- Node IPs.

Comment 1 Steve Brasier 2020-01-23 07:28:18 MST

Changes from watching squeue and sinfo:

[2020-01-23T12:01:42.067202]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2   idle ohpc-compute-[0-1]

[2020-01-23T12:01:56.306523]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute runhello   centos CF       0:02      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2   mix# ohpc-compute-[2-3]
compute*     up 60-00:00:0      2    mix ohpc-compute-[0-1]

[2020-01-23T12:08:16.573169]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute runhello   centos CF       6:22      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4    mix ohpc-compute-[0-3]

[2020-01-23T12:08:47.069845]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute runhello   centos  R       0:02      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4    mix ohpc-compute-[0-3]

[2020-01-23T12:08:49.103116]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute runhello   centos CG       0:02      3 ohpc-compute-[0-2]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      3   comp ohpc-compute-[0-2]
compute*     up 60-00:00:0      1   idle ohpc-compute-3

[2020-01-23T12:08:51.131978]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4   idle ohpc-compute-[0-3]

[2020-01-23T12:10:51.194768]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2   idle ohpc-compute-[0-1]

[2020-01-23T12:15:27.822033]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos CF       0:01      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2    mix ohpc-compute-[0-1]

[2020-01-23T12:15:58.336430]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos CF       0:32      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2   mix# ohpc-compute-[2-3]
compute*     up 60-00:00:0      2    mix ohpc-compute-[0-1]

[2020-01-23T12:21:39.730209]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos CF       6:13      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4    mix ohpc-compute-[0-3]

[2020-01-23T12:21:45.834557]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos  R       0:00      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4    mix ohpc-compute-[0-3]

[2020-01-23T12:21:49.903407]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos CG       0:03      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4   comp ohpc-compute-[0-3]

[2020-01-23T12:21:51.930630]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos CG       0:03      2 ohpc-compute-[2-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2   comp ohpc-compute-[2-3]
compute*     up 60-00:00:0      2   idle ohpc-compute-[0-1]

Comment 2 Steve Brasier 2020-01-23 07:30:29 MST

IPs:

ohpc-compute-2	10.0.0.82
ohpc-compute-3	10.0.0.98
ohpc-login	10.0.0.188
ohpc-compute-0	10.0.0.238
ohpc-compute-1	10.0.0.231

Comment 3 Steve Brasier 2020-01-23 08:20:46 MST

Sorry - should have added this:

I've worked through the https://slurm.schedmd.com/troubleshoot.html page, most of which feels relevant.

Interestingly, restarting slurmd on the affected nodes fixed the comms problem instantly, and slurmctld reported those nodes were up.

Comment 4 Jacob Jenson 2020-01-23 08:30:42 MST

Steve,

SchedMD has a professional services support team that can help resolve these issues for you. However, before the support team can engage we need to put a support contract in place. Is Stack HPC willing to consider Slurm support for this project? 

Thanks,
Jacob

Comment 5 Steve Brasier 2020-01-23 10:08:00 MST

Ah sorry I didn't realise this was how you handled bugs. I'll discuss.