8380 – Node loses communication during job

Ticket 8380 - Node loses communication during job

Summary: Node loses communication during job

Status:	RESOLVED INVALID

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	20.11.x
Hardware:	Linux Linux

Severity:	6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-01-23 07:27 MST by Steve Brasier
Modified:	2020-01-23 10:08 MST (History)
CC List:	0 users

See Also:
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmctld log (178.84 KB, text/plain) 2020-01-23 07:27 MST, Steve Brasier	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Steve Brasier 2020-01-23 07:27:29 MST

Created attachment 12812 [details]
slurmctld log

I'm prototyping a slurm cluster using the elastic computing functionality and running into a communications problem.

What happens, in brief, is:
- Create a cluster with 2 normal nodes and 2 State=CLOUD nodes. The normal ones are excluded from powersaving
- Run a 4x node job (this is Job 2 in logs below). The nodes get created, the job shows as Running, finishes, and no errors occur.
- Wait for nodes to be powered down.
- Run the same 4x node job again (this is Job 3 in logs below):
  - The nodes get created and slurmctld shows "Node ohpc-compute-2 now responding" and the same for node -3.
  - job goes into Running state
  - slurmctld shows "_job_complete: JobId=3 WEXITSTATUS 0"
  - the REQUEST_TERMINATE_JOB message fails and after retrying slurmctld reports "error: Nodes ohpc-compute-[2-3] not responding".

I have verified that during the "can't communicate" phase I can ping from the (combined) control/login node to the compute nodes in question, using both the hostnames and IPs.

The only odd thing I can see is that the slurmctld log shows:

[2020-01-23T12:23:23.047] debug2: Error connecting slurm stream socket at 10.0.0.75:6818: Connection timed out
[2020-01-23T12:23:23.048] debug2: Error connecting slurm stream socket at 10.0.0.91:6818: Connection timed out

neither of these are IPs associated with the compute or control/login nodes - is this expected, or should these be the IPs of the relevant compute nodes??

Attached are:
- Output from sinfo and squeue during the entire sequence, showing only changes
- The portion of slurmctld.log covering both Job 2 (starts ~12:01) and Job 3 (starts ~12:15) at highest logging level - I've stripped out the RPC calls due to sinfo/squeue that's all.
- Node IPs.

Comment 1 Steve Brasier 2020-01-23 07:28:18 MST

Changes from watching squeue and sinfo:

[2020-01-23T12:01:42.067202]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2   idle ohpc-compute-[0-1]

[2020-01-23T12:01:56.306523]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute runhello   centos CF       0:02      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2   mix# ohpc-compute-[2-3]
compute*     up 60-00:00:0      2    mix ohpc-compute-[0-1]

[2020-01-23T12:08:16.573169]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute runhello   centos CF       6:22      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4    mix ohpc-compute-[0-3]

[2020-01-23T12:08:47.069845]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute runhello   centos  R       0:02      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4    mix ohpc-compute-[0-3]

[2020-01-23T12:08:49.103116]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute runhello   centos CG       0:02      3 ohpc-compute-[0-2]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      3   comp ohpc-compute-[0-2]
compute*     up 60-00:00:0      1   idle ohpc-compute-3

[2020-01-23T12:08:51.131978]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4   idle ohpc-compute-[0-3]

[2020-01-23T12:10:51.194768]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2   idle ohpc-compute-[0-1]

[2020-01-23T12:15:27.822033]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos CF       0:01      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2    mix ohpc-compute-[0-1]

[2020-01-23T12:15:58.336430]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos CF       0:32      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2   mix# ohpc-compute-[2-3]
compute*     up 60-00:00:0      2    mix ohpc-compute-[0-1]

[2020-01-23T12:21:39.730209]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos CF       6:13      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4    mix ohpc-compute-[0-3]

[2020-01-23T12:21:45.834557]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos  R       0:00      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4    mix ohpc-compute-[0-3]

[2020-01-23T12:21:49.903407]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos CG       0:03      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4   comp ohpc-compute-[0-3]

[2020-01-23T12:21:51.930630]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos CG       0:03      2 ohpc-compute-[2-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2   comp ohpc-compute-[2-3]
compute*     up 60-00:00:0      2   idle ohpc-compute-[0-1]

Comment 2 Steve Brasier 2020-01-23 07:30:29 MST

IPs:

ohpc-compute-2	10.0.0.82
ohpc-compute-3	10.0.0.98
ohpc-login	10.0.0.188
ohpc-compute-0	10.0.0.238
ohpc-compute-1	10.0.0.231

Comment 3 Steve Brasier 2020-01-23 08:20:46 MST

Sorry - should have added this:

I've worked through the https://slurm.schedmd.com/troubleshoot.html page, most of which feels relevant.

Interestingly, restarting slurmd on the affected nodes fixed the comms problem instantly, and slurmctld reported those nodes were up.

Comment 4 Jacob Jenson 2020-01-23 08:30:42 MST

Steve,

SchedMD has a professional services support team that can help resolve these issues for you. However, before the support team can engage we need to put a support contract in place. Is Stack HPC willing to consider Slurm support for this project? 

Thanks,
Jacob

Comment 5 Steve Brasier 2020-01-23 10:08:00 MST

Ah sorry I didn't realise this was how you handled bugs. I'll discuss.