Ticket 8380 - Node loses communication during job
Summary: Node loses communication during job
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.11.x
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-01-23 07:27 MST by Steve Brasier
Modified: 2020-01-23 10:08 MST (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmctld log (178.84 KB, text/plain)
2020-01-23 07:27 MST, Steve Brasier
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Steve Brasier 2020-01-23 07:27:29 MST
Created attachment 12812 [details]
slurmctld log

I'm prototyping a slurm cluster using the elastic computing functionality and running into a communications problem.

What happens, in brief, is:
- Create a cluster with 2 normal nodes and 2 State=CLOUD nodes. The normal ones are excluded from powersaving
- Run a 4x node job (this is Job 2 in logs below). The nodes get created, the job shows as Running, finishes, and no errors occur.
- Wait for nodes to be powered down.
- Run the same 4x node job again (this is Job 3 in logs below):
  - The nodes get created and slurmctld shows "Node ohpc-compute-2 now responding" and the same for node -3.
  - job goes into Running state
  - slurmctld shows "_job_complete: JobId=3 WEXITSTATUS 0"
  - the REQUEST_TERMINATE_JOB message fails and after retrying slurmctld reports "error: Nodes ohpc-compute-[2-3] not responding".

I have verified that during the "can't communicate" phase I can ping from the (combined) control/login node to the compute nodes in question, using both the hostnames and IPs.

The only odd thing I can see is that the slurmctld log shows:

[2020-01-23T12:23:23.047] debug2: Error connecting slurm stream socket at 10.0.0.75:6818: Connection timed out
[2020-01-23T12:23:23.048] debug2: Error connecting slurm stream socket at 10.0.0.91:6818: Connection timed out

neither of these are IPs associated with the compute or control/login nodes - is this expected, or should these be the IPs of the relevant compute nodes??

Attached are:
- Output from sinfo and squeue during the entire sequence, showing only changes
- The portion of slurmctld.log covering both Job 2 (starts ~12:01) and Job 3 (starts ~12:15) at highest logging level - I've stripped out the RPC calls due to sinfo/squeue that's all.
- Node IPs.
Comment 1 Steve Brasier 2020-01-23 07:28:18 MST
Changes from watching squeue and sinfo:

[2020-01-23T12:01:42.067202]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2   idle ohpc-compute-[0-1]

[2020-01-23T12:01:56.306523]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute runhello   centos CF       0:02      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2   mix# ohpc-compute-[2-3]
compute*     up 60-00:00:0      2    mix ohpc-compute-[0-1]

[2020-01-23T12:08:16.573169]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute runhello   centos CF       6:22      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4    mix ohpc-compute-[0-3]

[2020-01-23T12:08:47.069845]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute runhello   centos  R       0:02      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4    mix ohpc-compute-[0-3]

[2020-01-23T12:08:49.103116]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute runhello   centos CG       0:02      3 ohpc-compute-[0-2]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      3   comp ohpc-compute-[0-2]
compute*     up 60-00:00:0      1   idle ohpc-compute-3

[2020-01-23T12:08:51.131978]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4   idle ohpc-compute-[0-3]

[2020-01-23T12:10:51.194768]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2   idle ohpc-compute-[0-1]

[2020-01-23T12:15:27.822033]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos CF       0:01      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2    mix ohpc-compute-[0-1]

[2020-01-23T12:15:58.336430]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos CF       0:32      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2   mix# ohpc-compute-[2-3]
compute*     up 60-00:00:0      2    mix ohpc-compute-[0-1]

[2020-01-23T12:21:39.730209]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos CF       6:13      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4    mix ohpc-compute-[0-3]

[2020-01-23T12:21:45.834557]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos  R       0:00      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4    mix ohpc-compute-[0-3]

[2020-01-23T12:21:49.903407]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos CG       0:03      4 ohpc-compute-[0-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      4   comp ohpc-compute-[0-3]

[2020-01-23T12:21:51.930630]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   compute runhello   centos CG       0:03      2 ohpc-compute-[2-3]
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up 60-00:00:0      2   comp ohpc-compute-[2-3]
compute*     up 60-00:00:0      2   idle ohpc-compute-[0-1]
Comment 2 Steve Brasier 2020-01-23 07:30:29 MST
IPs:

ohpc-compute-2	10.0.0.82
ohpc-compute-3	10.0.0.98
ohpc-login	10.0.0.188
ohpc-compute-0	10.0.0.238
ohpc-compute-1	10.0.0.231
Comment 3 Steve Brasier 2020-01-23 08:20:46 MST
Sorry - should have added this:

I've worked through the https://slurm.schedmd.com/troubleshoot.html page, most of which feels relevant.

Interestingly, restarting slurmd on the affected nodes fixed the comms problem instantly, and slurmctld reported those nodes were up.
Comment 4 Jacob Jenson 2020-01-23 08:30:42 MST
Steve,

SchedMD has a professional services support team that can help resolve these issues for you. However, before the support team can engage we need to put a support contract in place. Is Stack HPC willing to consider Slurm support for this project? 

Thanks,
Jacob
Comment 5 Steve Brasier 2020-01-23 10:08:00 MST
Ah sorry I didn't realise this was how you handled bugs. I'll discuss.