17084 – error: Failed to send MESSAGE_TASK_EXIT: Connection refused

Ticket 17084 - error: Failed to send MESSAGE_TASK_EXIT: Connection refused

Summary: error: Failed to send MESSAGE_TASK_EXIT: Connection refused

Status:	OPEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	23.02.1
Hardware:	Linux Other

Severity:	6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-06-29 12:13 MDT by Anoop Nair
Modified:	2023-07-13 10:20 MDT (History)
CC List:	0 users

See Also:
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Anoop Nair 2023-06-29 12:13:36 MDT

Slurmd fails to communicate with the server.

Scenario:

Fresh installation of Ubuntu 20.04.6 LTS (1 master node, 2 compute nodes) done with Slurm 23.02.1. Immediately after the installation, NCCL tests are completed successfully. After a while sbatch jobs are failing and sulrmd logs report the following errors. If we restart slurmd on all clients, jobs are running without any error for a while. 

On Client

[2023-06-22T08:54:37.261] task/affinity: lllp_distribution: JobId=37 manual binding: mask_cpu,one_thread
[2023-06-22T08:54:56.441] [37.extern] done with job
[2023-06-22T08:54:58.727] [37.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2023-06-22T08:54:58.729] [37.0] done with job
[2023-06-22T08:55:00.339] launch task StepId=38.0 request from UID:1001 GID:1001 HOST:172.16.4.172 PORT:58216
[2023-06-22T08:55:00.339] task/affinity: lllp_distribution: JobId=38 manual binding: mask_cpu,one_thread
[2023-06-22T08:55:19.289] [38.extern] done with job


On Server

[2023-06-22T08:56:30.380] sched: Allocate JobId=42 NodeList=compute-permanent-node-[173,829] #CPUs=128 Partition=compute
[2023-06-22T08:56:50.572] _job_complete: JobId=42 WEXITSTATUS 3
[2023-06-22T08:56:50.572] _job_complete: JobId=42 done
[2023-06-22T08:56:53.285] sched: Allocate JobId=43 NodeList=compute-permanent-node-[173,829] #CPUs=128 Partition=compute
[2023-06-22T08:57:13.392] _job_complete: JobId=43 WEXITSTATUS 3
[2023-06-22T08:57:13.392] _job_complete: JobId=43 done

There is no communication issue between the server and the client. 
Immediately after restarting slurmd on clients, jobs are running without any error for a while.


Whenever slurmd fails to communicate with the server sbatch job reports the following error

compute-permanent-node-173:445954:446043 [5] misc/shmutils.cc:43 NCCL WARN Call to open failed : No such file or directory

compute-permanent-node-173:445954:446043 [5] misc/shmutils.cc:70 NCCL WARN Error while attaching to shared memory segment /dev/shm/nccl-PZQhXJ (size 4194656)
compute-permanent-node-173: Test NCCL failure common.cu:958 'internal error / '
 .. compute-permanent-node-173 pid 445954: Test failure common.cu:842