Ticket 17084 - error: Failed to send MESSAGE_TASK_EXIT: Connection refused
Summary: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 23.02.1
Hardware: Linux Other
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-06-29 12:13 MDT by Anoop Nair
Modified: 2023-07-13 10:20 MDT (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Anoop Nair 2023-06-29 12:13:36 MDT
Slurmd fails to communicate with the server.

Scenario:

Fresh installation of Ubuntu 20.04.6 LTS (1 master node, 2 compute nodes) done with Slurm 23.02.1. Immediately after the installation, NCCL tests are completed successfully. After a while sbatch jobs are failing and sulrmd logs report the following errors. If we restart slurmd on all clients, jobs are running without any error for a while. 

On Client

[2023-06-22T08:54:37.261] task/affinity: lllp_distribution: JobId=37 manual binding: mask_cpu,one_thread
[2023-06-22T08:54:56.441] [37.extern] done with job
[2023-06-22T08:54:58.727] [37.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2023-06-22T08:54:58.729] [37.0] done with job
[2023-06-22T08:55:00.339] launch task StepId=38.0 request from UID:1001 GID:1001 HOST:172.16.4.172 PORT:58216
[2023-06-22T08:55:00.339] task/affinity: lllp_distribution: JobId=38 manual binding: mask_cpu,one_thread
[2023-06-22T08:55:19.289] [38.extern] done with job


On Server

[2023-06-22T08:56:30.380] sched: Allocate JobId=42 NodeList=compute-permanent-node-[173,829] #CPUs=128 Partition=compute
[2023-06-22T08:56:50.572] _job_complete: JobId=42 WEXITSTATUS 3
[2023-06-22T08:56:50.572] _job_complete: JobId=42 done
[2023-06-22T08:56:53.285] sched: Allocate JobId=43 NodeList=compute-permanent-node-[173,829] #CPUs=128 Partition=compute
[2023-06-22T08:57:13.392] _job_complete: JobId=43 WEXITSTATUS 3
[2023-06-22T08:57:13.392] _job_complete: JobId=43 done

There is no communication issue between the server and the client. 
Immediately after restarting slurmd on clients, jobs are running without any error for a while.


Whenever slurmd fails to communicate with the server sbatch job reports the following error

compute-permanent-node-173:445954:446043 [5] misc/shmutils.cc:43 NCCL WARN Call to open failed : No such file or directory

compute-permanent-node-173:445954:446043 [5] misc/shmutils.cc:70 NCCL WARN Error while attaching to shared memory segment /dev/shm/nccl-PZQhXJ (size 4194656)
compute-permanent-node-173: Test NCCL failure common.cu:958 'internal error / '
 .. compute-permanent-node-173 pid 445954: Test failure common.cu:842