Ticket 17084

Summary: error: Failed to send MESSAGE_TASK_EXIT: Connection refused
Product: Slurm Reporter: Anoop Nair <anoop.k.nair>
Component: slurmdAssignee: Jacob Jenson <jacob>
Status: OPEN --- QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 23.02.1   
Hardware: Linux   
OS: Other   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Anoop Nair 2023-06-29 12:13:36 MDT
Slurmd fails to communicate with the server.

Scenario:

Fresh installation of Ubuntu 20.04.6 LTS (1 master node, 2 compute nodes) done with Slurm 23.02.1. Immediately after the installation, NCCL tests are completed successfully. After a while sbatch jobs are failing and sulrmd logs report the following errors. If we restart slurmd on all clients, jobs are running without any error for a while. 

On Client

[2023-06-22T08:54:37.261] task/affinity: lllp_distribution: JobId=37 manual binding: mask_cpu,one_thread
[2023-06-22T08:54:56.441] [37.extern] done with job
[2023-06-22T08:54:58.727] [37.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2023-06-22T08:54:58.729] [37.0] done with job
[2023-06-22T08:55:00.339] launch task StepId=38.0 request from UID:1001 GID:1001 HOST:172.16.4.172 PORT:58216
[2023-06-22T08:55:00.339] task/affinity: lllp_distribution: JobId=38 manual binding: mask_cpu,one_thread
[2023-06-22T08:55:19.289] [38.extern] done with job


On Server

[2023-06-22T08:56:30.380] sched: Allocate JobId=42 NodeList=compute-permanent-node-[173,829] #CPUs=128 Partition=compute
[2023-06-22T08:56:50.572] _job_complete: JobId=42 WEXITSTATUS 3
[2023-06-22T08:56:50.572] _job_complete: JobId=42 done
[2023-06-22T08:56:53.285] sched: Allocate JobId=43 NodeList=compute-permanent-node-[173,829] #CPUs=128 Partition=compute
[2023-06-22T08:57:13.392] _job_complete: JobId=43 WEXITSTATUS 3
[2023-06-22T08:57:13.392] _job_complete: JobId=43 done

There is no communication issue between the server and the client. 
Immediately after restarting slurmd on clients, jobs are running without any error for a while.


Whenever slurmd fails to communicate with the server sbatch job reports the following error

compute-permanent-node-173:445954:446043 [5] misc/shmutils.cc:43 NCCL WARN Call to open failed : No such file or directory

compute-permanent-node-173:445954:446043 [5] misc/shmutils.cc:70 NCCL WARN Error while attaching to shared memory segment /dev/shm/nccl-PZQhXJ (size 4194656)
compute-permanent-node-173: Test NCCL failure common.cu:958 'internal error / '
 .. compute-permanent-node-173 pid 445954: Test failure common.cu:842