Slurmd fails to communicate with the server. Scenario: Fresh installation of Ubuntu 20.04.6 LTS (1 master node, 2 compute nodes) done with Slurm 23.02.1. Immediately after the installation, NCCL tests are completed successfully. After a while sbatch jobs are failing and sulrmd logs report the following errors. If we restart slurmd on all clients, jobs are running without any error for a while. On Client [2023-06-22T08:54:37.261] task/affinity: lllp_distribution: JobId=37 manual binding: mask_cpu,one_thread [2023-06-22T08:54:56.441] [37.extern] done with job [2023-06-22T08:54:58.727] [37.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused [2023-06-22T08:54:58.729] [37.0] done with job [2023-06-22T08:55:00.339] launch task StepId=38.0 request from UID:1001 GID:1001 HOST:172.16.4.172 PORT:58216 [2023-06-22T08:55:00.339] task/affinity: lllp_distribution: JobId=38 manual binding: mask_cpu,one_thread [2023-06-22T08:55:19.289] [38.extern] done with job On Server [2023-06-22T08:56:30.380] sched: Allocate JobId=42 NodeList=compute-permanent-node-[173,829] #CPUs=128 Partition=compute [2023-06-22T08:56:50.572] _job_complete: JobId=42 WEXITSTATUS 3 [2023-06-22T08:56:50.572] _job_complete: JobId=42 done [2023-06-22T08:56:53.285] sched: Allocate JobId=43 NodeList=compute-permanent-node-[173,829] #CPUs=128 Partition=compute [2023-06-22T08:57:13.392] _job_complete: JobId=43 WEXITSTATUS 3 [2023-06-22T08:57:13.392] _job_complete: JobId=43 done There is no communication issue between the server and the client. Immediately after restarting slurmd on clients, jobs are running without any error for a while. Whenever slurmd fails to communicate with the server sbatch job reports the following error compute-permanent-node-173:445954:446043 [5] misc/shmutils.cc:43 NCCL WARN Call to open failed : No such file or directory compute-permanent-node-173:445954:446043 [5] misc/shmutils.cc:70 NCCL WARN Error while attaching to shared memory segment /dev/shm/nccl-PZQhXJ (size 4194656) compute-permanent-node-173: Test NCCL failure common.cu:958 'internal error / ' .. compute-permanent-node-173 pid 445954: Test failure common.cu:842