Created attachment 739 [details] slurm.conf file Hello,Slurm support team ! I'm toru matsuoka in cray Japan Inc engineer. A RPC 1008 Communication connection failure occurs of the Customer site whom I am taking charge of, and the trouble which a job stops have occurred frequently. What has originated [ what ] and generated this trouble? I attach a slurm.conf file. The contents of slurmctld.log are indicated below. This Trouble is occur at multiple nodes. Please advice me about this trouble situation. Best Regards.. ---------------------- Toru Matsuoka Cray Japan Inc.. ----------------------
This error typically indicates 1) some network problem 2) some slurmd daemons on some compute node are down or 3) the munged daemon on some compute node being down If you are just starting the system and not all of the slurmd daemons are up, this error would be expected. It would also be expected if any of your compute nodes are down. Your configuration looks fine. Please look at the troubleshooting guide at the below site, especially the section on network problems. http://slurm.schedmd.com/troubleshoot.html#network
Were you able to identify the source of this problem?
Hello,Slurm Support team! thanks for always support. The Cause of this trouble was network connection Problem. At Customer site , Slurm integration was done. Access of a network became impossible temporarily under the influence. Now,Trouble status was became OK. Please close this case. Best Regards.. ---------------------- Toru Matsuoka Cray Japan Inc.. ----------------------
network problem at customer site resolved