Ticket 686

Summary: slurm rpc:1008 : Communication connection failure
Product: Slurm Reporter: toru matsuoka <tmatsuoka>
Component: slurmctldAssignee: David Bigagli <david>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: da
Version: 2.6.2   
Hardware: Linux   
OS: Linux   
Site: CRAY Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurm.conf file

Description toru matsuoka 2014-04-08 21:38:15 MDT
Created attachment 739 [details]
slurm.conf file

Hello,Slurm support team !

I'm toru matsuoka in cray Japan Inc engineer.

A RPC 1008  Communication connection failure occurs of the Customer site whom I am taking charge of, and the trouble which a job stops have occurred frequently.
 
What has originated [ what ] and generated this trouble?
 
I attach a slurm.conf file. 

The contents of slurmctld.log are indicated below. 

This Trouble is occur at multiple nodes.

Please advice me about this trouble situation. 


Best Regards..
----------------------
Toru Matsuoka

Cray Japan Inc..
----------------------
Comment 1 Moe Jette 2014-04-09 03:14:09 MDT
This error typically indicates 
1) some network problem
2) some slurmd daemons on some compute node are down or
3) the munged daemon on some compute node being down
If you are just starting the system and not all of the slurmd daemons are up, this error would be expected. It would also be expected if any of your compute nodes are down.

Your configuration looks fine.

Please look at the troubleshooting guide at the below site, especially the section on network problems.
http://slurm.schedmd.com/troubleshoot.html#network
Comment 2 Moe Jette 2014-04-15 08:47:54 MDT
Were you able to identify the source of this problem?
Comment 3 toru matsuoka 2014-04-15 17:42:03 MDT
Hello,Slurm Support team! 

thanks for always support.

The Cause of this trouble was network connection Problem.

At Customer site , Slurm integration was done.

Access of a network became impossible temporarily under the influence.

Now,Trouble status was became OK.

Please close this case.

Best Regards..
----------------------
Toru Matsuoka

Cray Japan Inc..
----------------------
Comment 4 Moe Jette 2014-04-16 03:33:58 MDT
network problem at customer site resolved