Ticket 7357

Summary: Socket timed out on send/recv operation
Product: Slurm Reporter: IDRIS System Team <gensyshpe>
Component: ConfigurationAssignee: Albert Gil <albert.gil>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: alex
Version: 18.08.7   
Hardware: Linux   
OS: Linux   
Site: IDRIS Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description IDRIS System Team 2019-07-04 10:11:02 MDT
Hi,
We are installing Slurm 18.08.7 on a large configuration (around 1800 compute nodes).
Submitted jobs can either be some large jobs requesting 1500 nodes or a burst of small jobs.
Commands like squeue or sinfo frequently return this error : Socket timed out on send/recv operation, especially when large jobs are launched or terminated.
MessageTimeout has already been increased to 20 ; we understand that we may increase this value up to 90 but from users perspective, waiting for the result of squeue or sinfo for more than 20 seconds can be a real problem.
We are also aware that the tuning of Slurm can be tricky, so could you please help us to cope with this problem ?

Best regards,

Philippe  COLLINET
Comment 1 Alejandro Sanchez 2019-07-04 10:45:00 MDT
Hi Philippe,

Will you kindly attach:

- a copy of your config files
- slurmctld.log
- output of sdiag

We keep track in an internal repo a copy of our customers configuration, so we don't need to ask for config attachments on every single bug. Also, please remember to word any password options like in i.e. slurmdbd.conf.

Thanks
Comment 3 IDRIS System Team 2019-07-08 05:28:20 MDT
Hello,

 We setup the debug mode and it helped US.
 We found that the address resolution of the nodes running slurm commands was incorrect and had to wait for the second address.

  Whence this issue corrected, we do not suffer Socket timed out on send/recv operation.

  Thanks for your help. We can close the Bug.

Best regards,

Philippe Collinet
Comment 4 Albert Gil 2019-07-08 05:33:20 MDT
Hi Philippe,

These are good news.
Thanks for the explanations.

I'm closing this ticket as infogiven (although it has been you who give the info to us ;-)

Please let us know if you need further support,
Albert