Ticket 7357 - Socket timed out on send/recv operation
Summary: Socket timed out on send/recv operation
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 18.08.7
Hardware: Linux Linux
: 2 - High Impact
Assignee: Albert Gil
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-07-04 10:11 MDT by IDRIS System Team
Modified: 2019-07-08 05:33 MDT (History)
1 user (show)

See Also:
Site: IDRIS
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description IDRIS System Team 2019-07-04 10:11:02 MDT
Hi,
We are installing Slurm 18.08.7 on a large configuration (around 1800 compute nodes).
Submitted jobs can either be some large jobs requesting 1500 nodes or a burst of small jobs.
Commands like squeue or sinfo frequently return this error : Socket timed out on send/recv operation, especially when large jobs are launched or terminated.
MessageTimeout has already been increased to 20 ; we understand that we may increase this value up to 90 but from users perspective, waiting for the result of squeue or sinfo for more than 20 seconds can be a real problem.
We are also aware that the tuning of Slurm can be tricky, so could you please help us to cope with this problem ?

Best regards,

Philippe  COLLINET
Comment 1 Alejandro Sanchez 2019-07-04 10:45:00 MDT
Hi Philippe,

Will you kindly attach:

- a copy of your config files
- slurmctld.log
- output of sdiag

We keep track in an internal repo a copy of our customers configuration, so we don't need to ask for config attachments on every single bug. Also, please remember to word any password options like in i.e. slurmdbd.conf.

Thanks
Comment 3 IDRIS System Team 2019-07-08 05:28:20 MDT
Hello,

 We setup the debug mode and it helped US.
 We found that the address resolution of the nodes running slurm commands was incorrect and had to wait for the second address.

  Whence this issue corrected, we do not suffer Socket timed out on send/recv operation.

  Thanks for your help. We can close the Bug.

Best regards,

Philippe Collinet
Comment 4 Albert Gil 2019-07-08 05:33:20 MDT
Hi Philippe,

These are good news.
Thanks for the explanations.

I'm closing this ticket as infogiven (although it has been you who give the info to us ;-)

Please let us know if you need further support,
Albert