5313 – Identify source of socket timeout errors

Ticket 5313 - Identify source of socket timeout errors

Summary: Identify source of socket timeout errors

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	17.11.5
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-06-13 15:11 MDT by Davide Vanzo
Modified:	2018-06-28 09:08 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Vanderbilt
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Configuration file (4.12 KB, text/plain) 2018-06-13 15:11 MDT, Davide Vanzo	Details
Controller log (1.17 MB, text/x-log) 2018-06-13 15:12 MDT, Davide Vanzo	Details
sdiag output (3.12 KB, text/plain) 2018-06-13 15:13 MDT, Davide Vanzo	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Davide Vanzo 2018-06-13 15:11:30 MDT

Created attachment 7083 [details]
Configuration file

Hello guys,

In order to prevent the poor scheduling performance and responsiveness we experienced when we had Slurm installed on top of GPFS, we just deployed Slurm on a single controller with SSD and HA managed externally via keepalive.

The strange thing is that, despite the new cluster being still in testing with no more than 1000 single jobs in queue, a significant amount of the following messages are present in the slurmctld log:

error: slurm_receive_msgs: Socket timed out on send/recv operation

However all the Slurm commands are very responsive and there seem to be no side-effect on the overall scheduling performance.
How could we identify the source of the timeouts?

In the attached log you will see a bunch of errors due to nodes with a different slurm.conf. That was due to some other issues we were experiencing on the cluster. The socket timeout errors persist even after we solved the issues (T15:00:00).

Thank you

Davide

Comment 1 Davide Vanzo 2018-06-13 15:12:39 MDT

Created attachment 7084 [details]
Controller log

Comment 2 Davide Vanzo 2018-06-13 15:13:04 MDT

Created attachment 7085 [details]
sdiag output

Comment 3 Tim Wickberg 2018-06-13 15:27:23 MDT

sdiag looks healthy.

I'm guessing that most/all of these messages are being generated by slurmctld trying to contact nodes that aren't currently up or aren't running slurmd at the moment. Are all the nodes online at the moment?

Increasing the debug level would certainly help narrow this down - you're only at 'info' by default which unfortunately doesn't identify which process is throwing that warning. 'scontrol setdebuglevel debug' for at least a few minutes may give you something better to work off of.

Comment 4 Will French 2018-06-19 06:11:52 MDT

> Increasing the debug level would certainly help narrow this down - you're
> only at 'info' by default which unfortunately doesn't identify which process
> is throwing that warning. 'scontrol setdebuglevel debug' for at least a few
> minutes may give you something better to work off of.

It appears you are correct:

[2018-06-19T07:03:04.317] debug:  Spawning ping agent for cn361
[2018-06-19T07:03:04.317] debug:  Spawning registration agent for cn902,ddell0004,ng[667-668,681] 5 hosts
[2018-06-19T07:03:14.333] debug:  slurm_recv_timeout at 0 of 4, timeout
[2018-06-19T07:03:14.333] error: slurm_receive_msgs: Socket timed out on send/recv operation

Ideally we would always have all our nodes up but invariably we will have a handful that are down due to hardware issues of some sort. Should we expect these controller timeouts to impact the responsiveness of SLURM commands like squeue and sbatch? Any other side effects we should be aware of?

Comment 5 Will French 2018-06-21 11:47:12 MDT

(In reply to Will French from comment #4)
> > Increasing the debug level would certainly help narrow this down - you're
> > only at 'info' by default which unfortunately doesn't identify which process
> > is throwing that warning. 'scontrol setdebuglevel debug' for at least a few
> > minutes may give you something better to work off of.
> 
> It appears you are correct:
> 
> [2018-06-19T07:03:04.317] debug:  Spawning ping agent for cn361
> [2018-06-19T07:03:04.317] debug:  Spawning registration agent for
> cn902,ddell0004,ng[667-668,681] 5 hosts
> [2018-06-19T07:03:14.333] debug:  slurm_recv_timeout at 0 of 4, timeout
> [2018-06-19T07:03:14.333] error: slurm_receive_msgs: Socket timed out on
> send/recv operation
> 
> Ideally we would always have all our nodes up but invariably we will have a
> handful that are down due to hardware issues of some sort. Should we expect
> these controller timeouts to impact the responsiveness of SLURM commands
> like squeue and sbatch? Any other side effects we should be aware of?

Hi there, just following up. We have been plagued by socket timeouts in our current environment and want to take as many proactive steps as possible to avoid these errors in our new environment before users begin using the system heavily. Thanks!

Comment 6 Tim Wickberg 2018-06-27 14:11:36 MDT

> > [2018-06-19T07:03:04.317] debug:  Spawning ping agent for cn361
> > [2018-06-19T07:03:04.317] debug:  Spawning registration agent for
> > cn902,ddell0004,ng[667-668,681] 5 hosts
> > [2018-06-19T07:03:14.333] debug:  slurm_recv_timeout at 0 of 4, timeout
> > [2018-06-19T07:03:14.333] error: slurm_receive_msgs: Socket timed out on
> > send/recv operation
> > 
> > Ideally we would always have all our nodes up but invariably we will have a
> > handful that are down due to hardware issues of some sort. Should we expect
> > these controller timeouts to impact the responsiveness of SLURM commands
> > like squeue and sbatch? Any other side effects we should be aware of?

No, these are generally harmless status messages. Slurm does periodically ping all elements of the cluster, and if they're not alive that's just a symptom of them already being down.

> Hi there, just following up. We have been plagued by socket timeouts in our
> current environment and want to take as many proactive steps as possible to
> avoid these errors in our new environment before users begin using the
> system heavily. Thanks!

You may want to look at adjusting MessageTimeout from the default 10 seconds up a bit if you've noticed intermittent communication issues, but the default is generally fine on most systems.

Can you elaborate on the socket timeouts you've seen? Do you have examples, outside of this node ping mechanism, where Slurm seems to be impacted?

Comment 7 Will French 2018-06-28 08:59:00 MDT

> No, these are generally harmless status messages. Slurm does periodically
> ping all elements of the cluster, and if they're not alive that's just a
> symptom of them already being down.

Great - thanks for confirming.

  
> Can you elaborate on the socket timeouts you've seen? Do you have examples,
> outside of this node ping mechanism, where Slurm seems to be impacted?

In our current production environment (we are in the middle of transitioning our cluster to CentOS 7), socket timeouts occur somewhat frequently and we get a fair number of complaints about SLURM sluggishness. See https://bugs.schedmd.com/show_bug.cgi?id=3800 for background and the tickets referenced within.

The advice we give to our users is described here: https://www.vanderbilt.edu/accre/support/faq/#a-slurm-command-fails-with-a-socket-timeout-message-whats-the-problem

We also monitor large job submissions and open tickets with users who do silly things like submit 10k jobs without job arrays.

The biggest change we've made in our new CentOS 7 environment is moving SLURM (and its state files) to dedicated SSDs with our own HA via keepalived. Our tests so far indicate that this will make a big difference. We had a lot of issues with GPFS sluggishness leading to SLURM sluggishness, so we thought it best to decouple the two.

We also have a fair number of groups that do automated job monitoring and submissions, which can abuse the scheduler if not done right. We work with these groups to design sensible systems to avoid overloading the scheduler.

Comment 8 Tim Wickberg 2018-06-28 09:08:34 MDT

> In our current production environment (we are in the middle of transitioning
> our cluster to CentOS 7), socket timeouts occur somewhat frequently and we
> get a fair number of complaints about SLURM sluggishness. See
> https://bugs.schedmd.com/show_bug.cgi?id=3800 for background and the tickets
> referenced within.

Ahh... I remember that now.
 
> The advice we give to our users is described here:
> https://www.vanderbilt.edu/accre/support/faq/#a-slurm-command-fails-with-a-
> socket-timeout-message-whats-the-problem
> 
> We also monitor large job submissions and open tickets with users who do
> silly things like submit 10k jobs without job arrays.
> 
> The biggest change we've made in our new CentOS 7 environment is moving
> SLURM (and its state files) to dedicated SSDs with our own HA via
> keepalived. Our tests so far indicate that this will make a big difference.
> We had a lot of issues with GPFS sluggishness leading to SLURM sluggishness,
> so we thought it best to decouple the two.

Yes. That should hopefully clear this all up. Slurm's throughput is directly linked with latency to the StateSaveLocation, and GPFS can certainly cause issues when shared with the user filesystems.

> We also have a fair number of groups that do automated job monitoring and
> submissions, which can abuse the scheduler if not done right. We work with
> these groups to design sensible systems to avoid overloading the scheduler.

Sounds good.

I'm going to tag this resolved/infogiven for now. If you're still seeing issues - especially after the upgrade - please reopen and we can discuss further.

- Tim