3554 – Performance issue and socket timeouts

Ticket 3554 - Performance issue and socket timeouts

Summary: Performance issue and socket timeouts

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	16.05.8
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-03-08 08:45 MST by Davide Vanzo
Modified:	2017-05-03 10:38 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Vanderbilt
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurm configuration and controller log (851.55 KB, application/x-compressed-tar) 2017-03-08 08:45 MST, Davide Vanzo	Details
sdiag output (8.74 KB, text/x-log) 2017-03-08 15:43 MST, Davide Vanzo	Details
Controllers log with vestigial files (31.54 KB, application/x-compressed-tar) 2017-03-22 11:24 MDT, Davide Vanzo	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Davide Vanzo 2017-03-08 08:45:21 MST

Created attachment 4171 [details]
Slurm configuration and controller log

Good morning,
we recently started hitting some performance bottleneck in Slurm when there are more than 10,000 jobs in queue. That happens even if jobs are submitted as job arrays. The effect is a slow responsiveness of all Slurm commands and this causes some automated submission system to fail submission because the sbatch execution takes too long.
From the slurmctld log we noticed a significant increase of pending RPC messages. A month ago we added max_rpc_cnt=16 to our configuration and that solved the pending RPC issue. However, although this solved Slurm comands responsiveness, sbatch submissions started failing because of socket timeout and users were still not happy. Even increasing the value to 48 did not seem to solve the problem.
As for resources utilization, we notice that the slurmctld and the sched processes on the Slurm server are using no more than 110% out of 2400% of the CPU power available on the server. As for the memory, less than 0.5% of memory is used by the Slurm controller.
We presume that 10k jobs in queue should not be a problem for Slurm to manage without significantly impacting its functionality as we are experiencing.
What can we do to improve its performance?

Attached you can find our current configuration files and the log since yesterday. From the log you can see that before Mar 7 16:33:16 the only pending RPC warnings are from the backfill. That was with max_rpc_cnt=16. At  16:33 we removed the max_rpc_cnt and sched started accumulating pending RPCs.
Please let me know if you need additional information.

Best regards,

Davide Vanzo

Comment 1 Tim Wickberg 2017-03-08 15:33:47 MST

(In reply to Davide Vanzo from comment #0)
> Created attachment 4171 [details]
> Slurm configuration and controller log
> 
> Good morning,
> we recently started hitting some performance bottleneck in Slurm when there
> are more than 10,000 jobs in queue. That happens even if jobs are submitted
> as job arrays. The effect is a slow responsiveness of all Slurm commands and
> this causes some automated submission system to fail submission because the
> sbatch execution takes too long.
> From the slurmctld log we noticed a significant increase of pending RPC
> messages. A month ago we added max_rpc_cnt=16 to our configuration and that
> solved the pending RPC issue. However, although this solved Slurm comands
> responsiveness, sbatch submissions started failing because of socket timeout
> and users were still not happy. Even increasing the value to 48 did not seem
> to solve the problem.
> As for resources utilization, we notice that the slurmctld and the sched
> processes on the Slurm server are using no more than 110% out of 2400% of
> the CPU power available on the server. As for the memory, less than 0.5% of
> memory is used by the Slurm controller.
> We presume that 10k jobs in queue should not be a problem for Slurm to
> manage without significantly impacting its functionality as we are
> experiencing.
> What can we do to improve its performance?

How long do these jobs run for? Throughput more than outstanding workload tends to be the bigger issue, although with some tuning this shouldn't be insurmountable.

While slurmctld is heavily threaded, most of these threads are fighting for control of only a few locks in the system. Thus the single thread performance is usually much more important than the number of cores available.

Is the machine virtualized by any chance? That can cause some significant issues unfortunately.

Info from 'sdiag' would help me get an idea of what the system is doing.

If someone is running 'squeue' in a while(true) loop somewhere that can also unfortunately have serious impacts on performance. You should be able to spot that in the sdiag output pretty easily though - look for an account with an unreasonable number of RPC calls.

> Attached you can find our current configuration files and the log since
> yesterday. From the log you can see that before Mar 7 16:33:16 the only
> pending RPC warnings are from the backfill. That was with max_rpc_cnt=16. At
> 16:33 we removed the max_rpc_cnt and sched started accumulating pending RPCs.
> Please let me know if you need additional information.

Comment 2 Davide Vanzo 2017-03-08 15:43:10 MST

Tim,
thank you for the clarification.

It is hard to say how long the job lasts since they come from multiple users. What I can say though is that we have a maximum duration of 14 days per single job. Is there a correlation between the job duration and Slurm performances?

No, our controllers are bare-metal with 12 physical cores (x2 with HT) and 128GB of RAM.

Attached you can find the sdiag output.

Davide

Comment 3 Davide Vanzo 2017-03-08 15:43:49 MST

Created attachment 4177 [details]
sdiag output

Comment 5 Tim Wickberg 2017-03-09 14:24:34 MST

I'm not seeing an immediate cause, sdiag looks relatively clean.

A few more questions for you:

- What filesystem is StateSaveLocation kept on? Are there any known performance quirks from there.

- Do you need the 'rack' feature? One thing that could considerably improve the scheduler/backfill performance is to reduce the different "types" of nodes. If you can collapse the Node config into fewer lines, the controller can make a lot of shortcuts when determining what nodes a job could run on. Although I don't think this should account for a significant performance issue.

- Have you always been using DenyAccounts like that, or would it be possible to invert some of these into AllowAccounts settings instead? Again, I don't think it's a huge performance hit.

- Do you have a plan to move to 17.02 at some point in the near future? There's a series of different performance fixes that have gone in to that; again, nothing huge, but a lot of small tweaks all over the place.

I'm going to send this over to Alex to see if he notices anything else amiss in the logs you've provided so far.

I do have some recommendations around tuning the backfill scheduler, but want to make sure we check on a few other things at play with your setup before asking you to make any changes.

Comment 6 Alejandro Sanchez 2017-03-10 04:26:01 MST

Davide, adding an extra question to Tim ones above. Between march 7 and 8, I count 7872 logged messages like this for different nodes:

Mar  8 09:36:09 slurmsched1 slurmctld[10966]: error: find_node_record: lookup failure for vmp518

Most probably this is not related to the performance issue but it got my attention. Have you done recent changes to node configs or network which could cause this?

Comment 7 Alejandro Sanchez 2017-03-10 04:41:45 MST

Some more minor tuning coming to my mind:

- MinJobAge is set to 300, lowering it to 150 might help slurmctld dealing with less jobs in the active database at a time.

- #BatchStartTimeout=10 commented, but it defaults to 10. Perhaps increasing this would help the automated submission systems to not fail so quickly and get the submission accepted if it waits a bit more, but this does not impact the performance.

Comment 9 Davide Vanzo 2017-03-14 10:29:48 MDT

Tim and Alex,
sorry for the late answer but I have been away for the lase few days.

Our state file is located on our GPFS filesystem and we recently had some slowdowns. We will consider moving it into a more reliable network storage filesystem.

We introduced the rack feature in order to be able to better manage the nodes when we need to drain a whole rack of nodes. We would like to leave reverting this as the last resort if all other attempts will fail to improve the Slurm performance.

We transitioned to AllowAccounts and DenyAccounts in that way following your suggestions on how to create a debug partition in our system that was based on partition based associations. Here is the ticket:  https://bugs.schedmd.com/show_bug.cgi?id=2895

No, we do not currently have any plan to transition to Slurm 17. We will probably start testing it soon but we will still have to figure out what is upsetting our current version.

The lookup failure error messages is something we have never been able to figure out. The weird thing is that some of them (e.g. vmpsXX, ginko, rokasgate2, ipoda1) are gateways and they have never been added as nodes to Slurm so we do not understand why Slurm is trying to find them. Do you have any idea what could cause this?

I changed the MinJobAge to 150. I will let it run for some time to see how this affect Slurm's performance before changing anything else.

Davide

Comment 10 Davide Vanzo 2017-03-22 11:23:40 MDT

A quick update on the situation.

Neither changing MinJobAge to 150 nor increasing BatchStartTimeout to 30 seconds improved the performance of the scheduler. We still see a lot of pending PRCs and users keep observing timeouts with sbatch and other commands.

We also noticed that the slurmctld started acting a little weird, with the secondary controller taking over even if the primary was still active and generating vestigial state files. I suspect this is also connected to the performance issue. Attached you can find an excerpt of the two logs.

Davide

Comment 11 Davide Vanzo 2017-03-22 11:24:29 MDT

Created attachment 4241 [details]
Controllers log with vestigial files

Comment 12 Davide Vanzo 2017-03-22 11:34:16 MDT

Escalating to high impact since the frequent controller instability generates held jobs because of missing cached script file.

Davide

Comment 13 Tim Wickberg 2017-03-22 11:51:30 MDT

(In reply to Davide Vanzo from comment #12)
> Escalating to high impact since the frequent controller instability
> generates held jobs because of missing cached script file.
> 
> Davide

The backup assuming control is leading to a lot of the issues identified in the logs. I'd suggest disabling it for the time being to avoid any further issues - it looks like the system went into a "split-brain" scenario where both were attempting to schedule and manage jobs simultaneously which is obvious a huge problem.

If the StateSaveLocation is on a GPFS filesystem that has been having performance issues that may likely be the root cause of your current problems. I would recommend moving that to a local location on the primary in the meantime, and disabling the backup for the moment.

While jobs are being submitted or started, files from that location must be created and later read promptly. Unfortunately, due to internal mutex locks protecting certain datastructures almost all other threads within the slurmctld process will be waiting on that I/O to complete, and an underperforming filesystem can lead to significant performance impacts like what I believe you're currently seeing.

Comment 14 Davide Vanzo 2017-03-22 12:42:44 MDT

Tim,
I have stabilized the system by setting max_rpc_cnt=160. Everything is back to good responsiveness and the two controllers are coexisting peacefully.

What would be the optimal storage configuration to save StateSaveLocation? Currently our GPFS is stable so I can't see how that could be the problem. Aside from that what would be the optimal storage configuration to save StateSaveLocation since it needs to be shared by all the controllers?

Davide

Comment 15 Tim Wickberg 2017-03-22 13:54:25 MDT

(In reply to Davide Vanzo from comment #14)
> Tim,
> I have stabilized the system by setting max_rpc_cnt=160. Everything is back
> to good responsiveness and the two controllers are coexisting peacefully.

That makes a certain amount of sense - it's letting additional work pile up while servicing slower I/O operations, but not inadvertently rejecting it.

You may still see issues if, e.g., a large number of jobs are submitted simultaneously from multiple users.

> What would be the optimal storage configuration to save StateSaveLocation?
> Currently our GPFS is stable so I can't see how that could be the problem.
> Aside from that what would be the optimal storage configuration to save
> StateSaveLocation since it needs to be shared by all the controllers?

This is something I'm working on better documenting; we don't have a set of specific design recommendations at present, but I believe that would be useful as sites plan for new installations. There are a number of assumptions implicit in Slurm's HA mode that are not well discussed at present.

If your GPFS filesystem is prone to occasional overloaded conditions then that will have an impact on Slurm's throughput. Each batch job needs to write out (then read back) two files corresponding to the job's batch script and environment, and these operations currently will block most other processing within slurmctld. (This relates to a separate issue I'm looking at addressing before the 17.11 release.)

If these file create / read requests aren't processed quickly then this will lead to problems. The higher the throughput of your cluster the greater the impact of FS performance will be. The IOPS load itself should not be that high - but the metadata performance (for file creation/open) is a significant concern, and tends to be the main bottleneck in GPFS or Lustre. An NFS mount from some other system should be sufficient, if it can be sourced from a stable host within the cluster.

A few additional caveats - it's best if the shared storage is accessed over the same network that the slurmctld's use to communicate with the cluster. Otherwise you can end up in a scenario where the communication network has disappeared, leaving both sides believing they are the master, and fighting from control of the state files ("split brain") which could lead to corruption.

Comment 16 Davide Vanzo 2017-03-22 16:08:54 MDT

Tim,
although limiting the RPC helped, we would still like to improve the overall performance wherever possible. So, please, let us know if you have any other suggestion.

As for the state files path, only the controllers need to have access to it, correct? Are the read and write operations equally distributed?

Davide

Comment 17 Tim Wickberg 2017-03-22 18:07:23 MDT

(In reply to Davide Vanzo from comment #16)
> Tim,
> although limiting the RPC helped, we would still like to improve the overall
> performance wherever possible. So, please, let us know if you have any other
> suggestion.

There may be some other (unrelated) tuning available in the backfill scheduler. I'll get back to you tomorrow on that.

> As for the state files path, only the controllers need to have access to it,
> correct? Are the read and write operations equally distributed?

Correct.

Read vs. write is relatively balanced; an "ideal" job would have a 1:1 ratio of write, read, then deleting the script and environment files. Job arrays write once, then read once per array task. And the other state files will tend to be written more than read.

As mentioned previously the I/O load itself shouldn't be too high - it will scale directly with the job throughput on the system - but it is unfortunately highly sensitive to latency spikes; it's been a while since I've run GPFS or Lustre, but I recall this can be an issue (especially with metadata operations) in almost any parallel filesystem.

- Tim

Comment 18 Tim Wickberg 2017-04-18 21:45:04 MDT

Apologies for not responding sooner. I've been tracking down some related issues with the failover mechanisms, and we should have a fix in place for that coming up soon. Bug 3692 is tracking that.

I'm not seeing any major issues with your configuration at the moment. I will note that the 17.02 release should improve performance slightly, and may help mitigate some of the symptoms you'd seen in the past.

- Tim

Comment 19 Will French 2017-04-19 09:51:08 MDT

(In reply to Tim Wickberg from comment #18)
> Apologies for not responding sooner. I've been tracking down some related
> issues with the failover mechanisms, and we should have a fix in place for
> that coming up soon. Bug 3692 is tracking that.
> 
> I'm not seeing any major issues with your configuration at the moment. I
> will note that the 17.02 release should improve performance slightly, and
> may help mitigate some of the symptoms you'd seen in the past.
> 
> - Tim

Thanks, Tim. Davide is out of the office for the next few weeks. The overall load on our cluster has been a bit lower this week so no issues to report. I will update our other ticket (Bug 3692) with more details.

Just a FYI that we are transitioning to CentOS 7 this summer and will stand up new instances of slurm primary and backup controllers and slurmdbd, each on new dedicated hardware. We will install and run the latest version of 17.02 in the new environment.

Comment 20 Tim Wickberg 2017-05-03 10:38:41 MDT

Marking this as resolved/infogiven. Bug 3692 continues to track issues around HA failover.

- Tim