8185 – Merlin6 Slurm Cluster: configuration assistance and recommendations

Ticket 8185 - Merlin6 Slurm Cluster: configuration assistance and recommendations

Summary: Merlin6 Slurm Cluster: configuration assistance and recommendations

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	18.08.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Ben Roberts
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-12-05 02:05 MST by Marc Caubet Serrabou
Modified:	2019-12-18 08:42 MST (History)
CC List:	1 user (show)

See Also:
Site:	Paul Scherrer
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurm configuration files for the Merlin6 cluster (104.21 KB, application/gzip) 2019-12-05 02:05 MST, Marc Caubet Serrabou	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Marc Caubet Serrabou 2019-12-05 02:05:52 MST

Created attachment 12481 [details]
Slurm configuration files for the Merlin6 cluster

Dear SchedMD Support,

I am responsible for one of the main Slurm clusters at PSI and we have 
recently requested SchedMD Support. I would like to start the first 
round by attaching our configuration files and get some advices for 
improvement and configuration assistance. This consists on a federated
cluster with two clusters: merlin5 (old setup), merlin6 (new setup):

  * 'merlin5' is running for few years already, and contains a very old
    configuration and is running on very old hardware, with hyper-
    threading disabled and CPU as Consumable Resource.
  * 'merlin6' is the new cluster, contains a new configuration with 
    Memory and Core as consumable resources. It contains CPU and GPU 
    machines. Hyper-threading is enabled on the CPU nodes, while this 
    is disabled on the GPU nodes. This is official production new 
    cluster, where:
    * Basically, all users should have the same fairshare rights, 
      independent of the organizational unit they belong to.
      * Exception: A few special groups have the following requirements
        based on their sponsoring of some cluster nodes
        * on an equivalent number of nodes (job slots) equal to their 
          sponsoring, they should always get the highest priority and 
          they should not be required to wait for job starting longer
          than 1h. 
        * This requirement will come in the near feature and we will 
          need to adapt our configurations according to that.
    * The cluster has a number of nodes featuring GPUs. The fairshare 
      for the GPU-nodes and for the normal CPU-nodes should be 
      accounted for separately, i.e. somebody having used the CPU 
      resources should not get a fairshare penalty based on that when 
      submitting a job for the GPU-nodes.

I attach the main files: 
  - slurm.conf for merlin6
  - slurm.prolog (slurmd) for merlin6
  - slurm.epilog (slurmd) for merlin6
  - slurmctld.prolog for merlin6 (currently just returns 'exit 0': 
    enabled for further development)
  - gres.conf for merlin6
  - slurmdbd.conf for merlin6
  - accounts.[parsable2|standard].txt ('sacctmgr show assoc' output)
    - Currently only 'merlin' account is used. 'merlin-gpu' was created
      for isolating fair share from CPU, but never went in to real 
      production.
    - 'meg' account can be ignored.
  - qos.[parsable2|standard].txt (output for 'sacctmgr show qos')
  - myhwloc.[output|.tar.bz2|xml] (output for 
    'hwloc-gather-topology /tmp/myhwloc' for one of the CPU nodes)

Main items:
  - Server configuration:
    - 2 slurmdbd in active/passive mode  - merlin-slurmctld0[1,2]
    - 2 slurmctld in active/passive mode - merlin-slurmctld0[1,2]
    - 1 single MariaDB instance, dedicated machine - merlin-slurmdb01
  - Hyper-threading enabled:
    - submitted jobs should use physical cores by default and only use 
      2 threads per core if explicitely requested by the user.
    - Different jobs can not land to the same physical core
    - Different jobs can not land to the same logical CPU (core thread)
  - CPU Affinity and CGroups enabled
  - Need for running Interactive and X11 based jobs

Pending items:
  - Upgrading to Slurm 19.05.4 with PMIx support
    - Repository is ready and version has been tested at PSI.
    - Upgrade will be performed in the upcoming weeks.
  - Problems with very short jobs ("cgroup cannot allocate memory" -
    errors). How to prevent that in an efficient way?
  - Limit GPU resources by 'maximum allowed GRES resources, instead of 
    'by maximum allowed nodes'.
  - Is QoS 'gpu-Xn' a good way for limiting GPU resources?
  - ELK integration:
    - 'elasticsearch' plugin not working with latest ELK.
  - Allow specific jobs from specific users to always run in the 
    Cluster (preemption? reservations?). We would need advices on this.
  - Problems with Hybrid jobs (OpenMPI+OpenMP): tasks are not correctly
    assigned to cores (hwloc issue?):
    - Problem softed with 'srun', while problem is severe with openmpi 
      'mpirun'
      - Code compiled with gcc/openmpi suffers from that
      - Code compiled with intel/impi usually runs smoothly with 'srun'
  - Introducing NHC:
    - Auto recovery when a node is rebooted.
    - Detect configuration issues and drain node when detected (check 
      also possible from prolog).
  - Bee-On-Demand for generating shared scratch disks on the demand.
    - Needs to be done at the prolog step.

For some of the pending items that we would need some help, would be 
better to open an single ticket for each? I would not open all them 
at once, as I want to investigate by myself a bit more about some of 
them (NHC, BeeOND).

Thanks a lot and best regards,
Marc

Comment 2 Ben Roberts 2019-12-06 16:09:48 MST

Hi Marc,

Thanks for the detailed description of your environment and situation, that does make it easier to provide suggestions. I'll try to offer suggestions for the problems you're facing.

You state that you would like to have different fairshare calculations for gpu vs non-gpu jobs. Slurm does allow you to create different user/account combinations, known as associations, that are unique entities. If you had users request a different account for gpu jobs then this would make sure that use of one account didn't affect the fairshare calculations for the other account.

You also mention that you have problems allocating memory for very short jobs. I was going to suggest you use MemSpecLimit, but I see that you've already got that defined. Are you seeing anything else in the logs related to these errors, like OOM errors? Are there other jobs on the nodes when you see this issue?

You are able to limit the number of gpu resources each job can use. There are a few options of ways to limit it, but it sounds like you probably want to limit the number of gpus per node a job can request. Assuming this is correct you can use the MaxTRESPerNode attribute for a job. Here's an example of how to set this:
sacctmgr modify user where user=user1 account=account1 set maxtrespernode=gres/gpu=2

You could also set this on an account or on a QOS. There are other MaxTRES attributes that might work better for your use case that are described in the sacctmgr documentation:
https://slurm.schedmd.com/sacctmgr.html

I'm not sure what you're referring to with 'gpu-Xn'. Can you elaborate on that, if the MaxTRESPerNode doesn't accomplish what you want?

Can you also elaborate on what is wrong with the elasticsearch plugin with ELK? This may be an issue that warrants it's own ticket since it sounds like it might be a bug rather than a configuration issue. I'm happy to look at it and if it does look like it should be in a separate ticket I'll let you know.

There are ways to make sure that certain users can always run jobs. Do they need to have nodes available immediately or is this the group you mentioned that needs jobs to start within an hour? If you want to use preemption, do you have requirements on the types of jobs that can be preempted? Would any job be eligible or would you want users to choose to run jobs where they could potentially be preempted?

I would need some more details about the OpenMPI issue you're talking about. Which version of OpenMPI are you using? Does Slurm build correctly with the software? You mention that it is less of an issue when you use srun, but it still happens? Does that depend on the number of jobs on a node or is it a random occurrence with a similar setup?

When you mark a node as down with a health check script you can have it come back up automatically with the ReturnToService parameter. The default (0) requires an admin to resume the node, but you can have it come back up only if it were marked down for being non-responsive (1) or have it come back up regardless of the reason (2). You can have the node health check script run as a prolog if you need it to check each time a job is about to start. You will just have to account for the possibility that there are other jobs currently on the node.

When you have a chance to answer some of these questions I'm happy to continue to work with you on getting your system configured the way you want.

Thanks,
Ben

Comment 3 Marc Caubet Serrabou 2019-12-11 06:49:55 MST

Hi Ben,

thanks a lot for your answer and details. You can find my
answers below.

> Thanks for the detailed description of your environment and situation, that
> does make it easier to provide suggestions.  I'll try to offer suggestions for
> the problems you're facing.
>
> You state that you would like to have different fairshare calculations for gpu
> vs non-gpu jobs.  Slurm does allow you to create different user/account
> combinations, known as associations, that are unique entities.  If you had
> users request a different account for gpu jobs then this would make sure that
> use of one account didn't affect the fairshare calculations for the other
> account.

Ok thanks Hence, having 2 different accounts was a good approach. I was
wondering if there was an alternative way but to me this is the best
solution (easy and direct). GPU usage should be independent from CPU
accounting, so then it makes sense to use a different account for it.

> You also mention that you have problems allocating memory for very short jobs.
> I was going to suggest you use MemSpecLimit, but I see that you've already got
> that defined.  Are you seeing anything else in the logs related to these
> errors, like OOM errors?  Are there other jobs on the nodes when you see this
> issue?

This problem is seen on single core based jobs, which finish very fast
(hundreds of jobs running between 0 and 5 seconds). I tried to reproduce
that with non single core very short jobs (occupying the full node or
few nodes), and I was not able to reproduce the problem, so it only
applies to single core based jobs with very short runs. Regarding to
your question, as far as I saw, no OOM kills.

I am forcing users to pack jobs into bigger ones. Some of these jobs
were packed already, and since then we see this problem less often.
However, I wonder if there is an option to prevent such short jobs,
because 'a priori' I can not prevent users running in that way, and
would be great to implement something for preventing that. I was
thinking by adding something like a sleep command in the epilog
for very short jobs (depending on how long they ran), but this is
really a terrible workaround :).

> You are able to limit the number of gpu resources each job can use.  There are
> a few options of ways to limit it, but it sounds like you probably want to
> limit the number of gpus per node a job can request.  Assuming this is correct
> you can use the MaxTRESPerNode attribute for a job.  Here's an example of how
> to set this:
> sacctmgr modify user where user=user1 account=account1 set
> maxtrespernode=gres/gpu=2
>
> You could also set this on an account or on a QOS.  There are other MaxTRES
> attributes that might work better for your use case that are described in the
> sacctmgr documentation:
> https://slurm.schedmd.com/sacctmgr.html
>
> I'm not sure what you're referring to with 'gpu-Xn'.  Can you elaborate on
> that, if the MaxTRESPerNode doesn't accomplish what you want?

Whenever possible, we would like to limit the overall number of GPUs
a user can use, and not the number of GPUs per node.

The command above I already tried it in the past, but unfortunately
this is not useful at all for us: we have jobs using a single GPU,
some other using 2, 4 or 8 GPUs, and we would like to limit #GPUs per
user. The closest workaround I found was to limit max number of nodes,
but since we have (and will have) machines with different number of
GPUs per node, it does not fit at all, and would be nice to limit the
number of GPUs a user can use (we have a small cluster).

> Can you also elaborate on what is wrong with the elasticsearch plugin with ELK?
>  This may be an issue that warrants it's own ticket since it sounds like it
> might be a bug rather than a configuration issue.  I'm happy to look at it and
> if it does look like it should be in a separate ticket I'll let you know.

For that I would need to give you detailed information which I
currently don't have: we used a test instance which is not exiting
anymore so I will need to reproduce the issue again. We did some tests
on September and looks like "_doc" was expected in the index. But we
will not start to implement ELK until the end of Q1 2020, so maybe
would be better to skip that for the moment, at least until I have the
proper infrastructure for reproducing that again. About this, I just
would like to know whether the newer ElasticSearch versions (v7 and/or
v8) have been tested or not with Slurm (to me looks like it worked on
previous releases, but not in the newests, due to this "_doc" issue).

> There are ways to make sure that certain users can always run jobs.  Do they
> need to have nodes available immediately or is this the group you mentioned
> that needs jobs to start within an hour?  If you want to use preemption, do you
> have requirements on the types of jobs that can be preempted?  Would any job be
> eligible or would you want users to choose to run jobs where they could
> potentially be preempted?
>
> I would need some more details about the OpenMPI issue you're talking about.
> Which version of OpenMPI are you using?  Does Slurm build correctly with the
> software?  You mention that it is less of an issue when you use srun, but it
> still happens?  Does that depend on the number of jobs on a node or is it a
> random occurrence with a similar setup?

I was digging a bit more, and OpenMPI was compiled with the internal
hwloc library, while Slurm was compiled with the one provided by the
system. Users can use multiple OpenMPI versions from a central sotware
repository, and it looks like all MPI versions were compiled in a
similar wrong way.

This week I recompiled OpenMPI with the same hwloc version I used when
compiling Slurm, and now it works. In addition, of course I need to set
OMP_PROC_BIND to 'true'. However, I have also to set this with 'srun'.
With 'srun' I expected that assigning tasks to different cores would
be the default behaviour, but looks like is not. I wonder if there is
a parameter in sbatch/srun which can replace OMP_PROC_BIND when using
'srun' (we would like to force users to always use 'srun' instead of
'mpirun').

> When you mark a node as down with a health check script you can have it come
> back up automatically with the ReturnToService parameter.  The default (0)
> requires an admin to resume the node, but you can have it come back up only if
> it were marked down for being non-responsive (1) or have it come back up
> regardless of the reason (2).  You can have the node health check script run as
> a prolog if you need it to check each time a job is about to start.  You will
> just have to account for the possibility that there are other jobs currently on
> the node.

Ok cool, thanks for the hint. I think "ReturnToService=2" + "running NHC
in prolog" would work for us. I will try that. If I have other jobs
running in the node, if NHC fails and prolog is forced to fail it would
just DRAIN the node, which would be ok.

> When you have a chance to answer some of these questions I'm happy to continue
> to work with you on getting your system configured the way you want.


Thanks a lot,

Marc

_________________________________________________________
Paul Scherrer Institut
High Performance Computing & Emerging Technologies
Marc Caubet Serrabou
Building/Room: OHSA/014
Forschungsstrasse, 111
5232 Villigen PSI
Switzerland

Telephone: +41 56 310 46 67
E-Mail: marc.caubet@psi.ch
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Saturday, December 7, 2019 12:09:48 AM
To: Caubet Serrabou Marc (PSI)
Subject: [Bug 8185] Merlin6 Slurm Cluster: configuration assistance and recommendations


Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=8185#c2> on bug 8185<https://bugs.schedmd.com/show_bug.cgi?id=8185> from Ben Roberts<mailto:ben@schedmd.com>

Hi Marc,

Thanks for the detailed description of your environment and situation, that
does make it easier to provide suggestions.  I'll try to offer suggestions for
the problems you're facing.

You state that you would like to have different fairshare calculations for gpu
vs non-gpu jobs.  Slurm does allow you to create different user/account
combinations, known as associations, that are unique entities.  If you had
users request a different account for gpu jobs then this would make sure that
use of one account didn't affect the fairshare calculations for the other
account.

You also mention that you have problems allocating memory for very short jobs.
I was going to suggest you use MemSpecLimit, but I see that you've already got
that defined.  Are you seeing anything else in the logs related to these
errors, like OOM errors?  Are there other jobs on the nodes when you see this
issue?

You are able to limit the number of gpu resources each job can use.  There are
a few options of ways to limit it, but it sounds like you probably want to
limit the number of gpus per node a job can request.  Assuming this is correct
you can use the MaxTRESPerNode attribute for a job.  Here's an example of how
to set this:
sacctmgr modify user where user=user1 account=account1 set
maxtrespernode=gres/gpu=2

You could also set this on an account or on a QOS.  There are other MaxTRES
attributes that might work better for your use case that are described in the
sacctmgr documentation:
https://slurm.schedmd.com/sacctmgr.html

I'm not sure what you're referring to with 'gpu-Xn'.  Can you elaborate on
that, if the MaxTRESPerNode doesn't accomplish what you want?

Can you also elaborate on what is wrong with the elasticsearch plugin with ELK?
 This may be an issue that warrants it's own ticket since it sounds like it
might be a bug rather than a configuration issue.  I'm happy to look at it and
if it does look like it should be in a separate ticket I'll let you know.

There are ways to make sure that certain users can always run jobs.  Do they
need to have nodes available immediately or is this the group you mentioned
that needs jobs to start within an hour?  If you want to use preemption, do you
have requirements on the types of jobs that can be preempted?  Would any job be
eligible or would you want users to choose to run jobs where they could
potentially be preempted?

I would need some more details about the OpenMPI issue you're talking about.
Which version of OpenMPI are you using?  Does Slurm build correctly with the
software?  You mention that it is less of an issue when you use srun, but it
still happens?  Does that depend on the number of jobs on a node or is it a
random occurrence with a similar setup?

When you mark a node as down with a health check script you can have it come
back up automatically with the ReturnToService parameter.  The default (0)
requires an admin to resume the node, but you can have it come back up only if
it were marked down for being non-responsive (1) or have it come back up
regardless of the reason (2).  You can have the node health check script run as
a prolog if you need it to check each time a job is about to start.  You will
just have to account for the possibility that there are other jobs currently on
the node.

When you have a chance to answer some of these questions I'm happy to continue
to work with you on getting your system configured the way you want.

Thanks,
Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 4 Ben Roberts 2019-12-11 16:08:10 MST

Hi Marc,

> Ok thanks Hence, having 2 different accounts was a good approach. I was
> wondering if there was an alternative way but to me this is the best
> solution (easy and direct). GPU usage should be independent from CPU
> accounting, so then it makes sense to use a different account for it.

Correct, having 2 accounts is the approach I would recommend.  


> This problem is seen on single core based jobs, which finish very fast
> (hundreds of jobs running between 0 and 5 seconds). I tried to reproduce
> that with non single core very short jobs (occupying the full node or
> few nodes), and I was not able to reproduce the problem, so it only
> applies to single core based jobs with very short runs. Regarding to
> your question, as far as I saw, no OOM kills.
>
> I am forcing users to pack jobs into bigger ones. Some of these jobs
> were packed already, and since then we see this problem less often.
> However, I wonder if there is an option to prevent such short jobs,
> because 'a priori' I can not prevent users running in that way, and
> would be great to implement something for preventing that. I was
> thinking by adding something like a sleep command in the epilog
> for very short jobs (depending on how long they ran), but this is
> really a terrible workaround :).

For this I would recommend trying the CompleteWait parameter.  It should provide a window of time for existing jobs to finish cleaning up before new jobs are started on the node (without putting a sleep in the epilog script).  Since you're talking about submitting a large number of small jobs it might be worth going over the recommendations we have in the high throughput guide:
https://slurm.schedmd.com/high_throughput.html


> Whenever possible, we would like to limit the overall number of GPUs
> a user can use, and not the number of GPUs per node.
>
> The command above I already tried it in the past, but unfortunately
> this is not useful at all for us: we have jobs using a single GPU,
> some other using 2, 4 or 8 GPUs, and we would like to limit #GPUs per
> user. The closest workaround I found was to limit max number of nodes,
> but since we have (and will have) machines with different number of
> GPUs per node, it does not fit at all, and would be nice to limit the
> number of GPUs a user can use (we have a small cluster).

There is a parameter you can set on a QOS to limit a TRES on a per user basis, MaxTRESPerUser.  You can define a default QOS (DefaultQOS with sacctmgr) for your users so that this limit is enforced without requiring the users to specify the QOS with the limit when they submit a job.  


> For that I would need to give you detailed information which I
> currently don't have: we used a test instance which is not exiting
> anymore so I will need to reproduce the issue again. We did some tests
> on September and looks like "_doc" was expected in the index. But we
> will not start to implement ELK until the end of Q1 2020, so maybe
> would be better to skip that for the moment, at least until I have the
> proper infrastructure for reproducing that again. About this, I just
> would like to know whether the newer ElasticSearch versions (v7 and/or
> v8) have been tested or not with Slurm (to me looks like it worked on
> previous releases, but not in the newests, due to this "_doc" issue).

I don't see any reports of there being issues with ElasticSearch, but I don't know for sure that newer versions of ElasticSearch were tested.  I'm working on getting a newer version set up just to run a basic functionality test.  I'll let you know what I find.  


> I was digging a bit more, and OpenMPI was compiled with the internal
> hwloc library, while Slurm was compiled with the one provided by the
> system. Users can use multiple OpenMPI versions from a central sotware
> repository, and it looks like all MPI versions were compiled in a
> similar wrong way.
>
> This week I recompiled OpenMPI with the same hwloc version I used when
> compiling Slurm, and now it works. In addition, of course I need to set
> OMP_PROC_BIND to 'true'. However, I have also to set this with 'srun'.
> With 'srun' I expected that assigning tasks to different cores would
> be the default behaviour, but looks like is not. I wonder if there is
> a parameter in sbatch/srun which can replace OMP_PROC_BIND when using
> 'srun' (we would like to force users to always use 'srun' instead of
> 'mpirun').

There is a flag for srun called "--cpu-bind" that will allow you specify if you want to bind to sockets, cores, etc.  You can also adjust the TaskPluginParam in your slurm.conf for this binding to happen by default.  Right now it looks like you have your TaskPluginParam set to "Sched".  You can change this parameter to specify things like sockets or cores, or you can use the "Autobind" option for it to have a fallback option if one of the other methods isn't matched.  


> Ok cool, thanks for the hint. I think "ReturnToService=2" + "running NHC
> in prolog" would work for us. I will try that. If I have other jobs
> running in the node, if NHC fails and prolog is forced to fail it would
> just DRAIN the node, which would be ok.

Sounds good.  

Thanks,
Ben

Comment 6 Ben Roberts 2019-12-17 16:05:37 MST

Hi Marc,

Thanks for your patience while I looked into whether there is an issue using ElasticSearch 7.  I was able to verify with a test environment running that ElasticSearch 7 that data is being recorded and reported correctly.  If you have more details about the problem you ran into when you have an environment set up again we'll be happy to look into it further with you.  Let me know if you have any additional questions about the information I sent in my previous response.

Thanks,
Ben

Comment 7 Marc Caubet Serrabou 2019-12-18 07:21:45 MST

Dear Ben,


thanks a lot for checking. We were running v7.3.2, but currently we can not test it.

Knowing that this should work with this release, we will try at the end of Q1 2020

and open a ticket if we see any problem.


Thanks a lot for your help and for checking that.


Marc

_________________________________________________________
Paul Scherrer Institut
High Performance Computing & Emerging Technologies
Marc Caubet Serrabou
Building/Room: OHSA/014
Forschungsstrasse, 111
5232 Villigen PSI
Switzerland

Telephone: +41 56 310 46 67
E-Mail: marc.caubet@psi.ch
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, December 18, 2019 12:05:37 AM
To: Caubet Serrabou Marc (PSI)
Subject: [Bug 8185] Merlin6 Slurm Cluster: configuration assistance and recommendations


Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=8185#c6> on bug 8185<https://bugs.schedmd.com/show_bug.cgi?id=8185> from Ben Roberts<mailto:ben@schedmd.com>

Hi Marc,

Thanks for your patience while I looked into whether there is an issue using
ElasticSearch 7.  I was able to verify with a test environment running that
ElasticSearch 7 that data is being recorded and reported correctly.  If you
have more details about the problem you ran into when you have an environment
set up again we'll be happy to look into it further with you.  Let me know if
you have any additional questions about the information I sent in my previous
response.

Thanks,
Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 8 Ben Roberts 2019-12-18 08:42:50 MST

Thanks Marc, I'll close this ticket then and wait to see how things look when you are able to test it again.  

Ben