| Summary: | Merlin6 Slurm Cluster: configuration assistance and recommendations | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Marc Caubet Serrabou <marc.caubet> |
| Component: | Configuration | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | derek.feichtinger |
| Version: | 18.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Paul Scherrer | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Slurm configuration files for the Merlin6 cluster | ||
Hi Marc, Thanks for the detailed description of your environment and situation, that does make it easier to provide suggestions. I'll try to offer suggestions for the problems you're facing. You state that you would like to have different fairshare calculations for gpu vs non-gpu jobs. Slurm does allow you to create different user/account combinations, known as associations, that are unique entities. If you had users request a different account for gpu jobs then this would make sure that use of one account didn't affect the fairshare calculations for the other account. You also mention that you have problems allocating memory for very short jobs. I was going to suggest you use MemSpecLimit, but I see that you've already got that defined. Are you seeing anything else in the logs related to these errors, like OOM errors? Are there other jobs on the nodes when you see this issue? You are able to limit the number of gpu resources each job can use. There are a few options of ways to limit it, but it sounds like you probably want to limit the number of gpus per node a job can request. Assuming this is correct you can use the MaxTRESPerNode attribute for a job. Here's an example of how to set this: sacctmgr modify user where user=user1 account=account1 set maxtrespernode=gres/gpu=2 You could also set this on an account or on a QOS. There are other MaxTRES attributes that might work better for your use case that are described in the sacctmgr documentation: https://slurm.schedmd.com/sacctmgr.html I'm not sure what you're referring to with 'gpu-Xn'. Can you elaborate on that, if the MaxTRESPerNode doesn't accomplish what you want? Can you also elaborate on what is wrong with the elasticsearch plugin with ELK? This may be an issue that warrants it's own ticket since it sounds like it might be a bug rather than a configuration issue. I'm happy to look at it and if it does look like it should be in a separate ticket I'll let you know. There are ways to make sure that certain users can always run jobs. Do they need to have nodes available immediately or is this the group you mentioned that needs jobs to start within an hour? If you want to use preemption, do you have requirements on the types of jobs that can be preempted? Would any job be eligible or would you want users to choose to run jobs where they could potentially be preempted? I would need some more details about the OpenMPI issue you're talking about. Which version of OpenMPI are you using? Does Slurm build correctly with the software? You mention that it is less of an issue when you use srun, but it still happens? Does that depend on the number of jobs on a node or is it a random occurrence with a similar setup? When you mark a node as down with a health check script you can have it come back up automatically with the ReturnToService parameter. The default (0) requires an admin to resume the node, but you can have it come back up only if it were marked down for being non-responsive (1) or have it come back up regardless of the reason (2). You can have the node health check script run as a prolog if you need it to check each time a job is about to start. You will just have to account for the possibility that there are other jobs currently on the node. When you have a chance to answer some of these questions I'm happy to continue to work with you on getting your system configured the way you want. Thanks, Ben Hi Ben, thanks a lot for your answer and details. You can find my answers below. > Thanks for the detailed description of your environment and situation, that > does make it easier to provide suggestions. I'll try to offer suggestions for > the problems you're facing. > > You state that you would like to have different fairshare calculations for gpu > vs non-gpu jobs. Slurm does allow you to create different user/account > combinations, known as associations, that are unique entities. If you had > users request a different account for gpu jobs then this would make sure that > use of one account didn't affect the fairshare calculations for the other > account. Ok thanks Hence, having 2 different accounts was a good approach. I was wondering if there was an alternative way but to me this is the best solution (easy and direct). GPU usage should be independent from CPU accounting, so then it makes sense to use a different account for it. > You also mention that you have problems allocating memory for very short jobs. > I was going to suggest you use MemSpecLimit, but I see that you've already got > that defined. Are you seeing anything else in the logs related to these > errors, like OOM errors? Are there other jobs on the nodes when you see this > issue? This problem is seen on single core based jobs, which finish very fast (hundreds of jobs running between 0 and 5 seconds). I tried to reproduce that with non single core very short jobs (occupying the full node or few nodes), and I was not able to reproduce the problem, so it only applies to single core based jobs with very short runs. Regarding to your question, as far as I saw, no OOM kills. I am forcing users to pack jobs into bigger ones. Some of these jobs were packed already, and since then we see this problem less often. However, I wonder if there is an option to prevent such short jobs, because 'a priori' I can not prevent users running in that way, and would be great to implement something for preventing that. I was thinking by adding something like a sleep command in the epilog for very short jobs (depending on how long they ran), but this is really a terrible workaround :). > You are able to limit the number of gpu resources each job can use. There are > a few options of ways to limit it, but it sounds like you probably want to > limit the number of gpus per node a job can request. Assuming this is correct > you can use the MaxTRESPerNode attribute for a job. Here's an example of how > to set this: > sacctmgr modify user where user=user1 account=account1 set > maxtrespernode=gres/gpu=2 > > You could also set this on an account or on a QOS. There are other MaxTRES > attributes that might work better for your use case that are described in the > sacctmgr documentation: > https://slurm.schedmd.com/sacctmgr.html > > I'm not sure what you're referring to with 'gpu-Xn'. Can you elaborate on > that, if the MaxTRESPerNode doesn't accomplish what you want? Whenever possible, we would like to limit the overall number of GPUs a user can use, and not the number of GPUs per node. The command above I already tried it in the past, but unfortunately this is not useful at all for us: we have jobs using a single GPU, some other using 2, 4 or 8 GPUs, and we would like to limit #GPUs per user. The closest workaround I found was to limit max number of nodes, but since we have (and will have) machines with different number of GPUs per node, it does not fit at all, and would be nice to limit the number of GPUs a user can use (we have a small cluster). > Can you also elaborate on what is wrong with the elasticsearch plugin with ELK? > This may be an issue that warrants it's own ticket since it sounds like it > might be a bug rather than a configuration issue. I'm happy to look at it and > if it does look like it should be in a separate ticket I'll let you know. For that I would need to give you detailed information which I currently don't have: we used a test instance which is not exiting anymore so I will need to reproduce the issue again. We did some tests on September and looks like "_doc" was expected in the index. But we will not start to implement ELK until the end of Q1 2020, so maybe would be better to skip that for the moment, at least until I have the proper infrastructure for reproducing that again. About this, I just would like to know whether the newer ElasticSearch versions (v7 and/or v8) have been tested or not with Slurm (to me looks like it worked on previous releases, but not in the newests, due to this "_doc" issue). > There are ways to make sure that certain users can always run jobs. Do they > need to have nodes available immediately or is this the group you mentioned > that needs jobs to start within an hour? If you want to use preemption, do you > have requirements on the types of jobs that can be preempted? Would any job be > eligible or would you want users to choose to run jobs where they could > potentially be preempted? > > I would need some more details about the OpenMPI issue you're talking about. > Which version of OpenMPI are you using? Does Slurm build correctly with the > software? You mention that it is less of an issue when you use srun, but it > still happens? Does that depend on the number of jobs on a node or is it a > random occurrence with a similar setup? I was digging a bit more, and OpenMPI was compiled with the internal hwloc library, while Slurm was compiled with the one provided by the system. Users can use multiple OpenMPI versions from a central sotware repository, and it looks like all MPI versions were compiled in a similar wrong way. This week I recompiled OpenMPI with the same hwloc version I used when compiling Slurm, and now it works. In addition, of course I need to set OMP_PROC_BIND to 'true'. However, I have also to set this with 'srun'. With 'srun' I expected that assigning tasks to different cores would be the default behaviour, but looks like is not. I wonder if there is a parameter in sbatch/srun which can replace OMP_PROC_BIND when using 'srun' (we would like to force users to always use 'srun' instead of 'mpirun'). > When you mark a node as down with a health check script you can have it come > back up automatically with the ReturnToService parameter. The default (0) > requires an admin to resume the node, but you can have it come back up only if > it were marked down for being non-responsive (1) or have it come back up > regardless of the reason (2). You can have the node health check script run as > a prolog if you need it to check each time a job is about to start. You will > just have to account for the possibility that there are other jobs currently on > the node. Ok cool, thanks for the hint. I think "ReturnToService=2" + "running NHC in prolog" would work for us. I will try that. If I have other jobs running in the node, if NHC fails and prolog is forced to fail it would just DRAIN the node, which would be ok. > When you have a chance to answer some of these questions I'm happy to continue > to work with you on getting your system configured the way you want. Thanks a lot, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet@psi.ch ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Saturday, December 7, 2019 12:09:48 AM To: Caubet Serrabou Marc (PSI) Subject: [Bug 8185] Merlin6 Slurm Cluster: configuration assistance and recommendations Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=8185#c2> on bug 8185<https://bugs.schedmd.com/show_bug.cgi?id=8185> from Ben Roberts<mailto:ben@schedmd.com> Hi Marc, Thanks for the detailed description of your environment and situation, that does make it easier to provide suggestions. I'll try to offer suggestions for the problems you're facing. You state that you would like to have different fairshare calculations for gpu vs non-gpu jobs. Slurm does allow you to create different user/account combinations, known as associations, that are unique entities. If you had users request a different account for gpu jobs then this would make sure that use of one account didn't affect the fairshare calculations for the other account. You also mention that you have problems allocating memory for very short jobs. I was going to suggest you use MemSpecLimit, but I see that you've already got that defined. Are you seeing anything else in the logs related to these errors, like OOM errors? Are there other jobs on the nodes when you see this issue? You are able to limit the number of gpu resources each job can use. There are a few options of ways to limit it, but it sounds like you probably want to limit the number of gpus per node a job can request. Assuming this is correct you can use the MaxTRESPerNode attribute for a job. Here's an example of how to set this: sacctmgr modify user where user=user1 account=account1 set maxtrespernode=gres/gpu=2 You could also set this on an account or on a QOS. There are other MaxTRES attributes that might work better for your use case that are described in the sacctmgr documentation: https://slurm.schedmd.com/sacctmgr.html I'm not sure what you're referring to with 'gpu-Xn'. Can you elaborate on that, if the MaxTRESPerNode doesn't accomplish what you want? Can you also elaborate on what is wrong with the elasticsearch plugin with ELK? This may be an issue that warrants it's own ticket since it sounds like it might be a bug rather than a configuration issue. I'm happy to look at it and if it does look like it should be in a separate ticket I'll let you know. There are ways to make sure that certain users can always run jobs. Do they need to have nodes available immediately or is this the group you mentioned that needs jobs to start within an hour? If you want to use preemption, do you have requirements on the types of jobs that can be preempted? Would any job be eligible or would you want users to choose to run jobs where they could potentially be preempted? I would need some more details about the OpenMPI issue you're talking about. Which version of OpenMPI are you using? Does Slurm build correctly with the software? You mention that it is less of an issue when you use srun, but it still happens? Does that depend on the number of jobs on a node or is it a random occurrence with a similar setup? When you mark a node as down with a health check script you can have it come back up automatically with the ReturnToService parameter. The default (0) requires an admin to resume the node, but you can have it come back up only if it were marked down for being non-responsive (1) or have it come back up regardless of the reason (2). You can have the node health check script run as a prolog if you need it to check each time a job is about to start. You will just have to account for the possibility that there are other jobs currently on the node. When you have a chance to answer some of these questions I'm happy to continue to work with you on getting your system configured the way you want. Thanks, Ben ________________________________ You are receiving this mail because: * You reported the bug. Hi Marc, > Ok thanks Hence, having 2 different accounts was a good approach. I was > wondering if there was an alternative way but to me this is the best > solution (easy and direct). GPU usage should be independent from CPU > accounting, so then it makes sense to use a different account for it. Correct, having 2 accounts is the approach I would recommend. > This problem is seen on single core based jobs, which finish very fast > (hundreds of jobs running between 0 and 5 seconds). I tried to reproduce > that with non single core very short jobs (occupying the full node or > few nodes), and I was not able to reproduce the problem, so it only > applies to single core based jobs with very short runs. Regarding to > your question, as far as I saw, no OOM kills. > > I am forcing users to pack jobs into bigger ones. Some of these jobs > were packed already, and since then we see this problem less often. > However, I wonder if there is an option to prevent such short jobs, > because 'a priori' I can not prevent users running in that way, and > would be great to implement something for preventing that. I was > thinking by adding something like a sleep command in the epilog > for very short jobs (depending on how long they ran), but this is > really a terrible workaround :). For this I would recommend trying the CompleteWait parameter. It should provide a window of time for existing jobs to finish cleaning up before new jobs are started on the node (without putting a sleep in the epilog script). Since you're talking about submitting a large number of small jobs it might be worth going over the recommendations we have in the high throughput guide: https://slurm.schedmd.com/high_throughput.html > Whenever possible, we would like to limit the overall number of GPUs > a user can use, and not the number of GPUs per node. > > The command above I already tried it in the past, but unfortunately > this is not useful at all for us: we have jobs using a single GPU, > some other using 2, 4 or 8 GPUs, and we would like to limit #GPUs per > user. The closest workaround I found was to limit max number of nodes, > but since we have (and will have) machines with different number of > GPUs per node, it does not fit at all, and would be nice to limit the > number of GPUs a user can use (we have a small cluster). There is a parameter you can set on a QOS to limit a TRES on a per user basis, MaxTRESPerUser. You can define a default QOS (DefaultQOS with sacctmgr) for your users so that this limit is enforced without requiring the users to specify the QOS with the limit when they submit a job. > For that I would need to give you detailed information which I > currently don't have: we used a test instance which is not exiting > anymore so I will need to reproduce the issue again. We did some tests > on September and looks like "_doc" was expected in the index. But we > will not start to implement ELK until the end of Q1 2020, so maybe > would be better to skip that for the moment, at least until I have the > proper infrastructure for reproducing that again. About this, I just > would like to know whether the newer ElasticSearch versions (v7 and/or > v8) have been tested or not with Slurm (to me looks like it worked on > previous releases, but not in the newests, due to this "_doc" issue). I don't see any reports of there being issues with ElasticSearch, but I don't know for sure that newer versions of ElasticSearch were tested. I'm working on getting a newer version set up just to run a basic functionality test. I'll let you know what I find. > I was digging a bit more, and OpenMPI was compiled with the internal > hwloc library, while Slurm was compiled with the one provided by the > system. Users can use multiple OpenMPI versions from a central sotware > repository, and it looks like all MPI versions were compiled in a > similar wrong way. > > This week I recompiled OpenMPI with the same hwloc version I used when > compiling Slurm, and now it works. In addition, of course I need to set > OMP_PROC_BIND to 'true'. However, I have also to set this with 'srun'. > With 'srun' I expected that assigning tasks to different cores would > be the default behaviour, but looks like is not. I wonder if there is > a parameter in sbatch/srun which can replace OMP_PROC_BIND when using > 'srun' (we would like to force users to always use 'srun' instead of > 'mpirun'). There is a flag for srun called "--cpu-bind" that will allow you specify if you want to bind to sockets, cores, etc. You can also adjust the TaskPluginParam in your slurm.conf for this binding to happen by default. Right now it looks like you have your TaskPluginParam set to "Sched". You can change this parameter to specify things like sockets or cores, or you can use the "Autobind" option for it to have a fallback option if one of the other methods isn't matched. > Ok cool, thanks for the hint. I think "ReturnToService=2" + "running NHC > in prolog" would work for us. I will try that. If I have other jobs > running in the node, if NHC fails and prolog is forced to fail it would > just DRAIN the node, which would be ok. Sounds good. Thanks, Ben Hi Marc, Thanks for your patience while I looked into whether there is an issue using ElasticSearch 7. I was able to verify with a test environment running that ElasticSearch 7 that data is being recorded and reported correctly. If you have more details about the problem you ran into when you have an environment set up again we'll be happy to look into it further with you. Let me know if you have any additional questions about the information I sent in my previous response. Thanks, Ben Dear Ben, thanks a lot for checking. We were running v7.3.2, but currently we can not test it. Knowing that this should work with this release, we will try at the end of Q1 2020 and open a ticket if we see any problem. Thanks a lot for your help and for checking that. Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet@psi.ch ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, December 18, 2019 12:05:37 AM To: Caubet Serrabou Marc (PSI) Subject: [Bug 8185] Merlin6 Slurm Cluster: configuration assistance and recommendations Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=8185#c6> on bug 8185<https://bugs.schedmd.com/show_bug.cgi?id=8185> from Ben Roberts<mailto:ben@schedmd.com> Hi Marc, Thanks for your patience while I looked into whether there is an issue using ElasticSearch 7. I was able to verify with a test environment running that ElasticSearch 7 that data is being recorded and reported correctly. If you have more details about the problem you ran into when you have an environment set up again we'll be happy to look into it further with you. Let me know if you have any additional questions about the information I sent in my previous response. Thanks, Ben ________________________________ You are receiving this mail because: * You reported the bug. Thanks Marc, I'll close this ticket then and wait to see how things look when you are able to test it again. Ben |
Created attachment 12481 [details] Slurm configuration files for the Merlin6 cluster Dear SchedMD Support, I am responsible for one of the main Slurm clusters at PSI and we have recently requested SchedMD Support. I would like to start the first round by attaching our configuration files and get some advices for improvement and configuration assistance. This consists on a federated cluster with two clusters: merlin5 (old setup), merlin6 (new setup): * 'merlin5' is running for few years already, and contains a very old configuration and is running on very old hardware, with hyper- threading disabled and CPU as Consumable Resource. * 'merlin6' is the new cluster, contains a new configuration with Memory and Core as consumable resources. It contains CPU and GPU machines. Hyper-threading is enabled on the CPU nodes, while this is disabled on the GPU nodes. This is official production new cluster, where: * Basically, all users should have the same fairshare rights, independent of the organizational unit they belong to. * Exception: A few special groups have the following requirements based on their sponsoring of some cluster nodes * on an equivalent number of nodes (job slots) equal to their sponsoring, they should always get the highest priority and they should not be required to wait for job starting longer than 1h. * This requirement will come in the near feature and we will need to adapt our configurations according to that. * The cluster has a number of nodes featuring GPUs. The fairshare for the GPU-nodes and for the normal CPU-nodes should be accounted for separately, i.e. somebody having used the CPU resources should not get a fairshare penalty based on that when submitting a job for the GPU-nodes. I attach the main files: - slurm.conf for merlin6 - slurm.prolog (slurmd) for merlin6 - slurm.epilog (slurmd) for merlin6 - slurmctld.prolog for merlin6 (currently just returns 'exit 0': enabled for further development) - gres.conf for merlin6 - slurmdbd.conf for merlin6 - accounts.[parsable2|standard].txt ('sacctmgr show assoc' output) - Currently only 'merlin' account is used. 'merlin-gpu' was created for isolating fair share from CPU, but never went in to real production. - 'meg' account can be ignored. - qos.[parsable2|standard].txt (output for 'sacctmgr show qos') - myhwloc.[output|.tar.bz2|xml] (output for 'hwloc-gather-topology /tmp/myhwloc' for one of the CPU nodes) Main items: - Server configuration: - 2 slurmdbd in active/passive mode - merlin-slurmctld0[1,2] - 2 slurmctld in active/passive mode - merlin-slurmctld0[1,2] - 1 single MariaDB instance, dedicated machine - merlin-slurmdb01 - Hyper-threading enabled: - submitted jobs should use physical cores by default and only use 2 threads per core if explicitely requested by the user. - Different jobs can not land to the same physical core - Different jobs can not land to the same logical CPU (core thread) - CPU Affinity and CGroups enabled - Need for running Interactive and X11 based jobs Pending items: - Upgrading to Slurm 19.05.4 with PMIx support - Repository is ready and version has been tested at PSI. - Upgrade will be performed in the upcoming weeks. - Problems with very short jobs ("cgroup cannot allocate memory" - errors). How to prevent that in an efficient way? - Limit GPU resources by 'maximum allowed GRES resources, instead of 'by maximum allowed nodes'. - Is QoS 'gpu-Xn' a good way for limiting GPU resources? - ELK integration: - 'elasticsearch' plugin not working with latest ELK. - Allow specific jobs from specific users to always run in the Cluster (preemption? reservations?). We would need advices on this. - Problems with Hybrid jobs (OpenMPI+OpenMP): tasks are not correctly assigned to cores (hwloc issue?): - Problem softed with 'srun', while problem is severe with openmpi 'mpirun' - Code compiled with gcc/openmpi suffers from that - Code compiled with intel/impi usually runs smoothly with 'srun' - Introducing NHC: - Auto recovery when a node is rebooted. - Detect configuration issues and drain node when detected (check also possible from prolog). - Bee-On-Demand for generating shared scratch disks on the demand. - Needs to be done at the prolog step. For some of the pending items that we would need some help, would be better to open an single ticket for each? I would not open all them at once, as I want to investigate by myself a bit more about some of them (NHC, BeeOND). Thanks a lot and best regards, Marc