7569 – Slurm configurations with gpu and cpu nodes

Ticket 7569 - Slurm configurations with gpu and cpu nodes

Summary: Slurm configurations with gpu and cpu nodes

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	18.08.7
Hardware:	Linux Linux

Severity:	6 - No support contract
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-08-13 13:44 MDT by Wei Feinstein
Modified:	2019-11-25 16:52 MST (History)
CC List:	0 users

See Also:
Site:	LBNL - Lawrence Berkeley National Laboratory
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Wei Feinstein 2019-08-13 13:44:32 MDT

Need help in configuring the following set of nodes:

I have the following setup -
slurm.conf (key settings)
 
GresTypes=gpu

SelectType=select/cons_res
SelectTypeParameters=CR_CPU_MEMORY
TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup
...
NodeName=n0[000-023].abc[0]  NodeAddr=10.0.0.[11-34] CPUs=24  Sockets=2 CoresPerSocket=12  feature=abc_cpu  Weight=1  # abc nodes
NodeName=n0[024-027].abc[0]  NodeAddr=10.0.0.[35-38] CPUs=16  Sockets=2 CoresPerSocket=8  feature=abc_gpu  Gres=gpu:3  Weight=3 # abc_gpu  nodes

PartitionName=abc     Nodes=n00[00-27].abc[0]   Oversubscribe=Yes    DefMemPerNode=512000


gres.conf  - 
Nodename=n0024.abc[0]  Name=gpu Type=titan Count=3
Nodename=n0025.abc[0]  Name=gpu Type=titan Count=3
Nodename=n0026.abc[0]  Name=gpu Type=titan Count=3
Nodename=n0027.abc[0]  Name=gpu Type=titan Count=3

Srun tests:

 
srun --nodes=20  --pty --account=scs --partition=abc  --qos=normal /bin/bash 
---> 20 CPU only nodes were allocated to my job

srun --gres=gpu:3 -n15 --nodes=3  --pty --account=scs --partition=abc  --qos=normal  /bin/bash
--->  3 GPU nodes were allocated to my job

srun --nodes=20 --constraint="[abc_gpu*3&abc_cpu*17]"  --pty --account=scs --partition=abc  --qos=normal  /bin/bash
----> 3 GPU nodes and 17 CPU nodes were allocated to my job

Shared resources and a cluster with only 28 nodes and a few users. Don't want to run exclusively on the node but want to run a set number of resources per node and per GPU:CPU.   What option can I provide in the last request to allow for a certain number of tasks per node to be able to run without taking the entire node? I want to also be able to select gpu to cpus on the gpu nodes?
 

Thanks

Jackie

Comment 1 Michael Hinton 2019-08-13 17:20:05 MDT

Hi Jackie,

(In reply to Jacqueline Scoggins from comment #0)
> Shared resources and a cluster with only 28 nodes and a few users. Don't
> want to run exclusively on the node but want to run a set number of
> resources per node and per GPU:CPU.
Your select type parameters indicate that you are scheduling CPUs, not entire nodes, so you are good there.

What’s your use case for Oversubscribe=Yes?

I’m not sure what you mean by "run a set number of resources per node and per GPU:CPU." Are you trying to reserve a certain number of CPUs for each GPU, so that CPU-only jobs don’t starve other GPU jobs of CPUs? If so, you can create two overlapping partitions with the same nodes, call one “cpus” and the other “gpus,” and then set MaxCPUsPerNode for the cpus partition to make sure CPU-only jobs don’t take all the CPUs on the node. See https://slurm.schedmd.com/slurm.conf.html#OPT_MaxCPUsPerNode. It’s similar to how you are using features, but with partitions instead.

> What option can I provide in the last
> request to allow for a certain number of tasks per node to be able to run
> without taking the entire node?
The default is one task per node, so the entire node shouldn’t be taken. You can adjust the # of tasks per node by adjusting --ntasks-per-node, or by specifying -n and -N, so you get (n/N) tasks per node. You can then look at the SLURM_TASKS_PER_NODE env var in the task to see what tasks per node actually is.

> I want to also be able to select gpu to cpus
> on the gpu nodes?
Slurm 19.05 introduced --cpus-per-gpu. But in 18.08, I don’t believe there is a good way to do this. You might be able to do something similar using a job submit plugin and altering gres and cpu counts of job submissions.

Feel free to elaborate if I am not understanding you correctly.

Thanks!
-Michael

Comment 2 Wei Feinstein 2019-08-13 19:04:41 MDT

See below

On Tue, Aug 13, 2019 at 4:20 PM <bugs@schedmd.com> wrote:

> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=7569#c1> on bug
> 7569 <https://bugs.schedmd.com/show_bug.cgi?id=7569> from Michael Hinton
> <hinton@schedmd.com> *
>
> Hi Jackie,
>
> (In reply to Jacqueline Scoggins from comment #0 <https://bugs.schedmd.com/show_bug.cgi?id=7569#c0>)> Shared resources and a cluster with only 28 nodes and a few users. Don't
> > want to run exclusively on the node but want to run a set number of
> > resources per node and per GPU:CPU.
> Your select type parameters indicate that you are scheduling CPUs, not entire
> nodes, so you are good there.
>
>
I found that if I did not specify mem that was taking up the entire node.
So I had to add --mem=xxx to avoid taking all of the node because I have
DefMemPerNode set in the Partition.   Should users have to request the
amount of memory they want for their job?

> What’s your use case for Oversubscribe=Yes?
>
> They want to have the nodes shared by all users.  Idle resources can
become available to users jobs even if another user is working on the node.

> I’m not sure what you mean by "run a set number of resources per node and per
> GPU:CPU." Are you trying to reserve a certain number of CPUs for each GPU, so
> that CPU-only jobs don’t starve other GPU jobs of CPUs? If so, you can create
> two overlapping partitions with the same nodes, call one “cpus” and the other
> “gpus,” and then set MaxCPUsPerNode for the cpus partition to make sure
> CPU-only jobs don’t take all the CPUs on the node. Seehttps://slurm.schedmd.com/slurm.conf.html#OPT_MaxCPUsPerNode. It’s similar to
> how you are using features, but with partitions instead.
>
> But they want the ability to run across all of the nodes and/or a mix.
Creating 2 overlapping partitions will do this but will it confuse the user
who sees resources in the partition taken but not requested from within the
partition? I don't want to have any thing configured to confuse users or to
have emails sent to us regarding a misunderstanding.

> What option can I provide in the last
> > request to allow for a certain number of tasks per node to be able to run
> > without taking the entire node?
> The default is one task per node, so the entire node shouldn’t be taken. You
> can adjust the # of tasks per node by adjusting --ntasks-per-node, or by
> specifying -n and -N, so you get (n/N) tasks per node. You can then look at the
> SLURM_TASKS_PER_NODE env var in the task to see what tasks per node actually
> is.
>
> I tried using the --ntasks-per-node with using
constraint=[abc_gpu*3&abc_cpu*17] and it complained.
:Requested node configuration is not available. Unable to allocate
resources.    This is the type of control I want to have.  Say a user need
a set of CPU nodes and a set of GPU nodes. They want for the GPU nodes to
only have a 1 GPU to 5 CPU ratio so if they were only requesting gpu's it
would be rasy to say -n5 --gres=gpu:1  but since we are asking for a mix
type of nodes adding -n or --gres will not work for the CPU type nodes. If
I have one partition how could I request for the specifics per node type
(gpu or cpus)?

> > I want to also be able to select gpu to cpus
> > on the gpu nodes?
> Slurm 19.05 introduced --cpus-per-gpu. But in 18.08, I don’t believe there is a
> good way to do this. You might be able to do something similar using a job
> submit plugin and altering gres and cpu counts of job submissions.
>
> Oh bummer.  I thought I would be able to do this in 18.08.

> Feel free to elaborate if I am not understanding you correctly.
>
> Thanks!
> -Michael
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 3 Michael Hinton 2019-11-25 12:02:04 MST

Hi Jackie, sorry for the super long delay.

(In reply to Jacqueline Scoggins from comment #2)
> I found that if I did not specify mem that was taking up the entire node.
> So I had to add --mem=xxx to avoid taking all of the node because I have
> DefMemPerNode set in the Partition.   Should users have to request the
> amount of memory they want for their job?
DefMemPerNode/DefMemPerCPU should supply the default for a job so a user doesn’t need to specify --mem. I would use DefMemPerCPU, since you want to schedule CPUs, not entire nodes, and DefMemPerCPU scales with the CPU request.

> > What’s your use case for Oversubscribe=Yes?
> They want to have the nodes shared by all users.  Idle resources can
> become available to users jobs even if another user is working on the node.
Then I’m not certain you want Oversubscribe=Yes, especially since I learned that you have only 3 users on that cluster. By default, Nodes are already shareable among multiple users by default, as long as you are using the select/cons_res or select/cons_tres plugins and don't set ExclusiveUser or OverSubscribe=EXCLUSIVE.

What Oversubscribe=Yes does is allow users to share *CPUs*, not nodes. This is a big difference. When sharing CPUs, different jobs can run on the *same* CPU and rely on the OS to timeslice the programs. This drastically reduces performance (because the OS has to context switch between the jobs on the CPU), taking the “HP” out of “HPC.” Usually you would only want to Oversubscribe CPUs if you are trying to do high-throughput computing.

By removing Oversubscribe=Yes, users will still share nodes, but get exclusive access to each CPU they get from the node. My recommendation is to get rid of OverSubscribe and simplify your life unless you have a use case for it. :)

If you are adamant on using Oversubscribe=Yes, then at least set MaxMemPerCPU on the partition with the following equation:

    MaxMemPerCPU = (RealMemory of node / # of CPUs) / (Max # of jobs per CPU)

Then change DefMemPerCPU on the partition to be <= MaxMemPerNode. Make sure to also do:

    Oversubscribe=Yes:<Max # of jobs per CPU>

By default, Oversubscribe=Yes means Oversubscribe=Yes:4, which means up to 4 jobs can share the same CPU.

I’ll look into your other questions and get back to you soon. You brought up some good points that I’ll have to research into a bit.

Thanks,
Michael

Comment 4 Michael Hinton 2019-11-25 14:23:50 MST

(In reply to Michael Hinton from comment #3)
> Usually you would only want to
> Oversubscribe CPUs if you are trying to do high-throughput computing.
Correction: high-throughput computing has nothing to do with it. The only good use case we can think of for this is if you had a super long-running, high priority job that takes up all resources, but wanted to periodically fit in some smaller, lower-priority jobs.

Also, you would only want Oversubscribe if you also set up Slurm's Gang Scheduling + Suspend features (and the jobs were able to handle SIGSTOP). Gang Scheduling + Suspend will give Slurm the context switching responsibility instead of the OS, which will reduce the performance impact of Oversubscribe. But it's still slower than no Oversubscribe.

Comment 5 Wei Feinstein 2019-11-25 16:05:09 MST

I have already taken care of this problem and fixed the issue we had.
My solution was as follows:

1. Shared resources setting DefMemPerCPU to a portion of the memory on the
node.  The users will request the memory they want per cpu if they want to
use more than the default. The resources are shared amongst the users
because its a shared system with minimal users.
2.  They have a mix of GPU and CPU nodes and we are using one partition.
The GPUs are configured by the gres.conf file and the limit is set to 3 GPU
per node. And we have asked the users to make the following request for
their jobs -
If you need 2 GPUs you can do --gres=gpu:2  --cpus-per-task=2 -n=10

If you would like to take the entire node use --exclusive on command line
or in the batch script.

I think we have things setup as intended.

Thanks for your help. This ticket can be closed.

Jackie

On Mon, Nov 25, 2019 at 1:24 PM <bugs@schedmd.com> wrote:

> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=7569#c4> on bug
> 7569 <https://bugs.schedmd.com/show_bug.cgi?id=7569> from Michael Hinton
> <hinton@schedmd.com> *
>
> (In reply to Michael Hinton from comment #3 <https://bugs.schedmd.com/show_bug.cgi?id=7569#c3>)> Usually you would only want to
> > Oversubscribe CPUs if you are trying to do high-throughput computing.
> Correction: high-throughput computing has nothing to do with it. The only good
> use case we can think of for this is if you had a super long-running, high
> priority job that takes up all resources, but wanted to periodically fit in
> some smaller, lower-priority jobs.
>
> Also, you would only want Oversubscribe if you also set up Slurm's Gang
> Scheduling + Suspend features (and the jobs were able to handle SIGSTOP). Gang
> Scheduling + Suspend will give Slurm the context switching responsibility
> instead of the OS, which will reduce the performance impact of Oversubscribe.
> But it's still slower than no Oversubscribe.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 6 Michael Hinton 2019-11-25 16:52:06 MST

Excellent! Closing out ticket. Feel free to reopen or open a new bug if you have any follow up questions.

-Michael