Ticket 10652

Summary:	How to pack jobs to make use of partial number of gpus in the node?
Product:	Slurm	Reporter:	Alexis <Alexis.Espinosa>
Component:	Scheduling	Assignee:	Director of Support <support>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	kevin.buckley
Version:	20.02.5
Hardware:	Linux
OS:	Linux
Site:	Pawsey	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	CentOS	Machine Name:	topaz
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Alexis 2021-01-19 00:09:22 MST

We have nodes with 4 gpus. We want to pack 8 jobs in two nodes for each of the subjobs to use only 1 gpu. Basically, to run 4 jobs within the node, each using 1 gpu.

So far I succeeded using this scripts:

```
#!/bin/bash --login

#SBATCH --partition=nvlinkq-dev
#SBATCH --nodes=2 #to request 2 nodes
#SBATCH --gres=gpu:4 #to request 4 gpus per node 
#SBATCH --ntasks=8 #to request 8 tasks
#SBATCH --ntasks-per-node=4 #to request 4 tasks per node
#SBATCH --time=00:10:00
#SBATCH --account=pawsey0001
#SBATCH --export=NONE

module load cascadelake
module rm gcc
module load gcc/9.2.0

# Launch 8 instances. 4 on each node, each one using 1 gpu

srun -u -N 1 -n 1 --mem=0 --gres=gpu:1 --exclusive ./laplace_mp0 4000 >log.step0 2>&1 &
srun -u -N 1 -n 1 --mem=0 --gres=gpu:1 --exclusive ./laplace_mp1 4000 >log.step1 2>&1 &
srun -u -N 1 -n 1 --mem=0 --gres=gpu:1 --exclusive ./laplace_mp2 4000 >log.step2 2>&1 &
srun -u -N 1 -n 1 --mem=0 --gres=gpu:1 --exclusive ./laplace_mp3 4000 >log.step3 2>&1 &
srun -u -N 1 -n 1 --mem=0 --gres=gpu:1 --exclusive ./laplace_mp4 4000 >log.step4 2>&1 &
srun -u -N 1 -n 1 --mem=0 --gres=gpu:1 --exclusive ./laplace_mp5 4000 >log.step5 2>&1 &
srun -u -N 1 -n 1 --mem=0 --gres=gpu:1 --exclusive ./laplace_mp6 4000 >log.step6 2>&1 &
srun -u -N 1 -n 1 --mem=0 --gres=gpu:1 --exclusive ./laplace_mp7 4000 >log.step7 2>&1 &
wait
```
This one above works fine.


But after reading your documentation, I thought I would be able to use the `--gpu-bind=single` option for this like:

```
srun --gpu-bind=single ./wrapper
```
Instead of the 8 lines as in the working example. Obviously, `wrapper` file would call the right executable recognized with $SLURM_PROCID.

But my attempt to use --gpu-bind is basically giving me an error:

```
srun: error: Invalid --gpu-bind argument: gpu:single
```

So, 3 questions here:

1) Is there a way of packing the jobs (maybe with --gpu-bind or --cpus-per-gpu or other) which I can use as an alternative to the first example above?
2) If yes in 1), then how should I define the `wrapper` file?
3) If no in 1), then can you give an example beyond what is written in the documentation for the use of --gpu-bind and --cpus-per-gpu options?

Thanks a lot,
Alexis

Comment 1 Michael Hinton 2021-01-19 15:08:37 MST

--gpu-bind=single is new to 20.11, so that's why it's not working for you in 20.02. Also, --gpu-bind=single has a simplified placement algorithm that doesn't work as well with multiple sockets, so it still might not fit your needs, depending on your CPU topology and CPU-GPU affinity layout.

1) Yes. As an alternative to --gpu-bind=single, you could use the map_gpu or mask_gpu options for --gpu-bind, and that should let you precisely match which GPU(s) you want to be accessible to which task.

2) I'm not quite sure what you are trying to do with the wrapper file. What is wrong with how your batch script is set up? I would just add `--gpu-bind=...` to each of those sruns.

When using --gpu-bind, it should be noted that it simply sets CUDA_VISIBLE_DEVICES (You could even skip the --gpu-bind argument and set CUDA_VISIBLE_DEVICES yourself for the task to get similar results). So that's the env that you want to check in the task when seeing if --gpu-bind is working as expected.

Comment 2 Alexis 2021-01-19 21:27:02 MST

Thanks a lot Michael.


Your suggestion:
```
#SBATCH --gpu-bind=map_gpu:0,1,2,3
.
.
.
srun ./wrapper.sh
```
is working great!

Now, I have a further question for future use:
1) What if I want to set two gpus per task? Can I still use `map_gpu` or any other slurm option?

I tried this but failed:
```
#SBATCH --gpu-bind=map_gpu:0-1,2-3
.
.
.
srun ./wrapper.sh
```

2) Or should I then come back to the native use of `CUDA_VISIBLE_DEVICES=0,1` settings?

Thank you very much,
Alexis

Comment 3 Michael Hinton 2021-01-20 10:49:53 MST

Glad it's working great!

(In reply to Alexis from comment #2)
> Now, I have a further question for future use:
> 1) What if I want to set two gpus per task? Can I still use `map_gpu` or any
> other slurm option?
> 
> I tried this but failed:
> ```
> #SBATCH --gpu-bind=map_gpu:0-1,2-3
> .
> .
> .
> srun ./wrapper.sh
> ```
> 
> 2) Or should I then come back to the native use of
> `CUDA_VISIBLE_DEVICES=0,1` settings?
So what you are looking for is mask_gpu. map_gpu only supports one GPU per task, as a convenience. If you use mask_gpu, however, you can get multiple GPUs per task, as long as you set the mask to cover multiple bits.

So to get task 0 to use GPUs 0-1, and task 1 to use GPUs 2-3, do something like this:

#SBATCH --gpu-bind=map_gpu:0x3,0xC

That creates binary masks 0011 (0x3) and 1100 (0xC) (assuming 4 total GPUs).

Let me know if that works for you.

Thanks!
-Michael

Comment 4 Alexis 2021-01-21 00:02:06 MST

Thanks a lot Michael,

Yes, that is working perfectly. Just a suggestion here:

1) could you add the chance to use a binary mask? I think, it is much more clear to use the binary masks: 1100,0011 than the use of the hexadecimal numbered masks.

And I have a final question related to all this (final I think):

2) When following the path of defining the CUDA_VISIBLE_DEVICES, I need to use a variable that tells me how many GPUs are originally available (allocated) per node. I would like to use SLURM_GPUS_PER_NODE, but that variable is not set because I did not set it with:
```
#SBATCH --gpus-per-node=4
```
in the header.
What I'm using to assign the gpus per node to the job is:
```
#SBATCH --gres=gpu:4
```
but then, What is the variable that keeps that number? How can I query that number in the wrapper?

Thanks a lot,
Alexis

Comment 5 Michael Hinton 2021-01-21 09:24:44 MST

(In reply to Alexis from comment #4)
> Thanks a lot Michael,
> 
> Yes, that is working perfectly. Just a suggestion here:
> 
> 1) could you add the chance to use a binary mask? I think, it is much more
> clear to use the binary masks: 1100,0011 than the use of the hexadecimal
> numbered masks.
It could make sense to have gpu_mask use a binary mask, especially since most nodes have 4 or less GPUs. But that would require a separate enhancement request ticket, and likely wouldn't get looked at unless it was sponsored, since we already have a lot on our plates. We are also open to contributions, if your team wanted to develop this.
 
> And I have a final question related to all this (final I think):
> 
> 2) When following the path of defining the CUDA_VISIBLE_DEVICES, I need to
> use a variable that tells me how many GPUs are originally available
> (allocated) per node. I would like to use SLURM_GPUS_PER_NODE, but that
> variable is not set because I did not set it with:
> ```
> #SBATCH --gpus-per-node=4
> ```
> in the header.
> What I'm using to assign the gpus per node to the job is:
> ```
> #SBATCH --gres=gpu:4
> ```
> but then, What is the variable that keeps that number? How can I query that
> number in the wrapper?
Why do you need to query it? You already know it's guaranteed to be 4 GPUs. Just pass that number into your wrapper script as an argument or through an env var.

However, if you want to know what GPU IDs you are allocated, take a look at SLURM_JOB_GPUS, SLURM_STEP_GPUS, and CUDA_VISIBLE_DEVICES or GPU_DEVICE_ORDINAL env vars. The first two should tell you which global GPU IDs are used for the job or step, respectively, while the last two (which are the same) tell you the local GPU IDs within the current cgroup.

Thanks!
-Michael

Comment 6 Alexis 2021-01-27 16:47:11 MST

Thank you very much Michael,

You have been very helpful!

I think we are done with this.

Comment 7 Michael Hinton 2021-01-27 16:59:29 MST

(In reply to Alexis from comment #6)
> Thank you very much Michael,
> 
> You have been very helpful!
> 
> I think we are done with this.
Excellent! Glad I could help :) I'll go ahead and close this out, then.

-Michael