Ticket 12901 - Unable to allocate MPS jobs beyond a single GPU
Summary: Unable to allocate MPS jobs beyond a single GPU
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: GPU (show other tickets)
Version: 21.08.4
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Albert Gil
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-11-22 17:32 MST by Michael Robbert
Modified: 2021-12-10 09:29 MST (History)
0 users

See Also:
Site: Colorado School of Mines
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (4.12 KB, text/plain)
2021-11-22 17:32 MST, Michael Robbert
Details
cgroup.conf (239 bytes, text/x-matlab)
2021-11-22 17:33 MST, Michael Robbert
Details
gres.conf (202 bytes, text/plain)
2021-11-22 17:33 MST, Michael Robbert
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Michael Robbert 2021-11-22 17:32:53 MST
Created attachment 22367 [details]
slurm.conf

We are implementing NVIDIA MPS and in my initial testing I have found that I can submit multiple jobs that will all use a single GPU, but as soon as the MPS count goes above 100% of a node any additional job will get queued with a status of Resources. 
An example:
All of our nodes have 4 GPUs and I've set an MPS count of 400 on each node. I can submit one job with "salloc --nodelist=g005 -pgpu --gres=mps:100" and any other job that I try to send to that node with be queued. I then submitted a job with "salloc --nodelist=g005 -pgpu --gres=mps:50" and was able to get 2 of those running, but a third job got stuck in the queue. 

Config files attached.
Comment 1 Michael Robbert 2021-11-22 17:33:20 MST
Created attachment 22368 [details]
cgroup.conf
Comment 2 Michael Robbert 2021-11-22 17:33:36 MST
Created attachment 22369 [details]
gres.conf
Comment 4 Michael Robbert 2021-11-23 16:21:56 MST
This may warrant a separate ticket, but a related issue I just found while working on our Prolog script is that the variable SLURM_JOB_GPUS doesn't appear to be available to in the Prolog. I believe that I need that in order to start up per GPU nvidia-cuda-mps-control instances.
Comment 7 Albert Gil 2021-11-24 07:41:51 MST
Hi Michael,

> This may warrant a separate ticket, but a related issue I just found while
> working on our Prolog script is that the variable SLURM_JOB_GPUS doesn't
> appear to be available to in the Prolog. I believe that I need that in order
> to start up per GPU nvidia-cuda-mps-control instances.

Are you using 21.08?
This was a known issue in older versions but is was fixed for 21.08.
See this in the NEWS file:
- https://github.com/SchedMD/slurm/blob/slurm-21-08-4-1/NEWS#L242
Or this commit:
- https://github.com/SchedMD/slurm/commit/f6e8d16dd57a45b52e8027fd01ebbf7fce6b9676


> All of our nodes have 4 GPUs and I've set an MPS count of 400 on each node.
> I can submit one job with "salloc --nodelist=g005 -pgpu --gres=mps:100" and
> any other job that I try to send to that node with be queued. I then
> submitted a job with "salloc --nodelist=g005 -pgpu --gres=mps:50" and was
> able to get 2 of those running, but a third job got stuck in the queue. 

I've been able to reproduce this behavior.
Let me look deeper and I'll come back to you.

Regards,
Albert
Comment 9 Michael Robbert 2021-11-24 08:03:14 MST
(In reply to Albert Gil from comment #7)
> Hi Michael,
> 
> > This may warrant a separate ticket, but a related issue I just found while
> > working on our Prolog script is that the variable SLURM_JOB_GPUS doesn't
> > appear to be available to in the Prolog. I believe that I need that in order
> > to start up per GPU nvidia-cuda-mps-control instances.
> 
> Are you using 21.08?
> This was a known issue in older versions but is was fixed for 21.08.
> See this in the NEWS file:
> - https://github.com/SchedMD/slurm/blob/slurm-21-08-4-1/NEWS#L242
> Or this commit:
> -
> https://github.com/SchedMD/slurm/commit/
> f6e8d16dd57a45b52e8027fd01ebbf7fce6b9676
> 
> 
> > All of our nodes have 4 GPUs and I've set an MPS count of 400 on each node.
> > I can submit one job with "salloc --nodelist=g005 -pgpu --gres=mps:100" and
> > any other job that I try to send to that node with be queued. I then
> > submitted a job with "salloc --nodelist=g005 -pgpu --gres=mps:50" and was
> > able to get 2 of those running, but a third job got stuck in the queue. 
> 
> I've been able to reproduce this behavior.
> Let me look deeper and I'll come back to you.
> 
> Regards,
> Albert

Yes, we are currently running 21.08.4. It looks like that commit only added SLURM_JOB_GPUS for gres=gpu, but I also need it for gres=mps.

Mike
Comment 10 Albert Gil 2021-11-24 10:04:26 MST
Hi Mike,

> > > All of our nodes have 4 GPUs and I've set an MPS count of 400 on each node.
> > > I can submit one job with "salloc --nodelist=g005 -pgpu --gres=mps:100" and
> > > any other job that I try to send to that node with be queued. I then
> > > submitted a job with "salloc --nodelist=g005 -pgpu --gres=mps:50" and was
> > > able to get 2 of those running, but a third job got stuck in the queue. 
> > 
> > I've been able to reproduce this behavior.
> > Let me look deeper and I'll come back to you.

It turns out that this is a known limitation: only one GPU will be usable in MPS mode per node, at a given time.
You can have multiple GPUs configured with MPS and any of them can be allocated with --gres=mps, but only 1 at a time. The others can still be allocated with --gres=gpu.


> Yes, we are currently running 21.08.4. It looks like that commit only added
> SLURM_JOB_GPUS for gres=gpu, but I also need it for gres=mps.

It seems that you are right.
I'll work on it and I'll come back to you.

Regards,
Albert
Comment 12 Michael Robbert 2021-11-24 10:45:54 MST
(In reply to Albert Gil from comment #10)
> Hi Mike,
> 
> > > > All of our nodes have 4 GPUs and I've set an MPS count of 400 on each node.
> > > > I can submit one job with "salloc --nodelist=g005 -pgpu --gres=mps:100" and
> > > > any other job that I try to send to that node with be queued. I then
> > > > submitted a job with "salloc --nodelist=g005 -pgpu --gres=mps:50" and was
> > > > able to get 2 of those running, but a third job got stuck in the queue. 
> > > 
> > > I've been able to reproduce this behavior.
> > > Let me look deeper and I'll come back to you.
> 
> It turns out that this is a known limitation: only one GPU will be usable in
> MPS mode per node, at a given time.
> You can have multiple GPUs configured with MPS and any of them can be
> allocated with --gres=mps, but only 1 at a time. The others can still be
> allocated with --gres=gpu.
> 

I assume you mean this is a limitation of Slurm. Nvidia has a section of their documentation that says MPS will work with multiple GPUs per node. https://docs.nvidia.com/deploy/mps/index.html#topic_3_3_4
Even the Slurm documentation seems to indicate that multiple GPUs can be used. Why else would it make sense to have the MPS count be distributed among all the GPUs? 

> 
> > Yes, we are currently running 21.08.4. It looks like that commit only added
> > SLURM_JOB_GPUS for gres=gpu, but I also need it for gres=mps.
> 
> It seems that you are right.
> I'll work on it and I'll come back to you.

If we really can't use more than one GPU per node at a given time with MPS then maybe we don't need to use SLURM_JOB_GPUS. My intention was to start one MPS control daemon for each GPU that got allocated with MPS and I needed that variable to keep track of which GPU was being used in each job. 

> 
> Regards,
> Albert

I feel that restricting MPS to a single GPU is a major limitation in the implementation and would really appreciate if it could be treated as a bug and fixed in the next bug fix release.

Thanks,
Mike
Comment 17 Albert Gil 2021-11-25 09:09:00 MST
Hi Mike,

> > It turns out that this is a known limitation: only one GPU will be usable in
> > MPS mode per node, at a given time.
> > You can have multiple GPUs configured with MPS and any of them can be
> > allocated with --gres=mps, but only 1 at a time. The others can still be
> > allocated with --gres=gpu.
> 
> I assume you mean this is a limitation of Slurm.

Yes.
In bug 7834 comment 6 you can see that initially "the restriction where MPS can only run on one GPU at a time is due to the fact that you have to statically allocate GPUs to the MPS server beforehand, and can't change that on the fly."

The Nvidia's CVE‑2018‑6260 has also some impact in the initial development of the MPS support on Slurm.

> Even the Slurm documentation seems to indicate that multiple GPUs can be
> used. Why else would it make sense to have the MPS count be distributed
> among all the GPUs? 

Because any of those GPUs can be used as the single MPS GPU.
One at a time, but any of them, depending on which one is available.
That is, if a node has 4 GPUs all with MPS and the first one is allocated by a request in the form --gres=gpu, a new job requesting GPU but using --gres=mps will allocate another GPU configured with MPS and available.
If only 1 GPU is configured with MPS and it's allocated with a request with --gres=gpu, then a newer request with --gres=mps would wait in the queue.

> If we really can't use more than one GPU per node at a given time with MPS
> then maybe we don't need to use SLURM_JOB_GPUS. My intention was to start
> one MPS control daemon for each GPU that got allocated with MPS and I needed
> that variable to keep track of which GPU was being used in each job. 

Agree.
There is only one GPU to select, so SLURM_JOB_GPUS isn't really needed, and that's why is not there.

> I feel that restricting MPS to a single GPU is a major limitation in the
> implementation and would really appreciate if it could be treated as a bug
> and fixed in the next bug fix release.

Yes, we also think that it's a major known limitation.
And I would say that this is one of the reason why NVidia sponsored a new Slurm development to focus on MIG technology instead of MPS on 21.08 (see bug 10970).

So, although this is not a final statement and it may change, right now everything looks like MIG is going to replace MPS, so most probably this known limitation will remain.

If you are really interested in such an enhancement, I will discuss it further internally to see if we may be interested at this point, but our most recent efforts have been moved from MPS to MIG.

Regards,
Albert
Comment 20 Michael Robbert 2021-12-02 12:27:57 MST
If SchedMD can't implement multi-GPU support into Slurm's gres:mps implementation then we I will need to work on a prolog that reads the job comments so that I can start an MPS daemon for each GPU that is allocated to a job. Similar to the code that was referenced in the other ticket ( https://github.com/mknoxnv/ubuntu-slurm/blob/master/prolog.d/prolog-mps-per-gpu ). That code starts an MPS server for all the GPUs, but I would like to only start one for the GPUs that are using MPS. For that I think I will need access to SLURM_JOB_GPUS which is documented to be available, but doesn't appear to be there. Is that fix in progress?

Regarding MIG as the way forward. Our initial request from the customer was to use MIG, but when they found out that they couldn't have it dynamically turned on and off that become prohibitive to their needs. They have some small workloads that benefit from splitting GPUs, but other large workloads that benefit from using more than a single GPU per job. Until we have a way to switch back and forth MIG won't work for us.

Mike
Comment 21 Michael Robbert 2021-12-02 15:28:28 MST
I did some more testing and found that SLURM_JOB_GPUS is available to the prolog when gres=gpu which is what I'm using for my workaround. 

A couple of questions to see if I can find a better way to do a couple of things:
1. Is there a better way for us to have the user provide their request to use mps per GPU than put it in a comment and use the following in the Prolog:
scontrol show job $SLURM_JOBID | grep Comment | grep -i mps-per-gpu

I read in the documentation that scontrol shouldn't be used in Prolog scripts.

2. Is there a way to set variables in the job's environment from the prolog script? I know how to do it in a task prolog and that should work, but I'd prefer to not have to use 2 separate scripts if possible. 

Thanks,
Mike
Comment 22 Albert Gil 2021-12-03 08:51:10 MST
Hi Mike,

I'm responding between lines:

> If SchedMD can't implement multi-GPU support into Slurm's gres:mps
> implementation

Just to be clearer, I think that we at SchedMD are now putting more efforts on MIG than on MPS, also because NVidia is requesting/sponsoring us also to work more on that direction, but maybe we can evaluate this particular case more in deep and maybe we can convert it into a request for enhancement (RFE).

Let me discuss this internally a bit more to comeback with a clearer statement from SchedMD on that regard.

> then we I will need to work on a prolog that reads the job
> comments so that I can start an MPS daemon for each GPU that is allocated to
> a job. Similar to the code that was referenced in the other ticket (
> https://github.com/mknoxnv/ubuntu-slurm/blob/master/prolog.d/prolog-mps-per-
> gpu ). That code starts an MPS server for all the GPUs, but I would like to
> only start one for the GPUs that are using MPS.

Sounds interesting.

> For that I think I will need
> access to SLURM_JOB_GPUS which is documented to be available, but doesn't
> appear to be there. Is that fix in progress?

As we only supported 1 GPU with MPS SLURM_JOB_GPUS is not available because it's always one.

> I did some more testing and found that SLURM_JOB_GPUS is available to the
> prolog when gres=gpu which is what I'm using for my workaround. 

Yes, it's available for gres=gpu, not for gres=mps.
Note that both options are mutually exclusive.

> A couple of questions to see if I can find a better way to do a couple of
> things:
> 1. Is there a better way for us to have the user provide their request to
> use mps per GPU than put it in a comment and use the following in the Prolog:
> scontrol show job $SLURM_JOBID | grep Comment | grep -i mps-per-gpu

I would suggest to create your own GRES.
For example you can create GresTypes=gpumps, and you can use the Count logic of any GRES.
You can even account it as any other TRES.

> I read in the documentation that scontrol shouldn't be used in Prolog
> scripts.

Yes, calling commands in Prolog/Epilog is discouraged, we recommend using environment variables as much as possible.
But right now none of the available variables contains the GRES info, so you will need to keep using scontrol.

> 2. Is there a way to set variables in the job's environment from the prolog
> script?

No, that's not possible.

> I know how to do it in a task prolog and that should work,

Yes, that's possible.

> but I'd
> prefer to not have to use 2 separate scripts if possible. 

Maybe you can use SLURM_SCRIPT_CONTEXT that is available in all prolog/epilogs and will have different values depending on the context that launched the script. For example prolog_slurmd or prolog_task. This allows you to have 1 script with shared code and different code based on the context.

> Regarding MIG as the way forward. Our initial request from the customer was
> to use MIG, but when they found out that they couldn't have it dynamically
> turned on and off that become prohibitive to their needs. They have some
> small workloads that benefit from splitting GPUs, but other large workloads
> that benefit from using more than a single GPU per job. Until we have a way
> to switch back and forth MIG won't work for us.

That's a good feedback.
I'll add it in our internal discussion.

Thanks!
Albert
Comment 26 Albert Gil 2021-12-10 09:29:30 MST
Hi Mike,

> > Regarding MIG as the way forward. Our initial request from the customer was
> > to use MIG, but when they found out that they couldn't have it dynamically
> > turned on and off that become prohibitive to their needs. They have some
> > small workloads that benefit from splitting GPUs, but other large workloads
> > that benefit from using more than a single GPU per job. Until we have a way
> > to switch back and forth MIG won't work for us.

>
> Just to be clearer, I think that we at SchedMD are now putting more efforts
> on MIG than on MPS, also because NVidia is requesting/sponsoring us also to
> work more on that direction, but maybe we can evaluate this particular case
> more in deep and maybe we can convert it into a request for enhancement
> (RFE).
> 
> Let me discuss this internally a bit more to comeback with a clearer
> statement from SchedMD on that regard.

We've discussed this internally and although we understand that MPS could sort of work in your use case, we think that what your use case and most MPS use cases should be also supported by MIG.
Note that MIG has also some dynamic support, but Slurm doesn't support it yet.

Although we are still doing some development that may touch some of the MPS code in next releases, at this point we are willing to focus much more on MIG development.
The dynamic part is not yet in the sponsored development roadmap for MIG, though.

I'm closing this ticket as infogiven, but please don't hesitate to reopen it if you need further support.

Regards,
Albert