3915 – Gating GPU Memory

Ticket 3915 - Gating GPU Memory

Summary: Gating GPU Memory

Status:	OPEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Limits (show other tickets)
Version:	17.11.x
Hardware:	Linux Linux

Severity:	5 - Enhancement
Assignee:	Unassigned Developer
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-06-21 08:12 MDT by Paul Edmon
Modified:	2022-04-22 11:16 MDT (History)
CC List:	0 users

See Also:	13907
Site:	Harvard University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Paul Edmon 2017-06-21 08:12:51 MDT

This is a bit broader than just GPU's but we are getting in situations where GPU's can address and see each other's memory.  Slurm, as far as I am aware, can't limit this access currently like it can with CPU's.  Thus a user who is using one GPU could exhaust the memory on the other and make a different user's job crash on the other GPU.  In environments where there are mixes of jobs that use different resources this is a bit of a problem.

Thus it would be a nice feature to add to be able to gate not only CPU memory but GPU memory.

-Paul Edmon-

Comment 1 Tim Wickberg 2017-06-21 09:01:31 MDT

Looks like an interesting idea, but AFAICT there is no API for us to build such support off of at present.

The Linux cgroup system only lets us block access to the device files - there is no equivalent of the cgroup memory controller tailored for the GPU. At a quick glance, I don't see any obvious equivalent through the nvidia-smi command or their other tools.

If you're aware of something that would enforce this that'd give us a viable approach please update the bug, otherwise this is likely to go unresolved.

- Tim

Comment 2 Paul Edmon 2017-06-21 09:05:50 MDT

Yeah, sadly I'm not aware of any method for this either.  About the only 
solution I have would be to gate access to the full GPU card or make GPU 
jobs use the full node.

-Paul Edmon-


On 06/21/2017 11:01 AM, bugs@schedmd.com wrote:
> Tim Wickberg <mailto:tim@schedmd.com> changed bug 3915 
> <https://bugs.schedmd.com/show_bug.cgi?id=3915>
> What 	Removed 	Added
> Severity 	4 - Minor Issue 	5 - Enhancement
> Assignee 	support@schedmd.com 	dev-unassigned@schedmd.com
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=3915#c1> on bug 
> 3915 <https://bugs.schedmd.com/show_bug.cgi?id=3915> from Tim Wickberg 
> <mailto:tim@schedmd.com> *
> Looks like an interesting idea, but AFAICT there is no API for us to build such
> support off of at present.
>
> The Linux cgroup system only lets us block access to the device files - there
> is no equivalent of the cgroup memory controller tailored for the GPU. At a
> quick glance, I don't see any obvious equivalent through the nvidia-smi command
> or their other tools.
>
> If you're aware of something that would enforce this that'd give us a viable
> approach please update the bug, otherwise this is likely to go unresolved.
>
> - Tim
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>