This is a bit broader than just GPU's but we are getting in situations where GPU's can address and see each other's memory. Slurm, as far as I am aware, can't limit this access currently like it can with CPU's. Thus a user who is using one GPU could exhaust the memory on the other and make a different user's job crash on the other GPU. In environments where there are mixes of jobs that use different resources this is a bit of a problem. Thus it would be a nice feature to add to be able to gate not only CPU memory but GPU memory. -Paul Edmon-
Looks like an interesting idea, but AFAICT there is no API for us to build such support off of at present. The Linux cgroup system only lets us block access to the device files - there is no equivalent of the cgroup memory controller tailored for the GPU. At a quick glance, I don't see any obvious equivalent through the nvidia-smi command or their other tools. If you're aware of something that would enforce this that'd give us a viable approach please update the bug, otherwise this is likely to go unresolved. - Tim
Yeah, sadly I'm not aware of any method for this either. About the only solution I have would be to gate access to the full GPU card or make GPU jobs use the full node. -Paul Edmon- On 06/21/2017 11:01 AM, bugs@schedmd.com wrote: > Tim Wickberg <mailto:tim@schedmd.com> changed bug 3915 > <https://bugs.schedmd.com/show_bug.cgi?id=3915> > What Removed Added > Severity 4 - Minor Issue 5 - Enhancement > Assignee support@schedmd.com dev-unassigned@schedmd.com > > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=3915#c1> on bug > 3915 <https://bugs.schedmd.com/show_bug.cgi?id=3915> from Tim Wickberg > <mailto:tim@schedmd.com> * > Looks like an interesting idea, but AFAICT there is no API for us to build such > support off of at present. > > The Linux cgroup system only lets us block access to the device files - there > is no equivalent of the cgroup memory controller tailored for the GPU. At a > quick glance, I don't see any obvious equivalent through the nvidia-smi command > or their other tools. > > If you're aware of something that would enforce this that'd give us a viable > approach please update the bug, otherwise this is likely to go unresolved. > > - Tim > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >