6538 – salloc does not set CUDA_VISIBLE_DEVICES

Ticket 6538 - salloc does not set CUDA_VISIBLE_DEVICES

Summary: salloc does not set CUDA_VISIBLE_DEVICES

Status:	RESOLVED DUPLICATE of ticket 6411

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	18.08.5
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Albert Gil
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-02-19 07:49 MST by Michael DiDomenico
Modified:	2019-03-05 11:32 MST (History)
CC List:	1 user (show)

See Also:	6412 6411
Site:	IDACCR
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Michael DiDomenico 2019-02-19 07:49:22 MST

This is probably by design, but maybe there's something I can do, when using 

sallocdefaultcommand="srun -n1 -N1 --mem-per-cpu=0 --cpu-bind=no --gres=gpu:0 --pty --preserve-env $SHELL"

CUDA_VISIBLE_DEVICES is not set in the environment

if i run 'srun -n 1 env |grep CUDA' then it is.

however, the crux of my problem stems from that, I have a series of users that are trying to issue an salloc command, ssh over to the node (always a single node) and then issue non-srun/non-mpi commands.  Ie running things like python, tensorflow, etc

if we ran exclusive node access then this probably wouldn't be an issue, but i'd like to use shared node access.

without CUDA_VISIBLE within the salloc, two users can effectively stomp on each other because they'll both grab the same gpu's.

Comment 3 Albert Gil 2019-02-22 05:30:00 MST

Hi Michael,

> This is probably by design, but maybe there's something I can do, when using 
> 
> sallocdefaultcommand="srun -n1 -N1 --mem-per-cpu=0 --cpu-bind=no
> --gres=gpu:0 --pty --preserve-env $SHELL"
> 
> CUDA_VISIBLE_DEVICES is not set in the environment
>
> if i run 'srun -n 1 env |grep CUDA' then it is.

Yes, it is currently by design as explained in bug 6022 comment 5.
But it is also true that we are maybe changing it in bug 6412. Still on progress.


> however, the crux of my problem stems from that, I have a series of users
> that are trying to issue an salloc command, ssh over to the node (always a
> single node)

Are using pam_slurm_adopt (https://slurm.schedmd.com/pam_slurm_adopt.html) to control these ssh sessions?

> and then issue non-srun/non-mpi commands.  Ie running things
> like python, tensorflow, etc

Running TensorFlow without srun doesn't look like a good idea to me.
Why not use "srun tensorflow" instead of "ssh tensorflow"?

> if we ran exclusive node access then this probably wouldn't be an issue, but
> i'd like to use shared node access.
> 
> without CUDA_VISIBLE within the salloc, two users can effectively stomp on
> each other because they'll both grab the same gpu's.

Ignoring the ssh that you mentioned before, if two users do your default salloc into the same node, and run TensorFlow *without* srun, then you are right: both can grab the same GPU. This is what bug 6412 is trying to improve.

But please note that if a user does ssh to a node (without pam_slurm_adopt) then slurm cannot contain/control the environment variables that it will have at all.
If you are using pam_slurm_adopt, then we could think if it should handle CUDA_VISIBLE_DEVICES as well as it already does with DISPLAY.. but I'm not sure.

Anyway, the alternatives that you have are:

  Remove --gres=gpu:0 from your SallocDefaultCommand

This will set CUDA_VISIBLE_DEVICES in salloc and running TensorFlow WITHOUT srun NOR ssh will work.
But please note that if users try to use srun+tensorflow asking again for a GPU, it won't work because they are already consumed the GPUs in the spawned shells with srun of the SallocDefaultCommand.
And also note that it won't neither work if users keep doing ssh+tensorflow.

  Set CUDA_VISIBLE_DEVICES in SSH server

Using SetEnv in sshd_config you can force CUDA_VISIBLE_DIVICES to NONE for all ssh connections.
This will force users to use srun+tensorflow to actually grant access to GPUs, that I strongly recommend vs ssh+tensorflow.


I would recommend the second idea and training/forcing users to use srun as much as possible, and ssh as less as possible in a Slurm cluster.

Does that make sense for you?


Finally, please note the CUDA_VISIBLE_DEVICES is not really a strong restriction/binding because any user can manually set it to whatever they want:
https://devblogs.nvidia.com/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/


Regards,
Albert

Comment 8 Albert Gil 2019-02-26 14:09:12 MST

Hi Michael,
Regarding this comment:

> Finally, please note the CUDA_VISIBLE_DEVICES is not really a strong
> restriction/binding because any user can manually set it to whatever they
> want:
> https://devblogs.nvidia.com/cuda-pro-tip-control-gpu-visibility-
> cuda_visible_devices/

To make in clearer:
The CUDA_VISIBLE_DEVICES is not a safe way to constrain jobs to a GPU.
To really constrain it Slurm uses the cgroup's devices system to allow or deny access to specific devices to each job/step/task.
To enable it you have to use task/cgroup plugin, specify the ConstrainDevices=yes in cgroups.conf and the actual "File" (ie "device") of each GRES (ie GPU) in gres.conf.
More info to set it up:

https://slurm.schedmd.com/gres.html
https://slurm.schedmd.com/cgroup.conf.html


Also note that if you combine ConstrainDevices with pam_slurm_adopt, you will also have safe/real constrain to the GPU devices from ssh connections.
That is, once you have a job allocated into a node and you ssh to it, the constrain to the devices of the job will be also applied to the ssh session (see bug 4122 comment 14 for more info).

If you are interested on using pam_slurm_adopt:
https://slurm.schedmd.com/pam_slurm_adopt.html


Finally, please note that in bug 6411 we are indeed working to enhance pam_slurm_adopt to set also the environment variables of a normal job allocation to the ssh logins, and CUDA_VISIBLE_DEVICES is one of them.
It's work on progress and we are still discussing about what's the best move regarding ssh and environment variables, but if you are interested, please feel free to add yourself in CC to be updated.

Hope that helps,
Albert

Comment 9 Michael DiDomenico 2019-02-27 06:33:05 MST

I get what you're saying about the safeness of the allocated gpu and collisions with other users, but i think that's a separate conversation.

the crux of my point is really that the variable isn't set unless you issue a second slurm command.  salloc already allocated the gpu so it should know which gpu was allocated.  i don't feel like i should *have* to run srun <command> within my allocation in order to activate it.

you're completely right in that this is a soft fence to keep users away from each other, but i'm okay with that for now.

Comment 10 Albert Gil 2019-02-27 11:49:14 MST

Hi Michael,

> I get what you're saying about the safeness of the allocated gpu and collisions
> with other users, but i think that's a separate conversation.
> you're completely right in that this is a soft fence to keep users away from
> each other, but i'm okay with that for now.

Ok, then let's forget about ConstrainDevices for now.

> the crux of my point is really that the variable isn't set unless you issue
> a second slurm command.  salloc already allocated the gpu so it should know
> which gpu was allocated.  i don't feel like i should *have* to run srun
> <command> within my allocation in order to activate it.

Let me put it step-by-step to make it clearer:

1) Running an empty salloc (without any default command) requesting a --gres=gpu
  - We are in the access node, not in the computing node.
  - Here the CUDA_VISIBLE_DEVICES is not set, because it is computing-node-wise.
  - No resources are consumed
  - Here we have several options:
    a) srun with --gres=gpu:0
       - We'll have access to a node without consuming the GPU resources
       - Currently, in 18.08, CUDA_VISIBLE_DEVICES is not set, but most probably bug 6412 will change this and Slurm will set CUDA_VISIBLE_DEVICES to NoDevFiles.
    b) srun with --gres=gpu:1
       - We'll have access to a node consuming the resources
       - CUDA_VISIBLE_DEVICES will be properly set
    c) ssh without pam_slurm_adopt:
       - Slurm cannot control any environment variable
    d) ssh with pam_slurm_adopt
       - We'll have access to a node without consuming the resources
       - Currently, in 18.08, CUDA_VISIBLE_DEVICES is not set, but most probably bug 6411 will change this and Slurm will set CUDA_VISIBLE_DEVICES to the right devices.

If I understand correctly what you want is this last one case (1->d) once/if bug 6411 adds the CUDA_VISIBLE_DEVICES to pam_slurm_adopt.
If I'm right, then I think that we should close this bug as a duplicated of it.
Am I right?

Albert

Comment 11 Michael DiDomenico 2019-02-27 12:26:43 MST

(In reply to Albert Gil from comment #10)
> Let me put it step-by-step to make it clearer:
> 
> 1) Running an empty salloc (without any default command) requesting a
> --gres=gpu
>   - We are in the access node, not in the computing node.
>   - Here the CUDA_VISIBLE_DEVICES is not set, because it is
> computing-node-wise.
>   - No resources are consumed
>   - Here we have several options:
>     a) srun with --gres=gpu:0
>        - We'll have access to a node without consuming the GPU resources
>        - Currently, in 18.08, CUDA_VISIBLE_DEVICES is not set, but most
> probably bug 6412 will change this and Slurm will set CUDA_VISIBLE_DEVICES
> to NoDevFiles.
>     b) srun with --gres=gpu:1
>        - We'll have access to a node consuming the resources
>        - CUDA_VISIBLE_DEVICES will be properly set
>     c) ssh without pam_slurm_adopt:
>        - Slurm cannot control any environment variable
>     d) ssh with pam_slurm_adopt
>        - We'll have access to a node without consuming the resources
>        - Currently, in 18.08, CUDA_VISIBLE_DEVICES is not set, but most
> probably bug 6411 will change this and Slurm will set CUDA_VISIBLE_DEVICES
> to the right devices.
> 
> If I understand correctly what you want is this last one case (1->d) once/if
> bug 6411 adds the CUDA_VISIBLE_DEVICES to pam_slurm_adopt.
> If I'm right, then I think that we should close this bug as a duplicated of
> it.
> Am I right?

close, i think the 'ssh' and 'empty salloc' part is what's throwing the conversation off.  here's a typical keyboard session for a user

login1$ salloc -n1 --gres=gpu:1
-- so the user asked for a single core on a node with 1 gpu
-- sallocdefaultcommand is enabled and will run this
-- srun -n1 <other options> --gres=gpu:0 --pty /bin/bash
-- so now the user has been granted access to their compute node
compute1$ 
-- lets check the env variable
compute1$ echo $CUDA_VISIBLE_DEVICES
<unset>
-- at this point a user is going to kick off a program
compute1$ ./<mygpuprogram>
-- everyone that does this on a single node is then going to get the first gpu and collide
-- however, if the user does
compute1$ srun echo $CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=0
compute1$ srun ./<mygpuprogram>
-- then things work fine, because CUDA_VISIBLE_DEVICES is set to 0

Comment 12 Albert Gil 2019-02-28 04:50:08 MST

Hi Michael,

> close, i think the 'ssh' and 'empty salloc' part is what's throwing the
> conversation off.  here's a typical keyboard session for a user
> 
> login1$ salloc -n1 --gres=gpu:1
> -- so the user asked for a single core on a node with 1 gpu
> -- sallocdefaultcommand is enabled and will run this
> -- srun -n1 <other options> --gres=gpu:0 --pty /bin/bash
> -- so now the user has been granted access to their compute node

Please note that here you are already in the scenario 1a.
That means that these three scenarios are equivalent:

Scenario A)
- SallocDefaultCommand set to "srun [options] --gres=gpu:0 --pty /bin/bash"
- run "salloc --gres=gpu:1"

Scenario B)
- SallocDefaultCommand not set
- run "salloc --gres=gpu:1"
- run "srun [options] --gres=gpu:0 --pty /bin/bash"

Scenario C)
- SallocDefaultCommand set to whatever
- run "salloc --gres=gpu:1 srun [options] --gres=gpu:0 --pty /bin/bash"

All the above scenarios are equivalent to the 1a) explained in comment 10.

So, following any of the above ways GPU is not consumed and user shouldn't have access to any GPU because they all specify --gres=gpu:0. Is that what you want?

If that is, to do it for good we have to setup ConstrainDevices.
Or to just limit but not force it, we have to wait for the bug 6412 that will set CUDA_VISIBLE_DEVICES to NoDevFiles.

But I think that that's not what you want, right?
If I understand correctly, what you want "is having access but not consume to the right GPUs directly from salloc, but not all the GPUs". Right?
That is being in the status (1d) scenario but:
- with the improvement to be done in bug 6411 that sets CUDA_VISIBLE_DEVICES to allow access to the right GPUs
- accessing the nodes with a single salloc command with a right SallocDefaultCommand.

Am I closer here?

If I am, I think that currently this is not possible.
The closest way that I can think to be in the status that you want is follow the (1d) path, and wait bug 6411 to be committed.
Anyway, let me discuss it further internally to see if there are other options or we should do some enhancement here.


Albert

Comment 13 Michael DiDomenico 2019-02-28 07:56:30 MST

(In reply to Albert Gil from comment #12)
> Please note that here you are already in the scenario 1a.
> That means that these three scenarios are equivalent:
> 
> Scenario A)
> - SallocDefaultCommand set to "srun [options] --gres=gpu:0 --pty /bin/bash"
> - run "salloc --gres=gpu:1"
> 
> Scenario B)
> - SallocDefaultCommand not set
> - run "salloc --gres=gpu:1"
> - run "srun [options] --gres=gpu:0 --pty /bin/bash"
> 
> Scenario C)
> - SallocDefaultCommand set to whatever
> - run "salloc --gres=gpu:1 srun [options] --gres=gpu:0 --pty /bin/bash"
> 
> All the above scenarios are equivalent to the 1a) explained in comment 10.

Yes, I agree all of those scenarios are equivalent.  I guess ultimately, what i'm asking for is, a way to substitute the --gres=gpu:0 in the sallocdefaultcommand with the --gres=gpu:1 of the original salloc.

this would allow the two scenarios.  if someone just wants a node but no gpu's the default behavior exists.  but if someone asks for a node and a gpu they get it and the CUDA env variable would be set

or to expand, instead of the sallocdefaultcommand being preset with a bunch of parameters like --cpu-bind=0 --mem-bind=0 --gres=gpu:0, is there a way to carry those from the original salloc

ie

instead of

sallocdefaultcommand="srun -n1 -N1 --cpu-bind=0 --mem-bind=0 --gres=gpu:0 --pty /bin/bash"

could be

sallocdefaultcommand="srun $SALLOCPARAMS --pty /bin/bash"

so a user could do

salloc -n1 -N1 --gres=gpu:1 

and have those passed to the sallocdefaultcommand

Comment 15 Albert Gil 2019-03-01 02:20:40 MST

Hi Michael,

When we srun inside a salloc session, by defulat all salloc parameters are passed to srun (through environment variables set by salloc), unless you force/change them specifically in the srun command.
That means that setting SallocDefaultCommand to just "srun --pty $SHELL" or using the srun commands of the following examples will behave as you described:

$ salloc --gres=gpu:1 srun --pty $SHELL
$ env | grep "CUDA"
CUDA_VISIBLE_DEVICES=0

$ salloc --gres=gpu:2 srun --pty $SHELL
$ env | grep "CUDA"
CUDA_VISIBLE_DEVICES=0,1

BUT, please note that:
- Until bug 6412 is not fixed/committed, if the salloc command doesn't ask for --gres=gpu or even if it asks for --gres=gpu:0, the CUDA_VISIBLE_DEVICES is not set at all. That is:

  $ salloc srun --pty $SHELL
  $ env | grep "CUDA"
  $
  $ salloc --gres=gpu:0 srun --pty $SHELL
  $ env | grep "CUDA"
  $

- If salloc asks for multiple tasks or nodes, you won't be able use them because the srun command is consuming them but you only have access to one shell.

My recomendation here would be to keep SallocDefaultCommand "safe" by not consuming resources, and train users to manually add "srun --pty $SHELL" as their salloc command like the above examples (it overwrites the default, of course) in the cases where they only ask for one node and one task and want direct access to the GPUS allocated from the shell.

Whould that work for you?

In fact, I would rather recommend to salloc with local shell (salloc [options] $SHELL) and then ssh to assigned nodes using pam_slurm_adopt (it grant access and not comsume resources, and bug 6411 will pass the CUDA_VISIBLE_DEVICES variable). This will allow users to ssh to different nodes and actually run srun commands with multiple tasks if they sallocated that resources. Bug 

Finally let me add that we discussed about some generic "noconsume" option for srun command to allow oversubscription (when possible), but that isn't likely to happen in v19.05.

Regards,
Albert

Comment 16 Michael DiDomenico 2019-03-01 08:05:58 MST

I'll wait for 6411 and 6412 to land and reset and see if that does what i'm after.

Comment 22 Albert Gil 2019-03-05 11:32:55 MST

Michael,

> I'll wait for 6411 and 6412 to land and reset and see if that does what i'm
> after.

If it OK for you I'm closing this bug as duplicate of #6411.

Albert

*** This ticket has been marked as a duplicate of ticket 6411 ***