Ticket 11653 - GPU environmental variables not set by salloc
Summary: GPU environmental variables not set by salloc
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 20.11.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-05-19 09:58 MDT by Spencer Bliven
Modified: 2021-05-27 14:15 MDT (History)
2 users (show)

See Also:
Site: Paul Scherrer
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 20.11.8 21.08.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Spencer Bliven 2021-05-19 09:58:00 MDT
When using salloc to allocate GPUs, several imporant environmental variables are not being set. These are only set if the subprocess is run using `srun`:

    ~ $ salloc --gpus=2
    salloc: Granted job allocation 2356
    salloc: Waiting for resource configuration
    salloc: Nodes merlin-g-100 are ready for job
    (base) bash-4.2$ env | egrep 'GPU|CUDA'
    SLURM_GPUS=2
    (base) bash-4.2$ srun env | egrep 'GPU|CUDA'
    SLURM_GPUS=2
    SLURM_STEP_GPUS=2,3
    CUDA_VISIBLE_DEVICES=2,3
    GPU_DEVICE_ORDINAL=2,3

The missing CUDA_VISIBLE_DEVICES is particularly problematic, since without it software like tensorflow will use all GPUs, leading to oversubscription.

    $ python -c 'import tensorflow as tf; print(len(tf.config.experimental.list_physical_devices("GPU")))' 2>/dev/null
    8
    $ srun python -c 'import tensorflow as tf; print(len(tf.config.experimental.list_physical_devices("GPU")))' 2>/dev/null
    2
Comment 7 Scott Hilton 2021-05-25 11:25:00 MDT
Spencer,

Thanks for pointing this out. I am working on a bug fix now.

-Scott
Comment 9 Scott Hilton 2021-05-26 15:21:56 MDT
Spencer,

The fix has been added to the GitHub repo with commit id 33f9459da4. It should be included in the release of 20.11.8.

-Scott
Comment 10 Spencer Bliven 2021-05-27 14:15:44 MDT
Thanks! I'm glad it was a simple fix.