Ticket 11653

Summary: GPU environmental variables not set by salloc
Product: Slurm Reporter: Spencer Bliven <spencer.bliven>
Component: User CommandsAssignee: Scott Hilton <scott>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek, spencer.bliven
Version: 20.11.5   
Hardware: Linux   
OS: Linux   
Site: Paul Scherrer Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.11.8 21.08.0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Spencer Bliven 2021-05-19 09:58:00 MDT
When using salloc to allocate GPUs, several imporant environmental variables are not being set. These are only set if the subprocess is run using `srun`:

    ~ $ salloc --gpus=2
    salloc: Granted job allocation 2356
    salloc: Waiting for resource configuration
    salloc: Nodes merlin-g-100 are ready for job
    (base) bash-4.2$ env | egrep 'GPU|CUDA'
    SLURM_GPUS=2
    (base) bash-4.2$ srun env | egrep 'GPU|CUDA'
    SLURM_GPUS=2
    SLURM_STEP_GPUS=2,3
    CUDA_VISIBLE_DEVICES=2,3
    GPU_DEVICE_ORDINAL=2,3

The missing CUDA_VISIBLE_DEVICES is particularly problematic, since without it software like tensorflow will use all GPUs, leading to oversubscription.

    $ python -c 'import tensorflow as tf; print(len(tf.config.experimental.list_physical_devices("GPU")))' 2>/dev/null
    8
    $ srun python -c 'import tensorflow as tf; print(len(tf.config.experimental.list_physical_devices("GPU")))' 2>/dev/null
    2
Comment 7 Scott Hilton 2021-05-25 11:25:00 MDT
Spencer,

Thanks for pointing this out. I am working on a bug fix now.

-Scott
Comment 9 Scott Hilton 2021-05-26 15:21:56 MDT
Spencer,

The fix has been added to the GitHub repo with commit id 33f9459da4. It should be included in the release of 20.11.8.

-Scott
Comment 10 Spencer Bliven 2021-05-27 14:15:44 MDT
Thanks! I'm glad it was a simple fix.