11653 – GPU environmental variables not set by salloc

Ticket 11653 - GPU environmental variables not set by salloc

Summary: GPU environmental variables not set by salloc

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	20.11.5
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Scott Hilton
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-05-19 09:58 MDT by Spencer Bliven
Modified:	2021-05-27 14:15 MDT (History)
CC List:	2 users (show)

See Also:
Site:	Paul Scherrer
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.11.8 21.08.0pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Spencer Bliven 2021-05-19 09:58:00 MDT

When using salloc to allocate GPUs, several imporant environmental variables are not being set. These are only set if the subprocess is run using `srun`:

    ~ $ salloc --gpus=2
    salloc: Granted job allocation 2356
    salloc: Waiting for resource configuration
    salloc: Nodes merlin-g-100 are ready for job
    (base) bash-4.2$ env | egrep 'GPU|CUDA'
    SLURM_GPUS=2
    (base) bash-4.2$ srun env | egrep 'GPU|CUDA'
    SLURM_GPUS=2
    SLURM_STEP_GPUS=2,3
    CUDA_VISIBLE_DEVICES=2,3
    GPU_DEVICE_ORDINAL=2,3

The missing CUDA_VISIBLE_DEVICES is particularly problematic, since without it software like tensorflow will use all GPUs, leading to oversubscription.

    $ python -c 'import tensorflow as tf; print(len(tf.config.experimental.list_physical_devices("GPU")))' 2>/dev/null
    8
    $ srun python -c 'import tensorflow as tf; print(len(tf.config.experimental.list_physical_devices("GPU")))' 2>/dev/null
    2

Comment 7 Scott Hilton 2021-05-25 11:25:00 MDT

Spencer,

Thanks for pointing this out. I am working on a bug fix now.

-Scott

Comment 9 Scott Hilton 2021-05-26 15:21:56 MDT

Spencer,

The fix has been added to the GitHub repo with commit id 33f9459da4. It should be included in the release of 20.11.8.

-Scott

Comment 10 Spencer Bliven 2021-05-27 14:15:44 MDT

Thanks! I'm glad it was a simple fix.