When using salloc to allocate GPUs, several imporant environmental variables are not being set. These are only set if the subprocess is run using `srun`: ~ $ salloc --gpus=2 salloc: Granted job allocation 2356 salloc: Waiting for resource configuration salloc: Nodes merlin-g-100 are ready for job (base) bash-4.2$ env | egrep 'GPU|CUDA' SLURM_GPUS=2 (base) bash-4.2$ srun env | egrep 'GPU|CUDA' SLURM_GPUS=2 SLURM_STEP_GPUS=2,3 CUDA_VISIBLE_DEVICES=2,3 GPU_DEVICE_ORDINAL=2,3 The missing CUDA_VISIBLE_DEVICES is particularly problematic, since without it software like tensorflow will use all GPUs, leading to oversubscription. $ python -c 'import tensorflow as tf; print(len(tf.config.experimental.list_physical_devices("GPU")))' 2>/dev/null 8 $ srun python -c 'import tensorflow as tf; print(len(tf.config.experimental.list_physical_devices("GPU")))' 2>/dev/null 2
Spencer, Thanks for pointing this out. I am working on a bug fix now. -Scott
Spencer, The fix has been added to the GitHub repo with commit id 33f9459da4. It should be included in the release of 20.11.8. -Scott
Thanks! I'm glad it was a simple fix.