| Summary: | GPU environmental variables not set by salloc | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Spencer Bliven <spencer.bliven> |
| Component: | User Commands | Assignee: | Scott Hilton <scott> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | cinek, spencer.bliven |
| Version: | 20.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Paul Scherrer | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 20.11.8 21.08.0pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Spencer, Thanks for pointing this out. I am working on a bug fix now. -Scott Spencer, The fix has been added to the GitHub repo with commit id 33f9459da4. It should be included in the release of 20.11.8. -Scott Thanks! I'm glad it was a simple fix. |
When using salloc to allocate GPUs, several imporant environmental variables are not being set. These are only set if the subprocess is run using `srun`: ~ $ salloc --gpus=2 salloc: Granted job allocation 2356 salloc: Waiting for resource configuration salloc: Nodes merlin-g-100 are ready for job (base) bash-4.2$ env | egrep 'GPU|CUDA' SLURM_GPUS=2 (base) bash-4.2$ srun env | egrep 'GPU|CUDA' SLURM_GPUS=2 SLURM_STEP_GPUS=2,3 CUDA_VISIBLE_DEVICES=2,3 GPU_DEVICE_ORDINAL=2,3 The missing CUDA_VISIBLE_DEVICES is particularly problematic, since without it software like tensorflow will use all GPUs, leading to oversubscription. $ python -c 'import tensorflow as tf; print(len(tf.config.experimental.list_physical_devices("GPU")))' 2>/dev/null 8 $ srun python -c 'import tensorflow as tf; print(len(tf.config.experimental.list_physical_devices("GPU")))' 2>/dev/null 2