Hello We are a new customer to slurm, our site is not yet included in the dropdown list. We are the Blue Brain Project (BBP) from Ecole polytechnic federal de lausanne (EPFL). We have a user who wishes to allocate a GPU via salloc, and then from the allocation run further code via srun that utilises the GPU. The following results in srun hanging [morrice@bbpv1 ~]$ salloc --account=proj16 -p interactive -C volta --gres=gpu:1 [morrice@r2i3n0 ~]$ echo $CUDA_VISIBLE_DEVICES 0 [morrice@r2i3n0 ~]$ srun hostname srun: Job step creation temporarily disabled, retrying If I run the following (from the same allocation) it works; [morrice@r2i3n0 ~]$ srun --gres=none hostname r2i3n0 For info the default salloc command is: [morrice@r2i3n0 ~]$ grep Sallo /etc/slurm/slurm.conf SallocDefaultCommand="/usr/bin/srun -n1 -N1 --propagate=ALL --pty --preserve-env --mem-per-cpu=0 --mpi=pmi2 $SHELL -l" We are running slurm-17.02.9-1.el7.x86_64 Am I misunderstanding something here - is it possible to have access to GPU resources via srun through an salloc ?
I believe that should work. I need to look into this more. Can you also upload your slurm.conf and gres.conf so I can use it and try to reproduce what you're seeing?
Created attachment 7969 [details] gres.conf
Created attachment 7970 [details] slurm.conf
Thanks Marshall - i've added our slurm.conf and gres.conf as attachments to this ticket.
I can reproduce what you're seeing, and I believe I have the solution. It's basically a duplicate of bug 5543. Add --gres=none to your salloc default command. Then, for salloc jobs that want GRES, they can override that. It works as expected for me. So, with your revised SallocDefaultCommand: SallocDefaultCommand="/usr/bin/srun -n1 -N1 --propagate=ALL --pty --preserve-env --mem-per-cpu=0 --mpi=pmi2 --gres=none $SHELL -l" marshall@voyager:~/slurm/18.08/voyager$ salloc --gres=gpu:1 salloc: Granted job allocation 15075 salloc: Waiting for resource configuration salloc: Nodes v1 are ready for job marshall@voyager:~/slurm/18.08/voyager$ env|grep -i cuda marshall@voyager:~/slurm/18.08/voyager$ srun env|grep -i cuda CUDA_VISIBLE_DEVICES=0 marshall@voyager:~/slurm/18.08/voyager$ srun --gres=gpu:none env|grep -i cuda srun: error: Unable to create step for job 15075: Invalid Trackable RESource (TRES) specification marshall@voyager:~/slurm/18.08/voyager$ srun --gres=gpu:0 env|grep -i cuda marshall@voyager:~/slurm/18.08/voyager$ srun --gres=none env|grep -i cuda Can you verify that this fixes it for you? Assuming this works for you, we might want to consider modifying to our documentation to recommend adding --gres=none to SallocDefault if GRES are used.
Hello Marshall, Thank-you for the information. Your suggestion has helped me come to a solution. In our case, our users would like to have CUDA_VISIBLE_DEVICES available via salloc AND srun. My solution is to: - add --gres=gpu:0 to the default salloc command - add logic to slurm.prolog to populate a file /tmp/${SLURM_JOB_USER}_${SLURM_JOB_ID}_CUDA with $SLURM_JOB_GPUS (if the variable is not null) - add logic to slurm.epilog to remove the above file - add logic to slurm.taskprolog to set CUDA_VISIBLE_DEVICES with the contents of /tmp/${USER}_${SLURM_JOB_ID}_CUDA The end result is something similar to the following: [morrice@bbpv1 ~]$ salloc --account proj14 -p interactive -C volta --gres=gpu:1 salloc: Granted job allocation 182476 salloc: Waiting for resource configuration salloc: Nodes r2i3n0 are ready for job [morrice@r2i3n0 ~]$ printenv |grep CUDA_VISIBLE_DEVICES CUDA_VISIBLE_DEVICES=1 [morrice@r2i3n0 ~]$ srun printenv |grep CUDA_VISIBLE_DEVICES CUDA_VISIBLE_DEVICES=1 [morrice@r2i3n0 ~]$ Thank-you for your assistance - you may close this ticket.
Great. You may also want to check if you're constraining devices in cgroup.conf, and if that's something you want to do or not. Closing as resolved/infogiven.