We are using cons_tres to manage GPU allocation. We also would like to use gres to specify per-node features (specifically available scratch and video memory). These are defined in gres.conf: NodeName=merlin-g-001 Name=gpu Type=GTX1080 File=/dev/nvidia[0,1] Cores=0-15 NodeName=merlin-g-001 Name=scratch Count=512 And slurm.conf: GresTypes=gpu,scratch AccountingStorageTRES=gres/scratch,gres/gpu,gres/gpu:GTX1080,gres/gpu:GTX1080Ti,ic/ofed NodeName=merlin-g-001 Weight=1 CPUs=16 RealMemory=128000 MemSpecLimit=25600 Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 Gres=gpu:GTX1080:2,scratch:512 State=UNKNOWN However, slurm is unable to allocate resources when specifying both --gpus and --gres. Example: $ sbatch --partition=gpu --gres=scratch:500 --gpus=1 --wrap 'sleep 10' sbatch: error: Batch job submission failed: Requested node configuration is not available Oddly, it does find a configuration if the gres is less than or equal to the number of GPUs: $ sbatch --partition=gpu --gres=scratch:2,gpu:2 --gpus=2 --wrap 'sleep 10' Submitted batch job 9551 It also works if we specify the number of gpus using --gres instead of --gpus (resulting in a per-node constraint as opposed to a per-job constraint): $ sbatch --partition=gpu --gres=scratch:500,gpu:1 --wrap 'sleep 10' Submitted batch job 9552 Is this a known issue with how --gpus interacts with --gres, or is there an error in our configuration? Part of the motivation for this related to implementing #9346 using gres instead of features. Since features are binary, specifying the video memory would require creating a feature for each class of graphics card (8G_GPU, 11G_GPU, etc). It would be much cleaner to specify this as an integer (--gres=gpumem:11G).
Created attachment 17358 [details] slurm.conf
Created attachment 17359 [details] gres.conf
Hi Spencer, I am able to reproduce this and will see what I find. Thanks, -Michael
Hi Spencer, We found the issue and are currently reviewing a patch that will fix it. Thanks for the report! Unfortunately, the fix will only land in 20.11, since that is the latest release and this isn't a security bug. Once it does land, though, I can attach a version of the patch for 20.02 in Bugzilla that you can apply yourself. Or, you can simply cherry-pick the 20.11 commits directly from GitHub, as I believe they should also apply cleanly on top of 20.02. Thanks, -Michael
Great, thanks for the fix!
(In reply to Spencer Bliven from comment #9) > Great, thanks for the fix! This has been fixed with commit 383808e7 and will be included in 20.11.3. See https://github.com/SchedMD/slurm/commit/383808e724baa71be58da54c2ab299b8a54d2bd4. Thanks! -Michael
Marking this as fixed in 20.11.3 and closing out.