We have told SLURM to track not only GPU count for jobs but the GPU subtype like v100 or v100-32g. I have attached our configs to illustrate this. I submitted two jobs, both landing on the same GPU node that has two v100 GPUs. The first job only has AllocTRES with gres/gpu=1 while the second job correctly has the subtype tracked of gres/gpu:v100=#. This is with 20.02.4 but that's not an option with bug tracker. Output: $ sbatch --gpus=1 hostname.sbatch Submitted batch job 11928 $ sbatch --gpus=1 --exclusive hostname.sbatch Submitted batch job 11929 $ sacct -j 11928 --noheader --allocations --parsable --format=jobid,nodelist,reqtres,alloctres 11928|p0253|billing=1,cpu=1,gres/gpu=1,mem=4556M,node=1|billing=1,cpu=1,gres/gpu=1,node=1| $ sacct -j 11929 --noheader --allocations --parsable --format=jobid,nodelist,reqtres,alloctres 11929|p0253|billing=1,cpu=1,gres/gpu=1,mem=4556M,node=1|billing=40,cpu=40,gres/gpu:v100=2,gres/gpu=2,node=1|
Created attachment 15358 [details] gres.conf
Created attachment 15359 [details] slurm.conf
Maybe this is expected behavior after reading AccountingStorageTRES more closely. I suppose the exclusive job only got the subtype because all GRES got assigned to the job. If I removed "gres/gpu" from AccountingStorageTRES and only had the items with subtype, would that ensure a job like my example would be assigned the subtype in accounting? The docs make it seem like there is some issues with that.
Trey, This is a known issue and I just happen to be currently working on a fix the 20.11 release. This is just an accounting issue, the gpus are being allocated to the jobs just fine. Until, if you really want to track gres types all requests users would need to specify the type they are asking for, like this: >sbatch --gres=gpu:v100:1 hostname.sbatch Thanks, -Scott
Trey, As an update, this is unlikely to be be finished in 20.11 but is being actively worked on. It may take a while due to the refactoring needed to address this issue. This was also brought up in 8024. I am marking this as a duplicate. If you personally have a question about this feel free to reopen this ticket. --Scott *** This ticket has been marked as a duplicate of ticket 8024 ***