Ticket 9543

Summary: GPU accounting for AllocTRES is missing GPU subtype for some jobs
Product: Slurm Reporter: Trey Dockendorf <tdockendorf>
Component: AccountingAssignee: Scott Hilton <scott>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.02.4   
Hardware: Linux   
OS: Linux   
Site: Ohio State OSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: gres.conf
slurm.conf

Description Trey Dockendorf 2020-08-07 14:40:31 MDT
We have told SLURM to track not only GPU count for jobs but the GPU subtype like v100 or v100-32g.  I have attached our configs to illustrate this.  I submitted two jobs, both landing on the same GPU node that has two v100 GPUs.  The first job only has AllocTRES with gres/gpu=1 while the second job correctly has the subtype tracked of gres/gpu:v100=#.

This is with 20.02.4 but that's not an option with bug tracker.

Output:

$ sbatch --gpus=1 hostname.sbatch 
Submitted batch job 11928
$ sbatch --gpus=1 --exclusive hostname.sbatch 
Submitted batch job 11929

$ sacct -j 11928 --noheader --allocations --parsable --format=jobid,nodelist,reqtres,alloctres
11928|p0253|billing=1,cpu=1,gres/gpu=1,mem=4556M,node=1|billing=1,cpu=1,gres/gpu=1,node=1|
$ sacct -j 11929 --noheader --allocations --parsable --format=jobid,nodelist,reqtres,alloctres
11929|p0253|billing=1,cpu=1,gres/gpu=1,mem=4556M,node=1|billing=40,cpu=40,gres/gpu:v100=2,gres/gpu=2,node=1|
Comment 1 Trey Dockendorf 2020-08-07 14:40:59 MDT
Created attachment 15358 [details]
gres.conf
Comment 2 Trey Dockendorf 2020-08-07 14:41:15 MDT
Created attachment 15359 [details]
slurm.conf
Comment 3 Trey Dockendorf 2020-08-07 14:45:39 MDT
Maybe this is expected behavior after reading AccountingStorageTRES more closely.  I suppose the exclusive job only got the subtype because all GRES got assigned to the job.  If I removed "gres/gpu" from AccountingStorageTRES and only had the items with subtype, would that ensure a job like my example would be assigned the subtype in accounting? The docs make it seem like there is some issues with that.
Comment 4 Scott Hilton 2020-08-07 17:00:42 MDT
Trey, 

This is a known issue and I just happen to be currently working on a fix the 20.11 release.

This is just an accounting issue, the gpus are being allocated to the jobs just fine.

Until, if you really want to track gres types all requests users would need to specify the type they are asking for, like this:
>sbatch --gres=gpu:v100:1 hostname.sbatch

Thanks,

-Scott
Comment 5 Scott Hilton 2020-10-05 16:28:21 MDT
Trey,

As an update, this is unlikely to be be finished in 20.11 but is being actively worked on. It may take a while due to the refactoring needed to address this issue.

This was also brought up in 8024. I am marking this as a duplicate. If you personally have a question about this feel free to reopen this ticket.

--Scott

*** This ticket has been marked as a duplicate of ticket 8024 ***