Ticket 9543 - GPU accounting for AllocTRES is missing GPU subtype for some jobs
Summary: GPU accounting for AllocTRES is missing GPU subtype for some jobs
Status: RESOLVED DUPLICATE of ticket 8024
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 20.02.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-08-07 14:40 MDT by Trey Dockendorf
Modified: 2020-10-05 16:28 MDT (History)
0 users

See Also:
Site: Ohio State OSC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
gres.conf (1.23 KB, text/plain)
2020-08-07 14:40 MDT, Trey Dockendorf
Details
slurm.conf (151.61 KB, text/plain)
2020-08-07 14:41 MDT, Trey Dockendorf
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Trey Dockendorf 2020-08-07 14:40:31 MDT
We have told SLURM to track not only GPU count for jobs but the GPU subtype like v100 or v100-32g.  I have attached our configs to illustrate this.  I submitted two jobs, both landing on the same GPU node that has two v100 GPUs.  The first job only has AllocTRES with gres/gpu=1 while the second job correctly has the subtype tracked of gres/gpu:v100=#.

This is with 20.02.4 but that's not an option with bug tracker.

Output:

$ sbatch --gpus=1 hostname.sbatch 
Submitted batch job 11928
$ sbatch --gpus=1 --exclusive hostname.sbatch 
Submitted batch job 11929

$ sacct -j 11928 --noheader --allocations --parsable --format=jobid,nodelist,reqtres,alloctres
11928|p0253|billing=1,cpu=1,gres/gpu=1,mem=4556M,node=1|billing=1,cpu=1,gres/gpu=1,node=1|
$ sacct -j 11929 --noheader --allocations --parsable --format=jobid,nodelist,reqtres,alloctres
11929|p0253|billing=1,cpu=1,gres/gpu=1,mem=4556M,node=1|billing=40,cpu=40,gres/gpu:v100=2,gres/gpu=2,node=1|
Comment 1 Trey Dockendorf 2020-08-07 14:40:59 MDT
Created attachment 15358 [details]
gres.conf
Comment 2 Trey Dockendorf 2020-08-07 14:41:15 MDT
Created attachment 15359 [details]
slurm.conf
Comment 3 Trey Dockendorf 2020-08-07 14:45:39 MDT
Maybe this is expected behavior after reading AccountingStorageTRES more closely.  I suppose the exclusive job only got the subtype because all GRES got assigned to the job.  If I removed "gres/gpu" from AccountingStorageTRES and only had the items with subtype, would that ensure a job like my example would be assigned the subtype in accounting? The docs make it seem like there is some issues with that.
Comment 4 Scott Hilton 2020-08-07 17:00:42 MDT
Trey, 

This is a known issue and I just happen to be currently working on a fix the 20.11 release.

This is just an accounting issue, the gpus are being allocated to the jobs just fine.

Until, if you really want to track gres types all requests users would need to specify the type they are asking for, like this:
>sbatch --gres=gpu:v100:1 hostname.sbatch

Thanks,

-Scott
Comment 5 Scott Hilton 2020-10-05 16:28:21 MDT
Trey,

As an update, this is unlikely to be be finished in 20.11 but is being actively worked on. It may take a while due to the refactoring needed to address this issue.

This was also brought up in 8024. I am marking this as a duplicate. If you personally have a question about this feel free to reopen this ticket.

--Scott

*** This ticket has been marked as a duplicate of ticket 8024 ***