Ticket 9275 - Unable to determine GPU allocations per node from accounting records
Summary: Unable to determine GPU allocations per node from accounting records
Status: RESOLVED DUPLICATE of ticket 8024
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 20.02.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-06-25 01:16 MDT by Greg Wickham
Modified: 2020-06-25 11:24 MDT (History)
0 users

See Also:
Site: KAUST
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
sacct -P -j 11021559 (4.40 KB, text/csv)
2020-06-25 01:16 MDT, Greg Wickham
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Greg Wickham 2020-06-25 01:16:34 MDT
Created attachment 14774 [details]
sacct -P -j 11021559

With the recent enhancements you GPU requests for sbatch / srun, it's no longer possible to know which nodes were allocated a specific GPU count from the accounting records.

For example, the command:

   $ srun -G 100 -t 1 --pty bash -l

Will result in 100 GPUs being allocated, each with one CPU core.

In one of our tests the GPUs were allocated across 19 nodes, however the information about the GPU allocations is incorrect:

$ sacct -P -j 11021559 --format=AllocTRES
AllocTRES
billing=19,cpu=19,gres/gpu=100,mem=38G,node=19
billing=19,cpu=19,gres/gpu=100,mem=38G,node=19
cpu=19,gres/gpu:gtx1080ti=100,gres/gpu=100,mem=0,node=19

(I've included the full job sacct output of the job as a CSV file)

In one of the records it is indicated gres/gpu:gtx1080ti=100 however we only have 64 x 1080ti in our cluster.

It would be desirable to have individual allocations for each node listed.


   -greg
Comment 3 Scott Hilton 2020-06-25 11:24:57 MDT
Greg, 

This is again bug 8024. See comment 13.

We are aware of the issue and plan on fixing it eventually. It has proven tricky so we cannot guarantee when it will be fixed. We will let you know when we have a fix available.

While this bug is still here you could try to look at nodelist to determine which gpus were actually allocated.

Good luck,

Scott

*** This ticket has been marked as a duplicate of ticket 8024 ***