Ticket 9159 - understanding sacct output of job data wrt GRES (and an anomaly)
Summary: understanding sacct output of job data wrt GRES (and an anomaly)
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 19.05.5
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-06-03 20:04 MDT by Greg Wickham
Modified: 2020-06-22 19:58 MDT (History)
1 user (show)

See Also:
Site: KAUST
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Job 10342997 data (5.09 KB, text/plain)
2020-06-11 16:46 MDT, Greg Wickham
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Greg Wickham 2020-06-03 20:04:10 MDT
This came up working on https://bugs.schedmd.com/show_bug.cgi?id=8952 with Scott - he requested this be created as a new case.

For job 10342997

$ sacct -P -j 10342997 --format jobidraw,start,end,nodelist,reqtres,reqgres,alloctres,allocgres
JobIDRaw|Start|End|NodeList|ReqTRES|ReqGRES|AllocTRES|AllocGRES
10342997|2020-04-21T18:33:34|2020-04-21T18:33:50|gpu504-37,gpu510-17|billing=20,cpu=20,gres/gpu=4,mem=120G,node=1|gpu:8|billing=20,cpu=20,gres/gpu=8,mem=120G,node=2|gpu:8
10342997.batch|2020-04-21T18:33:34|2020-04-21T18:33:50|gpu504-37||gpu:8|cpu=10,mem=60G,node=1|gpu:8
10342997.extern|2020-04-21T18:33:34|2020-04-21T18:33:51|gpu504-37,gpu510-17||gpu:8|billing=20,cpu=20,gres/gpu=8,mem=120G,node=2|gpu:8
10342997.0|2020-04-21T18:33:36|2020-04-21T18:33:52|gpu510-17||gpu:8|cpu=1,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=6G,node=1|gpu:8

$ sinfo -n gpu510-17,gpu504-37 -o "%n,%G"
HOSTNAMES,GRES
gpu504-37,gpu:gtx1080ti:8(S:0-1)
gpu510-17,gpu:rtx2080ti:8(S:0-1)

Q1: The job "10342997.0" indicates using "gres/gpu:gtx1080ti=4" on node gpu510-17, but the node has "rtx2080ti"

Q2: 10342997.0 shows AllocTRES is 4 GPUs while AllocGRES is 8 GPUS

Q3: 10342997.batch and 10342997.0 are running at the same time on different nodes - what is the total of GPUs allocated?
Comment 3 Scott Hilton 2020-06-04 16:28:08 MDT
Greg, 

1): This issue looks like it is the same one as in 8024. There hasn't been any update about it as well.

2): This one is interesting. It is also weird that ReqTres is asking for 1 node bug there were 2 nodes allocated. I'll look into it further and get back to you.

3): Batch isn't allocated any gpus according to AllocTRES. But like in Q2 it disagrees with AllocGRES again. I'll have to get back to you.

-Scott
Comment 4 Scott Hilton 2020-06-11 11:28:12 MDT
Greg,

Do you know how this job was created? i.e. (sbatch ...) Is this just an issue with this one job or are you getting similar discrepancies with many jobs? If so are that any patterns that lead to it?

Could you also send me the output to sacct -P -j 10342997 --format ALL

Thanks,

Scott
Comment 5 Greg Wickham 2020-06-11 16:46:59 MDT
Created attachment 14644 [details]
Job 10342997 data


output of "sacct -P -j 10342997 --format ALL" attached.
Comment 6 Greg Wickham 2020-06-11 16:49:44 MDT
(In reply to Scott Hilton from comment #4)
> Greg,
> 
> Do you know how this job was created? i.e. (sbatch ...) Is this just an
> issue with this one job or are you getting similar discrepancies with many
> jobs? If so are that any patterns that lead to it?

We only know what slurm records. The anomaly was only discovered while reviewing the accounting records when creating this case.

   -Greg
Comment 8 Scott Hilton 2020-06-17 15:28:35 MDT
Greg, 

For your second question, trust AllocTRES.

AllocGres (and ReqGres) are just printing the job value not the individual step values.

In other words, each row (.extern .batch .0) will always be the same as the top row as it is now written.

I think this is confusing and will look into changing it.

-Scott
Comment 11 Scott Hilton 2020-06-18 14:00:07 MDT
Greg,

We are now planning on removing AllocGRES and ReqGRES options in the 20.11 slurm release.

Just focus on the output of ReqTRES and AllocTRES.

Does this answer all your questions?

-Scott
Comment 13 Scott Hilton 2020-06-22 09:44:07 MDT
I am going to go ahead and close this bug as info given.
Comment 14 Greg Wickham 2020-06-22 19:58:01 MDT
Thanks Scott.