This came up working on https://bugs.schedmd.com/show_bug.cgi?id=8952 with Scott - he requested this be created as a new case. For job 10342997 $ sacct -P -j 10342997 --format jobidraw,start,end,nodelist,reqtres,reqgres,alloctres,allocgres JobIDRaw|Start|End|NodeList|ReqTRES|ReqGRES|AllocTRES|AllocGRES 10342997|2020-04-21T18:33:34|2020-04-21T18:33:50|gpu504-37,gpu510-17|billing=20,cpu=20,gres/gpu=4,mem=120G,node=1|gpu:8|billing=20,cpu=20,gres/gpu=8,mem=120G,node=2|gpu:8 10342997.batch|2020-04-21T18:33:34|2020-04-21T18:33:50|gpu504-37||gpu:8|cpu=10,mem=60G,node=1|gpu:8 10342997.extern|2020-04-21T18:33:34|2020-04-21T18:33:51|gpu504-37,gpu510-17||gpu:8|billing=20,cpu=20,gres/gpu=8,mem=120G,node=2|gpu:8 10342997.0|2020-04-21T18:33:36|2020-04-21T18:33:52|gpu510-17||gpu:8|cpu=1,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=6G,node=1|gpu:8 $ sinfo -n gpu510-17,gpu504-37 -o "%n,%G" HOSTNAMES,GRES gpu504-37,gpu:gtx1080ti:8(S:0-1) gpu510-17,gpu:rtx2080ti:8(S:0-1) Q1: The job "10342997.0" indicates using "gres/gpu:gtx1080ti=4" on node gpu510-17, but the node has "rtx2080ti" Q2: 10342997.0 shows AllocTRES is 4 GPUs while AllocGRES is 8 GPUS Q3: 10342997.batch and 10342997.0 are running at the same time on different nodes - what is the total of GPUs allocated?
Greg, 1): This issue looks like it is the same one as in 8024. There hasn't been any update about it as well. 2): This one is interesting. It is also weird that ReqTres is asking for 1 node bug there were 2 nodes allocated. I'll look into it further and get back to you. 3): Batch isn't allocated any gpus according to AllocTRES. But like in Q2 it disagrees with AllocGRES again. I'll have to get back to you. -Scott
Greg, Do you know how this job was created? i.e. (sbatch ...) Is this just an issue with this one job or are you getting similar discrepancies with many jobs? If so are that any patterns that lead to it? Could you also send me the output to sacct -P -j 10342997 --format ALL Thanks, Scott
Created attachment 14644 [details] Job 10342997 data output of "sacct -P -j 10342997 --format ALL" attached.
(In reply to Scott Hilton from comment #4) > Greg, > > Do you know how this job was created? i.e. (sbatch ...) Is this just an > issue with this one job or are you getting similar discrepancies with many > jobs? If so are that any patterns that lead to it? We only know what slurm records. The anomaly was only discovered while reviewing the accounting records when creating this case. -Greg
Greg, For your second question, trust AllocTRES. AllocGres (and ReqGres) are just printing the job value not the individual step values. In other words, each row (.extern .batch .0) will always be the same as the top row as it is now written. I think this is confusing and will look into changing it. -Scott
Greg, We are now planning on removing AllocGRES and ReqGRES options in the 20.11 slurm release. Just focus on the output of ReqTRES and AllocTRES. Does this answer all your questions? -Scott
I am going to go ahead and close this bug as info given.
Thanks Scott.