8024 – AllocTRES of step doesn't match AllocTRES of job

Ticket 8024 - AllocTRES of step doesn't match AllocTRES of job

Summary: AllocTRES of step doesn't match AllocTRES of job

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	19.05.3
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Director of Support
QA Contact:

URL:

Duplicates (3):	9275 9543 9932 (view as ticket list)
Depends on:
Blocks:

Reported:	2019-10-30 07:41 MDT by Greg Wickham
Modified:	2021-07-15 13:03 MDT (History)
CC List:	2 users (show)

See Also:	7827 8054
Site:	KAUST
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	21.08.0
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (4.30 KB, text/plain) 2019-10-30 23:06 MDT, Greg Wickham	Details
gres.conf (5.17 KB, text/plain) 2019-10-30 23:07 MDT, Greg Wickham	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Greg Wickham 2019-10-30 07:41:12 MDT

$ sacct -j 6647867 --format jobid,nodelist,alloctres%60
       JobID        NodeList                                                    AllocTRES 
------------ ---------------        ----------------------------------------------------- 
6647867            gpu208-02   billing=9,cpu=9,gres/gpu:v100=4,gres/gpu=4,mem=128G,node=1 
6647867.ext+       gpu208-02   billing=9,cpu=9,gres/gpu:v100=4,gres/gpu=4,mem=128G,node=1 
6647867.0          gpu208-02        cpu=9,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=128G,node=1 
6647867.1          gpu208-02        cpu=1,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=128G,node=1 
6647867.2          gpu208-02        cpu=1,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=128G,node=1 
6647867.3          gpu208-02        cpu=1,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=128G,node=1 
6647867.4          gpu208-02        cpu=9,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=128G,node=1 
6647867.5          gpu208-02        cpu=9,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=128G,node=1 
6647867.6          gpu208-02        cpu=9,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=128G,node=1 


The GRES (shown above) that was allocated to the steps according to sacct don't match either the node (which only has gpu:v100:8) or the Job itself.

$ scontrol show -o node=gpu208-02
NodeName=gpu208-02 Arch=x86_64 CoresPerSocket=24  CPUAlloc=27 CPUTot=48 CPULoad=14.91 AvailableFeatures=dragon,ibex2018,nolmem,cpu_intel_platinum_8260,intel,gpu,intel_gpu,local_200G,local_400G,local_500G,gpu_v100,v100,nossh ActiveFeatures=dragon,ibex2018,nolmem,cpu_intel_platinum_8260,intel,gpu,intel_gpu,local_200G,local_400G,local_500G,gpu_v100,v100,nossh Gres=gpu:v100:8 NodeAddr=gpu208-02 NodeHostName=gpu208-02  OS=Linux 3.10.0-957.12.1.el7.x86_64 #1 SMP Mon Apr 29 14:59:59 UTC 2019  RealMemory=763000 AllocMem=213840 FreeMem=751573 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=32000 Owner=N/A MCS_label=N/A Partitions=batch  BootTime=2019-10-30T15:15:59 SlurmdStartTime=2019-10-30T15:25:11 CfgTRES=cpu=48,mem=763000M,billing=48,gres/gpu=8,gres/gpu:v100=8 AllocTRES=cpu=27,mem=213840M,gres/gpu=3,gres/gpu:v100=3 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Comment 2 Michael Hinton 2019-10-30 14:46:20 MDT

Hi Greg,

Indeed, AllocTRES in `scontrol show nodes` can be messed up when there is more than one GPU type. Differing `Cores=` definitions for each of the GPUs in gres.conf can also increase the chances of this bug showing up. Although AllocTRES in `scontrol show nodes` is sometimes incorrect, I haven't seen any evidence yet that jobs get scheduled with incorrect resources, or that AllocTRES in sacct in wrong.

I actually have a fix pending review in a private internal bug for this very issue (bug 7827). I'll let you know when that lands.

In the meantime, could you give me your current slurm.conf and gres.conf, so I can reproduce your issue and make sure my patch solves your issue? And I'm assuming that sacct is showing the correct GRES request for the job, but if not, I'll need that as well.

Thanks,
-Michael

Comment 3 Greg Wickham 2019-10-30 23:06:50 MDT

Created attachment 12182 [details]
slurm.conf

Comment 4 Greg Wickham 2019-10-30 23:07:36 MDT

Created attachment 12183 [details]
gres.conf

Comment 5 Greg Wickham 2019-10-30 23:22:25 MDT

Hi Michael,

We're not seeing what you describe - the output of 'scontrol show nodes' is correct.

The user requested 'v100' GPUs and a suitable node was chosen (gpu208-02).

The job steps ran on the selected node (with v100 GPUs which is correct).

The issue is the "step" accounting is showing incorrect AllocTRES - the node has v100 GPUs but the step AllocTRES is indicating 'gtx1080ti'.

FWIW, the user started with an 'salloc' and then used 'srun' to run steps within the allocation.

Files uploaded.

   -Greg

Comment 6 Michael Hinton 2019-11-01 09:43:31 MDT

(In reply to Greg Wickham from comment #5)
> We're not seeing what you describe - the output of 'scontrol show nodes' is
> correct.
Ok, thanks. I'll get to the bottom of this and get back to you.

-Michael

Comment 7 Michael Hinton 2019-11-04 12:31:02 MST

Sorry Greg, could you also give me the node definition for gpu208-02? I didn't realize it would be in a different file. I want to double check that there isn't anything fishy going on there. Thanks.

Comment 8 Michael Hinton 2019-11-04 12:40:11 MST

Also, if you know the commands used to reproduce the jobs with the issue, that would be helpful. Thanks.

Comment 9 Michael Hinton 2019-11-04 12:42:44 MST

And if you see any currently-running jobs with this problem, an `scontrol show job <job_id>` would be nice as well.

Comment 10 Greg Wickham 2019-11-04 13:23:19 MST

NodeName=gpu208-[02,06,10,14,18] Gres=gpu:v100:8 Feature=dragon,ibex2018,nolmem,cpu_intel_platinum_8260,intel,gpu,intel_gpu,local_200G,local_400G,local_500G,gpu_v100,v100,nossh RealMemory=763000 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 Weight=32000


I don't know the commands used. I do have a belief that the user did an "salloc" to obtain the resources and then used "srun" to use them.

I'll have a look tomorrow and see if I can find any active jobs exhibiting similar behaviour.

Comment 11 Michael Hinton 2019-11-04 14:42:27 MST

Good news: I was able to reproduce the bug!

*****************
Reproducer
*****************

slurm.conf:
************************************
NodeName=DEFAULT CPUs=48 Sockets=2 CoresPerSocket=24 ThreadsPerCore=1 State=UNKNOWN RealMemory=7000
GresTypes=gpu
DebugFlags=gres
SlurmdParameters=config_overrides
AccountingStorageTRES=gres/gpu,gres/gpu:gtx1080ti,gres/gpu:p100,gres/gpu:p6000,gres/gpu:rtx2080ti,gres/gpu:tesla_k40m,gres/gpu:v100,cpu,node
NodeName=test1 NodeAddr=localhost Port=19052 Gres=gpu:v100:8


gres.conf:
************************************
Name=gpu Type=v100 File=/dev/tty[0-7]


Commands:
************************************
hintron@hintron:~/slurm/19.05/extra$ salloc --gres=gpu:v100:4
salloc: Granted job allocation 119

hintron@hintron:~/slurm/19.05/extra$ srun --gres=gpu:4 sleep 60 &                      
[1] 18663

hintron@hintron:~/slurm/19.05/extra$ sacct --format jobid,nodelist,alloctres%60 -j 119
      JobID        NodeList                                                    AllocTRES  
------------ ---------------        -----------------------------------------------------  
119                    test1   billing=1,cpu=1,gres/gpu:v100=4,gres/gpu=4,mem=150M,node=1  
119.0                  test1           cpu=1,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=0,node=1  

hintron@hintron:~/slurm/19.05/extra$ scontrol show nodes  
NodeName=test1 Arch=x86_64 CoresPerSocket=24  
  CPUAlloc=1 CPUTot=48 CPULoad=0.65
  AvailableFeatures=(null)
  ActiveFeatures=(null)
  Gres=gpu:v100:8
  NodeAddr=localhost NodeHostName=test1 Port=19052 Version=19.05.3-2
  OS=Linux 5.3.0-19-generic #20-Ubuntu SMP Fri Oct 18 09:04:39 UTC 2019  
  RealMemory=7000 AllocMem=150 FreeMem=539 Sockets=2 Boards=1
  State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
  Partitions=debug  
  BootTime=2019-11-04T11:06:33 SlurmdStartTime=2019-11-04T12:49:55
  CfgTRES=cpu=48,mem=7000M,billing=48,gres/gpu=8,gres/gpu:v100=8
  AllocTRES=cpu=1,mem=150M,gres/gpu=4,gres/gpu:v100=4
  CapWatts=n/a
  CurrentWatts=0 AveWatts=0
  ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
 


However, specifying gpu:v100:4 explicitly for both salloc and srun makes it work as expected:
************************************

hintron@hintron:~/slurm/19.05/extra$ salloc --gres=gpu:v100:4
salloc: Granted job allocation 118

hintron@hintron:~/slurm/19.05/extra$ srun --gres=gpu:v100:4 sleep 60 &                      
[1] 18496

hintron@hintron:~/slurm/19.05/extra$ sacct --format jobid,nodelist,alloctres%60 -j 118
      JobID        NodeList                                                    AllocTRES  
------------ ---------------        -----------------------------------------------------  
118                    test1   billing=1,cpu=1,gres/gpu:v100=4,gres/gpu=4,mem=150M,node=1  
118.0                  test1                cpu=1,gres/gpu:v100=4,gres/gpu=4,mem=0,node=1  

hintron@hintron:~/slurm/19.05/extra$ scontrol show nodes                               
NodeName=test1 Arch=x86_64 CoresPerSocket=24  
  CPUAlloc=1 CPUTot=48 CPULoad=0.24
  AvailableFeatures=(null)
  ActiveFeatures=(null)
  Gres=gpu:v100:8
  NodeAddr=localhost NodeHostName=test1 Port=19052 Version=19.05.3-2
  OS=Linux 5.3.0-19-generic #20-Ubuntu SMP Fri Oct 18 09:04:39 UTC 2019  
  RealMemory=7000 AllocMem=150 FreeMem=586 Sockets=2 Boards=1
  State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
  Partitions=debug  
  BootTime=2019-11-04T11:06:33 SlurmdStartTime=2019-11-04T12:49:55
  CfgTRES=cpu=48,mem=7000M,billing=48,gres/gpu=8,gres/gpu:v100=8
  AllocTRES=cpu=1,mem=150M,gres/gpu=4,gres/gpu:v100=4
  CapWatts=n/a
  CurrentWatts=0 AveWatts=0
  ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


I did a few more tests, and here’s a summary of the GRES combinations for salloc and srun:
**********************************
salloc      | srun       | sacct
---------------------------------
gpu:4       | gpu:4      | gres/gpu:gtx1080ti=4
gpu:4       | gpu:v100:4 | Invalid GRES specification
gpu:v100:4  | gpu:4      | gres/gpu:gtx1080ti=4
gpu:v100:4  | gpu:v100:4 | gres/gpu:v100=4

So it looks like the current work-around is to instruct users to explicitly specify the GPU type (v100) for both salloc AND srun, where possible. 

I’ll investigate this some more and hopefully get back to you once I have a fix.

Thanks,
-Michael


P.S. In gres.conf, `Count` is completely redundant for GPUs, since the count is inferred from the `File` is specification.

Comment 13 Michael Hinton 2019-11-04 17:30:27 MST

Hey Greg,

It looks like this is a known issue. From https://slurm.schedmd.com/slurm.conf.html#OPT_AccountingStorageTRES:

"If a job requests GPUs, but does not explicitly specify the GPU type, then its resource allocation will be accounted for as either "gres/gpu:tesla" or "gres/gpu:volta", although the accounting may not match the actual GPU type allocated to the job and the GPUs allocated to the job could be heterogeneous. In an environment containing various GPU types, use of a job_submit plugin may be desired in order to force jobs to explicitly specify some GPU type."

I still don't think this is ideal and hope to improve on this in 20.02 or later. But it looks like for now, using the job_submit plugin to alter any user's "srun --gres:gpu:4" to "srun --gres:gpu:v100:4" on nodes that only have a single type is your best bet to avoid this issue.

Another thing you can do that might help mitigate the issue is to add an additional 'meta' TRES GPU storage type to AccountingStorageTRES, like `gres/gpu:a_typeless`. Whatever it is named, it just needs to be higher in the alphabet than `gtx1080ti` (`a` comes before `g`). This is because when `srun --gres=gpu:4` is specified inside the allocation, since type is null, Slurm currently just picks the first GPU type it finds in a sorted list taken from AccountingStorageTRES.

It's clunky, but at least you will know what's going on and that the step didn't specify a type (so it could be any of the GPUs within the allocation).

For example:

$ sacct --format jobid,nodelist,alloctres%60 -j 140
       JobID        NodeList                                                    AllocTRES 
------------ ---------------        ----------------------------------------------------- 
140                    test1   billing=1,cpu=1,gres/gpu:v100=4,gres/gpu=4,mem=150M,node=1 
140.0                  test1          cpu=1,gres/gpu:a_typeless=4,gres/gpu=4,mem=0,node=1 

Now this is less confusing, because Slurm doesn't tell us that we have gtx1080ti GPUs. In this case, we can infer that the 4 GPUs were v100s, since it's confined to the job's allocation of v100s.

For allocations with multiple GPU types, this could still mean that you won't know exactly which GPU types were on the step, but it at least won't be misleading.

Thanks,
-Michael

Comment 14 Greg Wickham 2019-11-04 22:08:55 MST

Hi Michael,

Thanks for the comprehensive response.

We currently have 5 different GPU types in use, so I doubt there is an "easy" way to fix this during job submission (the target host isn't available during job submission so there's no way to generalize the GPU type).

We'll just have to deal with the step GPU TRES (probably) being incorrect.

Please close the ticket (unless you have something more to add).

cheers,

   -Greg

Comment 15 Michael Hinton 2019-11-05 13:55:21 MST

Ok Greg, thanks. I'll keep you posted if there are any new developments in this area.

-Michael

Comment 16 Scott Hilton 2020-06-25 11:24:57 MDT

*** Ticket 9275 has been marked as a duplicate of this ticket. ***

Comment 17 Scott Hilton 2020-10-05 16:22:00 MDT

*** Ticket 9932 has been marked as a duplicate of this ticket. ***

Comment 18 Scott Hilton 2020-10-05 16:28:21 MDT

*** Ticket 9543 has been marked as a duplicate of this ticket. ***

Comment 19 Ahmed Essam ElMazaty 2020-12-03 00:54:52 MST

Dear Michael,
Any estimation about when this issue will be solved?
We've upgraded recently to 20.11 and the issue is still there.
Thanks,
Ahmed

Comment 20 Michael Hinton 2020-12-03 11:24:41 MST

Hi Ahmed,

We are looking at a comprehensive fix to this problem for 21.08, but can't guarantee a timeline right now.

-Michael

Comment 21 Michael Hinton 2021-07-15 13:03:30 MDT

Hello Greg and Ahmed,

This issue should now be resolved in the upcoming 21.08 release in the following commits: https://github.com/SchedMD/slurm/compare/8aab06e4a120...0ff1668043e3.

Thanks!
-Michael