| Summary: | AllocTRES of step doesn't match AllocTRES of job | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Greg Wickham <greg.wickham> |
| Component: | Accounting | Assignee: | Director of Support <support> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | ahmed.mazaty, tdockendorf |
| Version: | 19.05.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=7827 https://bugs.schedmd.com/show_bug.cgi?id=8054 |
||
| Site: | KAUST | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 21.08.0 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf
gres.conf |
||
Hi Greg, Indeed, AllocTRES in `scontrol show nodes` can be messed up when there is more than one GPU type. Differing `Cores=` definitions for each of the GPUs in gres.conf can also increase the chances of this bug showing up. Although AllocTRES in `scontrol show nodes` is sometimes incorrect, I haven't seen any evidence yet that jobs get scheduled with incorrect resources, or that AllocTRES in sacct in wrong. I actually have a fix pending review in a private internal bug for this very issue (bug 7827). I'll let you know when that lands. In the meantime, could you give me your current slurm.conf and gres.conf, so I can reproduce your issue and make sure my patch solves your issue? And I'm assuming that sacct is showing the correct GRES request for the job, but if not, I'll need that as well. Thanks, -Michael Created attachment 12182 [details]
slurm.conf
Created attachment 12183 [details]
gres.conf
Hi Michael, We're not seeing what you describe - the output of 'scontrol show nodes' is correct. The user requested 'v100' GPUs and a suitable node was chosen (gpu208-02). The job steps ran on the selected node (with v100 GPUs which is correct). The issue is the "step" accounting is showing incorrect AllocTRES - the node has v100 GPUs but the step AllocTRES is indicating 'gtx1080ti'. FWIW, the user started with an 'salloc' and then used 'srun' to run steps within the allocation. Files uploaded. -Greg (In reply to Greg Wickham from comment #5) > We're not seeing what you describe - the output of 'scontrol show nodes' is > correct. Ok, thanks. I'll get to the bottom of this and get back to you. -Michael Sorry Greg, could you also give me the node definition for gpu208-02? I didn't realize it would be in a different file. I want to double check that there isn't anything fishy going on there. Thanks. Also, if you know the commands used to reproduce the jobs with the issue, that would be helpful. Thanks. And if you see any currently-running jobs with this problem, an `scontrol show job <job_id>` would be nice as well. NodeName=gpu208-[02,06,10,14,18] Gres=gpu:v100:8 Feature=dragon,ibex2018,nolmem,cpu_intel_platinum_8260,intel,gpu,intel_gpu,local_200G,local_400G,local_500G,gpu_v100,v100,nossh RealMemory=763000 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 Weight=32000 I don't know the commands used. I do have a belief that the user did an "salloc" to obtain the resources and then used "srun" to use them. I'll have a look tomorrow and see if I can find any active jobs exhibiting similar behaviour. Good news: I was able to reproduce the bug!
*****************
Reproducer
*****************
slurm.conf:
************************************
NodeName=DEFAULT CPUs=48 Sockets=2 CoresPerSocket=24 ThreadsPerCore=1 State=UNKNOWN RealMemory=7000
GresTypes=gpu
DebugFlags=gres
SlurmdParameters=config_overrides
AccountingStorageTRES=gres/gpu,gres/gpu:gtx1080ti,gres/gpu:p100,gres/gpu:p6000,gres/gpu:rtx2080ti,gres/gpu:tesla_k40m,gres/gpu:v100,cpu,node
NodeName=test1 NodeAddr=localhost Port=19052 Gres=gpu:v100:8
gres.conf:
************************************
Name=gpu Type=v100 File=/dev/tty[0-7]
Commands:
************************************
hintron@hintron:~/slurm/19.05/extra$ salloc --gres=gpu:v100:4
salloc: Granted job allocation 119
hintron@hintron:~/slurm/19.05/extra$ srun --gres=gpu:4 sleep 60 &
[1] 18663
hintron@hintron:~/slurm/19.05/extra$ sacct --format jobid,nodelist,alloctres%60 -j 119
JobID NodeList AllocTRES
------------ --------------- -----------------------------------------------------
119 test1 billing=1,cpu=1,gres/gpu:v100=4,gres/gpu=4,mem=150M,node=1
119.0 test1 cpu=1,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=0,node=1
hintron@hintron:~/slurm/19.05/extra$ scontrol show nodes
NodeName=test1 Arch=x86_64 CoresPerSocket=24
CPUAlloc=1 CPUTot=48 CPULoad=0.65
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:v100:8
NodeAddr=localhost NodeHostName=test1 Port=19052 Version=19.05.3-2
OS=Linux 5.3.0-19-generic #20-Ubuntu SMP Fri Oct 18 09:04:39 UTC 2019
RealMemory=7000 AllocMem=150 FreeMem=539 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=2019-11-04T11:06:33 SlurmdStartTime=2019-11-04T12:49:55
CfgTRES=cpu=48,mem=7000M,billing=48,gres/gpu=8,gres/gpu:v100=8
AllocTRES=cpu=1,mem=150M,gres/gpu=4,gres/gpu:v100=4
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
However, specifying gpu:v100:4 explicitly for both salloc and srun makes it work as expected:
************************************
hintron@hintron:~/slurm/19.05/extra$ salloc --gres=gpu:v100:4
salloc: Granted job allocation 118
hintron@hintron:~/slurm/19.05/extra$ srun --gres=gpu:v100:4 sleep 60 &
[1] 18496
hintron@hintron:~/slurm/19.05/extra$ sacct --format jobid,nodelist,alloctres%60 -j 118
JobID NodeList AllocTRES
------------ --------------- -----------------------------------------------------
118 test1 billing=1,cpu=1,gres/gpu:v100=4,gres/gpu=4,mem=150M,node=1
118.0 test1 cpu=1,gres/gpu:v100=4,gres/gpu=4,mem=0,node=1
hintron@hintron:~/slurm/19.05/extra$ scontrol show nodes
NodeName=test1 Arch=x86_64 CoresPerSocket=24
CPUAlloc=1 CPUTot=48 CPULoad=0.24
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:v100:8
NodeAddr=localhost NodeHostName=test1 Port=19052 Version=19.05.3-2
OS=Linux 5.3.0-19-generic #20-Ubuntu SMP Fri Oct 18 09:04:39 UTC 2019
RealMemory=7000 AllocMem=150 FreeMem=586 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=debug
BootTime=2019-11-04T11:06:33 SlurmdStartTime=2019-11-04T12:49:55
CfgTRES=cpu=48,mem=7000M,billing=48,gres/gpu=8,gres/gpu:v100=8
AllocTRES=cpu=1,mem=150M,gres/gpu=4,gres/gpu:v100=4
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
I did a few more tests, and here’s a summary of the GRES combinations for salloc and srun:
**********************************
salloc | srun | sacct
---------------------------------
gpu:4 | gpu:4 | gres/gpu:gtx1080ti=4
gpu:4 | gpu:v100:4 | Invalid GRES specification
gpu:v100:4 | gpu:4 | gres/gpu:gtx1080ti=4
gpu:v100:4 | gpu:v100:4 | gres/gpu:v100=4
So it looks like the current work-around is to instruct users to explicitly specify the GPU type (v100) for both salloc AND srun, where possible.
I’ll investigate this some more and hopefully get back to you once I have a fix.
Thanks,
-Michael
P.S. In gres.conf, `Count` is completely redundant for GPUs, since the count is inferred from the `File` is specification.
Hey Greg, It looks like this is a known issue. From https://slurm.schedmd.com/slurm.conf.html#OPT_AccountingStorageTRES: "If a job requests GPUs, but does not explicitly specify the GPU type, then its resource allocation will be accounted for as either "gres/gpu:tesla" or "gres/gpu:volta", although the accounting may not match the actual GPU type allocated to the job and the GPUs allocated to the job could be heterogeneous. In an environment containing various GPU types, use of a job_submit plugin may be desired in order to force jobs to explicitly specify some GPU type." I still don't think this is ideal and hope to improve on this in 20.02 or later. But it looks like for now, using the job_submit plugin to alter any user's "srun --gres:gpu:4" to "srun --gres:gpu:v100:4" on nodes that only have a single type is your best bet to avoid this issue. Another thing you can do that might help mitigate the issue is to add an additional 'meta' TRES GPU storage type to AccountingStorageTRES, like `gres/gpu:a_typeless`. Whatever it is named, it just needs to be higher in the alphabet than `gtx1080ti` (`a` comes before `g`). This is because when `srun --gres=gpu:4` is specified inside the allocation, since type is null, Slurm currently just picks the first GPU type it finds in a sorted list taken from AccountingStorageTRES. It's clunky, but at least you will know what's going on and that the step didn't specify a type (so it could be any of the GPUs within the allocation). For example: $ sacct --format jobid,nodelist,alloctres%60 -j 140 JobID NodeList AllocTRES ------------ --------------- ----------------------------------------------------- 140 test1 billing=1,cpu=1,gres/gpu:v100=4,gres/gpu=4,mem=150M,node=1 140.0 test1 cpu=1,gres/gpu:a_typeless=4,gres/gpu=4,mem=0,node=1 Now this is less confusing, because Slurm doesn't tell us that we have gtx1080ti GPUs. In this case, we can infer that the 4 GPUs were v100s, since it's confined to the job's allocation of v100s. For allocations with multiple GPU types, this could still mean that you won't know exactly which GPU types were on the step, but it at least won't be misleading. Thanks, -Michael Hi Michael, Thanks for the comprehensive response. We currently have 5 different GPU types in use, so I doubt there is an "easy" way to fix this during job submission (the target host isn't available during job submission so there's no way to generalize the GPU type). We'll just have to deal with the step GPU TRES (probably) being incorrect. Please close the ticket (unless you have something more to add). cheers, -Greg Ok Greg, thanks. I'll keep you posted if there are any new developments in this area. -Michael *** Ticket 9275 has been marked as a duplicate of this ticket. *** *** Ticket 9932 has been marked as a duplicate of this ticket. *** *** Ticket 9543 has been marked as a duplicate of this ticket. *** Dear Michael, Any estimation about when this issue will be solved? We've upgraded recently to 20.11 and the issue is still there. Thanks, Ahmed Hi Ahmed, We are looking at a comprehensive fix to this problem for 21.08, but can't guarantee a timeline right now. -Michael Hello Greg and Ahmed, This issue should now be resolved in the upcoming 21.08 release in the following commits: https://github.com/SchedMD/slurm/compare/8aab06e4a120...0ff1668043e3. Thanks! -Michael |
$ sacct -j 6647867 --format jobid,nodelist,alloctres%60 JobID NodeList AllocTRES ------------ --------------- ----------------------------------------------------- 6647867 gpu208-02 billing=9,cpu=9,gres/gpu:v100=4,gres/gpu=4,mem=128G,node=1 6647867.ext+ gpu208-02 billing=9,cpu=9,gres/gpu:v100=4,gres/gpu=4,mem=128G,node=1 6647867.0 gpu208-02 cpu=9,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=128G,node=1 6647867.1 gpu208-02 cpu=1,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=128G,node=1 6647867.2 gpu208-02 cpu=1,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=128G,node=1 6647867.3 gpu208-02 cpu=1,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=128G,node=1 6647867.4 gpu208-02 cpu=9,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=128G,node=1 6647867.5 gpu208-02 cpu=9,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=128G,node=1 6647867.6 gpu208-02 cpu=9,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=128G,node=1 The GRES (shown above) that was allocated to the steps according to sacct don't match either the node (which only has gpu:v100:8) or the Job itself. $ scontrol show -o node=gpu208-02 NodeName=gpu208-02 Arch=x86_64 CoresPerSocket=24 CPUAlloc=27 CPUTot=48 CPULoad=14.91 AvailableFeatures=dragon,ibex2018,nolmem,cpu_intel_platinum_8260,intel,gpu,intel_gpu,local_200G,local_400G,local_500G,gpu_v100,v100,nossh ActiveFeatures=dragon,ibex2018,nolmem,cpu_intel_platinum_8260,intel,gpu,intel_gpu,local_200G,local_400G,local_500G,gpu_v100,v100,nossh Gres=gpu:v100:8 NodeAddr=gpu208-02 NodeHostName=gpu208-02 OS=Linux 3.10.0-957.12.1.el7.x86_64 #1 SMP Mon Apr 29 14:59:59 UTC 2019 RealMemory=763000 AllocMem=213840 FreeMem=751573 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=32000 Owner=N/A MCS_label=N/A Partitions=batch BootTime=2019-10-30T15:15:59 SlurmdStartTime=2019-10-30T15:25:11 CfgTRES=cpu=48,mem=763000M,billing=48,gres/gpu=8,gres/gpu:v100=8 AllocTRES=cpu=27,mem=213840M,gres/gpu=3,gres/gpu:v100=3 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s