Ticket 13961 - odd resource allocation issue
Summary: odd resource allocation issue
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 21.08.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-04-28 14:50 MDT by Todd Merritt
Modified: 2022-05-06 13:05 MDT (History)
0 users

See Also:
Site: U of AZ
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Slurm config (5.26 KB, text/plain)
2022-05-02 06:41 MDT, Todd Merritt
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Todd Merritt 2022-04-28 14:50:25 MDT
We had an odd resource allocation issue that seems to have circumvented our job limits. The job requested gpus=25. We used to have keplers and pascals, but now we only have pascal gpus in the system and there's a limit of 10 gpus on the partition. The job was allocated 25 pascal gpus though:

Job ID               : 690774              
State                : RUNNING
QOS                  : part_qos_standard   
Requsted Resources   : billing=700,cpu=700,gres/gpu:kepler=25,mem=4200G,node=25    
Allocated Resources  : billing=700,cpu=700,gres/gpu:pascal=25,mem=4200G,node=25

root@gilliam:~ #  sacctmgr --parsable2 show qos part_qos_standard format=GrpTRES
GrpTRES
gres/gpu:pascal=10

I'm not super clear why it set the gres/gpu to kepler when there wasn't one set. The only reference in the config is 

AccountingStorageTRES=gres/gpu:volta,gres/gpu:pascal,gres/gpu:kepler

I plan to remove that entry from AccountingStorageTRES but thought I'd wait until after I got your feedback on the issue. Thanks!
Comment 1 Jason Booth 2022-04-28 15:04:44 MDT
Do the slurmd's report a different GRES when you run the following?

> $ slurmd -G

How are your currently managing configurations throughout your cluster (NFS/configless etc...)?
Comment 2 Todd Merritt 2022-04-29 06:38:36 MDT
Hi Jason,

We're running in a configless setup. Running that command on one of the nodes I get no output

[root@i0n1 ~]# slurmd -G
[root@i0n1 ~]#

Thanks,
Todd
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, April 28, 2022 2:04 PM
To: Merritt, Todd R - (tmerritt) <tmerritt@arizona.edu>
Subject: [EXT][Bug 13961] odd resource allocation issue


External Email

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=13961#c1> on bug 13961<https://bugs.schedmd.com/show_bug.cgi?id=13961> from Jason Booth<mailto:jbooth@schedmd.com>

Do the slurmd's report a different GRES when you run the following?

> $ slurmd -G

How are your currently managing configurations throughout your cluster
(NFS/configless etc...)?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 3 Scott Hilton 2022-04-29 15:01:59 MDT
Can I get the output of:
>sacct -Pl -j 690774

Can I also get your slurm.conf?
Comment 4 Todd Merritt 2022-05-02 06:39:19 MDT
Sure thing.

# sacct -Pl -j 690774
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot
690774|690774|run_wav|standard||||||||||||||||||700|1-22:45:05|COMPLETED|0:0||Unknown|Unknown|Unknown|4200G|0|||||||||billing=700,cpu=700,gres/gpu:kepler=25,mem=4200G,node=25|billing=700,cpu=700,gres/gpu:pascal=25,mem=4200G,node=25|||||||||||||
690774.batch|690774.batch|batch||313487760K|i16n0|0|313487760K|28671772K|i16n0|0|28671772K|304|i16n0|0|304|53-15:00:06|i16n0|0|53-15:00:06|1|28|1-22:45:05|COMPLETED|0:0|11K|0|0|0||0|754.96M|i16n0|0|754.96M|194632.63M|i16n0|0|194632.63M||cpu=28,gres/gpu:pascal=1,mem=168G,node=1|cpu=53-15:00:06,energy=0,fs/disk=791635013,mem=28671772K,pages=304,vmem=313487760K|cpu=53-15:00:06,energy=0,fs/disk=791635013,mem=28671772K,pages=304,vmem=313487760K|cpu=i16n0,energy=i16n0,fs/disk=i16n0,mem=i16n0,pages=i16n0,vmem=i16n0|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=53-15:00:06,energy=0,fs/disk=791635013,mem=28671772K,pages=304,vmem=313487760K|cpu=i16n0,energy=i16n0,fs/disk=i16n0,mem=i16n0,pages=i16n0,vmem=i16n0|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=53-15:00:06,energy=0,fs/disk=791635013,mem=28671772K,pages=304,vmem=313487760K|energy=0,fs/disk=204087109074|energy=i16n0,fs/disk=i16n0|fs/disk=0|energy=0,fs/disk=204087109074|energy=0,fs/disk=204087109074
690774.extern|690774.extern|extern||146524K|i16n20|20|146524K|1036K|i16n15|15|999915|0|i16n20|20|0|00:00:00|i16n20|20|00:00:00|25|700|1-22:45:10|COMPLETED|0:0|59.83G|0|0|0||0|0.00M|i16n20|20|0.00M|0|i16n20|20|0||billing=700,cpu=700,gres/gpu:pascal=25,mem=4200G,node=25|cpu=00:00:00,energy=0,fs/disk=2012,mem=999915,pages=0,vmem=146524K|cpu=00:00:00,energy=0,fs/disk=2012,mem=1036K,pages=0,vmem=146524K|cpu=i16n20,energy=i16n20,fs/disk=i16n20,mem=i16n15,pages=i16n20,vmem=i16n20|cpu=20,fs/disk=20,mem=15,pages=20,vmem=20|cpu=00:00:00,energy=0,fs/disk=2012,mem=836K,pages=0,vmem=146524K|cpu=i16n20,energy=i16n20,fs/disk=i16n20,mem=i16n20,pages=i16n20,vmem=i16n20|cpu=20,fs/disk=20,mem=20,pages=20,vmem=20|cpu=00:00:00,energy=0,fs/disk=50300,mem=24412K,pages=0,vmem=3663100K|energy=0,fs/disk=0|energy=i16n20,fs/disk=i16n20|fs/disk=20|energy=0,fs/disk=0|energy=0,fs/disk=0
690774.0|690774.0|orted||313404468K|i16n14|13|319716945578|29211664K|i16n17|16|28029274K|349|i16n8|7|240|53-10:19:29|i16n23|22|53-17:25:06|24|672|1-22:44:59|COMPLETED|0:0|144K|Unknown|Unknown|Unknown||0|5.26M|i16n3|2|5.25M|23535.27M|i16n17|16|5888.86M||cpu=672,gres/gpu:pascal=24,mem=4032G,node=24|cpu=53-17:25:06,energy=0,fs/disk=5500585,mem=28029274K,pages=240,vmem=319716945578|cpu=53-22:56:25,energy=0,fs/disk=5518298,mem=29211664K,pages=349,vmem=313404468K|cpu=i16n3,energy=i16n15,fs/disk=i16n3,mem=i16n17,pages=i16n8,vmem=i16n14|cpu=2,fs/disk=2,mem=16,pages=7,vmem=13|cpu=53-10:19:29,energy=0,fs/disk=5422759,mem=26884524K,pages=130,vmem=311087400K|cpu=i16n23,energy=i16n15,fs/disk=i16n23,mem=i18n0,pages=i16n16,vmem=i18n0|cpu=22,fs/disk=22,mem=23,pages=15,vmem=23|cpu=1289-10:02:34,energy=0,fs/disk=132014055,mem=672702576K,pages=5776,vmem=7493365912K|energy=0,fs/disk=24678517825|energy=i16n15,fs/disk=i16n17|fs/disk=16|energy=0,fs/disk=6174912701|energy=0,fs/disk=148197904828
Comment 5 Todd Merritt 2022-05-02 06:41:22 MDT
Created attachment 24758 [details]
Slurm config
Comment 6 Scott Hilton 2022-05-02 11:08:29 MDT
Todd,

Are you able to reproduce this? Do you know what the submitting line looked like or sbatch options were used?

Is the job or a job like it still in the system? If so send me this:
scontrol show job <jobid>

-Scott
Comment 7 Todd Merritt 2022-05-02 12:35:09 MDT
Hi Scott, yes, this is reproducible

#!/bin/bash
#SBATCH --account=hpcteam
#SBATCH --partition=standard
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:1
#SBATCH --nodes=20
#SBATCH --ntasks=20

wait 100

(ocelote) tmerritt@junonia:~/ocelote-test $ scontrol show job 721123
JobId=721123 JobName=gpu-fail.sh
   UserId=tmerritt(7862) GroupId=hpcteam(30001) MCS_label=N/A
   Priority=5100 Nice=0 Account=hpcteam QOS=part_qos_standard
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2022-05-02T10:25:50 EligibleTime=2022-05-02T10:25:50
   AccrueTime=2022-05-02T10:25:50
   StartTime=2022-05-02T19:41:08 EndTime=2022-05-02T20:41:08 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-05-02T11:34:24 Scheduler=Main
   Partition=standard AllocNode:Sid=junonia:12185
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=i16n[0,2-10,12,16,19,21,23],i18n[10,18-20,23]
   NumNodes=20-20 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=20,mem=120G,node=20,billing=20,gres/gpu:kepler=20
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=6G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/u11/tmerritt/ocelote-test/gpu-fail.sh
   WorkDir=/home/u11/tmerritt/ocelote-test
   StdErr=/home/u11/tmerritt/ocelote-test/slurm-721123.out
   StdIn=/dev/null
   StdOut=/home/u11/tmerritt/ocelote-test/slurm-721123.out
   Power=
   TresPerNode=gres:gpu:1
   
The job hasn't started yet though due to some other large jobs using all of our GPUs.
Comment 8 Scott Hilton 2022-05-02 13:28:45 MDT
Todd,

The requested Tres says gpu:kepler despite the fact that under the hood, slurm is just requesting any type of gpu. Because non-typed gpus don'tt exist in AccountingStorageTRES, it defaults to the first gpu type alphabetically (kepler). This is so you can see some gpus were requested.

To avoid this ambiguity you could add gres/gpu to AccountingStorageTRES:
>AccountingStorageTRES=gres/gpu:volta,gres/gpu:pascal,gres/gpu:kepler,gres/gpu

Or if you only wanted it to show when no type was specified you could do something like gres/gpu:anytype. This would show because it starts with 'a'.

The actual issue of the limits being ignored looks like a bug to me. I am able to reproduce it and will look into the issue.

-Scott
Comment 9 Todd Merritt 2022-05-02 13:37:16 MDT
Thanks Scott, I was reviewing the documentation after opening this ticket up and saw that I was missing the generic gpu resource but thought I'd wait to resolve the ticket before fiddling with the configuration. I'll go make that update and it will probably take care of this in the short term. Thanks!
Comment 11 Scott Hilton 2022-05-05 15:58:27 MDT
Todd,

The limits issue is due to a design limitation. See:
https://slurm.schedmd.com/resource_limits.html#gres_limits

This is specific to limits with gres subtypes. Setting then limit to gres/gpu=10 rather than gres/gpu:pascal=10 would work just fine.

-Scott
Comment 12 Todd Merritt 2022-05-06 06:49:57 MDT
Thanks Scott,

I guess that should be fine now. It would have been problematic for us in the past because we had PIs buy in with a particular type of GPU and we wanted to make sure we were keeping users on the right type of GPU. Sounds like it would be a difficult thing to change but I just wanted to point out there is a legitimate use case for being able to apply the limit more granularly.

I'll go ahead and update our limits though to apply the restriction at the parent type.

Todd
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, May 5, 2022 2:58 PM
To: Merritt, Todd R - (tmerritt) <tmerritt@arizona.edu>
Subject: [EXT][Bug 13961] odd resource allocation issue


External Email

Comment # 11<https://bugs.schedmd.com/show_bug.cgi?id=13961#c11> on bug 13961<https://bugs.schedmd.com/show_bug.cgi?id=13961> from Scott Hilton<mailto:scott@schedmd.com>

Todd,

The limits issue is due to a design limitation. See:
https://slurm.schedmd.com/resource_limits.html#gres_limits

This is specific to limits with gres subtypes. Setting then limit to
gres/gpu=10 rather than gres/gpu:pascal=10 would work just fine.

-Scott

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 13 Scott Hilton 2022-05-06 13:05:52 MDT
Glad I could help. Thanks for letting me know about your setup. If you have more questions about this issue in the future, let us know.

-Scott