| Summary: | partition node assignment | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Anthony DelSorbo <anthony.delsorbo> |
| Component: | Configuration | Assignee: | Gavin D. Howard <gavin> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | aradeva, jbooth |
| Version: | 19.05.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NOAA | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | NESCC | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Anthony DelSorbo
2020-02-07 11:30:23 MST
Tony, There is an option that might work. The option is the `Weight` option for nodes. This is what the docs say about it: > Weight > > The priority of the node for scheduling purposes. All things being equal, > jobs will be allocated the nodes with the lowest weight which satisfies their > requirements. For example, a heterogeneous collection of nodes might be > placed into a single partition for greater system utilization, responsiveness > and capability. It would be preferable to allocate smaller memory nodes > rather than larger memory nodes if either will satisfy a job's requirements. > The units of weight are arbitrary, but larger weights should be assigned to > nodes with more processors, memory, disk space, higher processor speed, etc. > Note that if a job allocation request can not be satisfied using the nodes > with the lowest weight, the set of nodes with the next lowest weight is added > to the set of nodes under consideration for use (repeat as needed for higher > weight values). If you absolutely want to minimize the number of higher > weight nodes allocated to a job (at a cost of higher scheduling overhead), > give each node a distinct Weight value and they will be added to the pool of > nodes being considered for scheduling individually. The default value is 1. The key part is this: > All things being equal, jobs will be allocated the nodes with the lowest > weight which satisfies their requirements. This means that your last resort nodes should have their weight set *higher* than the weight of all other nodes. This will cause them to be considered last. I hope that all makes sense. Please send any further questions you have, and I will do my best to answer them. Gavin, Thanks for the reply. I had written a reply of my own to yours last week but turns out I didn't click on "Save Changes." Unfortunately, now I'm in maintenance mode so I'll forego those original follow-up questions as I have an urgent question I need to address today: The nodes I'm adding have more memory than the current compute nodes. I am adding those nodes to both the default partition (hera) and the big memory partition (bigmem). But, today I specify: hera: DefMemPerCpu=2300 bigmem: DefMemPerCpu=9500 So, what's the best practice here? Should I move this setting to the NodeName definition line such as: NodeName=h[32-36]m[01-52] CPUs=40 CoresPerSocket=20 RealMemory=384000 Weight=2 DefMemPerCpu=9500 and remove the settings from the partitions? Or is there a better way? What are the ramifications of the proper solution? I hope you don't mind, but I've increased this ticket's priority for the time being. Best, Tony. Tony, It sounds like your increase in the severity is justified. I will get right on looking into your new question. Tony, Do the nodes you are adding have more memory than any other nodes? Or do they have less memory than, say, the rest of the nodes in bigmem? (In reply to Gavin D. Howard from comment #4) > Tony, > > Do the nodes you are adding have more memory than any other nodes? Or do > they have less memory than, say, the rest of the nodes in bigmem? They have the same amount as the bigmem nodes and clearly more than the nodes in the hera partition: NodeName=h1c[01-52],h[2-5]c[01-56],h6c[01-57],h8c[01-56],h9c[01-54],h10c[01-57],h[11-12]c[01-56],h[13,14]c[01-52],h[15-24]c[01-56],h[25]c[01-52] CPUs=40 CoresPerSocket=20 RealMemory=95000 NodeName=h1m[01-04],h13m[01-04] CPUs=40 CoresPerSocket=20 RealMemory=384000 NodeName=h[32-36]m[01-52] CPUs=40 CoresPerSocket=20 RealMemory=384000 Weight=2 and here are the partition defs: PartitionName=hera \ Nodes=h1c[01-52],h[2-5]c[01-56],h6c[01-57],h8c[01-56],h9c[01-54],h10c[01-57],h[11-12]c[01-56],h[13,14]c[01-52],h[15-24]c[01-56],h[25]c[01-52],h[32-36]m[01-52] \ # All compute nodes parallel only Default=yes \ DefMemPerCpu=2300 \ AllowQos=windfall,batch,debug,novel,urgent,admin PartitionName=bigmem Nodes=h1m[01-04],h13m[01-04],h[32-36]m[01-52] \ TRESBillingWeights=cpu=1 \ DefMemPerCpu=9500 \ AllowQos=windfall,batch,debug,urgent,admin Tony, With that info, I don't think you need to do anything special, as long as you set the weight of the nodes as I suggested in Comment #1. The reason is that DefMemPerCPU is meant to prevent oversubscription. It should be set to the lowest amount of memory per CPU for any node in the partition to prevent such oversubscription. (Slurm will, in fact, allow a job to use more memory if it's available, but if some of the memory is swap, that could cause paging.) However, in this case, since the nodes you are worried about have at least as much memory as any other node in the cluster, you don't have to worry about them being oversubscribed if, for example, a job is submitted to bigmem with its DefMemPerCPU setting because that DefMemPerCPU setting should not be higher than the amount of memory those nodes have available. Does this all make sense? (In reply to Gavin D. Howard from comment #6) ... > > Does this all make sense? Sure - that makes sense - thanks Gavin. However, I'm trying to anticipate the customer's questions. So, please bear with me as I pick your collective brains on this issue. If the user submits a job to the hera partition, if they don't specify a mem per cpu amount, they will get the default of 2300. So, if the larger nodes get assigned to that job, the memory there will also be 2300 per CPU. So the question is: supposing the user purposely wants a mix of both node types. Can they specify a memory amount on 1 set of nodes and a different amount on the other set? If so, what is the syntax for making that happen. Tony, As far as I know, there is no way to specify different amounts of memory for different nodes. I did a bit of research, and it seems that doing so could complicate scheduling. Tony, I have a bit better information for you. If a user wants to have two different node types, they can submit a heterogeneous job, as explained in this page: https://slurm.schedmd.com/heterogeneous_jobs.html . This should allow them to do what they want, but it is complicated. (In reply to Gavin D. Howard from comment #9) > Tony, > > I have a bit better information for you. > > If a user wants to have two different node types, they can submit a > heterogeneous job, as explained in this page: > https://slurm.schedmd.com/heterogeneous_jobs.html . > > This should allow them to do what they want, but it is complicated. Gavin, thanks. Our extended experience with heterogenous jobs was not very positive - we had to abandon it, at least until it better matures. It just seems that it would be much better to say give me this type of node with these memory requirements and this type of node with these other memory requirements. And while the nodes may be heterogeneous, the job shouldn't be treated so different. Tony, I am sorry that I could not give you better news about that. Is there anything else I can do for you? (In reply to Gavin D. Howard from comment #11) > Tony, > > I am sorry that I could not give you better news about that. > > Is there anything else I can do for you? Well, now that you asked, Gavin ... customer asked if we can set a time limit on the nodes (other than partition). That is, say I add these nodes to an existing partition. Normally, the partition permits jobs to run up to 8 hours. But, since these nodes have other requirements, the customer want to be able to limit only jobs that can fit in 4 hours on these nodes. So, could I define the nodes as NodeName=h[32-36]m[01-52] CPUs=40 CoresPerSocket=20 RealMemory=384000 Weight=2 features=bigmem MaxTime=0-04:00 Given that these nodes may be in multiple partitions, will slurm be "smart enough" to permit jobs less than four hours to use these nodes and those more than four hours to use other nodes in that partition? Or, will the partition MaxTime take precedence over node maxTime? Thanks, Tony. Tony, As far as I can tell from the docs and source, there is no `MaxTime` option for node definitions. Tony, Assuming there is nothing else you need on this bug, I am going to close it. If you have more questions related to this bug, feel free to reopen. (In reply to Gavin D. Howard from comment #14) > Tony, > > Assuming there is nothing else you need on this bug, I am going to close it. > If you have more questions related to this bug, feel free to reopen. Gavin, Sorry about leaving you hanging. It's been a real busy couple of weeks. No problem here. Please close and I'll re-open if I need to. Best, Tony. |