Ticket 11051

Summary: Request for general tunning assistance
Product: Slurm Reporter: Jurij Pečar <jurij.pecar>
Component: ConfigurationAssignee: Ben Roberts <ben>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
Site: EMBL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Part of last year usage report
slurm.conf
allocation/usage graph
logs and command outputs
seconds batch of logs
third batch of logs

Description Jurij Pečar 2021-03-10 05:55:58 MST
Created attachment 18334 [details]
Part of last year usage report
Comment 1 Jurij Pečar 2021-03-10 05:58:11 MST
Created attachment 18335 [details]
slurm.conf
Comment 2 Jurij Pečar 2021-03-10 06:10:35 MST
Hi, 

I'd like your comments on our config and if there's anything that we can improve. Attached are info on our jobs from last year and our config.

We're a typical HTC case with only two or three MPI apps in use. My aim is to optimize for user hapiness, meaning shortest possible pending times.

What I do now is set daily/nightly QoS policy. This is what I define now:
# sacctmgr show qos -P
Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES
normal|100|00:00:00|low,lowest||cluster|||1.000000||||||||||||||cpu=3026,mem=19624G||5000|
lowest|1|00:00:00|||cancel|||1.000000|||||||||||||||10000|250000|
low|10|00:00:00|||requeue|||1.000000|||||||||||||||10000|250000|
high|1000|00:00:00|low,lowest||cluster|||1.000000|||||||||||cpu=128||||||
highest|10000|00:00:00|low,lowest||cluster|||1.000000|||||||||||cpu=64||||||

Default QoS is "normal" and that cpu/mem is set 30% of available from 6am-6pm and to 80% of available from 6pm-6am.

Of course this only works for jobs shorter than about 1day. Still every now and then someone comes around and dumps thousands of jobs into the queue asking for one week, clogging the cluster. For now I'm working on such cases individually by educating these users, which is usually effective. Problem is that turnaround at EMBL is huge by design and I get new users every few weeks ...

Additionally I'd like to know if you have any good ideas about 
a) fairshare in our situation
b) rising priority based on past job efficiency
c) helping my users do better memory request estimations

I'm measuring efficiency per job with seff and on the whole cluster by looking at what all running jobs are allocating compared to what ganglia is showing me as being in use. Third attachment is a plot of this throughout the year and it's clear that there is tons of room for improvement.

Thanks for suggestions.
Comment 3 Jurij Pečar 2021-03-10 06:10:58 MST
Created attachment 18337 [details]
allocation/usage graph
Comment 4 Ben Roberts 2021-03-10 11:45:40 MST
Hi Jurij,

The thing that comes to mind for your problem with users submitting thousands of long jobs is to put a maximum amount of time users can specify for the normal QOS.  If you want to allow long jobs you could create a separate QOS to allow for the long jobs and (assuming you want to limit the number of long jobs) you could place a MaxJobs or MaxSubmit limit to encourage the user to use a shorter wall time for their regular work.  This will allow them to pick between a lot of short jobs or a few long ones, which will hopefully accomplish the goal of users educating themselves when they see their options.

It looks like you already have Fairshare at least partially configured:
PriorityWeightFairshare=500

The weight you give Fairshare is relatively low compared to the other priority factors, so you may not be seeing a big difference in job scheduling because of Fairshare.  It can be hard to tell how the different priority factors are affecting jobs if you don't know where to look.  If you use the 'sprio' command you can see a breakdown with the values of the different priority factors that contribute to a job's overall priority.  This should make it easier to make decisions about adjusting the PriorityWeight* value for the different factors.  If you have a specific scenario you're trying to address please let me know and I can offer some advice on how you might handle it.

There isn't a way to raise priority based on past efficiency, but you can encourage users to request just enough by billing them for the maximum amount of the resources requested by their job.  You can do this by defining a relationship between the resources with the TRESBillingWeights parameter and defining PriorityFlags=MAX_TRES.  As a simple example, if you have a node with 4 CPUs and 16Gb of RAM, if a user requested 1 CPU and 12Gb of RAM, they would be using 75% of the resources on the node and would be charged for that rather than just being charged for 1 CPU (or 25% of the system resources).  To have there be real consequences for requesting more resource than necessary you would want to have Fairshare have a greater impact on a job's overall priority.  You can read more about TRESBillingWeights here:
https://slurm.schedmd.com/slurm.conf.html#OPT_TRESBillingWeights

Please let me know if you have any questions about this.

Thanks,
Ben
Comment 5 Ben Roberts 2021-04-08 08:54:53 MDT
Hi Jurij,

Do you have any additional questions about tuning your scheduler?  Let me know if there's anything else we can do to help or if this ticket is ok to close.

Thanks,
Ben
Comment 6 Jurij Pečar 2021-04-08 20:55:44 MDT
Sure, here's one.

As you can see we use a single gpu partition with different gpu types where users can select the gpu type they need via features. Every now and then it happens that a job blocks the queue waiting for a particular gpu type that is in high demand with reason "Resources", causing other jobs at lower priority to pend with reason "Priority", although gpus they want are available. 

I need to lower the priority of blocking job manually to allow pending jobs to start immediately.

We have a similar story with fat mem nodes sharing the same partition with regular compute nodes.

Is there a way around this short of having a separate homogeneous partition for each hardware type?
Comment 7 Ben Roberts 2021-04-09 08:42:06 MDT
It sounds like the backfill scheduler isn't able to get to the lower priority jobs to schedule them on available resources.  This would especially be a problem if you have a large number of jobs queued so the backfill scheduler can't evaluate the entire job queue in a single iteration.  We do have a parameter called 'bf_continue' that you can set for SchedulerParamters.  This will allow the backfill scheduler to evaluate as many jobs as it can in a single iteration, but keep track of where it stops so it can pick up from where it left off for the next iteration.  This way there won't be jobs that are perpetually left un-evaluated.  Let me know if this helps with the behavior you're seeing.

Thanks,
Ben
Comment 8 Ben Roberts 2021-05-11 08:33:49 MDT
Hi Jurij,

I wanted to check with you to see if you've been able to try setting the bf_continue parameter and whether it's helped with the behavior you were seeing.  Let me know if you still need help with this ticket.

Thanks,
Ben
Comment 9 Jurij Pečar 2021-05-17 02:46:08 MDT
Yes, I've added bf_continue to SchedulerParameters and I'm still observing this behavior. We mix nodes with different memory sizes in the same general htc partition and right now I have a job pending with Resources asking for amount of meomry only the few highmem nodes can provide. Then 1500+ small jobs are pending with Priority, although nodes that could run them are idling.

Anything else I can do about this? Or is splitting the partition into homogeneous hw configs the only option?
Comment 10 Ben Roberts 2021-05-17 09:58:13 MDT
I'm sorry to hear that this is still an issue for you.  From what I understand about your requirements we should be able to make it work without requiring you to put the different types of hardware in separate partitions.  Is this something that you see on a regular basis on your cluster?  If so I would like to have you gather the output of the following commands the next time the cluster is in this state:
squeue
scontrol show jobs
scontrol show nodes
sdiag

If you could identify a couple jobs that you think should be able to start that would be helpful.  I would also like to see some additional logging for the time that you collect the commands.  If you could enable 'backfill' related logging for several minutes and then turn it off again when you are done that would be helpful.  You can enable and disable this logging like this:
scontrol setdebugflags +backfill
scontrol setdebugflags -backfill


Thanks,
Ben
Comment 11 Jurij Pečar 2021-05-17 10:16:54 MDT
Created attachment 19517 [details]
logs and command outputs
Comment 12 Jurij Pečar 2021-05-17 10:31:10 MDT
The way our budgeting works is that we end up renewing about 20% of the cluster each year. Therefore it is a challenge to get uniform hardware ... We are now mostly on AMD rome but with different machines, different configs and memory sizes. SmerNN-N are our primary nodes with 256G of memory, then we have sb0[45]-NN with 512G of memory and sm-epyc-0[1-5] with 2T of memory. 

I have fond memories of PBS default routing queue that greatly simplified things for end users. It would be great if slurm can come up with something like that. Until then, I'd like to hear your suggestions on how to approach such state, so that jobs destined for specific nodes don't hold back jobs that could otherwise run.

In the tarball I just attached, you can see two jobs at the top of htc partition asking for 1T of memory and blocking everything else. See for example jobs by user fabreges, like 17766971. He's asking for 1 core and 8G of memory and yet slurm is keeping some of the smer nodes powered down. There's even a job 17766971, a tiny 1c/1G 5min thing that could squeeze somewhere for sure without causing any major issues or delays for other jobs.
Comment 13 Ben Roberts 2021-05-17 15:09:18 MDT
Thank you for sending that information.  I've been looking through it and I have an idea of why you aren't seeing more jobs be backfilled.  It looks like the backfill scheduler is only able to evaluate around 15 jobs per iteration.  With thousands of jobs queued (like you have at the time you gathered this info) it would take a long time for all the jobs to be evaluated.  There are a couple of things contributing to the fact that it is only evaluating 15 jobs at a time.  One is that node sb01-13 is repeatedly contacting the controller because the hardware isn't matching up with how the node is configured in the slurm.conf.  It is repeatedly generating errors that look like this:
[2021-05-17T18:11:35.535] error: Node sb01-13 has low socket*core*thread count (24 < 48)
[2021-05-17T18:11:35.535] error: Node sb01-13 has low cpu count (24 < 48)
[2021-05-17T18:11:35.535] error: Node sb01-13 has low real_memory size (95150 < 191895)
[2021-05-17T18:11:35.535] error: _slurm_rpc_node_registration node=sb01-13: Invalid argument

It looks like that node is defined with a bunch of similar nodes, so it is probably supposed to have the hardware you have defined.  There might be some sort of hardware problem that's causing it not to report all the CPUs and Memory though.  Can I have you run a few commands from that node to confirm that theory?
slurmd -C
lscpu
free


The other part of the equation for limited backfill is the fact that you have 'max_rpc_cnt=16' defined in SchedulerParameters.  This defines how many remote procedure calls can be queued before the scheduler will stop scheduling to handle those requests.  With this set to 16 and node sb01-13 sending a steady stream of RPCs, it is causing the backfill scheduler to stop shortly after it starts.  

To help with the problem you're facing I think there are two things you need to do.  First, make sure node sb01-13 isn't generating errors for the CPUs and Memory not matching the configuration.  Second is to increase the value of max_rpc_cnt.  Even when the node isn't generating these errors any more I would recommend increasing that value to allow the scheduler more time before having to stop to handle these requests.  I would probably start with a value somewhere around 64 and see how that affects scheduling, balanced with responsiveness to user commands.  Settings like this take some adjusting to meet the needs of each site, so you can adjust it up or down according to your needs after seeing how the change affects things.  

I'll wait to see the information about that node, but feel free to let me know if you have any questions about this.

Thanks,
Ben
Comment 14 Jurij Pečar 2021-05-18 08:24:26 MDT
Good catch on the max_rpc_cnt. Not sure where I found suggestions to set it like this ... I've now raised it to 64 and will monitor the situation. Is there anything in particular I should grep the logs for to see how this change affected scheduler? 

Yes, sb01-13 has one fried cpu and is waiting for its replacement. There's another node with disabled dimm and another one that refuses to power on. How do these unhealthy nodes affect scheduler? I would imagine dead node makes slurm try to ping it every now and then? Node with "Low RealMemory" is in a known unhealthy state so should not cause much issues for slurm? And node with low meomry and low cpu should not as well, apart from spamming the logs. Anyway, for now I've stopped slurmd on sb01-13.
Comment 15 Ben Roberts 2021-05-18 09:10:08 MDT
To get an accurate view of how this change affects the backfill scheduling in the logs you would have to have the debug flag enabled, which I don't recommend leaving on long-term.  The way I would monitor the effects of the change would be by looking at the 'sdiag' output.  The output of sdiag has a section titled "Backfilling stats" where it shows information on how many jobs are being evaluated each cycle.  The most relevant ones for what we're looking at are "Last depth cycle" and "Depth Mean".  You would want to see those numbers increase dramatically from being in the teens like the Depth Mean was in the output you sent.

If nodes with hardware problems are down then there shouldn't be any real impact on scheduling.  The problem is when the nodes are still up and running slurmd and they try to contact the controller repeatedly.  One option would be to re-define the hardware for the node temporarily until the bad processor has been replaced.  The other is to take it down until the actual hardware matches the rest of the nodes.  

Thanks,
Ben
Comment 16 Jurij Pečar 2021-05-18 15:11:25 MDT
Ok, after increasing max_rpc_cnt to 64, I saw backfill depth cycle and depth mean increase to over 1800. Then in the afternoon we again got a copuple of thousands of jobs submitted and things slowed down. I bumped max_rpc_cnt again to 128 and two hours later, I still see Depth Mean at only 100 at queue length over 3500.

What should I look for to understand why is that so?
Comment 17 Ben Roberts 2021-05-18 16:07:18 MDT
Hi Jurij,

I'm glad to hear that with those changes the depth cycle and mean were getting up to around 1800, that sounds more like what I would expect.  However, if a large influx of jobs caused things to slow down and the controller isn't able to evaluate as many jobs then it sounds like there is probably something else going on.  It's possible that there could be delays due to the controller trying to write information to the StateSaveLocation.  Can you let me know if the StateSaveLocation is on a shared file system?  It's also possible that there could be other network related delays that are causing a slow down.  If you can send the slurmctld logs I'll see if there is anything there that gives a clue about why the backfill scheduler isn't processing more jobs.

Thanks,
Ben
Comment 18 Jurij Pečar 2021-05-19 01:16:51 MDT
Yes, StateSaveLocation is on a netapp, mounted to two VMs that run slurmctld. I don't see any iowait on these two VMs. Btw, they're configured with 8 cores and 8GB of memory.
Our storage team told me that this share is configured for maximum iops so it should be pretty snappy already. Anyway I asked them to temporary disable any qos on it to rule out this possibility.

Otherwise current state of the cluster is 46% allocted cpus and 94% allocated memory and sdiag reporting that backfill manages to reach about 80 jobs deep into the queue. 

My current plan is to not switch to "nightly QoS" at 6pm, keeping daily limits in so that current memory hog doesn't consume all the memory again overnight.
Comment 19 Jurij Pečar 2021-05-19 01:17:34 MDT
Created attachment 19558 [details]
seconds batch of logs
Comment 20 Ben Roberts 2021-05-19 10:31:32 MDT
Thank you for collecting and sending that information.  I didn't see anything as obvious in the logs this time that points to a problem, but I do have a couple ideas of what might be contributing factors.  One is the overall memory usage on the system.  As you mention the memory is 94% allocated.  A lot of the jobs in the queue request a large amount of memory or are in an account that has met the MaxMemoryPerAccount limit.  However, this isn't all the jobs, there are still a lot that don't request much memory and should be able to start on available resources.

The logs show that the backfilling happens periodically with some cycles starting quite a few jobs while other backfill cycles only start one or two jobs.  I was hoping that the backfill debug flag would be enough to give details of what was happening, but I think we'll need some additional logging to see why more jobs aren't being started/evaluated each time.  I assume that sdiag shows that the backfill scheduler is processing more jobs when there aren't a lot of jobs in the queue, is that right?  Can I have you enable debug3 logging the next time the cluster is in this state?  I would like to see have the backfill flag still included with this, so you can enable these by running the following commands:
scontrol setdebug debug3
scontrol setdebugflags +backfill

I would like to have the output of the other commands you've collected previously during this time.  You can disable the debug logging again after a few minutes by running:
scontrol setdebug info
scontrol setdebugflags -backfill


Regarding the scheduler parameters you have set, I don't see evidence that the max_rpc_cnt was being hit, causing the backfill scheduler to stop prematurely.  Since that is the case I would recommend setting it back to 64 for now, unless we find something in the logs indicating it's too low.  I also see that it looks like there are several users who have the majority of the pending jobs in the queue.  Right now you have 'bf_max_job_user=1000', which means it will look at the first 1000 jobs for a user before stopping to move on to the next user.  You may consider lowering that to somewhere around 250 to increase the likelihood that jobs from all users are considered in the backfill cycles.  

To summarize, I would recommend making the changes to bf_max_job_user and max_rpc_cnt, and gathering logs with 'debug3' and 'backfill' enabled the next time you have a large influx of jobs that results in the backfill scheduler processing fewer jobs than normal.

Thanks,
Ben
Comment 21 Jurij Pečar 2021-05-19 11:24:05 MDT
I've talked with user consuming the most memory to shrink her batches so it might be a while before we hit this state again. Will keep an eye on it and will collect info, now that I know how.
Comment 22 Jurij Pečar 2021-05-19 13:45:10 MDT
Created attachment 19566 [details]
third batch of logs
Comment 23 Jurij Pečar 2021-05-19 13:46:47 MDT
Ok, happened earlier than expected. Another few highmem jobs blocking the queue with 6k+ jobs and backfill only reaching 60 jobs deep into the queue. 
This is now with max_job_user=250 and max_rpc_cnt=64.

Curious about what you'll find.
Comment 24 Jurij Pečar 2021-05-19 13:48:11 MDT
Might also add that cluster cpus are 56% allocated and memory is 62% allocated. So there should be some room somewhere to accommodate more jobs ...
Comment 25 Ben Roberts 2021-05-20 10:49:57 MDT
Thanks for collecting that information one more time and sending it my way.  These logs do show a lot more information and provided some clues about why more jobs aren't being evaluated each backfill cycle.  The thing that stands out the most is that some jobs are taking a long time to be evaluated for backfill, with some of them taking up to 2 seconds.

[2021-05-19T21:36:58.890] debug2: backfill: entering _try_sched for JobId=17856726.
...
[2021-05-19T21:37:00.944] JobId=17856726 to start at 2021-05-20T04:01:10, end at 2021-05-20T12:01:00 on nodes sb04-04 in partition htc


Looking into what is making the evaluation of one of these jobs take so long, I can see that there is a function (part_data_build_row_bitmaps) that is consistently taking several milliseconds.  This may not sound like much, but when it's happening thousands of times it does add up.  Here are some example log entries with an added line break to show the delay between part_data_build_row_bitmaps and the next log entry.

[2021-05-19T21:36:58.946] debug2: select/cons_res: _will_run_test, JobId=17832600: overlap=1
[2021-05-19T21:36:58.946] debug3: select/cons_res: job_res_rm_job: JobId=17832600 action 0
[2021-05-19T21:36:58.946] debug3: select/cons_res: job_res_rm_job: removed JobId=17832600 from part htc row 0
[2021-05-19T21:36:58.946] debug3: select/cons_res: part_data_build_row_bitmaps reshuffling 1846 jobs

[2021-05-19T21:36:58.950] debug2: select/cons_res: _will_run_test, JobId=17832604: overlap=1
[2021-05-19T21:36:58.950] debug3: select/cons_res: job_res_rm_job: JobId=17832604 action 0
[2021-05-19T21:36:58.950] debug3: select/cons_res: job_res_rm_job: removed JobId=17832604 from part htc row 0
[2021-05-19T21:36:58.950] debug3: select/cons_res: part_data_build_row_bitmaps reshuffling 1845 jobs


I looked into this and found that this delay was reported before and there was a fix added to 20.11 to optimize this function.  The details of this are in bug 9365.  I see that you are still on 20.02, so upgrading to 20.11 should help reduce this delay in evaluating jobs for backfill.  I know upgrading may not be something you can do immediately.  What would the time frame look like at your site to be able to go to 20.11?

In the meantime, there are a few other suggestions I can make to try and optimize the backfill cycle.  For your scheduler parameters I would recommend adding 'bf_running_job_reserve', which will keep the backfill scheduler from trying to evaluate nodes that are fully occupied.  I would also recommend reducing the bf_window to 20160, or 14 days.  Right now it is set to 20 days, and the longest wall time you allow for jobs is 14 days in your 'htc' partition.  Finally, I would recommend increasing your bf_resolution to 120 seconds, where it's currently using the default of 60 seconds.  This essentially tells the scheduler to look at possible job placement in 2 minute blocks rather than 1 minute blocks.  The scheduler can evaluate things more quickly, but job placement may not be as optimized as it would with smaller time blocks.

In summary I would add the following to your SchedulerParameters:
bf_running_job_reserve
bf_window=20160
bf_resolution=120

These parameters may help a little, but I do think that the biggest improvement for what you are seeing will come from an upgrade to 20.11 where the optimization for part_data_build_row_bitmaps will take effect.  Please let me know if you have any questions/concerns about this.

Thanks,
Ben
Comment 26 Jurij Pečar 2021-05-26 02:07:01 MDT
We already got a bit of an unplanned upgrade on slurmdbd when EPEL introduced 20.11 slurm packages into their repository ... I didn't dare to stop that one but prevented the rest of the cluster to jump onto it. 

Yesterday I upgraded everything to 20.11.7 in controlled fashion and it appears to run as expected. I also applied the settings you recommend and will monitor the situation. So far the job load is reasonable and everything is running smoothly. I'll let you know if I see some pending jobs piling up again.
Comment 27 Ben Roberts 2021-05-26 11:20:49 MDT
I'm sorry the EPEL upgrade caught you off guard, but letting the slurmdbd upgrade to go completion was the right approach if you didn't have a recent backup.  I'm glad to hear you were able to upgrade the rest of the way to 20.11 without incident.  I'll leave this ticket open for now while you monitor things.  

Thanks,
Ben
Comment 28 Jurij Pečar 2021-06-01 07:53:27 MDT
Now with 20.11.7 I see backfill numbers like:

Last cycle: 26517416
Max cycle:  31253070
Mean cycle: 10000832
Last depth cycle: 1860
Last depth cycle (try sched): 282
Depth Mean: 380
Depth Mean (try depth): 139
Last queue length: 4197

Which look good enough for me to move to my next question.

I'm collecting allocation numbers straight from slurm and actual usage from ganglia, for # of cpus and for memory. I do take into account that ganglia counts SMT threads as cpus.

What I usually see is cpu usage hovering between 50-80% of cpu allocation and memory usage around 20-60% of memory allocation. I'd like to improve these numbers.

I know that proper way here is to improve software quality but this is bioinformatics, I even see sed and awk being used in these "high performance" pipelines. So rewriting this software is rather futile approach.

I've already took a look into oversubscribing features offered by slurm and I think there's not much more that I can do here. Please let me know if it is. I see that oversubscription works quite well for a single user and I see allocations approaching 200% if a single user submits thousands of poorly behaving jobs into a mostly empty cluster. Is there a way to achieve the same without user boundary? 

The other issue is memory. Slurm looks only at resident memory (RES field in top) (or is this because we use cgroups?) but many of our codes dealing with large datasets use mmap() and end up with most of memory under VIRT. This sometimes leads to swapping even if jobs are not reaching their limits of RES. I have a hard time teaching users how to estimate their job memory requirements, which turns into "lets just ask for a bit more to be sure" and at the end I'm looking at 88% memory allocation and 35% memory usage, like right now.

How do you suggest to handle such situation?

I'm thinking that I/we should teach slurm to somehow also account for VIRT and that it should be doing scheduling decisions more on actual usage and not just what people are asking. Is there a way to achieve this?
Comment 29 Ben Roberts 2021-06-03 09:10:05 MDT
Hi Jurij,

I'm glad to hear that backfill is looking better after upgrading to 20.11.7.  

You can oversubscribe CPUs on the cluster for more than just a single user.  In your slurm.conf you can set the OverSubscribe option for certain partitions.  You can configure it to disallow, allow or force oversubscription.  As you're aware, there is a tradeoff with sharing the CPUs, so I would recommend starting with 'OverSubscribe=FORCE:2' and seeing how things look with that.  You may also want to consider whether certain partitions would be better candidates for oversubscription than others.

You also bring up memory, which can't be oversubscribed like CPUs can.  There isn't a way to automatically adjust the amount of memory allocated to a job based on past usage.  There are some things you can do to encourage users to request what they need and not more.  You can set a default amount of memory per CPU (DefMemPerCPU) and a maximum amount of memory per CPU (MaxMemPerCPU).  This can allow you to prevent users from requesting half the memory on a node and a single CPU for example. These can be set cluster-wide or on a per-partition basis.  You can also use TRESWeights to define how much the memory request impacts the reported usage on a node.  Coupled with FairShare this can be used to encourage users to just request as much memory as they need since requesting a lot of memory will impact their reported usage.  If you (as an admin) can accurately determine which types of jobs will need certain amounts of memory, then you can also use a job submit filter to set a pre-determined amount of memory on a job when it is submitted.  I think this is the riskiest approach as you are probably going to have special cases that won't be handled correctly by a script, but I wanted to include it as an option.

I hope this helps.  Let me know if you have any additional questions.

Thanks,
Ben
Comment 30 Ben Roberts 2021-07-02 11:08:21 MDT
Hi Jurij,

I wanted to follow up and see if you have any additional questions about this.  Let me know if there's anything else I can do to help.

Thanks,
Ben
Comment 31 Ben Roberts 2021-07-29 11:18:34 MDT
Hi Jurij,

I haven't heard any follow up questions so I'll close this ticket.  Let us know if there's anything else we can do to help.

Thanks,
Ben