Ticket 21446 - MaxCPUsPerNode and MaxMemPerCPU not being enforced
Summary: MaxCPUsPerNode and MaxMemPerCPU not being enforced
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 24.05.2
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Michael Norris
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-11-18 10:35 MST by Jeremy Sullivan
Modified: 2025-01-02 10:28 MST (History)
1 user (show)

See Also:
Site: Arc Research Institute
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (2.21 KB, text/plain)
2024-11-18 10:35 MST, Jeremy Sullivan
Details
cgroup.conf (113 bytes, text/plain)
2024-11-18 10:36 MST, Jeremy Sullivan
Details
slurmctld.log (8.89 KB, text/plain)
2024-11-18 12:04 MST, Jeremy Sullivan
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Jeremy Sullivan 2024-11-18 10:35:41 MST
Created attachment 39735 [details]
slurm.conf

I have a partition with the following config, in particular MaxCPUsPerNode=32 and MaxMemPerCPU=10240.

PartitionName=gpu_batch
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=3-00:00:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=14-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=32 MaxCPUsPerSocket=UNLIMITED
   NodeSets=ALL
   Nodes=GPU[3694,7094,7130,7220],GPU3B62,GPU36DC,GPU70A0,GPU70DC,GPU71BA,GPU71E4,GPU71FM,GPU389E,GPU708E,GPU720E,GPU724A,GPU726E,GPUC[870,960,972],GPUCA6E,GPUCACE,GPUCB52,GPUCB7C,GPUCBB8
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=4608 TotalNodes=24 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerCPU=10240
   TRES=cpu=4608,mem=24756720M,node=24,billing=4608,gres/gpu=96

My expectation is that the maximum memory allocation for this partition would be 320GB. However the following job was able to successfully allocate 800GB:

$ sudo scontrol show job 53903
JobId=53903 JobName=1118_run_7b_config_baseline_nodp
   Priority=1 Nice=0 Account=hielab QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=04:00:39 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2024-11-17T12:05:22 EligibleTime=2024-11-17T12:05:22
   AccrueTime=2024-11-17T12:05:22
   StartTime=2024-11-18T05:22:17 EndTime=2024-11-19T05:22:17 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-11-18T05:22:17 Scheduler=Main
   Partition=gpu_batch AllocNode:Sid=GPU720E:229126
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=GPUC960
   BatchHost=GPUC960
   NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=4,mem=800G,node=1,billing=4,gres/gpu=1
   AllocTRES=cpu=32,mem=800G,node=1,billing=32,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=80 MinMemoryNode=800G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   MemPerTres=gpu:81920
   TresPerNode=gres/gpu:1
   TresPerTask=cpu=4

This is blocking access to the other GPUs/resources on the node. Are my settings incorrect or am I misunderstanding them? I have attached my full slurm.conf.
Comment 1 Jeremy Sullivan 2024-11-18 10:36:09 MST
Created attachment 39736 [details]
cgroup.conf
Comment 2 Ethan Simmons 2024-11-18 11:26:27 MST
Could you upload the logs for that job submission and the job submission itself (with timestamp)?
Comment 3 Jeremy Sullivan 2024-11-18 12:04:17 MST
Created attachment 39737 [details]
slurmctld.log

Sure attached here. I do see an error due to memory limit. It seemed to pend for almost 24 hours until eventually it was allocated anyways.
Comment 4 Ethan Simmons 2024-11-19 10:56:31 MST
Do you have the srun/sbatch/salloc/etc. command that was run to submit the job? I want to see what the user typed to get that result. Also, to help correlate the job submission with the logs, please include the 
time the job submission command was run.
Comment 5 Jeremy Sullivan 2024-11-19 14:28:29 MST
Sure the sbatch script has the following settings:

#!/bin/bash

#SBATCH --job-name=redacted
#SBATCH --nodes=1
#SBATCH --gres=gpu:2
#SBATCH --partition=gpu_batch
#SBATCH --cpus-per-task=4
#SBATCH --mem=800G
#SBATCH --time=1-00:00:00
#SBATCH --signal=B:SIGINT@300
#SBATCH --output=log/output.log
#SBATCH --open-mode=append

obviously --mem=800G is why the job is requesting so much memory, but question is why is it being accepted into the gpu_batch partition which has MaxCPUsPerNode=32  and MaxMemPerCPU=10240?


I believe the command was run at SubmitTime=2024-11-17T12:05:22 as shown in the scontrol show job 53903 output.
Comment 6 Ethan Simmons 2024-11-22 16:23:43 MST
Just to give you a quick update: I've been able to reproduce the behavior that you're seeing. I'll be looking deeper in to if this is intended or not, and what we need to do about it going forward.
Comment 7 Jeremy Sullivan 2024-11-26 16:19:14 MST
Ok thank you, so confirming the limit is not behaving as intended?
Comment 8 Ethan Simmons 2024-12-02 14:43:12 MST
According to the documentation, this behavior we're currently seeing is correct. Since you have a fixed memory and mem-per-cpu, the job will have to add CPU's to meet both of those requirements. Adding CPU's this way will go over the limits you've set as stated in here (link 1):

> NOTE: If a job specifies a memory per CPU limit that exceeds this system limit, that job's count of CPUs per task will try to 
> automatically increase. This may result in the job failing due to CPU count limits. This auto-adjustment feature is a best-effort one and 
> optimal assignment is not guaranteed due to the possibility of having heterogeneous configurations and multi-partition/qos jobs. If this 
> is a concern it is advised to use a job submit LUA plugin instead to enforce auto-adjustments to your specific needs. 
If you are regularly running in to a situation where users are submitting these kinds of jobs, setting up a job submit plugin like the docs suggest can help you.
Does that answer your question?

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_MaxMemPerCPU
Comment 9 Jeremy Sullivan 2024-12-03 10:54:35 MST
I understand the auto-increase in CPUs to make the requested memory, however shouldn't the MaxCPUsPerNode partition setting prevent the job from being scheduled after its auto-incremented the cpu count?

Should I instead be setting MaxMemNode instead of using MaxMemPerCPU?
Comment 11 Jeremy Sullivan 2024-12-12 17:20:52 MST
Hey just following up on this, I am also seeing a ton of 

Dec 12 16:19:45 arc-slurm slurmctld[416896]: slurmctld: debug:  JobId=66284: Setting job's pn_min_cpus to 60 due to memory limit

logs and seems to be causing delays in the scheduler. So wondering if that could be resolved by switching to MaxMemNode?
Comment 12 Ethan Simmons 2024-12-13 09:23:59 MST
If you want to restrict user resources, make sure to set MaxMemNode as well your other limits (MaxMemPerCPU). If you users are requesting --mem directly, there should be a restriction on that parameter 
(MaxMemNode). If you want a more fine tuned restriction, set that restriction in a job submit plugin.
Does that answer your question?
Comment 13 Jeremy Sullivan 2024-12-13 11:53:38 MST
I think so but it still seems like the MaxMem is not being properly enforced. For example I have the following partition:

PartitionName=gpu_40gb Nodes=40gbnodes \
  DefaultTime=12:00:00 MaxTime=1-00:00:00 \
  MaxCPUsPerNode=160 MaxMemPerNode=819200 \
  DefMemPerGPU=102400 PriorityTier=11 \
  MaxNodes=1 \
  State=UP

I also have DefMemPerCPU=4096 in my slurm.conf.

My expectation is that jobs should not be able to allocate more than 819200M when using this partition.

If I run:
>$ salloc -p gpu_20gb --mem=900G
> salloc: error: Job submit/allocate failed: Memory required by task is not available

The job is correctly rejected.

However if I first allocate a job with 800G via

>$ salloc -p gpu_20gb --mem=800G
>salloc: Pending job allocation 66569
>salloc: job 66569 queued and waiting for resources
>salloc: job 66569 has been allocated resources
>salloc: Granted job allocation 66569
>salloc: Nodes GPUCACE are ready for job

I can kick off another job that consumes more than 19GB, and thus goes over the MaxMemPerNode=819200 limit:

> $ salloc -p gpu_20gb --mem=100G
>salloc: Pending job allocation 66573
>salloc: job 66573 queued and waiting for resources
>salloc: job 66573 has been allocated resources
>salloc: Granted job allocation 66573
>salloc: Nodes GPUCACE are ready for job

scontrol output showing 900GB allocated on node:

$ scontrol show node GPUCACE
NodeName=GPUCACE Arch=x86_64 CoresPerSocket=48 
   CPUAlloc=4 CPUEfctv=184 CPUTot=192 CPULoad=32.11
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:1g.20gb:16(S:0-1)
   NodeAddr=GPUCACE NodeHostName=GPUCACE Version=24.05.2
   OS=Linux 5.15.0-125-generic #135-Ubuntu SMP Fri Sep 27 13:53:58 UTC 2024 
   RealMemory=1031530 AllocMem=921600 FreeMem=965778 Sockets=2 Boards=1
   CoreSpecCount=4 CPUSpecList=80-87 MemSpecLimit=20480
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=gpu_20gb 
   BootTime=2024-12-11T14:44:01 SlurmdStartTime=2024-12-11T18:39:36
   LastBusyTime=2024-12-13T10:45:56 ResumeAfterTime=None
   CfgTRES=cpu=184,mem=1031530M,billing=184,gres/gpu=16,gres/gpu:1g.20gb=16
   AllocTRES=cpu=4,mem=900G

Am I missing something? shouldn't the second job wait in the queue for resources?
Comment 14 Jeremy Sullivan 2024-12-16 10:29:14 MST
In fact in seems like --mem can circumvent the MaxCPUsPerNode restriction as well.

I have the following partition that only should allow 152 cores and 152 * 4GB = 608GB of RAM:

PartitionName=cpu Default=YES Nodes=h100nodes \
  QOS=nogpu MaxCPUsPerNode=152 MaxMemPerCPU=4096 \
  DefaultTime=12:00:00 MaxTime=5-00:00:00 \
  MaxNodes=1 State=UP

If I try to request more than 152 cores it is correctly rejected:

salloc -c 172 -p cpu
salloc: error: Job submit/allocate failed: Requested node configuration is not available

However I can specify more than than 608GB of ram the request is accepted and I am allocated more that 152 cores:

$ salloc --mem 700G -p cpu
salloc: Pending job allocation 69267
salloc: job 69267 queued and waiting for resources
salloc: job 69267 has been allocated resources
salloc: Granted job allocation 69267
salloc: Nodes GPUC870 are ready for job

job config showing 176 cores allocated:

$ scontrol show job 69267
JobId=69267 JobName=interactive
   UserId=jeremys(10001) GroupId=arcinfra(10001) MCS_label=N/A
   Priority=1 Nice=0 Account=arcinfra QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:59 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2024-12-16T09:26:16 EligibleTime=2024-12-16T09:26:16
   AccrueTime=2024-12-16T09:26:16
   StartTime=2024-12-16T09:26:16 EndTime=2024-12-16T21:26:16 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-12-16T09:26:16 Scheduler=Main
   Partition=cpu AllocNode:Sid=GPU7220:3157243
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=GPUC870
   BatchHost=GPUC870
   NumNodes=1 NumCPUs=176 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=700G,node=1,billing=1
   AllocTRES=cpu=176,mem=700G,node=1,billing=176
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=175 MinMemoryNode=700G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/sh


This seems to be a bug/misconfiguration, can you please advise? I am trying to create GPU/CPU partitions on our nodes and I fear our CPU users can block access to the GPUs right now due to over-memory/core consumption.
Comment 15 Ethan Simmons 2024-12-17 15:28:32 MST
I believe there's been a misunderstanding with the parameter MaxMemPerNode. From the documentation (link 1):
> "Maximum real memory size available per allocated node in a job allocation in megabytes"
So this limit exists for a single job allocation. If you want to limit how much memory should be allocated to any node, set the RealMemory 
on that node. The jobs don't use up all of the node's resources, so they won't be pending on each other.

> In fact in seems like --mem can circumvent the MaxCPUsPerNode restriction as well.
We previously looked at this, which is why I recommended adding the MaxMemPerNode restriction. To figure out what those limits should be, 
compute the following:
  MemPerNode = CPUsPerNode * MemPerCPU
So if you want to have 152 CPUs at most, with 4 GB per CPU, 
  608 GB = 152 * 4 GB
Those are the limits you should set. That way users can't circumvent the CPU restriction by requesting too much memory. They can't 
circumvent the memory restriction by requesting too many CPUs. Etc. Does that make sense?

> I fear our CPU users can block access to the GPUs right now due to over-memory/core consumption.
It sounds like you're worried about resource starvation. I'll recommend looking at using a QOS to limit resources, especially if you want 
these limits to be over more than just a single job. See link 2 for more details.

Severity 2 is reserved for critical loss that regularly impacts users. I'm returning this to a severity 3, unless there's been a new 
development.

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_MaxMemPerNode
[2] https://slurm.schedmd.com/qos.html#limits
Comment 16 Jeremy Sullivan 2024-12-17 15:49:24 MST
> So this limit exists for a single job allocation. If you want to limit how much > memory should be allocated to any node, set the RealMemory 
> on that node. The jobs don't use up all of the node's resources, so they won't 
> be pending on each other.

I would like to limit how much memory should be allocated to the partition, not just a single job or the entire node. All of our nodes reside in both the cpu and gpu partitions. I am attempting to divide the node up into segments. For example if a node has 1 TB of RAM I would like RealMemory=1TB the max mem for "cpu" partition to be 600GB and the max mem for gpu partition to be 400GB.

>Those are the limits you should set. That way users can't circumvent the CPU >restriction by requesting too much memory. They can't 
>circumvent the memory restriction by requesting too many CPUs. Etc. Does that >make sense?

I thought I was already accomplishing this by setting MaxCPUsPerNode and MaxMemPerCPU on the partition, but it seems --mem can circumvent this. And if MaxMemPerNode only restricts a single job allocation, then that would not be sufficient. 


I am open to using a QOS to accomplish this, would MaxTRESPerNode be the correct value to set?
Comment 17 Jeremy Sullivan 2024-12-18 13:17:29 MST
Also if I add a QOS to a "live partition" cause job downtime for existing/running jobs?
Comment 18 Jeremy Sullivan 2024-12-18 17:14:46 MST
>MaxTRESPerNode
>The maximum number of TRES each node in a job allocation can use.

So I'm concerned this would only limit the max memory for each job and multiple jobs would be able to exceed the 600GB partition limit.
Comment 19 Ethan Simmons 2024-12-19 14:56:49 MST
So it sounds like the big problem is resource starvation for your GPU users. You want the GPU users to be able to use resources, but you're running in to the situation where the CPU users have everything 
allocated already. The GPU users might not need to be using as much as the CPU users, but they still need guaranteed access to resources. Does that sound correct?

There are a few solutions we can go with. Feel free to ask for more details on any of them, or any clarifying questions that come up.

- Have dedicated nodes for each group of users. If your hardware is split between GPU nodes and non-GPU nodes, put all GPU nodes in their own partition. This will be dedicated resources for the GPU users that 
can't be taken up by the CPU users. If your hardware setup is such that every node has a GPU, or there's an improper ratio of GPU nodes to non-GPU nodes, this might not be what you're looking for. See link 1 for 
configuring this.

- Use a QOS to tweak how many resources each group has access to. If you want a specific ratio of resources for each partition, setting a QOS for the partition will allow you to do that in a more flexible way 
than the previous solution (link 2). Setting a limit on how many jobs can be running under an account/user and setting limits on individual jobs gives you an upper limit for each account/user. For example, if a 
job is allowed to use at most 100 CPUs and each user is allowed to have 10 jobs, that's 1,000 CPUs at most for each user. If you have 50,000 CPUs in your cluster, each user has 1/50th of the cluster at most. See 
links 3 and 4 for more details.

- If you set up restricted GPU cores, certain cores on the node will be only used in GPU jobs. This means CPU users can't starve out a node's cores completely. You will have to update your gres.conf to include 
these cores. See link 5 for details here.

- If you have preemption setup or wish to have it setup, GPU jobs could be configured to preempt CPU jobs. This isn't how every site wants jobs to be configured, so consider reading our documentation on it 
before committing (link 6).

- Increase the priority of GPU jobs. This will lead to GPU jobs waiting in the queue for a shorter time. They'll still have the same resources as other jobs, but they'll get allocated faster. A QOS can work 
great for this. See link 7 for details.

These solutions can be used together or on their own, depending on what your specific needs are. Each solution gives you control over one aspect of how resources are divided/restricted.

> So I'm concerned this would only limit the max memory for each job and multiple jobs would be able to exceed the 600GB partition limit.
We're going to be looking in to the documentation for the parameters initially discussed. It does seem the wording could improve on our end. If you have multiple jobs, MaxTRESPerNode can feel like it is being 
violated. Each job individually is measured against that limit, not combined with other jobs. Two different jobs combined could then have more resources allocated than you might expect, but it wouldn't be a bug. 
Thank you for pointing out this weakness of our docs.

I know this was a lot so please let me know what is confusing or needs more clarification.


[1] https://slurm.schedmd.com/slurm.conf.html#SECTION_PARTITION-CONFIGURATION
[2] https://slurm.schedmd.com/qos.html#partition
[3] https://slurm.schedmd.com/sacctmgr.html#SECTION_SPECIFICATIONS-FOR-QOS
[4] https://slurm.schedmd.com/resource_limits.html#qos
[5] https://slurm.schedmd.com/slurm.conf.html#OPT_RestrictedCoresPerGPU
[6] https://slurm.schedmd.com/preempt.html
[7] https://slurm.schedmd.com/qos.html#priority
Comment 20 Jeremy Sullivan 2024-12-19 15:33:31 MST
> The GPU users might not need to be using as much as the CPU users, but they still need guaranteed access to resources. Does that sound correct?

Yes that is correct, I want the GPU partitions CPU and memory to be untouchable by non-GPU users/jobs.

> Have dedicated nodes for each group of users.

We only have "GPU nodes" in our cluster, but they have a significant amount of CPU/RAM that goes un-utilized by the GPU workloads. We want to make this idle resources available

> Use a QOS to tweak how many resources each group has access to.

To clarify there are both CPU-only and GPU users in all of our slurm groups. We don't have a static set of "CPU-only" and "GPU-only" users/groups. Again I am totally open to using QOS, but I don't think we can apply it just to specific groups. 

> For example, if a 
job is allowed to use at most 100 CPUs and each user is allowed to have 10 jobs, that's 1,000 CPUs at most for each user. 

But couldn't multiple jobs (from different users/groups) still consume more than X CPU/Mem on a certain node, thus blocking GPU access?

> If you set up restricted GPU cores, certain cores on the node will be only used in GPU jobs.

This is interesting, albeit seems a bit heavy-handed. I'm open to it but we would still be vulnerable to --mem overallocation correct? 

> If you have preemption setup or wish to have it setup, GPU jobs could be configured to preempt CPU jobs

This would be a pretty extreme/dire solution for us.

We really would just like to "split each node in ~half" with GPUs on one side and excess CPU/RAM on the other. MaxCPUsPerNode and MaxMemPerCPU on a partition seems like a perfect way to do that. I am honestly still having a hard time understanding why it makes sense for --mem to be able to violate the MaxMemPerCPU constraint and why this isn't a bug?
Comment 21 Jeremy Sullivan 2024-12-19 15:54:41 MST
In fact this section of the docs describes our exact use case: https://slurm.schedmd.com/slurm.conf.html#OPT_MaxCPUsPerNode

> Maximum number of CPUs on any node available to all jobs from this partition. >This can be especially useful to schedule GPUs. For example a node can be >associated with two Slurm partitions (e.g. "cpu" and "gpu") and the >partition/queue "cpu" could be limited to only a subset of the node's CPUs, >ensuring that one or more CPUs would be available to jobs in the "gpu" >partition/queue.

Also this limits applies to "to all jobs from this partition" not just a single job allocation. This is exactly what we need but just to be enforced for memory as well. 

Also I noticed in this section: https://slurm.schedmd.com/slurm.conf.html#OPT_MaxMemPerCPU_1

> MaxMemPerCPU and MaxMemPerNode are mutually exclusive.

So setting both as recommended above isn't actually possible?
Comment 22 Ethan Simmons 2024-12-20 10:13:28 MST
> We don't have a static set of "CPU-only" and "GPU-only" users/groups
You mentioned there were different partitions for CPU and GPU users. You can set a different QOS to each partition, and those resource restrictions will apply to all jobs on that partition.

> But couldn't multiple jobs (from different users/groups) still consume more than X CPU/Mem on a certain node, thus blocking GPU access?
Yes. A QOS can limit what a group uses overall, not splitting nodes in half. If used cleverly with GPU restricted cores, you can actually split nodes in half including memory. For example:
- Nodes have 10 CPUs, 10GB of memory, and 1 GPU
- Cores 0 and 1 are restricted to just GPU use
If you wanted memory allocated to GPU users that the CPU users can't touch, set a similar limit on the CPU partition:
- CPU users can get at most 1GB of memory per cpu
Now CPU users can't request any more memory than 8GB per node (they only have access to 8 cores), leaving 2GB for GPU users. Even if 2 different CPU jobs land on that node, the first 2 cores are restricted. All 
CPU jobs combined will be able to access 8 cores per node, and therefore only 8GB of memory per node. Of course, this needs to be adjusted for your specific hardware and the split you want.

You can skip over preemption, it sounds like it doesn't get you the desired 'split nodes in half'.

> MaxCPUsPerNode
You can get a similar result by having GPU restricted cores. This is a set of cores that can only be used by GPU jobs on that node. It is a newer feature that is meant to fix this exact problem. You can save 
memory by setting a cap on memory per cpu.

> So setting both as recommended above isn't actually possible?
You won't need to set MaxMemPerNode if you set restricted GPU cores and put a cap on memory per CPU.

Let me know if there are questions here.
Comment 23 Jeremy Sullivan 2024-12-20 11:55:23 MST
> You mentioned there were different partitions for CPU and GPU users. You can set a different QOS to each partition, and those resource restrictions will apply to all jobs on that partition.

That is great, but what is the specific QOS option I can use to limit the "Maximum amount of RAM on any node available to all jobs from this partition", similar to MaxCPUPerNode?

> You can get a similar result by having GPU restricted cores. This is a set of cores that can only be used by GPU jobs on that node. It is a newer feature that is meant to fix this exact problem. You can save 
memory by setting a cap on memory per cpu.

Ok but this is what I have been trying to say, there is no way to set a cap on memory per cpu. I have MaxCPUsPerNode and MaxMemPerCPU set on my partition but that limit is easily violated just by passing in a --mem value greater than MaxCPUsPerNode * MaxMemPerCPU. I don't see how swapping in GPU restricted cores instead of MaxCPUsPerNode will remedy the memory limit situation?

Please see the example below again that illustrates the violation. Again, I feel this is a bug and the MaxCPUsPerNode * MaxMemPerCPU memory limit should be enforced.

--------------- violation example ---------------------
I have the following partition that only should allow 152 cores and 152 * 4GB = 608GB of RAM:

PartitionName=cpu Default=YES Nodes=h100nodes \
  QOS=nogpu MaxCPUsPerNode=152 MaxMemPerCPU=4096 \
  DefaultTime=12:00:00 MaxTime=5-00:00:00 \
  MaxNodes=1 State=UP

If I try to request more than 152 cores it is correctly rejected:

salloc -c 172 -p cpu
salloc: error: Job submit/allocate failed: Requested node configuration is not available

However I can specify more than than 608GB of ram the request is accepted and I am allocated more that 152 cores and more than 608GB of ram:

$ salloc --mem 700G -p cpu
salloc: Pending job allocation 69267
salloc: job 69267 queued and waiting for resources
salloc: job 69267 has been allocated resources
salloc: Granted job allocation 69267
salloc: Nodes GPUC870 are ready for job

job config showing 176 cores allocated:

$ scontrol show job 69267
JobId=69267 JobName=interactive
   UserId=jeremys(10001) GroupId=arcinfra(10001) MCS_label=N/A
   Priority=1 Nice=0 Account=arcinfra QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:59 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2024-12-16T09:26:16 EligibleTime=2024-12-16T09:26:16
   AccrueTime=2024-12-16T09:26:16
   StartTime=2024-12-16T09:26:16 EndTime=2024-12-16T21:26:16 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-12-16T09:26:16 Scheduler=Main
   Partition=cpu AllocNode:Sid=GPU7220:3157243
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=GPUC870
   BatchHost=GPUC870
   NumNodes=1 NumCPUs=176 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=700G,node=1,billing=1
   AllocTRES=cpu=176,mem=700G,node=1,billing=176
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=175 MinMemoryNode=700G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/bin/sh
-------------------------------------------

This shows that passing in --mem allows the job to violate both the MaxCPUsPerNode=152 and MaxMemPerCPU=4096 limits by being allocated 176 cores and 700G I RAM.


I still don't see how those violations are not a bug?
Comment 24 Ethan Simmons 2024-12-26 16:59:17 MST
Thank you for the details, that has helped me see the same behavior as you. My apologies for miscommunication around us seeing different 
behavior. I believe we're both working with the same setup now.

MaxCPUsPerNode isn't being honored here because of --mem. I couldn't find a different set of arguments that could allocate more CPUs, so 
it's only showing up with the --mem flag.

Looking over the documentation, the intended behavior isn't very clear here.
I'm looking at links 1-3 here. Several sources state that --mem-per-cpu and --mem are mutually exclusive, but we're submitting jobs with 
MaxMemPerCPU and --mem. The docs state that a higher --mem value will increase the CPU count, but it's not clear in the newest docs why this 
can go over partition limits. The --mem argument is meant to be used in just select/linear clusters, while --mem-per-cpu is meant as a 
select/cons_tres parameter only. It could be the case that this case wasn't considered and missed in the code, or that this case isn't 
supported but isn't clear in the documentation. Either way, something does need to change. Thank you for pushing this, I'll figure out if 
this is a bug or a gap in our documentation. 

As a way for you to prevent issues in your current configuration, set up a job submit plugin (link 4) that denies jobs with --mem or removes 
the --mem flag, whichever suits your users best. Then specify MaxMemPerCPU and GPU restricted CPUs. This will promise some amount of memory 
memory for GPU cores.

[1] https://slurm.schedmd.com/srun.html#OPT_cpus-per-task
[2] https://slurm.schedmd.com/srun.html#OPT_mem
[3] https://slurm.schedmd.com/srun.html#OPT_mem-per-cpu
[4] https://slurm.schedmd.com/job_submit_plugins.html