There is an issue with the mem-per-cpu argument and how it is used to calculate the total memory required to run a job. The memory limits and how they are enforced are incorrect. I am not certain where the bug is but I have an extensive lists of tests below which my be illuminating. Our partitions use DefMemPerCPU and MaxMemPerCPU to reserve cpus based on memory requests and the maximum amount that can be allocated. An example of one of our partition is PartitionName=work AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=nid00[1008-1323] PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=80896 TotalNodes=316 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=920 MaxMemPerCPU=1840 TRESBillingWeights=CPU=1 The total amount of memory available for nodes in this partition should be 230GB (or 235520M) as seen by : salloc --mem=235520 -p work salloc: Granted job allocation 62908 salloc: Waiting for resource configuration salloc --mem=235521 -p work salloc: error: Job submit/allocate failed: Requested node configuration is not available salloc: Job allocation 62909 has been revoked. (I will drop the -p work from following examples). If I request salloc --ntasks-per-node=1 --threads-per-core=1 --cpus-per-task=1 --mem-per-cpu=920 scontrol show jobid reports JobId=62915 JobName=interactive UserId=pelahi(22063) GroupId=pelahi(22063) MCS_label=pawsey0001 Priority=47 Nice=0 Account=pawsey0001 QOS=exhausted JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:05 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=11:47:23 EligibleTime=11:47:23 AccrueTime=Unknown StartTime=11:47:23 EndTime=11:47:28 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=11:47:23 Scheduler=Main Partition=work AllocNode:Sid=setonix-02-can:164427 ReqNodeList=(null) ExcNodeList=(null) NodeList=nid001211 BatchHost=nid001211 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=2,mem=920M,node=1,billing=2 Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryCPU=920M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/pelahi Power= Here the critical item is that the memory requested is 920 and since I am asking for 1 task, 1 cpu, and threads-per-core=1, the memory reservation is 920, despite slurm reserving 2 cores (since reserving a physical core always reserves 2 virtual cores). If I enable a reservation to nodes which allow 2 threads per core (all nodes have this available): salloc --ntasks-per-node=1 --threads-per-core=2 --cpus-per-task=1 --mem-per-cpu=920 I get NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:2 TRES=cpu=2,mem=1840M,node=1,billing=2 Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryCPU=920M MinTmpDiskNode=0 Now the total mem is double the value it should be based on my interpretation of what --threads-per-core should do at the salloc stage: "Restrict node selection to nodes with at least the specified number of threads per core. In task layout, use the specified maximum number of threads per core. NOTE: "Threads" refers to the number of processing units on each core rather than the number of application tasks to be launched per core." Why does it double the memory request? I can even add `--ntask-per-core=1` and it should reserve 920. for all variations of `ntasks-per-node=1` and `cpus-per-task` if `mem-per-cpu` is provided, and `threads-per-core=1` the resulting memory reservations is simple number of cores on a node * the value provided, so long as it can fit on the node: salloc --ntasks-per-node=1 --threads-per-core=1 --cpus-per-task=128 --ntasks-per-core=1 --mem-per-cpu=920 NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=128 ReqB:S:C:T=0:0:*:1 TRES=cpu=256,mem=115G,node=1,billing=256 Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=* MinCPUsNode=128 MinMemoryCPU=920M MinTmpDiskNode=0 salloc --ntasks-per-node=1 --threads-per-core=1 --cpus-per-task=128 --ntasks-per-core=1 --mem-per-cpu=1840 NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=128 ReqB:S:C:T=0:0:*:1 TRES=cpu=256,mem=230G,node=1,billing=256 Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=* MinCPUsNode=128 MinMemoryCPU=1840M MinTmpDiskNode=0 Changing the `--threads-per-core` does not affect this: salloc --ntasks-per-node=1 --threads-per-core=2 --cpus-per-task=256 --ntasks-per-core=1 --mem-per-cpu=920 NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=256 ReqB:S:C:T=0:0:*:2 TRES=cpu=256,mem=230G,node=1,billing=256 Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=* MinCPUsNode=256 MinMemoryCPU=920M MinTmpDiskNode=0 salloc --ntasks-per-node=1 --threads-per-core=1 --cpus-per-task=128 --ntasks-per-core=1 --mem-per-cpu=1840 NumNodes=1 NumCPUs=128 NumTasks=1 CPUs/Task=128 ReqB:S:C:T=0:0:*:2 TRES=cpu=128,mem=230G,node=1,billing=128 Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=* MinCPUsNode=128 MinMemoryCPU=1840M MinTmpDiskNode=0 (Note that the last of the above reports the incorrect number of cpus reserved due to the memory requested and the DefMemPerCPU, which should reserve and bill for 256) If one requests memory that would exceed the 230GB given the number of --cpus-per-task, the request should fail but does not iff threads-per-core=2. For threads-per-core=1, the following is seen salloc --ntasks-per-node=1 --threads-per-core=1 --cpus-per-task=128 --ntasks-per-core=1 --mem=230G NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=128 ReqB:S:C:T=0:0:*:1 TRES=cpu=256,mem=230G,node=1,billing=256 Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=* MinCPUsNode=128 MinMemoryNode=230G MinTmpDiskNode=0 salloc --ntasks-per-node=1 --threads-per-core=1 --cpus-per-task=128 --ntasks-per-core=1 --mem=237G salloc: error: Job submit/allocate failed: Requested node configuration is not available salloc --ntasks-per-node=1 --threads-per-core=1 --cpus-per-task=128 --ntasks-per-core=1 --mem-per-cpu=1850 salloc: error: Job submit/allocate failed: Requested node configuration is not available Both failed requests exceed the 230G in the configuration. Yet I can exceed this memory limit by setting threads-per-core=2 salloc --ntasks-per-node=1 --threads-per-core=2 --cpus-per-task=128 --ntasks-per-core=1 --mem-per-cpu=1850 NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=256 ReqB:S:C:T=0:0:*:2 TRES=cpu=256,mem=236800M,node=1,billing=256 Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=* MinCPUsNode=128 MinMemoryCPU=925M MinTmpDiskNode=0 salloc --ntasks-per-node=1 --threads-per-core=2 --cpus-per-task=128 --ntasks-per-core=1 --mem=239G NumNodes=1 NumCPUs=128 NumTasks=1 CPUs/Task=128 ReqB:S:C:T=0:0:*:2 TRES=cpu=128,mem=239G,node=1,billing=128 Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=* MinCPUsNode=134 MinMemoryNode=239G MinTmpDiskNode=0 salloc --ntasks-per-node=1 --threads-per-core=2 --cpus-per-task=256 --ntasks-per-core=1 --mem=239G NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=256 ReqB:S:C:T=0:0:*:2 TRES=cpu=256,mem=239G,node=1,billing=256 Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=* MinCPUsNode=256 MinMemoryNode=239G MinTmpDiskNode=0 There is a bug here when threads per core is not 1. Additionally the total memory calculated when providing mem-per-cpu is incorrect in mpi allocations for numbers above 1 socket's worth of mpi tasks. Consider the following request: salloc --ntasks-per-node=64 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1840 scontrol reports NumNodes=1 NumCPUs=128 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=128,mem=115G,node=1,billing=128 Socks/Node=* NtasksPerN:B:S:C=64:0:*:1 CoreSpec=* MinCPUsNode=64 MinMemoryCPU=1840M MinTmpDiskNode=0 yet seff reports Cores per node: 128 Memory Efficiency: 0.00% of 230.00 GB This is double the memory. This could be a bug in the seff tool yet we observe the following behaviour as we go above 1 sockets worth of physical cores: salloc --ntasks-per-node=65 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1840 NumNodes=1 NumCPUs=130 NumTasks=65 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=130,mem=119600M,node=1,billing=130 Socks/Node=* NtasksPerN:B:S:C=65:0:*:1 CoreSpec=* MinCPUsNode=65 MinMemoryCPU=1840M MinTmpDiskNode=0 salloc --ntasks-per-node=66 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1840 NumNodes=1 NumCPUs=132 NumTasks=66 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=132,mem=121440M,node=1,billing=132 Socks/Node=* NtasksPerN:B:S:C=66:0:*:1 CoreSpec=* MinCPUsNode=66 MinMemoryCPU=1840M MinTmpDiskNode=0 salloc --ntasks-per-node=67 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1840 salloc: error: Job submit/allocate failed: Requested node configuration is not available salloc: Job allocation 63482 has been revoked. The --ntasks-per-node=67 should be acceptable but fails since it is only requesting another additional 1840M of memory. If we look at what seff reports for the 65 and 66 request we see the following - for 65 Cores per node: 130 Memory Efficiency: 0.00% of 233.59 GB - for 66 Cores per node: 132 Memory Efficiency: 0.00% of 237.19 GB So double the value of what is reported by scontrol. And these values, if correct exceed the 230GB limit that should be enforced. By playing with the request, I find I am able to allocate a job with 67 tasks and --mem-per-cpu=1828 salloc --ntasks-per-node=67 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1828 NumNodes=1 NumCPUs=134 NumTasks=67 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=134,mem=122476M,node=1,billing=134 Socks/Node=* NtasksPerN:B:S:C=67:0:*:1 CoreSpec=* As I increase the number of tasks, the amount of memory I can use via --mem-per-cpu also decrease. The result of these requests seems to indicate that there is a hidden 240GB limit. salloc --ntasks-per-node=68 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1801 NumNodes=1 NumCPUs=136 NumTasks=68 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=136,mem=122468M,node=1,billing=136 Socks/Node=* NtasksPerN:B:S:C=68:0:*:1 CoreSpec=* MinCPUsNode=68 MinMemoryCPU=1801M MinTmpDiskNode=0 salloc --ntasks-per-node=128 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=957 NumNodes=1 NumCPUs=256 NumTasks=128 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=256,mem=122496M,node=1,billing=256 Socks/Node=* NtasksPerN:B:S:C=128:0:*:1 CoreSpec=* MinCPUsNode=128 MinMemoryCPU=957M MinTmpDiskNode=0 These give total requests assuming a hidden factor of 2 of 244952, 244936, 244992, something close to 245000. This number is nowhere to be found in our config. Even on other partitions with different amounts of memory, similar behaviour is noted, though the limit for which slurm allocates a job is different and also not based on any particular config. For instance, the highme partition has the following config: DefMemPerCPU=3950 MaxMemPerCPU=7900 and should have maximum of 1011200 but I am hitting an apparent limit (with a hidden factor of 2) of 1019904, which is just slightly more than what should be enforced. Why is this happening?! The confusing memory reservation for ntasks depends on threads-per-core and ntasks-per-core. Further examples of my confusion where I alter threads-per-core and check the resulting memory request through scontrol are: 1) salloc --ntasks-per-node=67 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1828 NumNodes=1 NumCPUs=134 NumTasks=67 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=134,mem=122476M,node=1,billing=134 Socks/Node=* NtasksPerN:B:S:C=67:0:*:1 CoreSpec=* MinCPUsNode=67 MinMemoryCPU=1828M MinTmpDiskNode=0 2) salloc --ntasks-per-node=67 --threads-per-core=2 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1828 NumNodes=1 NumCPUs=134 NumTasks=67 CPUs/Task=1 ReqB:S:C:T=0:0:*:2 TRES=cpu=134,mem=244952M,node=1,billing=134 Socks/Node=* NtasksPerN:B:S:C=67:0:*:1 CoreSpec=* MinCPUsNode=67 MinMemoryCPU=1828M MinTmpDiskNode=0 3) salloc --ntasks-per-node=67 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=2 --mem-per-cpu=1828 NumNodes=1 NumCPUs=134 NumTasks=67 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=134,mem=122476M,node=1,billing=134 Socks/Node=* NtasksPerN:B:S:C=67:0:*:2 CoreSpec=* MinCPUsNode=67 MinMemoryCPU=1828M MinTmpDiskNode=0 4) salloc --ntasks-per-node=67 --threads-per-core=2 --cpus-per-task=1 --ntasks-per-core=2 --mem-per-cpu=1828 NumNodes=1 NumCPUs=68 NumTasks=67 CPUs/Task=1 ReqB:S:C:T=0:0:*:2 TRES=cpu=68,mem=124304M,node=1,billing=68 Socks/Node=* NtasksPerN:B:S:C=67:0:*:2 CoreSpec=* MinCPUsNode=67 MinMemoryCPU=1828M MinTmpDiskNode=0 These examples are confusing. What is the formula used to calculate the mem if one provides --mem-per-cpu? What is the "cpu" in mem-per-cpu? The memory required should be just the number of execution threads running based on results from 1) and 4). 3) indicates that --ntasks-per-core has no impact on the memory required. 2) is the outlier and is incorrect for --threads-per-core=2. The odd calculation of memory also impacts hybrid resource requests. 1) salloc --ntasks-per-node=34 --threads-per-core=1 --cpus-per-task=2 --ntasks-per-core=1 --mem-per-cpu=1828 NumNodes=1 NumCPUs=136 NumTasks=34 CPUs/Task=2 ReqB:S:C:T=0:0:*:1 TRES=cpu=136,mem=124304M,node=1,billing=136 Socks/Node=* NtasksPerN:B:S:C=34:0:*:1 CoreSpec=* MinCPUsNode=68 MinMemoryCPU=1828M MinTmpDiskNode=0 2) salloc --ntasks-per-node=34 --threads-per-core=2 --cpus-per-task=2 --ntasks-per-core=1 --mem-per-cpu=1828 NumNodes=1 NumCPUs=68 NumTasks=34 CPUs/Task=2 ReqB:S:C:T=0:0:*:2 TRES=cpu=68,mem=124304M,node=1,billing=68 Socks/Node=* NtasksPerN:B:S:C=34:0:*:1 CoreSpec=* MinCPUsNode=68 MinMemoryCPU=1828M MinTmpDiskNode=0 3) salloc --ntasks-per-node=34 --threads-per-core=2 --cpus-per-task=3 --ntasks-per-core=1 --mem-per-cpu=1828 salloc: Pending job allocation 63585 salloc: job 63585 queued and waiting for resources HANGS INDEFINITELY It is odd that 3) hangs indefinitely if the physical cpus required are 34*2 = 68 and the memory should be 34* 3 * 1828 ~180000 M. If I reduce the memory requirement, I see that in fact the memory being requested is 34* 4 * 1828, which cannot be satisfied by any node. Yet it is not rejected. Why is it not rejected? To show that it is using 34 * 4 and not 34 * 3 as the multiplier: 1) salloc --ntasks-per-node=34 --threads-per-core=2 --cpus-per-task=2 --ntasks-per-core=1 --mem-per-cpu=920 NumNodes=1 NumCPUs=68 NumTasks=34 CPUs/Task=2 ReqB:S:C:T=0:0:*:2 TRES=cpu=68,mem=62560M,node=1,billing=68 Socks/Node=* NtasksPerN:B:S:C=34:0:*:1 CoreSpec=* MinCPUsNode=68 MinMemoryCPU=920M MinTmpDiskNode=0 2) salloc --ntasks-per-node=34 --threads-per-core=2 --cpus-per-task=3 --ntasks-per-core=1 --mem-per-cpu=920 NumNodes=1 NumCPUs=136 NumTasks=34 CPUs/Task=3 ReqB:S:C:T=0:0:*:2 TRES=cpu=136,mem=125120M,node=1,billing=136 Socks/Node=* NtasksPerN:B:S:C=34:0:*:1 CoreSpec=* MinCPUsNode=102 MinMemoryCPU=920M MinTmpDiskNode=0 Despite running 34 tasks, 3 cpus per task it bases the memory reservation on the fact that NumCPUS must be an even number as one has to reserve pairs of virtual cores since there are 2 per physical core. Yet this should NOT impact the total memory reservation. Again, there is a bug. If you require further tests, please let me know. Do you know where this bug is arising? And is there a timeline for fixing it? Cheers, Pascal
Pascal, Sorry for the delay in replay, but since the initial comment is really long it took me some time to read it properly. I'll try to formulate and answer some questions arising from your message, for parts with commands and outputs I'm going to split it to a few smaller bugs. (Technically I'll be a reporter there but I'll add you in CC so you'll be notified about the updates there). It may end up with finding the same root cause, but at this point we'd like to split those so we have the possibility of different resolutions. I see the doubt coming from --threads-per-core impact on allocation/step memory with its current documentation. It's in fact an issue we already discuss in Bug 13879, where a documentation fix is in review. I see a little bit hidden question like: 'Why --ntask-per-core=1 (or other parameters) doesn't affect total memory calculation?' The way I can paraphrase my understanding is that "CPU" (in --mem-per-cpu) refers to allocated CPUs not to number of tasks started (Marshall already did very comprehensive discussion of that in Bug 13879 - he comments on DefMemPerCPU and MaxMemPerCPU too). The physical part understood as CPU by Slurm is "configurable", you can set total number of CPUs on node to either total number of hyper threads, cores or sockets. You should get email notifications about other bugs/discussion threads opened. I'll keep this one as a "master ticket" for those. cheers, Marcin
Dear sender, Thank you for your message. I am out of the office until July 11 and will have limited email access while I am away. I will respond to your email when I return. Best, Martijn
Pascal, Before I start opening separate threads, could you please share your slurm.conf so we can focus on the exact case and I'll be able to attach it in places needed without bothering you with the same request everywhere. cheers, Marcin
Pascal, Could you please take a look at my last comment? I want to comment/analyse the behavior in relation to your configuration not in general. cheers, Marcin
Hi Marcin, Sorry for the late reply, I was away on holidays. I will attach the slurm config but just to give you a heads up, this has been updated a bit based on replies by schedmd related to the mcs plugin, amongst others. I haven't run it through my gambit of tests.
Created attachment 25765 [details] slurm config
Pascal, Could you please add node/partition definition too? I'm mostly interested in relation betweeen CPUs/Sockets/CoresPerSocket/ThreadsPerCore. cheers, Marcin
Do you have any SLURM_* SALLOC_* SBATCH_* set in environment while testing? cheers, Marcin PS. Sorry for second email, I just understood from config that you're on Cray and I know that it's quite common.
Could you please take a look at the case? cheers, Marcin
I just wanted to point out that in Bug 13879 we just merged a related documentation changes. The commit is 0754895c970[1] In the mean time - just a kindly reminder about the information requested in previous comments. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/0754895c970d943570ba6303312c2dc32ec5a44e
Hi Marcin, Sorry for the late reply regarding the environment variables. The only ones that are set are SLURM_TIME_FORMAT=relative SBATCH_EXPORT=NONE Cheers, Pascal
>Could you please add node/partition definition too? I'm mostly interested in relation betweeen CPUs/Sockets/CoresPerSocket/ThreadsPerCore. cheers, Marcin
PartitionName=work AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=nid00[1008-1323] PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=80896 TotalNodes=316 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=920 MaxMemPerCPU=1840 TRESBillingWeights=CPU=1 PartitionName=long AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=4-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=nid00[1316-1323] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2048 TotalNodes=8 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=920 MaxMemPerCPU=1840 TRESBillingWeights=CPU=1 PartitionName=copy AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=YES MaxCPUsPerNode=UNLIMITED Nodes=dm0[1-8] PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=512 TotalNodes=8 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=1850 MaxMemPerCPU=3700 TRESBillingWeights=CPU=0 PartitionName=askaprt AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=nid00[1324-1503] PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=46080 TotalNodes=180 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=920 MaxMemPerCPU=1840 TRESBillingWeights=CPU=1 PartitionName=debug AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=4 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=nid00[1000-1007] PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2048 TotalNodes=8 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=920 MaxMemPerCPU=1840 TRESBillingWeights=CPU=1 PartitionName=highmem AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=nid00[1504-1511] PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2048 TotalNodes=8 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=3950 MaxMemPerCPU=7900 TRESBillingWeights=CPU=1
A typical node for work, debug, askaprt, long partitions NodeName=nid001008 Arch=x86_64 CoresPerSocket=64 CPUAlloc=256 CPUTot=256 CPULoad=61.15 AvailableFeatures=AMD_EPYC_7763 ActiveFeatures=AMD_EPYC_7763 Gres=(null) NodeAddr=nid001008-nmn NodeHostName=nid001008 Version=21.08.8-2 OS=Linux 5.3.18-24.75_10.0.189-cray_shasta_c #1 SMP Sun Sep 26 14:27:04 UTC 2021 (0388af5) RealMemory=245000 AllocMem=235520 FreeMem=229461 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=work BootTime=17 Jun 12:28 SlurmdStartTime=5 Jul 10:41 LastBusyTime=04:26:02 CfgTRES=cpu=256,mem=245000M,billing=256 AllocTRES=cpu=256,mem=230G CapWatts=n/a CurrentWatts=0 AveWatts=0 Typical node for high mem NodeName=nid001504 Arch=x86_64 CoresPerSocket=64 CPUAlloc=0 CPUTot=256 CPULoad=0.00 AvailableFeatures=AMD_EPYC_7763 ActiveFeatures=AMD_EPYC_7763 Gres=(null) NodeAddr=nid001504-nmn NodeHostName=nid001504 Version=21.08.8-2 OS=Linux 5.3.18-24.75_10.0.189-cray_shasta_c #1 SMP Sun Sep 26 14:27:04 UTC 2021 (0388af5) RealMemory=1020000 AllocMem=0 FreeMem=1017774 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=highmem BootTime=17 Jun 12:29 SlurmdStartTime=5 Jul 10:41 LastBusyTime=Ystday 12:50 CfgTRES=cpu=256,mem=1020000M,billing=256 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Pascal, Trying to find a major confusion - beside what we clarified in the mentioned documentation change. I see your question: >These examples are confusing. What is the formula used to calculate the mem if one provides --mem-per-cpu? What is the "cpu" in mem-per-cpu? In your case since you have configured number of CPUs on the node to be #Sockets*#CoresPerSocket*#ThreadsPerCore >NodeName=nid001008 Arch=x86_64 CoresPerSocket=64 > CPUAlloc=0 CPUTot=256 CPULoad=0.00 > RealMemory=245000 AllocMem=235520 FreeMem=229461 Sockets=2 Boards=1 > State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A ThreadsPerCore=2 CoresPerSocket=64 Sockets=2 and CPU=256, this number of CPUs is calculated by default if you don't specify that, however, it's also possible to configure Slurm NodeName=... line like: >NodeName=DEFAULT Sockets=4 CoresPerSocket=4 ThreadsPerCore=2 CPUs=16 RealMemory=15000 as you see in the above line total number of CPUs is equal to total number of Cores (not threads), which defines the meaning of CPU: # srun --mem-per-cpu=10 /bin/bash -c 'scontrol show job $SLURM_JOB_ID' | egrep '(TRES|Core)' TRES=cpu=1,mem=10M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* This job is allocated a whole Core, however, the core is accounted as single CPU and --mem-per-cpu meaning is in tact with that definition, so in your configuration Slurm interprets physical hyperthreads as CPUs, however, we never assign the same code to two job allocations. However, it's possible to run two job steps within the same allocation on dedicated hyperthreads. I'll split the following question/question to separate bug as mentioned before. Please let me know if the explanation above helps or raises additional questions. cheers, Marcin
Thanks for the info Marcin, and again sorry for the delay in replying. So --mem-per-cpu has a fluid definition of "cpu". Not something I think is ideal. Is it possible to request a --mem-per-thread option that would not have a fluid definition? Cheers, Pascal
>So --mem-per-cpu has a fluid definition of "cpu".[...] It's as fluid as a definition of "CPU" in Slurm, depending on your configuration the number of CPUs job is accounted for may be equal to hyper-threads or cores. >Is it possible to request a --mem-per-thread option that would not have a fluid definition? I can check with our senior developers if such a enhancement request is something we may be interested in, however, as an enhancement this will require sponsorship. Is that something you may be interested in? cheers, Marcin
I am on holidays from 05-Augustus-2022 till 21-Augustus-2022
Pascal, We're having internal discussion about the eventual enhancement. I'll let you know once we establish something. cheers, Marcin
Pascal, We did initial code check for potential development of --mem-per-core/--mem-per-thread. As a feature request it will require sponsorship and can be done in Slurm 23.11 at earliest. Are you interested? cheers, Marcin
Pascal, I'm closing the case as information given. Please feel free to reopen if you're interested in the feature development sponsorship. cheers, Marcin