14397 – mem-per-cpu and mem calculation limit enforcement incorrect

Ticket 14397 - mem-per-cpu and mem calculation limit enforcement incorrect

Summary: mem-per-cpu and mem calculation limit enforcement incorrect

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	21.08.6
Hardware:	Cray Shasta Linux

Severity:	4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-06-24 01:14 MDT by Pascal
Modified:	2022-11-21 02:25 MST (History)
CC List:	4 users (show)

See Also:	14625
Site:	Pawsey
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm config (8.11 KB, text/plain) 2022-07-06 06:38 MDT, Pascal	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Pascal 2022-06-24 01:14:37 MDT

There is an issue with the mem-per-cpu argument and how it is used to calculate the total memory required to run a job. The memory limits and how they are enforced are incorrect. I am not certain where the bug is but I have an extensive lists of tests below which my be illuminating. 

Our partitions use DefMemPerCPU and MaxMemPerCPU to reserve cpus based on memory requests and the maximum amount that can be allocated. An example of one of our partition is 
PartitionName=work
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=nid00[1008-1323]
   PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=80896 TotalNodes=316 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=920 MaxMemPerCPU=1840
   TRESBillingWeights=CPU=1

The total amount of memory available for nodes in this partition should be 230GB (or 235520M) as seen by :
salloc --mem=235520 -p work 
salloc: Granted job allocation 62908
salloc: Waiting for resource configuration

salloc --mem=235521 -p work
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 62909 has been revoked.


(I will drop the -p work from following examples). 

If I request 
salloc --ntasks-per-node=1 --threads-per-core=1 --cpus-per-task=1 --mem-per-cpu=920 
scontrol show jobid reports 
JobId=62915 JobName=interactive
   UserId=pelahi(22063) GroupId=pelahi(22063) MCS_label=pawsey0001
   Priority=47 Nice=0 Account=pawsey0001 QOS=exhausted
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:05 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=11:47:23 EligibleTime=11:47:23
   AccrueTime=Unknown
   StartTime=11:47:23 EndTime=11:47:28 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=11:47:23 Scheduler=Main
   Partition=work AllocNode:Sid=setonix-02-can:164427
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=nid001211
   BatchHost=nid001211
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=2,mem=920M,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=920M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/pelahi
   Power=

Here the critical item is that the memory requested is 920 and since I am asking for 1 task, 1 cpu, and threads-per-core=1, the memory reservation is 920, despite slurm reserving 2 cores (since reserving a physical core always reserves 2 virtual cores). 

If I enable a reservation to nodes which allow 2 threads per core (all nodes have this available): 
salloc --ntasks-per-node=1 --threads-per-core=2 --cpus-per-task=1 --mem-per-cpu=920
I get 
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:2
   TRES=cpu=2,mem=1840M,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=920M MinTmpDiskNode=0
Now the total mem is double the value it should be based on my interpretation of what --threads-per-core should do at the salloc stage:
"Restrict node selection to nodes with at least the specified number of threads per core. In task layout, use the specified maximum number of threads per core. NOTE: "Threads" refers to the number of processing units on each core rather than the number of application tasks to be launched per core."

Why does it double the memory request? I can even add `--ntask-per-core=1` and it should reserve 920. 

for all variations of `ntasks-per-node=1` and `cpus-per-task` if `mem-per-cpu` is provided, and `threads-per-core=1` the resulting memory reservations is simple number of cores on a node * the value provided, so long as it can fit on the node: 


salloc --ntasks-per-node=1 --threads-per-core=1 --cpus-per-task=128 --ntasks-per-core=1 --mem-per-cpu=920
   NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=128 ReqB:S:C:T=0:0:*:1
   TRES=cpu=256,mem=115G,node=1,billing=256
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=*
   MinCPUsNode=128 MinMemoryCPU=920M MinTmpDiskNode=0

salloc --ntasks-per-node=1 --threads-per-core=1 --cpus-per-task=128 --ntasks-per-core=1 --mem-per-cpu=1840
   NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=128 ReqB:S:C:T=0:0:*:1
   TRES=cpu=256,mem=230G,node=1,billing=256
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=*
   MinCPUsNode=128 MinMemoryCPU=1840M MinTmpDiskNode=0

Changing the `--threads-per-core` does not affect this: 
salloc --ntasks-per-node=1 --threads-per-core=2 --cpus-per-task=256 --ntasks-per-core=1 --mem-per-cpu=920
   NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=256 ReqB:S:C:T=0:0:*:2
   TRES=cpu=256,mem=230G,node=1,billing=256
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=*
   MinCPUsNode=256 MinMemoryCPU=920M MinTmpDiskNode=0

salloc --ntasks-per-node=1 --threads-per-core=1 --cpus-per-task=128 --ntasks-per-core=1 --mem-per-cpu=1840
   NumNodes=1 NumCPUs=128 NumTasks=1 CPUs/Task=128 ReqB:S:C:T=0:0:*:2
   TRES=cpu=128,mem=230G,node=1,billing=128
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=*
   MinCPUsNode=128 MinMemoryCPU=1840M MinTmpDiskNode=0

(Note that the last of the above reports the incorrect number of cpus reserved due to the memory requested and the DefMemPerCPU, which should reserve and bill for 256) 

If one requests memory that would exceed the 230GB given the number of --cpus-per-task, the request should fail but does not iff threads-per-core=2. For threads-per-core=1, the following is seen

salloc --ntasks-per-node=1 --threads-per-core=1 --cpus-per-task=128 --ntasks-per-core=1 --mem=230G
   NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=128 ReqB:S:C:T=0:0:*:1
   TRES=cpu=256,mem=230G,node=1,billing=256
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=*
   MinCPUsNode=128 MinMemoryNode=230G MinTmpDiskNode=0
salloc --ntasks-per-node=1 --threads-per-core=1 --cpus-per-task=128 --ntasks-per-core=1 --mem=237G
salloc: error: Job submit/allocate failed: Requested node configuration is not available
 salloc --ntasks-per-node=1 --threads-per-core=1 --cpus-per-task=128 --ntasks-per-core=1 --mem-per-cpu=1850
salloc: error: Job submit/allocate failed: Requested node configuration is not available

Both failed requests exceed the 230G in the configuration. Yet I can exceed this memory limit by setting threads-per-core=2
salloc --ntasks-per-node=1 --threads-per-core=2 --cpus-per-task=128 --ntasks-per-core=1 --mem-per-cpu=1850
   NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=256 ReqB:S:C:T=0:0:*:2
   TRES=cpu=256,mem=236800M,node=1,billing=256
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=*
   MinCPUsNode=128 MinMemoryCPU=925M MinTmpDiskNode=0

salloc --ntasks-per-node=1 --threads-per-core=2 --cpus-per-task=128 --ntasks-per-core=1 --mem=239G
   NumNodes=1 NumCPUs=128 NumTasks=1 CPUs/Task=128 ReqB:S:C:T=0:0:*:2
   TRES=cpu=128,mem=239G,node=1,billing=128
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=*
   MinCPUsNode=134 MinMemoryNode=239G MinTmpDiskNode=0

salloc --ntasks-per-node=1 --threads-per-core=2 --cpus-per-task=256 --ntasks-per-core=1 --mem=239G
   NumNodes=1 NumCPUs=256 NumTasks=1 CPUs/Task=256 ReqB:S:C:T=0:0:*:2
   TRES=cpu=256,mem=239G,node=1,billing=256
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=*
   MinCPUsNode=256 MinMemoryNode=239G MinTmpDiskNode=0

There is a bug here when threads per core is not 1. 


Additionally the total memory calculated when providing mem-per-cpu is incorrect in mpi allocations for numbers above 1 socket's worth of mpi tasks. Consider the following request: 

salloc --ntasks-per-node=64 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1840

scontrol reports 
   NumNodes=1 NumCPUs=128 NumTasks=64 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=128,mem=115G,node=1,billing=128
   Socks/Node=* NtasksPerN:B:S:C=64:0:*:1 CoreSpec=*
   MinCPUsNode=64 MinMemoryCPU=1840M MinTmpDiskNode=0
yet seff reports 
Cores per node: 128
Memory Efficiency: 0.00% of 230.00 GB

This is double the memory. This could be a bug in the seff tool yet we observe the following behaviour as we go above 1 sockets worth of physical cores:

salloc --ntasks-per-node=65 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1840
   NumNodes=1 NumCPUs=130 NumTasks=65 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=130,mem=119600M,node=1,billing=130
   Socks/Node=* NtasksPerN:B:S:C=65:0:*:1 CoreSpec=*
   MinCPUsNode=65 MinMemoryCPU=1840M MinTmpDiskNode=0 

salloc --ntasks-per-node=66 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1840
   NumNodes=1 NumCPUs=132 NumTasks=66 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=132,mem=121440M,node=1,billing=132
   Socks/Node=* NtasksPerN:B:S:C=66:0:*:1 CoreSpec=*
   MinCPUsNode=66 MinMemoryCPU=1840M MinTmpDiskNode=0

salloc --ntasks-per-node=67 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1840
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 63482 has been revoked.

The --ntasks-per-node=67 should be acceptable but fails since it is only requesting another additional 1840M of memory. 

If we look at what seff reports for the 65 and 66 request we see the following 
- for 65 
Cores per node: 130
Memory Efficiency: 0.00% of 233.59 GB
- for 66 
Cores per node: 132
Memory Efficiency: 0.00% of 237.19 GB

So double the value of what is reported by scontrol. And these values, if correct exceed the 230GB limit that should be enforced. By playing with the request, I find I am able to allocate a job with 67 tasks and --mem-per-cpu=1828

salloc --ntasks-per-node=67 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1828
   NumNodes=1 NumCPUs=134 NumTasks=67 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=134,mem=122476M,node=1,billing=134
   Socks/Node=* NtasksPerN:B:S:C=67:0:*:1 CoreSpec=*

As I increase the number of tasks, the amount of memory I can use via --mem-per-cpu also decrease. The result of these requests seems to indicate that there is a hidden 240GB limit. 
salloc --ntasks-per-node=68 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1801
   NumNodes=1 NumCPUs=136 NumTasks=68 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=136,mem=122468M,node=1,billing=136
   Socks/Node=* NtasksPerN:B:S:C=68:0:*:1 CoreSpec=*
   MinCPUsNode=68 MinMemoryCPU=1801M MinTmpDiskNode=0

salloc --ntasks-per-node=128 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=957
   NumNodes=1 NumCPUs=256 NumTasks=128 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=256,mem=122496M,node=1,billing=256
   Socks/Node=* NtasksPerN:B:S:C=128:0:*:1 CoreSpec=*
   MinCPUsNode=128 MinMemoryCPU=957M MinTmpDiskNode=0

These give total requests assuming a hidden factor of 2 of 244952, 244936, 244992, something close to 245000. This number is nowhere to be found in our config. 

Even on other partitions with different amounts of memory, similar behaviour is noted, though the limit for which slurm allocates a job is different and also not based on any particular config. 

For instance, the highme partition has the following config: DefMemPerCPU=3950 MaxMemPerCPU=7900 and should have maximum of 1011200 but I am hitting an apparent limit (with a hidden factor of 2) of 1019904, which is just slightly more than what should be enforced. 

Why is this happening?!

The confusing memory reservation for ntasks depends on threads-per-core and ntasks-per-core. 

Further examples of my confusion where I alter threads-per-core and check the resulting memory request through scontrol are:
1) salloc --ntasks-per-node=67 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1828
   NumNodes=1 NumCPUs=134 NumTasks=67 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=134,mem=122476M,node=1,billing=134
   Socks/Node=* NtasksPerN:B:S:C=67:0:*:1 CoreSpec=*
   MinCPUsNode=67 MinMemoryCPU=1828M MinTmpDiskNode=0

2) salloc --ntasks-per-node=67 --threads-per-core=2 --cpus-per-task=1 --ntasks-per-core=1 --mem-per-cpu=1828
   NumNodes=1 NumCPUs=134 NumTasks=67 CPUs/Task=1 ReqB:S:C:T=0:0:*:2
   TRES=cpu=134,mem=244952M,node=1,billing=134
   Socks/Node=* NtasksPerN:B:S:C=67:0:*:1 CoreSpec=*
   MinCPUsNode=67 MinMemoryCPU=1828M MinTmpDiskNode=0


3) salloc --ntasks-per-node=67 --threads-per-core=1 --cpus-per-task=1 --ntasks-per-core=2 --mem-per-cpu=1828
   NumNodes=1 NumCPUs=134 NumTasks=67 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=134,mem=122476M,node=1,billing=134
   Socks/Node=* NtasksPerN:B:S:C=67:0:*:2 CoreSpec=*
   MinCPUsNode=67 MinMemoryCPU=1828M MinTmpDiskNode=0


4) salloc --ntasks-per-node=67 --threads-per-core=2 --cpus-per-task=1 --ntasks-per-core=2 --mem-per-cpu=1828
   NumNodes=1 NumCPUs=68 NumTasks=67 CPUs/Task=1 ReqB:S:C:T=0:0:*:2
   TRES=cpu=68,mem=124304M,node=1,billing=68
   Socks/Node=* NtasksPerN:B:S:C=67:0:*:2 CoreSpec=*
   MinCPUsNode=67 MinMemoryCPU=1828M MinTmpDiskNode=0

These examples are confusing. What is the formula used to calculate the mem if one provides --mem-per-cpu? What is the "cpu" in mem-per-cpu? The memory required should be just the number of execution threads running based on results from 1) and 4). 3) indicates that --ntasks-per-core has no impact on the memory required. 2) is the outlier and is incorrect for --threads-per-core=2. 

The odd calculation of memory also impacts hybrid resource requests. 

1) salloc --ntasks-per-node=34 --threads-per-core=1 --cpus-per-task=2 --ntasks-per-core=1 --mem-per-cpu=1828
   NumNodes=1 NumCPUs=136 NumTasks=34 CPUs/Task=2 ReqB:S:C:T=0:0:*:1
   TRES=cpu=136,mem=124304M,node=1,billing=136
   Socks/Node=* NtasksPerN:B:S:C=34:0:*:1 CoreSpec=*
   MinCPUsNode=68 MinMemoryCPU=1828M MinTmpDiskNode=0


2) salloc --ntasks-per-node=34 --threads-per-core=2 --cpus-per-task=2 --ntasks-per-core=1 --mem-per-cpu=1828
   NumNodes=1 NumCPUs=68 NumTasks=34 CPUs/Task=2 ReqB:S:C:T=0:0:*:2
   TRES=cpu=68,mem=124304M,node=1,billing=68
   Socks/Node=* NtasksPerN:B:S:C=34:0:*:1 CoreSpec=*
   MinCPUsNode=68 MinMemoryCPU=1828M MinTmpDiskNode=0


3) salloc --ntasks-per-node=34 --threads-per-core=2 --cpus-per-task=3 --ntasks-per-core=1 --mem-per-cpu=1828
salloc: Pending job allocation 63585
salloc: job 63585 queued and waiting for resources 
HANGS INDEFINITELY 

It is odd that 3) hangs indefinitely if the physical cpus required are 34*2 = 68 and the memory should be 34* 3 * 1828 ~180000 M. If I reduce the memory requirement, I see that in fact the memory being requested is 34* 4 * 1828, which cannot be satisfied by any node. Yet it is not rejected. 

Why is it not rejected? 

To show that it is using 34 * 4 and not 34 * 3 as the multiplier:
1) salloc --ntasks-per-node=34 --threads-per-core=2 --cpus-per-task=2 --ntasks-per-core=1 --mem-per-cpu=920
   NumNodes=1 NumCPUs=68 NumTasks=34 CPUs/Task=2 ReqB:S:C:T=0:0:*:2
   TRES=cpu=68,mem=62560M,node=1,billing=68
   Socks/Node=* NtasksPerN:B:S:C=34:0:*:1 CoreSpec=*
   MinCPUsNode=68 MinMemoryCPU=920M MinTmpDiskNode=0
2) salloc --ntasks-per-node=34 --threads-per-core=2 --cpus-per-task=3 --ntasks-per-core=1 --mem-per-cpu=920
   NumNodes=1 NumCPUs=136 NumTasks=34 CPUs/Task=3 ReqB:S:C:T=0:0:*:2
   TRES=cpu=136,mem=125120M,node=1,billing=136
   Socks/Node=* NtasksPerN:B:S:C=34:0:*:1 CoreSpec=*
   MinCPUsNode=102 MinMemoryCPU=920M MinTmpDiskNode=0

Despite running 34 tasks, 3 cpus per task it bases the memory reservation on the fact that NumCPUS must be an even number as one has to reserve pairs of virtual cores since there are 2 per physical core. Yet this should NOT impact the total memory reservation. Again, there is a bug. 

If you require further tests, please let me know. 

Do you know where this bug is arising? And is there a timeline for fixing it?

Cheers,
Pascal

Comment 7 Marcin Stolarek 2022-06-30 03:55:16 MDT

Pascal,

Sorry for the delay in replay, but since the initial comment is really long it took me some time to read it properly. I'll try to formulate and answer some questions arising from your message, for parts with commands and outputs I'm going  to split it to a few smaller bugs. (Technically I'll be a reporter there but I'll add you in CC so you'll be notified about the updates there). It may end up with finding the same root cause, but at this point we'd like to split those so we have the possibility of different resolutions.

I see the doubt coming from --threads-per-core impact on allocation/step memory with its current documentation. It's in fact an issue we already discuss in Bug 13879, where a documentation fix is in review.

I see a little bit hidden question like: 'Why --ntask-per-core=1 (or other parameters) doesn't affect total memory calculation?' The way I can paraphrase my understanding is that "CPU" (in --mem-per-cpu) refers to allocated CPUs not to number of tasks started (Marshall already did very comprehensive discussion of that in Bug 13879 - he comments on DefMemPerCPU and MaxMemPerCPU too). The physical part understood as CPU by Slurm is "configurable", you can set total number of CPUs on node to either total number of hyper threads, cores or sockets.

You should get email notifications about other bugs/discussion threads opened. I'll keep this one as a "master ticket" for those.

cheers,
Marcin

Comment 8 Martijn Kruiten 2022-06-30 04:01:17 MDT

Dear sender,

Thank you for your message. I am out of the office until July 11 and will have limited email access while I am away. I will respond to your email when I return.

Best,
Martijn

Comment 9 Marcin Stolarek 2022-06-30 04:49:18 MDT

Pascal,

Before I start opening separate threads, could you please share your slurm.conf so we can focus on the exact case and I'll be able to attach it in places needed without bothering you with the same request everywhere.

cheers,
Marcin

Comment 10 Marcin Stolarek 2022-07-04 00:54:48 MDT

Pascal,

 Could you please take a look at my last comment? I want to comment/analyse the behavior in relation to your configuration not in general.

cheers,
Marcin

Comment 11 Pascal 2022-07-06 06:38:26 MDT

Hi Marcin, Sorry for the late reply, I was away on holidays. I will attach the slurm config but just to give you a heads up, this has been updated a bit based on replies by schedmd related to the mcs plugin, amongst others. I haven't run it through my gambit of tests.

Comment 12 Pascal 2022-07-06 06:38:54 MDT

Created attachment 25765 [details]
slurm config

Comment 13 Marcin Stolarek 2022-07-08 02:14:35 MDT

Pascal,

Could you please add node/partition definition too? I'm mostly interested in relation betweeen CPUs/Sockets/CoresPerSocket/ThreadsPerCore.

cheers,
Marcin

Comment 15 Marcin Stolarek 2022-07-08 04:11:13 MDT

Do you have any SLURM_* SALLOC_* SBATCH_* set in environment while testing?

cheers,
Marcin
PS. Sorry for second email, I just understood from config that you're on Cray and I know that it's quite common.

Comment 16 Marcin Stolarek 2022-07-19 09:21:09 MDT

Could you please take a look at the case?

cheers,
Marcin

Comment 17 Marcin Stolarek 2022-07-26 04:00:22 MDT

I just wanted to point out that in Bug 13879 we just merged a related documentation changes. The commit is 0754895c970[1]

In the mean time - just a kindly reminder about the information requested in previous comments.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/0754895c970d943570ba6303312c2dc32ec5a44e

Comment 18 Pascal 2022-07-26 06:34:38 MDT

Hi Marcin,
Sorry for the late reply regarding the environment variables. The only ones that are set are 

SLURM_TIME_FORMAT=relative
SBATCH_EXPORT=NONE

Cheers,
Pascal

Comment 19 Marcin Stolarek 2022-07-26 09:19:13 MDT

>Could you please add node/partition definition too? I'm mostly interested in relation betweeen CPUs/Sockets/CoresPerSocket/ThreadsPerCore.

cheers,
Marcin

Comment 20 Pascal 2022-07-26 19:06:33 MDT

PartitionName=work
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=nid00[1008-1323]
   PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=80896 TotalNodes=316 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=920 MaxMemPerCPU=1840
   TRESBillingWeights=CPU=1

PartitionName=long
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=1 MaxTime=4-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=nid00[1316-1323]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2048 TotalNodes=8 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=920 MaxMemPerCPU=1840
   TRESBillingWeights=CPU=1

PartitionName=copy
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=YES MaxCPUsPerNode=UNLIMITED
   Nodes=dm0[1-8]
   PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=512 TotalNodes=8 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=1850 MaxMemPerCPU=3700
   TRESBillingWeights=CPU=0

PartitionName=askaprt
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=nid00[1324-1503]
   PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=46080 TotalNodes=180 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=920 MaxMemPerCPU=1840
   TRESBillingWeights=CPU=1

PartitionName=debug
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=4 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=nid00[1000-1007]
   PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2048 TotalNodes=8 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=920 MaxMemPerCPU=1840
   TRESBillingWeights=CPU=1

PartitionName=highmem
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=nid00[1504-1511]
   PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2048 TotalNodes=8 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=3950 MaxMemPerCPU=7900
   TRESBillingWeights=CPU=1

Comment 21 Pascal 2022-07-26 19:09:05 MDT

A typical node for work, debug, askaprt, long partitions 

NodeName=nid001008 Arch=x86_64 CoresPerSocket=64
   CPUAlloc=256 CPUTot=256 CPULoad=61.15
   AvailableFeatures=AMD_EPYC_7763
   ActiveFeatures=AMD_EPYC_7763
   Gres=(null)
   NodeAddr=nid001008-nmn NodeHostName=nid001008 Version=21.08.8-2
   OS=Linux 5.3.18-24.75_10.0.189-cray_shasta_c #1 SMP Sun Sep 26 14:27:04 UTC 2021 (0388af5)
   RealMemory=245000 AllocMem=235520 FreeMem=229461 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=work
   BootTime=17 Jun 12:28 SlurmdStartTime=5 Jul 10:41
   LastBusyTime=04:26:02
   CfgTRES=cpu=256,mem=245000M,billing=256
   AllocTRES=cpu=256,mem=230G
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0

Typical node for high mem
NodeName=nid001504 Arch=x86_64 CoresPerSocket=64
   CPUAlloc=0 CPUTot=256 CPULoad=0.00
   AvailableFeatures=AMD_EPYC_7763
   ActiveFeatures=AMD_EPYC_7763
   Gres=(null)
   NodeAddr=nid001504-nmn NodeHostName=nid001504 Version=21.08.8-2
   OS=Linux 5.3.18-24.75_10.0.189-cray_shasta_c #1 SMP Sun Sep 26 14:27:04 UTC 2021 (0388af5)
   RealMemory=1020000 AllocMem=0 FreeMem=1017774 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=highmem
   BootTime=17 Jun 12:29 SlurmdStartTime=5 Jul 10:41
   LastBusyTime=Ystday 12:50
   CfgTRES=cpu=256,mem=1020000M,billing=256
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Comment 22 Marcin Stolarek 2022-07-27 05:07:30 MDT

Pascal,

Trying to find a major confusion - beside what we clarified in the mentioned documentation change. I see your question:

>These examples are confusing. What is the formula used to calculate the mem if one provides --mem-per-cpu? What is the "cpu" in mem-per-cpu? 

In your case since you have configured number of CPUs on the node to be #Sockets*#CoresPerSocket*#ThreadsPerCore
>NodeName=nid001008 Arch=x86_64 CoresPerSocket=64
>   CPUAlloc=0 CPUTot=256 CPULoad=0.00
>   RealMemory=245000 AllocMem=235520 FreeMem=229461 Sockets=2 Boards=1
>   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A

ThreadsPerCore=2 CoresPerSocket=64 Sockets=2 and CPU=256, this number of CPUs is calculated by default if you don't specify that, however, it's also possible to configure Slurm NodeName=... line like:
>NodeName=DEFAULT Sockets=4 CoresPerSocket=4 ThreadsPerCore=2 CPUs=16  RealMemory=15000 

as you see in the above line total number of CPUs is equal to total number of Cores (not threads), which defines the meaning of CPU:
# srun --mem-per-cpu=10 /bin/bash -c 'scontrol show job $SLURM_JOB_ID' | egrep '(TRES|Core)'
   TRES=cpu=1,mem=10M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

This job is allocated a whole Core, however, the core is accounted as single CPU and --mem-per-cpu meaning is in tact with that definition, so in your configuration Slurm interprets physical hyperthreads as CPUs, however, we never assign the same code to two job allocations. However, it's possible to run two job steps within the same allocation on dedicated hyperthreads.

I'll split the following question/question to separate bug as mentioned before. Please let me know if the explanation above helps or raises additional questions.

cheers,
Marcin

Comment 23 Pascal 2022-08-05 02:59:14 MDT

Thanks for the info Marcin, and again sorry for the delay in replying. 
So --mem-per-cpu has a fluid definition of "cpu". Not something I think is ideal. Is it possible to request a --mem-per-thread option that would not have a fluid definition? 
Cheers,
Pascal

Comment 24 Marcin Stolarek 2022-08-08 01:58:57 MDT

>So --mem-per-cpu has a fluid definition of "cpu".[...]
It's as fluid as a definition of "CPU" in Slurm, depending on your configuration the number of CPUs job is accounted for may be equal to hyper-threads or cores.

>Is it possible to request a --mem-per-thread option that would not have a fluid definition? 
I can check with our senior developers if such a enhancement request is something we may be interested in, however, as an enhancement this will require sponsorship. Is that something you may be interested in?

cheers,
Marcin

Comment 25 Bas van der Vlies 2022-08-08 02:04:36 MDT

I am on holidays from 05-Augustus-2022 till 21-Augustus-2022

Comment 29 Marcin Stolarek 2022-09-02 02:06:39 MDT

Pascal,

We're having internal discussion about the eventual enhancement. I'll let you know once we establish something.

cheers,
Marcin

Comment 33 Marcin Stolarek 2022-10-13 10:06:16 MDT

Pascal,

We did initial code check for potential development of --mem-per-core/--mem-per-thread. As a feature request it will require sponsorship and can be done in Slurm 23.11 at earliest.
Are you interested?

cheers,
Marcin

Comment 34 Marcin Stolarek 2022-11-18 11:43:09 MST

Pascal,

I'm closing the case as information given.

Please feel free to reopen if you're interested in the feature development sponsorship.

cheers,
Marcin