| Summary: | mem-per-cpu and mem calculation limit enforcement incorrect | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Pascal <pascal.elahi> |
| Component: | Accounting | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bas.vandervlies, marshall, scott, tim |
| Version: | 21.08.6 | ||
| Hardware: | Cray Shasta | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=14625 | ||
| Site: | Pawsey | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm config | ||
|
Description
Pascal
2022-06-24 01:14:37 MDT
Pascal, Sorry for the delay in replay, but since the initial comment is really long it took me some time to read it properly. I'll try to formulate and answer some questions arising from your message, for parts with commands and outputs I'm going to split it to a few smaller bugs. (Technically I'll be a reporter there but I'll add you in CC so you'll be notified about the updates there). It may end up with finding the same root cause, but at this point we'd like to split those so we have the possibility of different resolutions. I see the doubt coming from --threads-per-core impact on allocation/step memory with its current documentation. It's in fact an issue we already discuss in Bug 13879, where a documentation fix is in review. I see a little bit hidden question like: 'Why --ntask-per-core=1 (or other parameters) doesn't affect total memory calculation?' The way I can paraphrase my understanding is that "CPU" (in --mem-per-cpu) refers to allocated CPUs not to number of tasks started (Marshall already did very comprehensive discussion of that in Bug 13879 - he comments on DefMemPerCPU and MaxMemPerCPU too). The physical part understood as CPU by Slurm is "configurable", you can set total number of CPUs on node to either total number of hyper threads, cores or sockets. You should get email notifications about other bugs/discussion threads opened. I'll keep this one as a "master ticket" for those. cheers, Marcin Dear sender, Thank you for your message. I am out of the office until July 11 and will have limited email access while I am away. I will respond to your email when I return. Best, Martijn Pascal, Before I start opening separate threads, could you please share your slurm.conf so we can focus on the exact case and I'll be able to attach it in places needed without bothering you with the same request everywhere. cheers, Marcin Pascal, Could you please take a look at my last comment? I want to comment/analyse the behavior in relation to your configuration not in general. cheers, Marcin Hi Marcin, Sorry for the late reply, I was away on holidays. I will attach the slurm config but just to give you a heads up, this has been updated a bit based on replies by schedmd related to the mcs plugin, amongst others. I haven't run it through my gambit of tests. Created attachment 25765 [details]
slurm config
Pascal, Could you please add node/partition definition too? I'm mostly interested in relation betweeen CPUs/Sockets/CoresPerSocket/ThreadsPerCore. cheers, Marcin Do you have any SLURM_* SALLOC_* SBATCH_* set in environment while testing? cheers, Marcin PS. Sorry for second email, I just understood from config that you're on Cray and I know that it's quite common. Could you please take a look at the case? cheers, Marcin I just wanted to point out that in Bug 13879 we just merged a related documentation changes. The commit is 0754895c970[1] In the mean time - just a kindly reminder about the information requested in previous comments. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/0754895c970d943570ba6303312c2dc32ec5a44e Hi Marcin, Sorry for the late reply regarding the environment variables. The only ones that are set are SLURM_TIME_FORMAT=relative SBATCH_EXPORT=NONE Cheers, Pascal >Could you please add node/partition definition too? I'm mostly interested in relation betweeen CPUs/Sockets/CoresPerSocket/ThreadsPerCore.
cheers,
Marcin
PartitionName=work AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=nid00[1008-1323] PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=80896 TotalNodes=316 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=920 MaxMemPerCPU=1840 TRESBillingWeights=CPU=1 PartitionName=long AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=1 MaxTime=4-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=nid00[1316-1323] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2048 TotalNodes=8 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=920 MaxMemPerCPU=1840 TRESBillingWeights=CPU=1 PartitionName=copy AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=YES MaxCPUsPerNode=UNLIMITED Nodes=dm0[1-8] PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=512 TotalNodes=8 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=1850 MaxMemPerCPU=3700 TRESBillingWeights=CPU=0 PartitionName=askaprt AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=nid00[1324-1503] PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=46080 TotalNodes=180 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=920 MaxMemPerCPU=1840 TRESBillingWeights=CPU=1 PartitionName=debug AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=4 MaxTime=01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=nid00[1000-1007] PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2048 TotalNodes=8 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=920 MaxMemPerCPU=1840 TRESBillingWeights=CPU=1 PartitionName=highmem AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=nid00[1504-1511] PriorityJobFactor=0 PriorityTier=0 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1 OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=2048 TotalNodes=8 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=3950 MaxMemPerCPU=7900 TRESBillingWeights=CPU=1 A typical node for work, debug, askaprt, long partitions NodeName=nid001008 Arch=x86_64 CoresPerSocket=64 CPUAlloc=256 CPUTot=256 CPULoad=61.15 AvailableFeatures=AMD_EPYC_7763 ActiveFeatures=AMD_EPYC_7763 Gres=(null) NodeAddr=nid001008-nmn NodeHostName=nid001008 Version=21.08.8-2 OS=Linux 5.3.18-24.75_10.0.189-cray_shasta_c #1 SMP Sun Sep 26 14:27:04 UTC 2021 (0388af5) RealMemory=245000 AllocMem=235520 FreeMem=229461 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=work BootTime=17 Jun 12:28 SlurmdStartTime=5 Jul 10:41 LastBusyTime=04:26:02 CfgTRES=cpu=256,mem=245000M,billing=256 AllocTRES=cpu=256,mem=230G CapWatts=n/a CurrentWatts=0 AveWatts=0 Typical node for high mem NodeName=nid001504 Arch=x86_64 CoresPerSocket=64 CPUAlloc=0 CPUTot=256 CPULoad=0.00 AvailableFeatures=AMD_EPYC_7763 ActiveFeatures=AMD_EPYC_7763 Gres=(null) NodeAddr=nid001504-nmn NodeHostName=nid001504 Version=21.08.8-2 OS=Linux 5.3.18-24.75_10.0.189-cray_shasta_c #1 SMP Sun Sep 26 14:27:04 UTC 2021 (0388af5) RealMemory=1020000 AllocMem=0 FreeMem=1017774 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=highmem BootTime=17 Jun 12:29 SlurmdStartTime=5 Jul 10:41 LastBusyTime=Ystday 12:50 CfgTRES=cpu=256,mem=1020000M,billing=256 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Pascal, Trying to find a major confusion - beside what we clarified in the mentioned documentation change. I see your question: >These examples are confusing. What is the formula used to calculate the mem if one provides --mem-per-cpu? What is the "cpu" in mem-per-cpu? In your case since you have configured number of CPUs on the node to be #Sockets*#CoresPerSocket*#ThreadsPerCore >NodeName=nid001008 Arch=x86_64 CoresPerSocket=64 > CPUAlloc=0 CPUTot=256 CPULoad=0.00 > RealMemory=245000 AllocMem=235520 FreeMem=229461 Sockets=2 Boards=1 > State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A ThreadsPerCore=2 CoresPerSocket=64 Sockets=2 and CPU=256, this number of CPUs is calculated by default if you don't specify that, however, it's also possible to configure Slurm NodeName=... line like: >NodeName=DEFAULT Sockets=4 CoresPerSocket=4 ThreadsPerCore=2 CPUs=16 RealMemory=15000 as you see in the above line total number of CPUs is equal to total number of Cores (not threads), which defines the meaning of CPU: # srun --mem-per-cpu=10 /bin/bash -c 'scontrol show job $SLURM_JOB_ID' | egrep '(TRES|Core)' TRES=cpu=1,mem=10M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* This job is allocated a whole Core, however, the core is accounted as single CPU and --mem-per-cpu meaning is in tact with that definition, so in your configuration Slurm interprets physical hyperthreads as CPUs, however, we never assign the same code to two job allocations. However, it's possible to run two job steps within the same allocation on dedicated hyperthreads. I'll split the following question/question to separate bug as mentioned before. Please let me know if the explanation above helps or raises additional questions. cheers, Marcin Thanks for the info Marcin, and again sorry for the delay in replying. So --mem-per-cpu has a fluid definition of "cpu". Not something I think is ideal. Is it possible to request a --mem-per-thread option that would not have a fluid definition? Cheers, Pascal >So --mem-per-cpu has a fluid definition of "cpu".[...] It's as fluid as a definition of "CPU" in Slurm, depending on your configuration the number of CPUs job is accounted for may be equal to hyper-threads or cores. >Is it possible to request a --mem-per-thread option that would not have a fluid definition? I can check with our senior developers if such a enhancement request is something we may be interested in, however, as an enhancement this will require sponsorship. Is that something you may be interested in? cheers, Marcin I am on holidays from 05-Augustus-2022 till 21-Augustus-2022 Pascal, We're having internal discussion about the eventual enhancement. I'll let you know once we establish something. cheers, Marcin Pascal, We did initial code check for potential development of --mem-per-core/--mem-per-thread. As a feature request it will require sponsorship and can be done in Slurm 23.11 at earliest. Are you interested? cheers, Marcin Pascal, I'm closing the case as information given. Please feel free to reopen if you're interested in the feature development sponsorship. cheers, Marcin |