Ticket 13879

Summary:	Unexpected amount of memory when using DefMemPerCPU and nomultithread
Product:	Slurm	Reporter:	IDRIS System Team <gensyshpe>
Component:	Scheduling	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	cinek, remi.lacroix
Version:	21.08.6
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=13194 https://bugs.schedmd.com/show_bug.cgi?id=13913 https://bugs.schedmd.com/show_bug.cgi?id=9724 https://bugs.schedmd.com/show_bug.cgi?id=11148 https://bugs.schedmd.com/show_bug.cgi?id=11868 https://bugs.schedmd.com/show_bug.cgi?id=11719 https://bugs.schedmd.com/show_bug.cgi?id=12319 https://bugs.schedmd.com/show_bug.cgi?id=12824 https://bugs.schedmd.com/show_bug.cgi?id=5290
Site:	GENCI - Grand Equipement National de Calcul Intensif	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	22.05.3 23.02.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description IDRIS System Team 2022-04-20 09:51:49 MDT

Hi!

We migrated from 20.02.6 (with some patches from schedmd) to 21.08.6 and we now have an issue on job memory allocation when using DefMemPerCPU and nomultithread. A job submitted with "--hint=nomultithread" has half as much memory as expected.

Configuration:

NodeName=r1i0n0 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=191648

DefMemPerCPU            = 2048
SelectType              = select/cons_tres
SelectTypeParameters    = CR_CORE_MEMORY,CR_CORE_DEFAULT_DIST_BLOCK

In the example below, the job should have 4 NumCPUs x 2G DefMemPerCPU = 8G memory.

JobId=1284195 JobName=nomultithread
   UserId=user01(300271) GroupId=group01(300225) MCS_label=N/A
   Priority=270723 Nice=0 Account=abc QOS=qos_cpu-dev
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:10 TimeLimit=00:02:00 TimeMin=N/A
   SubmitTime=2022-04-20T17:13:21 EligibleTime=2022-04-20T17:13:21
   AccrueTime=2022-04-20T17:13:21
   StartTime=2022-04-20T17:13:28 EndTime=2022-04-20T17:15:28 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-04-20T17:13:28 Scheduler=Main
   Partition=cpu_p1 AllocNode:Sid=idradmsys-ib0:2344400
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=r1i0n0
   BatchHost=r1i0n0
   NumNodes=1 NumCPUs=4 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0::1
   TRES=cpu=4,mem=4G,node=1,billing=2
   Socks/Node= NtasksPerN:B:S:C=2:0:: CoreSpec=*
   MinCPUsNode=2 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/user01/cpu_p1.nomultithread.job
   WorkDir=/home/user01
   StdErr=/home/user01/nomultithread.o1284195
   StdIn=/dev/null
   StdOut=/home/user01/nomultithread.o1284195
   Power=

The same job without "-hint=nomultithread" has the expected amount of memory (2 NumCPUs x 2G DefMemPerCPU = 4G):

JobId=1284148 JobName=multithread
   UserId=user01(300271) GroupId=group01(300225) MCS_label=N/A
   Priority=270720 Nice=0 Account=abc QOS=qos_cpu-dev
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:10 TimeLimit=00:02:00 TimeMin=N/A
   SubmitTime=2022-04-20T17:11:15 EligibleTime=2022-04-20T17:11:15
   AccrueTime=2022-04-20T17:11:15
   StartTime=2022-04-20T17:11:17 EndTime=2022-04-20T17:13:17 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-04-20T17:11:17 Scheduler=Main
   Partition=cpu_p1 AllocNode:Sid=idradmsys-ib0:2344400
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=r1i0n0
   BatchHost=r1i0n0
   NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0::
   TRES=cpu=2,mem=4G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=2:0:: CoreSpec=*
   MinCPUsNode=2 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/user01/cpu_p1.multithread.job
   WorkDir=/home/user01
   StdErr=/home/user01/multithread.o1284148
   StdIn=/dev/null
   StdOut=/home/user01/multithread.o1284148
   Power=

Thanks for you help!

Comment 1 Marshall Garey 2022-04-20 09:55:36 MDT

This is expected because --hint=nomultithread means "only use one thread per core" so the number of CPUs allocated to a job or step is halved (assuming 2 threads per core) when using --hint=nomultithread.

This bug in previous versions of Slurm allowed steps to be allocated twice the amount of memory as the job (assuming 2 threads per core). This meant that memory could be oversubscribed on nodes.

So, 21.08 fixed this bug.





Here are more details:

In 21.08, --mem-per-cpu (or DefMemPerCPU) is calculated to only count the CPU threads that were requested by the step. If the step requests --hint=nomultithread, that means it requests 1 thread per core, and --mem-per-cpu will be calculated accordingly.

Let’s consider a node with 8 cores and 2 threads per core. For the following job:

salloc --mem-per-cpu=100 --exclusive --hint=nomultithread

With --hint=nomultithread, mem per cpu is multiplied by 8, not 16. The job will be allocated 800 MB of memory.

Without --hint=nomultithread, mem per cpu is multiplied by 16, so the job is allocated 1600 MB of memory.

Examples:

With nomultithread (only one thread per core):

$ salloc --mem-per-cpu=100 --exclusive --hint=nomultithread
salloc: Granted job allocation 18332
salloc: Waiting for resource configuration
salloc: Nodes n1-1 are ready for job
$ srun -n1 -c8 hostname
voyager
$ srun -n1 -c16 hostname
srun: error: Unable to create step for job 18332: More processors requested than permitted

$ sacct -j 18332 -o jobid,reqtres,alloctres -p  
JobID|ReqTRES|AllocTRES|
18332|billing=1,cpu=1,mem=100M,node=1|billing=16,cpu=16,mem=800M,node=1|
18332.interactive||cpu=16,mem=800M,node=1|
18332.0||cpu=16,mem=800M,node=1|

Please Note: Although mem per cpu is multiplied by 8, Slurm still allocates all 16 CPUs to prevent other jobs from using the other threads on those cores. Notice that the steps can only use one thread per core, so they only have access to 8 CPUs.


Without nomultithread (two threads per core):

$ salloc --mem-per-cpu=100 --exclusive
salloc: Granted job allocation 18334
salloc: Waiting for resource configuration
salloc: Nodes n1-1 are ready for job
$ srun -n1 -c8 --exact hostname
voyager
$ srun -n1 -c16 --exact hostname  
voyager
$ sacct -j 18334 -o jobid,reqtres,alloctres -p  
JobID|ReqTRES|AllocTRES|
18334|billing=1,cpu=1,mem=100M,node=1|billing=16,cpu=16,mem=1600M,node=1|
18334.interactive||cpu=16,mem=1600M,node=1|
18334.0||cpu=8,mem=800M,node=1|
18334.1||cpu=16,mem=1600M,node=1|

Comment 3 Marshall Garey 2022-04-20 09:59:52 MDT

To address your specific examples:

In your "nomultithread" job, the job allocation had 4 CPUs: 2 cores, 2 threads per core. However, because of --hint=nomultithread, the job and steps are actually limited to just 2 CPUs (1 thread on each core). Slurm allocated the entire cores to the jobs because Slurm (if it is aware of cores and threads per core) will not allow multiple jobs to share a single core even if they're on different CPU threads in the core (this would be bad for performance).

In your "multithread" job, the job allocation had 2 CPUs: both threads on a single core. The job is allowed to use both threads on the core.

Comment 4 IDRIS System Team 2022-04-21 03:49:28 MDT

Thank for this explanation, although we do not fully understand how the memory could be oversubscribed.

We think it's weird that for the same number of allocated CPUs the allocated memory is different with or without multi-threading, especially when using DefMemPerCPU. But we can live with it ;)

What we are trying to do is to allocate the memory proportionally to the allocated CPUs without any specification by users. How can we do that? Could you give us some hints?

Comment 8 Marshall Garey 2022-04-22 19:11:52 MDT

> Thank for this explanation, although we do not fully understand how the
> memory could be oversubscribed.

It turns out that oversubscription of step memory was *not* the reason that this change in behavior was made, although it was a bug after this change in behavior.



> We think it's weird that for the same number of allocated CPUs the allocated
> memory is different with or without multi-threading, especially when using
> DefMemPerCPU. But we can live with it ;)

I agree that it is weird, and we didn't document this change.


I've been researching this. There actually a lot of history behind these changes, as you can see from all the bugs that I added to "see also."

I will update you on Monday with more detailed responses to your questions.


> What we are trying to do is to allocate the memory proportionally to the
> allocated CPUs without any specification by users. How can we do that? Could
> you give us some hints?

Valid concerns, and I'm looking into this. I'll hopefully be able to give you more details on Monday.

Comment 10 IDRIS System Team 2022-04-27 01:53:40 MDT

Hi!

Do you have any news?

We also noticed that even the manpages seem to describe the behavior we expect.

  --mem-per-cpu
    Minimum memory required per allocated CPU. [...]
    The default value is DefMemPerCPU [...]
    If resources are allocated by core, socket, or whole nodes, then the number of CPUs allocated to a job may be higher than the task count and the value of --mem-per-cpu should be adjusted accordingly.

Comment 12 Marshall Garey 2022-04-27 15:32:15 MDT

Hi,

I'm discussing this issue with my colleagues. Once we have made some progress I will let you know and I will explain a little bit of the history.

Comment 17 Marshall Garey 2022-05-02 10:37:23 MDT

Hi!

Thanks for your patience.

Summary:
* We're going to keep the current behavior. I'll explain the history and our reasoning for keeping the current behavior.
* A solution for you: You could use a job_submit plugin or a cli_filter plugin. However, we generally don't recommend using DefMemPerCPU = RealMemory / NumCpus because the memory tends to not be utilized by the applications.

Can you let me know if my explanations make sense and if my recommendations will work for you?



In my explanations, I'm using --threads-per-core. But --hint=nomultithread is the same as --threads-per-core=1.

The original change in behavior started in commit 49a7d7f9fb. Here were the problems:
(1) --exclusive and non-exclusive jobs were inconsistent when calculating --mem-per-cpu and --threads-per-core: non-exclusive allocations would use hyperthreads in the calculation; exclusive allocations would not. We wanted to make this consistent.
(2) Jobs requesting --mem-per-cpu and --threads-per-core=1 could get starved - Slurm thought the job could fit on one node when just looking at the job request, so Slurm tried to schedule the job on one node. Then when taking into account the extra (unused but still allocated) threads for the mem per CPU calculation the job could not fit onto one node, and Slurm couldn't schedule the job. Several sites hit this job starvation issue.
(3) For 20.11 a site sponsored making --threads-per-core actually affect job and step allocations instead of just be a minimum resource request. (I don't have any written evidence for this last point, but I was told this by a coworker:) Apparently they also thought it was confusing that when a job requested --threads-per-core=1 --mem=per-cpu=100 the job would be allocated 100*<num threads per core> instead of just 100 MB of memory.

Anyway, we added commit 49a7d7f9fb to fix all of these things. There have been several follow-up commits that have fixed various bugs to continue to make the behavior consistent. One of those bugs was that steps could get allocated more memory than the job.

We certainly could have not changed the behavior and found ways to fix those issues using the old and documented behavior. However, we have had the current behavior for two releases (20.11 and 21.08). In addition, we have reached feature freeze for 22.05 (scheduled to be released this month), so we wouldn't be able to change this behavior for 22.05, either. So if we went back to the old behavior, the soonest we could do it is for 23.02.

If we make that change in 23.02, then we will certainly get some sites submitting tickets about the change in behavior, like you (and another site) have submitted tickets about the change in behavior in 20.11.

At this point, we'd like to avoid changing the behavior yet again in 23.02, especially since it will have been the behavior for 3 releases. So we are keeping the current behavior.


We definitely failed to update our documentation on the changed behavior in 20.11. I'll make sure to get that done as part of this bug.



The good new is that I have recommendations for you:

(1) We usually do not recommend setting DefMemPerCPU=RealMemory / NumCpus. I've been told that this will often result in wasted memory (applications get more memory than they need).

We often recommend setting DefMemPerCPU rather low to force users to learn to explicitly request memory.

Also, if nodes are not shared (--exclusive is used), then a user can request --mem=0 to get all the memory on the nodes.

(Or --mem-per-cpu=0, or set DefMemPer{CPU|Node}=0 in slurm.conf, or just remove DefMemPer{CPU|Node} since zero is the default value).


(2) However, we also understand that we do not determine site policies. We understand that this is something that you may want to continue to do; maybe it works well enough for your site. Our recommendation in this case is to use a job_submit plugin to adjust the --mem-per-cpu in the job; or use a cli_filter plugin to adjust the --mem-per-cpu in the job or step.

I personally recommend a cli_filter plugin since it affects job and step allocations (all salloc/sbatch/srun), instead of just job allocations. A cli_filter plugin is not secure since it runs on the client; if you want to enforce that users never request --mem or --mem-per-cpu, then you can use a job_submit plugin. You can also set MaxMemPerCPU/MaxMemPerNode in slurm.conf as a way to enforce user requests stay under those configured values.

I'll give an example of a cli_filter plugin.

https://slurm.schedmd.com/slurm.conf.html#OPT_CliFilterPlugins
https://slurm.schedmd.com/cli_filter_plugins.html

Lua plugins are quick and easy to write and do not require re-compiling Slurm. This is probably what you want to use.

Personally, since I'm already frequently modifying the Slurm source code, I find the easiest way to do a quick test is to just modify the cli_filter/none C plugin and then recompile Slurm - see the following file in the Slurm source:

src/plugins/cli_filter/none/cli_filter_none.c

You can also always modify another one of the existing C plugins or write your own C plugin.

Regardless, you'll need to configure it in slurm.conf:

CliFilterPlugins=cli_filter/lua
or if you choose to modify the "none" C plugin:
CliFilterPlugins=cli_filter/none

You would insert logic like this in the funct on cli_filter_p_pre_submit():

If srun uses --hint=nomultithread (or --threads-per-core=1), then set --mem-per-cpu=twice the amount of DefMemPerCPU. This is only useful if all your nodes have 2 threads per core.

Here's an example in C.

extern int cli_filter_p_pre_submit(slurm_opt_t *opt, int offset)
{
    if ((opt->threads_per_core == 1) ||
        (xstrstr(opt->hint, "nomultithread"))) {
        info("DEBUGGING: requested threads per core=1 or nomultithread; adjusting mem per cpu");
        /* Here I am assuming that DefMemPerCPU=100 in slurm.conf */
        opt->mem_per_cpu = 100 * 2;
    }
    return SLURM_SUCCESS;
}



Here's an example in action:

marshall@curiosity:~/slurm/21.08/install/c1$ salloc --threads-per-core=1
salloc: cli_filter/none: cli_filter_p_pre_submit: DEBUGGING: requested threads per core=1 or nomultithread; adjusting mem per cpu
salloc: Granted job allocation 94
salloc: Waiting for resource configuration
salloc: Nodes n1-1 are ready for job
srun: cli_filter/none: cli_filter_p_pre_submit: DEBUGGING: requested threads per core=1 or nomultithread; adjusting mem per cpu
marshall@curiosity:~/slurm/21.08/install/c1$ srun hostname
srun: cli_filter/none: cli_filter_p_pre_submit: DEBUGGING: requested threads per core=1 or nomultithread; adjusting mem per cpu
curiosity
marshall@curiosity:~/slurm/21.08/install/c1$ sacct -j $SLURM_JOB_ID -ojobid,alloctres -p
JobID|AllocTRES|
94|billing=2,cpu=2,mem=200M,node=1|
94.interactive|cpu=2,mem=200M,node=1|
94.0|cpu=2,mem=200M,node=1|


Without this cli_filter plugin:

marshall@curiosity:~/slurm/21.08/install/c1$ salloc --threads-per-core=1
salloc: Granted job allocation 93
salloc: Waiting for resource configuration
salloc: Nodes n1-1 are ready for job
marshall@curiosity:~/slurm/21.08/install/c1$ srun hostname
curiosity
marshall@curiosity:~/slurm/21.08/install/c1$ sacct -j $SLURM_JOB_ID -ojobid,alloctres -p
JobID|AllocTRES|
93|billing=2,cpu=2,mem=100M,node=1|
93.interactive|cpu=2,mem=100M,node=1|
93.0|cpu=2,mem=100M,node=1|

Comment 18 Marshall Garey 2022-05-16 09:39:18 MDT

Hi,

I'm just checking in. Did you have any questions about my response? Do you have any questions about a job_submit or cli_filter plugin?

Comment 19 IDRIS System Team 2022-05-17 09:43:33 MDT

Thanks for your explanations and your recommendations!

We modified our job_submit plugin to manage the memory quickly after we discovered the issue because it was impacting our users. We thought these modifications would be temporary but after your comment they will become permanent ;)

We still be confused by DefMemPerCPU and MaxMemPerCPU parameters because they don't seem to use the same logic (DefMemPerCPU is based on requested CPU and MaxMemPerCPU on allocated CPU). So we have some trouble to set a correct value to MaxMemPerCPU on our partitions. For now we have removed the parameter.

Comment 20 Marshall Garey 2022-05-17 09:45:25 MDT

Okay. Thanks also for pointing out the differences in MaxMemPerCPU and DefMemPerCPU. I think we have another bug where we are fixing an issue with MaxMemPerCPU, and I will point this out to my colleague who is working on it.


I'll keep this ticket open to track improving the documentation.

Comment 32 Marshall Garey 2022-07-26 11:12:35 MDT

We have updated the documentation in commit 0754895c970d943. This will show up live on the website when 22.05.3 is released (we don't have a release date yet).

I'm closing this as resolved/fixed. Please let us know if you have any more questions. Also please let us know if you think the documentation is unclear.


https://github.com/schedMD/slurm/commit/0754895c970d943