Ticket 14625 - Clarify memory impact on "hybrid" resource allocation
Summary: Clarify memory impact on "hybrid" resource allocation
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 23.02.x
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-07-27 05:22 MDT by Marcin Stolarek
Modified: 2022-11-21 02:25 MST (History)
1 user (show)

See Also:
Site: Pawsey
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Marcin Stolarek 2022-07-27 05:22:05 MDT
Nodes defined as:
>NodeName=nid001008 Arch=x86_64 CoresPerSocket=64
>   CPUAlloc=0 CPUTot=256 CPULoad=0.00
>   RealMemory=245000 AllocMem=235520 FreeMem=229461 Sockets=2 Boards=1
>   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A

>SelectType              = select/cons_tres
>SelectTypeParameters    = CR_CORE_MEMORY,OTHER_CONS_RES,CR_ONE_TASK_PER_CORE
>DefMemPerNode           = UNLIMITED


From Pascal:

>1) salloc --ntasks-per-node=34 --threads-per-core=1 --cpus-per-task=2 --ntasks-per-core=1 --mem-per-cpu=1828
>   NumNodes=1 NumCPUs=136 NumTasks=34 CPUs/Task=2 ReqB:S:C:T=0:0:*:1
>   TRES=cpu=136,mem=124304M,node=1,billing=136
>   Socks/Node=* NtasksPerN:B:S:C=34:0:*:1 CoreSpec=*
>   MinCPUsNode=68 MinMemoryCPU=1828M MinTmpDiskNode=0


>2) salloc --ntasks-per-node=34 --threads-per-core=2 --cpus-per-task=2 --ntasks-per-core=1 --mem-per-cpu=1828
>   NumNodes=1 NumCPUs=68 NumTasks=34 CPUs/Task=2 ReqB:S:C:T=0:0:*:2
>   TRES=cpu=68,mem=124304M,node=1,billing=68
>   Socks/Node=* NtasksPerN:B:S:C=34:0:*:1 CoreSpec=*
>   MinCPUsNode=68 MinMemoryCPU=1828M MinTmpDiskNode=0


>3) salloc --ntasks-per-node=34 --threads-per-core=2 --cpus-per-task=3 --ntasks-per-core=1 --mem-per-cpu=1828
>salloc: Pending job allocation 63585
>salloc: job 63585 queued and waiting for resources 
>HANGS INDEFINITELY 

It is odd that 3) hangs indefinitely if the physical cpus required are 34*2 = 68 and the memory should be 34* 3 * 1828 ~180000 M. If I reduce the memory requirement, I see that in fact the memory being requested is 34* 4 * 1828, which cannot be satisfied by any node. Yet it is not rejected. 

Why is it not rejected?
Comment 1 Marcin Stolarek 2022-07-27 06:08:34 MDT
Let's discuss the case of ThreadsPerCore=4 which is more generic than ThreadsPerCore=2, on the configuration like:
│
>NodeName=DEFAULT Sockets=4 CoresPerSocket=4 ThreadsPerCore=4 RealMemory=15000 

1. Just using --mem-per-cpu:
>#  srun --mem-per-cpu=10 /bin/bash -c 'scontrol show job $SLURM_JOB_ID | grep TRES'
>   TRES=cpu=4,mem=40M,node=1,billing=4
which is what we get since the definition of CPU is by default a physical thread we get the whole core and end-up-with 40M (4 ThreadsPerCore)

2. Add --cpus-per-task into consideration:
>#  srun --mem-per-cpu=10 --cpus-per-task=1  /bin/bash -c 'scontrol show job -d $SLURM_JOB_ID | grep TRES' | uniq -c
>      4    TRES=cpu=4,mem=40M,node=1,billing=4
this results in 4 tasks being launched since we've got 4 CPUs, so the number of tasks got adjusted, similar for:
>#  srun --mem-per-cpu=10 --cpus-per-task=2  /bin/bash -c 'scontrol show job -d $SLURM_JOB_ID | grep TRES' | uniq -c
>      2    TRES=cpu=4,mem=40M,node=1,billing=4
requesting two CPUs per task on a 4 CPUs(single core) we have 2 tasks.
>#  srun --mem-per-cpu=10 --cpus-per-task=3  /bin/bash -c 'scontrol show job -d $SLURM_JOB_ID | grep TRES' | uniq -c
>      1    TRES=cpu=4,mem=40M,node=1,billing=4
when 3 CPUs are requested we have only one task, since we can't start more on 4 CPUs. When we request --cpus-per-task=5 (or 6,7,8) we get 8CPUs=2 cores allocated:
>#  srun --mem-per-cpu=10 --cpus-per-task=5  /bin/bash -c 'scontrol show job -d $SLURM_JOB_ID | grep TRES' | uniq -c
>      1    TRES=cpu=8,mem=80M,node=1,billing=8
>#  srun --mem-per-cpu=10 --cpus-per-task=6  /bin/bash -c 'scontrol show job -d $SLURM_JOB_ID | grep Nodes=tes' | uniq -c
>      1      Nodes=test01 CPU_IDs=0-7 Mem=80 GRES=


3. Add --threads-per-core to specification (limit number of threads to be used per core):
>#  srun --mem-per-cpu=10 --cpus-per-task=2 --threads-per-core=1  /bin/bash -c 'scontrol show job -d $SLURM_JOB_ID | grep TRES' | uniq -c
>      1    TRES=cpu=8,mem=20M,node=1,billing=8
>#  srun --mem-per-cpu=10 --cpus-per-task=2 --threads-per-core=1  /bin/bash -c 'scontrol show job -d $SLURM_JOB_ID | grep Nodes=tes' | uniq -c
>      1      Nodes=test01 CPU_IDs=0-7 Mem=20 GRES=

In previous case we've got 2 tasks on a single core and only 4 CPUs were allocated, now we've got 1 task but because of 2 CPUs per task request and requirement o 1 thread core two cores were allocated. The major reason why we decided to limit memory allocation to 20MB (2 CPUs in use) was that such a request on a hybrid cluster will result in the same number of tasks will start on a nodes with 1 ThreadPerCore and 4 ThreadsPerCore and we shouldn't allocate more memory just because we select different node. The allocation here is reliably calculated despite of node type used.

Let me know if that helps and please ask further questions. I believe it's better to discuss than send more very complicated case discussion since addition of every option makes the case complexity grow exponentially more complicated.

cheers,
Marcin
Comment 2 Marcin Stolarek 2022-08-02 05:48:42 MDT
Pascal,

Is the explanation clear for you? Does it rise any additional questions?

cheers,
Marcin
Comment 3 Pascal 2022-08-05 02:52:39 MDT
Hi Marcin, thanks for the info. I've been busy with other stuff but if I have any further questions, I'll try to place them soon. 
Cheers,
Pascal
Comment 4 Marcin Stolarek 2022-08-22 05:09:16 MDT
Pascal, Did you have some time to take a look at the case?
Comment 5 Pascal 2022-08-23 01:15:10 MDT
Hi Marcin,
So I have looked over the ticket and I think the underlying issue is that --mem-per-cpu is too fluid. I know you mentioned it before but I think the simple examples you show and stuff like 

srun -p debug --ntasks=1 --mem-per-cpu=128 --threads-per-core=1 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID '
   NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=3 ReqB:S:C:T=0:0:*:1
   TRES=cpu=6,mem=384M,node=1,billing=6

srun -p debug --ntasks=1 --mem-per-cpu=128 --threads-per-core=2 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID'
   NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=3 ReqB:S:C:T=0:0:*:2
   TRES=cpu=4,mem=512M,node=1,billing=4

Someone reading the request would assume that I am only asking for 3 execution threads and so memory should always be 384, whether or not it is spread on 6 or 4 virtual cores. 

The other issue highlighted here is that I can run something like 
srun -p debug --nodes=1 --ntasks=34 --mem-per-cpu=1850 --threads-per-core=1 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID'

and given the configuration settings, this should be fine and reserve 3*2*34 virtual cores = 204 virtual cores (102 physical cores) and should reserve 102*1850= 188700 M. However, the request will fail with a 
srun: error: Unable to allocate resources: Requested node configuration is not available

If I use --threads-per-core=2 and then reserve round_up_to_nearest_even(3*2*34) = 204 virtual cores it is fine and does give 

srun -p debug --nodes=1 --ntasks=34 --mem-per-cpu=1850 --threads-per-core=2 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID'

   NumNodes=1 NumCPUs=204 NumTasks=34 CPUs/Task=6 ReqB:S:C:T=0:0:*:2
   TRES=cpu=204,mem=188700M,node=1,billing=204


Even weirder for me now having retested it is that mem-per-cpu requests < 1850 seem to hang indefinitely. Or least there are values between [1802,1840] just sit indefinitely. Why? It does suggest underlying bugs. 

I will test other systems with the same version of slurm. 

Cheers,
Pascal
Comment 6 Marcin Stolarek 2022-10-14 00:37:58 MDT
Pascal,

Sorry for the long delay. I wanted to keep this in sync with Bug 14397, where had a longer internal discussion and we had to spend some time on code analysis to answer potential RFE question.

>Someone reading the request would assume that I am only asking for 3 execution threads and so memory should always be 384, whether or not it is spread on 6 or 4 virtual cores. 
At the allocation stage Slurm will only assign full cores to jobs, so if you're having 2 hyper-threads (HT) per core, requesting 3 CPUs per task with a limit of 2 threads per core requires allocation of 4 CPUs (2 full cores).

If you limit the use of HT per core to 1, than 3 Cores, which in your configuration means 6 CPUs are required to fulfill the allocation spec.

Since you're testing by an allocating srun those values are automatically inherited by the running step, however, this doesn't has to be like that. For instance one can specify resources for allocation to sbatch/salloc and then use a subset of those by steps running in the alloction.

I see that it may be confusing, but if we decided to change that probably even more users of the option would get results they don't expect.

Job specification may contain many parameters and if those aren't direct it may always happen that the "interpretation" differs. In those cases (not only related to --mem-per-cpu) we always advice users to specify directly what they need. For instance, if the user knows the amount of memory required per node it's best to state it directly using --mem.

> [...]reserve 3*2*34 virtual cores = 204 virtual cores (102 physical cores)
Since you've specified --threads-per-core=1 only one HT from core may be allocated which requires 102 physical cores on the node, since only full cores are allocated you're getting 204CPUs (and you're accounted for that amount when you look at sacct).
When you let the job use 2 threads per core, than the allocation is smaller in terms of number of CPUs given to user. He/She isn't accounted for additional CPUs and the number of CPUs considered for memory calculation is smaller.


>Even weirder for me now having retested it is that mem-per-cpu requests < 1850 seem to hang indefinitely. Or least there are values between [1802,1840] just sit indefinitely. Why? It does suggest underlying bugs. 
I'd need to see the nodes definition 

You can use the `--test-only`[1] option of srun to get an estimate on when the job can start.

cheers,
Marcin
[1]https://slurm.schedmd.com/srun.html#OPT_test-only
Comment 7 Marcin Stolarek 2022-11-18 11:43:50 MST
Pascal,

Any update from your side?

cheers,
Marcin
Comment 8 Pascal 2022-11-21 01:14:33 MST
(In reply to Marcin Stolarek from comment #7)
> Pascal,
> 
> Any update from your side?
> 
> cheers,
> Marcin

Hi Marcin, 
So we've migrated to 22.05.2 on Pawsey systems (and will eventually need to update to 22.05.6 based on the bug fixes) and have also updated hwloc such that slurm is now l3 cache aware in that we can use l3_cache_as_socket. 

I have been busy getting Setonix's software stack up and running but I will quickly note that the computation of memory and billing still seems confusing to me. 


Sometimes I understand the underlying calculation. Example:
$ srun -p debug --nodes=1 --ntasks=34 --mem-per-cpu=50 --threads-per-core=1 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID '

will report 

   NumNodes=1 NumCPUs=204 NumTasks=34 CPUs/Task=3 ReqB:S:C:T=0:0:*:1
   TRES=cpu=204,mem=5100M,node=1,billing=204


$ srun -p debug --nodes=1 --ntasks=34 --mem-per-cpu=50 --threads-per-core=2 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID'

reports 
   NumNodes=1 NumCPUs=136 NumTasks=34 CPUs/Task=3 ReqB:S:C:T=0:0:*:2
   TRES=cpu=136,mem=6800M,node=1,billing=136


So it does look like the CPUs per task is reserving 4 virtual CPUs per MPI task even though 3*34 = even number, hence the 136 CPUs instead of the 102 CPUs it should be reserving. For the former case the total memory is 102 *50 = 5100. The latter's request is asking for more total mem because even though the request is for 102 CPUs, it rounds the cpus per task up to multiple of 2 and gets 136*50 = 6800. 

The issue is here is that the memory request calculated at first glance appears to be at odds with the cpu reservation. 

Now updating the memory request but keeping the cpus requests the same leads to more confusion (and the mem-per-cpu bug cropping up)

$ srun -p debug --nodes=1 --ntasks=34 --mem-per-cpu=1850 --threads-per-core=2 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID'

reports 
   NumNodes=1 NumCPUs=204 NumTasks=34 CPUs/Task=6 ReqB:S:C:T=0:0:*:2
   TRES=cpu=204,mem=188700M,node=1,billing=204
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=3 MinMemoryCPU=925M MinTmpDiskNode=0

$ srun -p debug --nodes=1 --ntasks=34 --mem-per-cpu=1850 --threads-per-core=1 --cpus-per-task=3 /bin/bash -c 'scontrol show job $SLURM_JOB_ID'

fails but should have been calculating 1850*102=188700

Likely I'll wait till we get the fully patched version to see how it behaves and continue to ask users to specify --mem when possible. 

Happy to close the ticket and reopen it if necessary once we have the latest version. 

Cheers,
Pascal
Comment 9 Marcin Stolarek 2022-11-21 02:25:15 MST
Pascal,

I agree that one may want to differentiate the meaning of --mem-per-cpu, especially when you're thinking about it in terms of DefMemPerCPU set in slurm.conf between allocation and step.

That's the reason why we considered potential enhancement from Bug 14397 (development of something like --mem-per-core / DefMemPerCore) as a potential way to go.

I'll mark the bug as resolved now, as agreed please reopen if you have any questions.

cheers,
Marcin