Ticket 6950 - Slurm does only honor DefMemPerCPU of first partition, when submitting to more than one partition
Summary: Slurm does only honor DefMemPerCPU of first partition, when submitting to mor...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 18.08.6
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-05-02 06:37 MDT by Tim Ehlers
Modified: 2019-10-29 01:50 MDT (History)
2 users (show)

See Also:
Site: GWDG
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 19.05.0rc2 20.02.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (2.74 KB, text/plain)
2019-05-08 06:22 MDT, Tim Ehlers
Details
partitions.conf (1.76 KB, text/plain)
2019-05-08 06:23 MDT, Tim Ehlers
Details
nodes.conf (2.29 KB, text/plain)
2019-05-08 06:31 MDT, Tim Ehlers
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Tim Ehlers 2019-05-02 06:37:51 MDT
Dear Slurm Team,

we have a problem with different partitions and DefMemPerCPU. For example, we habe 2 partitions: "medium-fmz" and "medium-fas":

gwdu105:3 14:17:38 ~ # scontrol show partition medium-fas
PartitionName=medium-fas
   AllowGroups=ALL AllowAccounts=ALL AllowQos=normal,long,short
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=dmp[011-082]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=1728 TotalNodes=72 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=5333 MaxMemPerNode=UNLIMITED

gwdu105:3 14:20:12 ~ # scontrol show partition medium-fmz
PartitionName=medium-fmz
   AllowGroups=ALL AllowAccounts=ALL AllowQos=normal,long,short
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=gwdd[001-168],gwdd[173-176]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=3440 TotalNodes=172 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=3200 MaxMemPerNode=UNLIMITED


The nodes have different memory and different amount of cores. We default to the amount of memory per CPU of each node class. For "medium-fas", this is "5333", for "medium-fmz", this is "3200". This is working when submitting to one of these partitions:

tehlers@gwdu101:..users/tehlers> srun --pty -p medium-fas -t 05:00 -n 1 -N 1 --qos=short bash
tehlers@dmp014:..users/tehlers> scontrol show job $SLURM_JOB_ID | grep MinMemoryCPU
   MinCPUsNode=1 MinMemoryCPU=5333M MinTmpDiskNode=0
tehlers@dmp014:..users/tehlers> cat /sys/fs/cgroup/memory/slurm/uid_$UID/job_$SLURM_JOB_ID/memory.limit_in_bytes
5592055808

tehlers@gwdu101:..users/tehlers> srun --pty -p medium-fmz -t 05:00 -n 1 -N 1 --qos=short bash
tehlers@gwdd106:..users/tehlers> scontrol show job $SLURM_JOB_ID | grep MinMemoryCPU
   MinCPUsNode=1 MinMemoryCPU=3200M MinTmpDiskNode=0
tehlers@gwdd106:..users/tehlers> cat /sys/fs/cgroup/memory/slurm/uid_$UID/job_$SLURM_JOB_ID/memory.limit_in_bytes
3355443200

When I now submit to both partitions, Slurm uses one of the partitions, where the nodes are free at first. To be able to steer this however, I submit to both partitions, but request a specific node "dmp014" from the "medium-fas" partition:

tehlers@gwdu101:..users/tehlers> srun --pty -w dmp014 -p medium-fmz,medium-fas -t 05:00 -n 1 -N 1 --qos=short bash
tehlers@dmp014:..users/tehlers> cat /sys/fs/cgroup/memory/slurm/uid_$UID/job_$SLURM_JOB_ID/memory.limit_in_bytes
3355443200
tehlers@dmp014:..users/tehlers> scontrol show job $SLURM_JOB_ID | grep MinMemoryCPU
   MinCPUsNode=1 MinMemoryCPU=3200M MinTmpDiskNode=0


And you see, the allocated amount of memory is *not* from "medium-fas", but from "medium-fmz"! We already found out that this depends on the order of the partitions in the request string:

tehlers@gwdu101:..users/tehlers> srun --pty -w dmp014 -p medium-fas,medium-fmz -t 05:00 -n 1 -N 1 --qos=short bash
tehlers@dmp014:..users/tehlers> scontrol show job $SLURM_JOB_ID | grep MinMemoryCPU
   MinCPUsNode=1 MinMemoryCPU=5333M MinTmpDiskNode=0
tehlers@dmp014:..users/tehlers> cat /sys/fs/cgroup/memory/slurm/uid_$UID/job_$SLURM_JOB_ID/memory.limit_in_bytes
5592055808


This seems to be a bug, that DefMemPerCPU is not updated after the actual partition is chosen, when the job starts. It always uses only the first found definition when submitting (the one from the first partition in the list) and this is not updated later.

Is this a know issue? May this be fixed in 19.05?

Thanks

Tim Ehlers
Comment 5 Alejandro Sanchez 2019-05-08 06:06:02 MDT
Hi Tim,

I can reproduce this also in 19.05. 

There's historically been a lot of discussion[1] around this problem (I guess that's why Kilian added himself to CC since he was involved in some of them).

Can you attach your current slurm.conf? I'm interested in a few options like EnforcePartLimits.

[1] Some related commits and bugs:
17.11.7 https://github.com/SchedMD/slurm/commit/bf4cb0b1b0

17.11.8 https://github.com/SchedMD/slurm/commit/f07f53fc13

17.11.8 https://github.com/SchedMD/slurm/commit/d52d8f4f0c
Comment 6 Tim Ehlers 2019-05-08 06:22:41 MDT
Created attachment 10153 [details]
slurm.conf
Comment 7 Tim Ehlers 2019-05-08 06:23:18 MDT
Created attachment 10154 [details]
partitions.conf
Comment 8 Tim Ehlers 2019-05-08 06:31:07 MDT
Created attachment 10155 [details]
nodes.conf

sure, appended (3 files).

To explain: We are using "dummy" partitions like "medium", the users should submit to and check for these partitions in "job_submit.lua". If submitted to "medium", the submit string is changed to "-p medium-faz,medium-fmz", as you advised us in the course in Goettingen.

If we now can't honor DefMemPerCPU per partition, the whole mechanism would be kind of useless... :(

Best
Comment 10 Alejandro Sanchez 2019-05-10 03:56:40 MDT
Hi. Just as an update I've triggered the review process for a patch for 19.05.
Comment 11 Tim Ehlers 2019-05-10 03:58:02 MDT
(In reply to Alejandro Sanchez from comment #10)
> Hi. Just as an update I've triggered the review process for a patch for
> 19.05.

Thanks!
Comment 17 Alejandro Sanchez 2019-05-22 04:09:23 MDT
Tim,

this has been fixed in the following commit available since Slurm 19.05.0rc2:

https://github.com/SchedMD/slurm/commit/8a1e5a5250b3ce469c