4481 – Unable to change RealMemory on nodes

Ticket 4481 - Unable to change RealMemory on nodes

Summary: Unable to change RealMemory on nodes

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	17.02.9
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-12-06 17:05 MST by Simran
Modified:	2017-12-06 18:35 MST (History)
CC List:	0 users

See Also:
Site:	Genentech (Roche)
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Simran 2017-12-06 17:05:12 MST

Hello,

I am trying to set RealMemory on my nodes to a bit lower than what is actually available but can't get slurm to pick up the change.  Is there any other option that I need to change or am I missing something here?

--
[ghpc1 root@nb001 ~]# grep -i realmem /etc/slurm/slurm.conf 
NodeName=nc[001-291]  CoresPerSocket=28 RealMemory=254850 Sockets=2 ThreadsPerCore=1

[ghpc1 root@nb001 ~]# ssh nc291 free -m | grep -i Mem
Mem:         257850       11072      245499         144        1277      245013
[ghpc1 root@nb001 ~]# sinfo -lNe | grep -i 291
nc291          1     defq*        idle   56   2:28:1 257850   762723      1   (null) none           

I have tried to bounce slurmctl and slurmd but can't get the RealMemory setting to stick.  The sockets/cores settings seem to update just ok.

Here is my slurm.conf:

# cat /etc/slurm/slurm.conf 
#
# See the slurm.conf man page for more information.
#

ClusterName=SLURM_CLUSTER
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/cm/shared/apps/slurm/var/cm/statesave
SlurmdSpoolDir=/cm/local/apps/slurm/var/spool
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
#ProctrackType=proctrack/pgid
ProctrackType=proctrack/cgroup
#PluginDir=
CacheGroups=0
#FirstJobId=
ReturnToService=2
#MaxJobCount=
MaxJobCount=2000000
MaxArraySize=500000
#PlugStackConfig=
PlugStackConfig=/etc/slurm/plugstack.conf
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#SrunProlog=
#SrunEpilog=
TaskProlog=/cm/local/apps/slurm/var/prologs/user_prolog.sh
#TaskEpilog=
TaskPlugin=task/cgroup,task/affinity
#TrackWCKey=no
TreeWidth=18
#TmpFs=
#UsePAM=
PrologFlags=contain
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
MessageTimeout=30
#
# SCHEDULING
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
SchedulerParameters=max_rpc_cnt=20,sched_interval=10,bf_interval=30,bf_window=20160,kill_invalid_depend
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightFairshare=100000
PriorityWeightAge=10000
PriorityWeightPartition=0
PriorityWeightJobSize=1000
PriorityWeightQOS=100000
PriorityMaxAge=7-0
PriorityFlags=FAIR_TREE

#Default Memory
DefMemPerCPU=4096

#
# LOGGING
SlurmctldDebug=4
SlurmctldLogFile=/var/log/slurmctld
SlurmdDebug=4
SlurmdLogFile=/var/log/slurmd
DebugFlags=ElasticSearch

#JobCompType=jobcomp/filetxt
#JobCompLoc=/cm/local/apps/slurm/var/spool/job_comp.log
JobCompType=jobcomp/elasticsearch
JobCompLoc=http://elasticsearch.marathon.mesos.ghpc1.sc1.roche.com:9200
#JobCompLoc=http://elasticsearch.marathon.mesos.hpct1.sc1.roche.com:9200
#

# PROFILING
AcctGatherProfileType=acct_gather_profile/influxdb
#AcctGatherInfinibandType=acct_gather_infiniband/ofed


#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/cgroup
JobAcctGatherFrequency=task=15
AccountingStorageEnforce=qos,limits,associations
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStorageTRES=gres/gpu
# AccountingStorageLoc=slurm_acct_db
# AccountingStoragePass=SLURMDBD_USERPASS

##
## Job Submit plugins
###
#
#JobSubmitPlugins=lua

##
## Reboot nodes
###
#
RebootProgram="/usr/bin/logger -p user.crit 'Slurm rebooting Node!!' && /bin/echo 1 > /proc/sys/kernel/sysrq && /bin/echo b > /proc/sysrq-trigger"

# This section of this file was automatically generated by cmd. Do not edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
# Scheduler
SchedulerType=sched/backfill
# Master nodes
ControlMachine=nb001
ControlAddr=nb001
BackupController=nb002
BackupAddr=nb002
AccountingStorageHost=nb001
# Nodes
NodeName=ni003
NodeName=nc[001-291]  CoresPerSocket=28 RealMemory=254850 Sockets=2 ThreadsPerCore=1
NodeName=nh[001-006]  CoresPerSocket=44 Sockets=2 ThreadsPerCore=1
# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=nc[001-291]
PartitionName=himem Default=NO MinNodes=1 AllowGroups=ALL DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=23000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=nh[001-006]
# Generic resources types
GresTypes=gpu,mic
# Epilog/Prolog parameters
PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob
Prolog=/cm/local/apps/cmd/scripts/prolog
Epilog=/cm/local/apps/cmd/scripts/epilog
# Fast Schedule option
FastSchedule=0
# Power Saving
SuspendTime=-1 # this disables power saving
SuspendTimeout=30
ResumeTimeout=60
SuspendProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweroff
ResumeProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweron
# END AUTOGENERATED SECTION   -- DO NOT REMOVE

Comment 1 Simran 2017-12-06 18:32:03 MST

Looks like this was because FastSchedule was set to 0 instead of 1.  After changing this setting we are now good.  Feel free to close this request.

Regards,
-Simran

Comment 2 Tim Wickberg 2017-12-06 18:35:31 MST

Yep, that would do it.

You might also want to look at the MemSpecLimit option as an alternative approach, if you have a reason to use FastSchedule=0.

Marking resolved/infogiven.

- Tim