Ticket 2275 - scontrol show node: AllocMem does not report usage
Summary: scontrol show node: AllocMem does not report usage
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 14.11.6
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-12-22 00:37 MST by Jamal Uddin
Modified: 2016-01-21 08:23 MST (History)
0 users

See Also:
Site: DANA
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Jamal Uddin 2015-12-22 00:37:19 MST
When I issue the command "scontrol show nodes" correct CPU usage is reported, however, memory usage is shown as zero (AllocMem=0).
Comment 1 David Bigagli 2015-12-22 00:46:42 MST
The AllocMem indicates the memory allocated by Slurm on the node based 
on the running jobs. You also need to have your SelectTypeParameters
in slurm.conf to use memory, for example: SelectTypeParameters=CR_Core_Memory

Then when you submit a job asking for 100MB you will see they are allocated
on the node.

$ sbatch -o /dev/null --mem=100 sleepme 3600
Submitted batch job 11564
david@prometeo ~/slurm/work $ scontrol show node
NodeName=prometeo Arch=x86_64 CoresPerSocket=4
   CPUAlloc=2 CPUErr=0 CPUTot=8 CPULoad=0.08 Features=(null)
   Gres=gpu:4
   NodeAddr=prometeo NodeHostName=prometeo Version=14.11
   OS=Linux RealMemory=24000 AllocMem=100 Sockets=1 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1
   BootTime=2015-11-16T14:43:00 SlurmdStartTime=2015-12-22T15:46:57
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Can you please append your slurm.conf so we can examine it?

David
Comment 2 Jamal Uddin 2015-12-22 01:37:35 MST
Hi David,

Below is the content of my slurm.conf file. Slurm does not accept SelectTypeParameters=CR_Core_Memory in slurm.conf, get following error:
slurm_load_node error: Unable to contact slurm controller (connect failure)

Thanks,
Jamal


ClusterName=XXXXXXXXXXXXXXX
SlurmUser=slurm
SlurmctldPort=XXXX
SlurmdPort=XXXX
AuthType=auth/munge
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmdPidFile=/var/run/slurm/slurmd.pid
ProctrackType=proctrack/cgroup
CacheGroups=0
ReturnToService=2
TaskPlugin=task,cgroup
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
SchedulerType=sched/builtin
ControlMachine=XXXXXXXXXXXXXX
ControlAddr=XXXXXXXXXXXXXX
AccountingStorageHost=XXXXXXXXXXX
NodeName=XXXXXXX[001-500]
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO State=UP Nodes=XXXXXXX[001-500]
GresTypes=gpu,mic
PrologSlurmctld=/cmd/scripts/prolog-prejob
Prolog= /cmd/scripts/prolog
Epilog= /cmd/scripts/epilog
FastSchedule=0
SuspendTime=-1 # this disables power saving
SuspendTimeout=30
ResumeTimeout=60
SuspendProgram= /apps/cluster-tools/wlm/scripts/slurmpoweroff
ResumeProgram= /apps/cluster-tools/wlm/scripts/slurmpoweron




From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Tuesday, December 22, 2015 9:47 AM
To: Uddin, Jamal <Jamal.Uddin@dana.com>
Subject: [Bug 2275] scontrol show node: AllocMem does report usage

Comment # 1<http://bugs.schedmd.com/show_bug.cgi?id=2275#c1> on bug 2275<http://bugs.schedmd.com/show_bug.cgi?id=2275> from David Bigagli<mailto:david@schedmd.com>



The AllocMem indicates the memory allocated by Slurm on the node based

on the running jobs. You also need to have your SelectTypeParameters

in slurm.conf to use memory, for example: SelectTypeParameters=CR_Core_Memory



Then when you submit a job asking for 100MB you will see they are allocated

on the node.



$ sbatch -o /dev/null --mem=100 sleepme 3600

Submitted batch job 11564

david@prometeo ~/slurm/work $ scontrol show node

NodeName=prometeo Arch=x86_64 CoresPerSocket=4

   CPUAlloc=2 CPUErr=0 CPUTot=8 CPULoad=0.08 Features=(null)

   Gres=gpu:4

   NodeAddr=prometeo NodeHostName=prometeo Version=14.11

   OS=Linux RealMemory=24000 AllocMem=100 Sockets=1 Boards=1

   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1

   BootTime=2015-11-16T14:43:00 SlurmdStartTime=2015-12-22T15:46:57

   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s



Can you please append your slurm.conf so we can examine it?



David

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________

This e-mail, and any attachments, is intended solely for use by the addressee(s) named above. It may contain the confidential or proprietary information of Dana Holding Corporation, its subsidiaries, affiliates or business partners. If you are not the intended recipient of this e-mail or are an unauthorized recipient of the information, you are hereby notified that any dissemination, distribution or copying of this e-mail or any attachments, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender by reply e-mail and permanently delete the original and any copies or printouts.

Computer viruses can be transmitted via email. The recipient should check this e-mail and any attachments for the presence of viruses. Dana Holding Corporation accepts no liability for any damage caused by any virus transmitted by this e-mail.

English, Francais, Espanol, Deutsch, Italiano, Portugues:
http://www.dana.com/dext/siteresources/emaildisclaimer.htm
________________________________
Comment 3 David Bigagli 2015-12-22 22:17:00 MST
Hi Jamal,
        if you cannot connect to your controller it means the controller is down
or you are using a different port number than the one specified in the 
slurm.conf, the one the slurmctld is bound to. Could you please verify that 
slurm.conf is the same on all nodes and that your slurmctld is running?
To check the port bound by slurmctld use the lsof command. 
lsof -p <slurmctld pid>

David
Comment 4 Jamal Uddin 2015-12-23 00:13:57 MST
David,

Controller, ports are okay and all nodes have the same slurm.conf file. I can run jobs using slurm. Controller problem appeared when I used certain combination of “SelectType” and “SelectTypeParameters” in slurm.conf. I was not able to get any changes for AllocMem yet.

Below is the output from “scontrol show nodes” command. You may see, CPULoad=14.36 while AllocMem=0.

> scontrol show node xxxxxx003
NodeName=xxxxxx003 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=24 CPUErr=0 CPUTot=24 CPULoad=14.36 Features=(null)
   Gres=(null)
   NodeAddr= xxxxxx003 NodeHostName= xxxxxx003 Version=14.11
   OS=Linux RealMemory=129090 AllocMem=0 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=922771 Weight=1
   BootTime=2015-12-17T09:26:29 SlurmdStartTime=2015-12-17T09:33:21
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Thanks,
Jamal

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Wednesday, December 23, 2015 7:17 AM
To: Uddin, Jamal <Jamal.Uddin@dana.com>
Subject: [Bug 2275] scontrol show node: AllocMem does report usage

Comment # 3<http://bugs.schedmd.com/show_bug.cgi?id=2275#c3> on bug 2275<http://bugs.schedmd.com/show_bug.cgi?id=2275> from David Bigagli<mailto:david@schedmd.com>



Hi Jamal,

        if you cannot connect to your controller it means the controller is

down

or you are using a different port number than the one specified in the

slurm.conf, the one the slurmctld is bound to. Could you please verify that

slurm.conf is the same on all nodes and that your slurmctld is running?

To check the port bound by slurmctld use the lsof command.

lsof -p <slurmctld pid>



David

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________

This e-mail, and any attachments, is intended solely for use by the addressee(s) named above. It may contain the confidential or proprietary information of Dana Holding Corporation, its subsidiaries, affiliates or business partners. If you are not the intended recipient of this e-mail or are an unauthorized recipient of the information, you are hereby notified that any dissemination, distribution or copying of this e-mail or any attachments, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender by reply e-mail and permanently delete the original and any copies or printouts.

Computer viruses can be transmitted via email. The recipient should check this e-mail and any attachments for the presence of viruses. Dana Holding Corporation accepts no liability for any damage caused by any virus transmitted by this e-mail.

English, Francais, Espanol, Deutsch, Italiano, Portugues:
http://www.dana.com/dext/siteresources/emaildisclaimer.htm
________________________________
Comment 5 David Bigagli 2015-12-27 21:03:06 MST
Hi,
   can you show me which configuration is giving you problems?
You need to have SelectType=select/cons_res and SelectTypeParameters
to include memory for example CR_CPU_Memory. When you change the 
SelectTypeParameters than you have to restart the slurmctld and the slurmds.

David
Comment 6 Tim Wickberg 2016-01-04 07:21:01 MST
Can you elaborate on the problems you have when setting the SelectType? Please note that you will have to restart slurmctld as well as slurmd on the nodes to make those changes.

Please note that the AllocMem line does not display the current real memory usage of the node - it displays the memory currently allocated to jobs running on that node through Slurm, and does not reflect the load on the machine itself.

- Tim
Comment 7 Moe Jette 2016-01-21 08:23:54 MST
I presume that you did not have SelectTypeParameters configured to track memory. If that is not the case, please re-open the ticket and provide your configuration file and logs.