When I issue the command "scontrol show nodes" correct CPU usage is reported, however, memory usage is shown as zero (AllocMem=0).
The AllocMem indicates the memory allocated by Slurm on the node based on the running jobs. You also need to have your SelectTypeParameters in slurm.conf to use memory, for example: SelectTypeParameters=CR_Core_Memory Then when you submit a job asking for 100MB you will see they are allocated on the node. $ sbatch -o /dev/null --mem=100 sleepme 3600 Submitted batch job 11564 david@prometeo ~/slurm/work $ scontrol show node NodeName=prometeo Arch=x86_64 CoresPerSocket=4 CPUAlloc=2 CPUErr=0 CPUTot=8 CPULoad=0.08 Features=(null) Gres=gpu:4 NodeAddr=prometeo NodeHostName=prometeo Version=14.11 OS=Linux RealMemory=24000 AllocMem=100 Sockets=1 Boards=1 State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 BootTime=2015-11-16T14:43:00 SlurmdStartTime=2015-12-22T15:46:57 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Can you please append your slurm.conf so we can examine it? David
Hi David, Below is the content of my slurm.conf file. Slurm does not accept SelectTypeParameters=CR_Core_Memory in slurm.conf, get following error: slurm_load_node error: Unable to contact slurm controller (connect failure) Thanks, Jamal ClusterName=XXXXXXXXXXXXXXX SlurmUser=slurm SlurmctldPort=XXXX SlurmdPort=XXXX AuthType=auth/munge SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmdPidFile=/var/run/slurm/slurmd.pid ProctrackType=proctrack/cgroup CacheGroups=0 ReturnToService=2 TaskPlugin=task,cgroup SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm SchedulerType=sched/builtin ControlMachine=XXXXXXXXXXXXXX ControlAddr=XXXXXXXXXXXXXX AccountingStorageHost=XXXXXXXXXXX NodeName=XXXXXXX[001-500] PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO State=UP Nodes=XXXXXXX[001-500] GresTypes=gpu,mic PrologSlurmctld=/cmd/scripts/prolog-prejob Prolog= /cmd/scripts/prolog Epilog= /cmd/scripts/epilog FastSchedule=0 SuspendTime=-1 # this disables power saving SuspendTimeout=30 ResumeTimeout=60 SuspendProgram= /apps/cluster-tools/wlm/scripts/slurmpoweroff ResumeProgram= /apps/cluster-tools/wlm/scripts/slurmpoweron From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Tuesday, December 22, 2015 9:47 AM To: Uddin, Jamal <Jamal.Uddin@dana.com> Subject: [Bug 2275] scontrol show node: AllocMem does report usage Comment # 1<http://bugs.schedmd.com/show_bug.cgi?id=2275#c1> on bug 2275<http://bugs.schedmd.com/show_bug.cgi?id=2275> from David Bigagli<mailto:david@schedmd.com> The AllocMem indicates the memory allocated by Slurm on the node based on the running jobs. You also need to have your SelectTypeParameters in slurm.conf to use memory, for example: SelectTypeParameters=CR_Core_Memory Then when you submit a job asking for 100MB you will see they are allocated on the node. $ sbatch -o /dev/null --mem=100 sleepme 3600 Submitted batch job 11564 david@prometeo ~/slurm/work $ scontrol show node NodeName=prometeo Arch=x86_64 CoresPerSocket=4 CPUAlloc=2 CPUErr=0 CPUTot=8 CPULoad=0.08 Features=(null) Gres=gpu:4 NodeAddr=prometeo NodeHostName=prometeo Version=14.11 OS=Linux RealMemory=24000 AllocMem=100 Sockets=1 Boards=1 State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 BootTime=2015-11-16T14:43:00 SlurmdStartTime=2015-12-22T15:46:57 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Can you please append your slurm.conf so we can examine it? David ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ This e-mail, and any attachments, is intended solely for use by the addressee(s) named above. It may contain the confidential or proprietary information of Dana Holding Corporation, its subsidiaries, affiliates or business partners. If you are not the intended recipient of this e-mail or are an unauthorized recipient of the information, you are hereby notified that any dissemination, distribution or copying of this e-mail or any attachments, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender by reply e-mail and permanently delete the original and any copies or printouts. Computer viruses can be transmitted via email. The recipient should check this e-mail and any attachments for the presence of viruses. Dana Holding Corporation accepts no liability for any damage caused by any virus transmitted by this e-mail. English, Francais, Espanol, Deutsch, Italiano, Portugues: http://www.dana.com/dext/siteresources/emaildisclaimer.htm ________________________________
Hi Jamal, if you cannot connect to your controller it means the controller is down or you are using a different port number than the one specified in the slurm.conf, the one the slurmctld is bound to. Could you please verify that slurm.conf is the same on all nodes and that your slurmctld is running? To check the port bound by slurmctld use the lsof command. lsof -p <slurmctld pid> David
David, Controller, ports are okay and all nodes have the same slurm.conf file. I can run jobs using slurm. Controller problem appeared when I used certain combination of “SelectType” and “SelectTypeParameters” in slurm.conf. I was not able to get any changes for AllocMem yet. Below is the output from “scontrol show nodes” command. You may see, CPULoad=14.36 while AllocMem=0. > scontrol show node xxxxxx003 NodeName=xxxxxx003 Arch=x86_64 CoresPerSocket=12 CPUAlloc=24 CPUErr=0 CPUTot=24 CPULoad=14.36 Features=(null) Gres=(null) NodeAddr= xxxxxx003 NodeHostName= xxxxxx003 Version=14.11 OS=Linux RealMemory=129090 AllocMem=0 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=922771 Weight=1 BootTime=2015-12-17T09:26:29 SlurmdStartTime=2015-12-17T09:33:21 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Thanks, Jamal From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Wednesday, December 23, 2015 7:17 AM To: Uddin, Jamal <Jamal.Uddin@dana.com> Subject: [Bug 2275] scontrol show node: AllocMem does report usage Comment # 3<http://bugs.schedmd.com/show_bug.cgi?id=2275#c3> on bug 2275<http://bugs.schedmd.com/show_bug.cgi?id=2275> from David Bigagli<mailto:david@schedmd.com> Hi Jamal, if you cannot connect to your controller it means the controller is down or you are using a different port number than the one specified in the slurm.conf, the one the slurmctld is bound to. Could you please verify that slurm.conf is the same on all nodes and that your slurmctld is running? To check the port bound by slurmctld use the lsof command. lsof -p <slurmctld pid> David ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ This e-mail, and any attachments, is intended solely for use by the addressee(s) named above. It may contain the confidential or proprietary information of Dana Holding Corporation, its subsidiaries, affiliates or business partners. If you are not the intended recipient of this e-mail or are an unauthorized recipient of the information, you are hereby notified that any dissemination, distribution or copying of this e-mail or any attachments, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender by reply e-mail and permanently delete the original and any copies or printouts. Computer viruses can be transmitted via email. The recipient should check this e-mail and any attachments for the presence of viruses. Dana Holding Corporation accepts no liability for any damage caused by any virus transmitted by this e-mail. English, Francais, Espanol, Deutsch, Italiano, Portugues: http://www.dana.com/dext/siteresources/emaildisclaimer.htm ________________________________
Hi, can you show me which configuration is giving you problems? You need to have SelectType=select/cons_res and SelectTypeParameters to include memory for example CR_CPU_Memory. When you change the SelectTypeParameters than you have to restart the slurmctld and the slurmds. David
Can you elaborate on the problems you have when setting the SelectType? Please note that you will have to restart slurmctld as well as slurmd on the nodes to make those changes. Please note that the AllocMem line does not display the current real memory usage of the node - it displays the memory currently allocated to jobs running on that node through Slurm, and does not reflect the load on the machine itself. - Tim
I presume that you did not have SelectTypeParameters configured to track memory. If that is not the case, please re-open the ticket and provide your configuration file and logs.