Ticket 10161

Summary: NCPUS/NumCPUs shows 2 even when use --cpus-per-task 1 with sbatch
Product: Slurm Reporter: George Hwa <george.hwa>
Component: SchedulingAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 19.05.0   
Hardware: Linux   
OS: Linux   
Site: KLA-Tencor RAPID Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: x Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description George Hwa 2020-11-05 11:49:54 MST
I submitted a simple job with the following command

    sbatch --cpus-per-task=1 sleeper.sh 

sacct shows

(sonic_tf23) [ghwa@rocks7fe fcv_3.6-bkmA_newBC]$ sbatch --cpus-per-task=1 sleeper.sh 
Submitted batch job 16222579
(sonic_tf23) [ghwa@rocks7fe fcv_3.6-bkmA_newBC]$ sacct -o ReqCPUS,ReqTRES,ReqGRES,ReqMem,ReqCPUFreq -j 16222579
               JobID    Elapsed      NCPUS   NTasks    AllocGRES      State            JobName      User  Timelimit   NNodes            NodeList               Start                 End  MaxVMSize     MaxRSS  ReqCPUS    ReqTRES      ReqGRES     ReqMem ReqCPUFreq 
-------------------- ---------- ---------- -------- ------------ ---------- ------------------ --------- ---------- -------- ------------------- ------------------- ------------------- ---------- ---------- -------- ---------- ------------ ---------- ---------- 
            16222579   00:00:12          2                          RUNNING         sleeper.sh      ghwa   01:00:00        1    compute-gpu-12-7 2020-11-05T10:46:25             Unknown                              1 billing=1+                   512Mc    Unknown 


and scontrol show job 


(sonic_tf23) [ghwa@rocks7fe fcv_3.6-bkmA_newBC]$ scontrol show job 16222579
JobId=16222579 JobName=sleeper.sh
   UserId=ghwa(5001) GroupId=sonic(21063) MCS_label=N/A
   Priority=3594828827 Nice=0 Account=local QOS=normal WCKey=*default
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:01:39 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2020-11-05T10:46:24 EligibleTime=2020-11-05T10:46:24
   AccrueTime=2020-11-05T10:46:24
   StartTime=2020-11-05T10:46:25 EndTime=2020-11-05T11:46:25 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2020-11-05T10:46:25
   Partition=snq2 AllocNode:Sid=rocks7fe:8814
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=compute-gpu-12-7
   BatchHost=compute-gpu-12-7
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=1G,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=512M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/gsshare/users/ghwa/sonic3FCV/fcv_3.6-bkmA_newBC/sleeper.sh
   WorkDir=/gsshare/users/ghwa/sonic3FCV/fcv_3.6-bkmA_newBC
   StdErr=/gsshare/users/ghwa/sonic3FCV/fcv_3.6-bkmA_newBC/slurm-16222579.out
   StdIn=/dev/null
   StdOut=/gsshare/users/ghwa/sonic3FCV/fcv_3.6-bkmA_newBC/slurm-16222579.out
   Power=

(sonic_tf23) [ghwa@rocks7fe fcv_3.6-bkmA_newBC]$ cat sleeper.sh 
#!/bin/bash
sleep 600



My question is that is SLURM really allocating 2 CPU for my job?
Comment 1 George Hwa 2020-11-05 11:51:01 MST
(sonic_tf23) [ghwa@rocks7fe fcv_3.6-bkmA_newBC]$ scontrol show config
Configuration data as of 2020-11-05T10:50:08
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos,safe,wckeys
AccountingStorageHost   = rocks7fe
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 0
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2020-10-06T18:36:13
BurstBufferType         = (null)
CheckpointType          = checkpoint/none
ClusterName             = luminizer6
CommunicationParameters = (null)
CompleteWait            = 0 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand
CryptoType              = crypto/munge
DebugFlags              = Backfill,BackfillMap,CPU_Bind,Gres,NO_CONF_HASH,Priority,Steps
DefMemPerNode           = UNLIMITED
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = NO
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FastSchedule            = 1
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = gpu,scn,sln,swn,plx
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 30 sec
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = NoOverMemoryKill
JobCheckpointDir        = /var/spool/slurm.checkpoint
JobCompHost             = rocks7fe
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 60 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 = 
Licenses                = (null)
LicensesUsed            = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 150000
MaxJobCount             = 1000000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MemLimitEnforce         = Yes
MessageTimeout          = 10 sec
MinJobAge               = 600 sec
MpiDefault              = none
MpiParams               = (null)
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 16222580
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = /etc/slurm/plugstack.conf
PowerParameters         = (null)
PowerPlugin             = 
PreemptMode             = OFF
PreemptType             = preempt/none
PriorityParameters      = (null)
PriorityType            = priority/basic
PrivateData             = none
ProctrackType           = proctrack/linuxproc
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = (null)
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = /etc/slurm/resumehost.sh
ResumeRate              = 4 nodes/min
ResumeTimeout           = 450 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 2
RoutePlugin             = route/default
SallocDefaultCommand    = (null)
SbcastParameters        = (null)
SchedulerParameters     = bf_max_job_test=1500,bf_interval=10,MessageTimeout=30,max_rpc_cnt=1000,sched_interval=20,default_queue_depth=1500
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY
SlurmUser               = root(0)
SlurmctldAddr           = (null)
SlurmctldDebug          = info
SlurmctldHost[0]        = rocks7fe(10.2.1.1)
SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmctldPort           = 6817
SlurmctldSyslogDebug    = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 300 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurm/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdSyslogDebug       = unknown
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 18.08.0
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /var/spool/slurm.state
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = /etc/slurm/suspendhost.sh
SuspendRate             = 4 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 45 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/none
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /state/partition1
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = Yes
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 110 percent
WaitTime                = 60 sec
X11Parameters           = (null)

Slurmctld(primary) at rocks7fe is UP
(sonic_tf23) [ghwa@rocks7fe fcv_3.6-bkmA_newBC]$
Comment 2 Michael Hinton 2020-11-05 12:24:48 MST
Hi George,

(In reply to George Hwa from comment #0)
> My question is that is SLURM really allocating 2 CPU for my job?
Yes. Since you have CR_CORE_MEMORY in your SelectTypeParameters, that means that even if you only request 1 CPU, you will get the entire core allocated (2 threads/core, thus 2 CPUs). From https://slurm.schedmd.com/slurm.conf.html#OPT_CR_Core_Memory:

"CR_Core_Memory
Cores and memory are consumable resources. On nodes with hyper-threads, each thread is counted as a CPU to satisfy a job's resource requirement, but multiple jobs are not allocated threads on the same core. The count of CPUs allocated to a job may be rounded up to account for every CPU on an allocated core."

Here's an example of this, from https://slurm.schedmd.com/srun.html#OPT_cpus-per-task:

"For example `srun -c2 --threads-per-core=1 prog` may allocate two cores for the job, but if each of those cores contains two threads, the job allocation will include four CPUs."

Thanks,
-Michael
Comment 4 George Hwa 2020-11-06 08:40:54 MST
Michael,

Got it.
We changed to CR_CPU_Memory and now it is getting 1.


Thanks
George
Comment 5 Michael Hinton 2020-11-06 10:23:28 MST
Ok, great. Just note that we usually recommend CR_CORE_MEMORY because there can be performance and security concerns when allowing separate jobs to run on the same core.
Comment 6 George Hwa 2020-11-06 18:31:49 MST
Going a bit deeper on this topic:

so if all my jobs are single CPU tasks, SLURM would still only schedule up to the number of CORES jobs on a node, not up to the number of CPUs, right?
Comment 7 Michael Hinton 2020-11-09 10:24:14 MST
(In reply to George Hwa from comment #6)
> so if all my jobs are single CPU tasks, SLURM would still only schedule up
> to the number of CORES jobs on a node, not up to the number of CPUs, right?
If cr_core_memory is specified, then yes, it will schedule up to the # of cores. If cr_cpu_memory is specified, it will schedule up to the # of CPUs.