Ticket 12056

Summary: Reservations and cpus-per-task
Product: Slurm Reporter: lhuang
Component: reservationsAssignee: Oriol Vilarrubi <jvilarru>
Status: RESOLVED INVALID QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.11.0   
Hardware: Linux   
OS: Linux   
Site: NY Genome Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description lhuang 2021-07-16 08:11:40 MDT
We've created a reservation, with one node. The node only has 20 cpus, however we found that we can request for --cpus-per-task=20 many times with srun. Is this the expected behavior?

ReservationName=rescomp_4 StartTime=2021-07-16T10:03:00 EndTime=2021-07-16T20:03:00 Duration=10:00:00
   Nodes=pe2cc2-068 NodeCnt=1 CoreCnt=20 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=20
   Users=(null) Groups=(null) Accounts=rescomp Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)


[lhuang@pe2cc2-068 ~]$ squeue -u lhuang
                    JOBID PARTITION NAME                                                  USER ST       TIME TIME_LIMIT  NODES     CPUS MIN_MEMORY      QOS PRIORITY NODELIST(REASON)
                 14372647       dev bash                                                lhuang  R       2:34   10:00:00      1       20      1000M  rescomp   766290 pe2cc2-068
                 14372640       dev bash                                                lhuang  R       7:03   10:00:00      1       20      1000M  rescomp   766290 pe2cc2-068
                 14372639       dev bash                                                lhuang  R       7:12   10:00:00      1       20      1000M  rescomp   766290 pe2cc2-068


[lhuang@pe2cc2-068 ~]$ scontrol show node pe2cc2-068
NodeName=pe2cc2-068 Arch=x86_64 CpuBind=cores CoresPerSocket=10 
   CPUAlloc=20 CPUTot=20 CPULoad=0.01
   AvailableFeatures=v2
   ActiveFeatures=v2
   Gres=(null)
   NodeAddr=pe2cc2-068 NodeHostName=pe2cc2-068 Version=20.11.0
   OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 
   RealMemory=230000 AllocMem=60000 FreeMem=248713 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=3 Owner=N/A MCS_label=N/A
   Partitions=dev 
   BootTime=2021-07-13T13:36:40 SlurmdStartTime=2021-07-13T13:37:31
   CfgTRES=cpu=20,mem=230000M,billing=20
   AllocTRES=cpu=20,mem=60000M
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)
Comment 1 Oriol Vilarrubi 2021-07-16 08:47:32 MDT
Hello,

No, this is not the expected behavior. Could you please send me the output of the following commands:

scontrol show config
scontrol show partitions
scontrol show nodes
sacctmgr show qos rescomp


That way I would be able to see if you have something in the configuration that makes this happen.
Comment 2 lhuang 2021-07-16 08:59:42 MDT
I found the issue. That test node have OverSubscribe=FORCE:4  options enabled. Hence why we were able to request for more resources than available.

Closing it out. Thanks.