Ticket 12056

Summary:	Reservations and cpus-per-task
Product:	Slurm	Reporter:	lhuang
Component:	reservations	Assignee:	Oriol Vilarrubi <jvilarru>
Status:	RESOLVED INVALID	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	20.11.0
Hardware:	Linux
OS:	Linux
Site:	NY Genome	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description lhuang 2021-07-16 08:11:40 MDT

We've created a reservation, with one node. The node only has 20 cpus, however we found that we can request for --cpus-per-task=20 many times with srun. Is this the expected behavior?

ReservationName=rescomp_4 StartTime=2021-07-16T10:03:00 EndTime=2021-07-16T20:03:00 Duration=10:00:00
   Nodes=pe2cc2-068 NodeCnt=1 CoreCnt=20 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=20
   Users=(null) Groups=(null) Accounts=rescomp Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)


[lhuang@pe2cc2-068 ~]$ squeue -u lhuang
                    JOBID PARTITION NAME                                                  USER ST       TIME TIME_LIMIT  NODES     CPUS MIN_MEMORY      QOS PRIORITY NODELIST(REASON)
                 14372647       dev bash                                                lhuang  R       2:34   10:00:00      1       20      1000M  rescomp   766290 pe2cc2-068
                 14372640       dev bash                                                lhuang  R       7:03   10:00:00      1       20      1000M  rescomp   766290 pe2cc2-068
                 14372639       dev bash                                                lhuang  R       7:12   10:00:00      1       20      1000M  rescomp   766290 pe2cc2-068


[lhuang@pe2cc2-068 ~]$ scontrol show node pe2cc2-068
NodeName=pe2cc2-068 Arch=x86_64 CpuBind=cores CoresPerSocket=10 
   CPUAlloc=20 CPUTot=20 CPULoad=0.01
   AvailableFeatures=v2
   ActiveFeatures=v2
   Gres=(null)
   NodeAddr=pe2cc2-068 NodeHostName=pe2cc2-068 Version=20.11.0
   OS=Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 
   RealMemory=230000 AllocMem=60000 FreeMem=248713 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=3 Owner=N/A MCS_label=N/A
   Partitions=dev 
   BootTime=2021-07-13T13:36:40 SlurmdStartTime=2021-07-13T13:37:31
   CfgTRES=cpu=20,mem=230000M,billing=20
   AllocTRES=cpu=20,mem=60000M
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

Comment 1 Oriol Vilarrubi 2021-07-16 08:47:32 MDT

Hello,

No, this is not the expected behavior. Could you please send me the output of the following commands:

scontrol show config
scontrol show partitions
scontrol show nodes
sacctmgr show qos rescomp


That way I would be able to see if you have something in the configuration that makes this happen.

Comment 2 lhuang 2021-07-16 08:59:42 MDT

I found the issue. That test node have OverSubscribe=FORCE:4  options enabled. Hence why we were able to request for more resources than available.

Closing it out. Thanks.