Ticket 3803

Summary:	Heterogeneous allocation randomness
Product:	Slurm	Reporter:	Amit Kumar <ahkumar>
Component:	Scheduling	Assignee:	Danny Auble <da>
Status:	RESOLVED FIXED	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	ahkumar
Version:	16.05.8
Hardware:	Linux
OS:	Linux
Site:	SMU	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	16.05.08
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf

Description Amit Kumar 2017-05-12 10:54:36 MDT

Dear SchedMD,

We are requesting three different types of nodes from a partition, that has included all our nodes in the cluster.

Different types of nodes we have are Broadwell nodes(36 CPU-cores), KNL(64 cores) nodes, and P100 nodes(36 CPU-cores and 1 P100). 

Here is a sample allocation request, that explicitly requests a set of 6 nodes with tasks (-n) equal to max cores combined. And the result is I am seeing random idle nodes allocated out of what I had requested

#salloc -J hybrid -n 272 --exclusive -p defq -w b001,b002,p035,p036,k003,k015 -x admin[01-03]
salloc: Pending job allocation 11534
salloc: job 11534 queued and waiting for resources
salloc: job 11534 has been allocated resources
salloc: Granted job allocation 11534


~]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             11534      defq   hybrid     root  R       0:05     10 b[001-002],k[003,015],login[01-04],p[035-036]


If you notice above I got allocations on all nodes I requested, although not all CPU-cores on them. Instead I was randomly give few cores off of login[01-04]. I think this is odd. 

If I were to Specify -N flag, then I get BadContrains as the reason and the allocation remains pending:

JobId=11535 JobName=hybrid
   UserId=root(0) GroupId=root(0) MCS_label=N/A
   Priority=0 Nice=0 Account=root QOS=normal
   JobState=PENDING Reason=BadConstraints Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2017-05-12T11:03:04 EligibleTime=2017-05-12T11:03:04
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=defq AllocNode:Sid=cm1:12429
   ReqNodeList=b[001-002],k[003,015],p[035-036] ExcNodeList=admin[01-03]
   NodeList=(null)
   NumNodes=6-6 NumCPUs=272 NumTasks=272 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=272,mem=557056,node=6
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/root
   Power=

 
How can I get around this any help here is greatly appreciated. 

Thank you,
Amit

Comment 1 Danny Auble 2017-05-12 11:05:00 MDT

Amit, could you send your slurm.conf?  

At first glance it is hard to tell if there is an error or not.

Comment 2 Amit Kumar 2017-05-12 11:10:14 MDT

Created attachment 4547 [details]
slurm.conf

Conf file attached

Comment 6 Danny Auble 2017-05-12 11:59:29 MDT

Amit, When I emulate your system I can see it grabbing an extra node just like you do.  But I always see all the nodes being completely allocated (never partially).

I believe your problem lies with

DefMemPerCPU=2048

You will notice your knl nodes don't have that much memory per cpu (each has 256 as we count threads as cpus).  So this means by default you have 96000/2048=46 cpus allocatible on your knls (thus the reason you have to grab other nodes).

On your other nodes you are set as 256000/2048=125 which is more than the number of cpus you have.

I always get the correct thing when I add --mem=0 as that will allocate all the memory on the node.

Does this make sense?

Let me know if you need anything else on this.

Comment 7 Amit Kumar 2017-05-12 12:57:19 MDT

Danny, Believe me while we were setting up the configuration file, I made a note that I need to change this DefMemPerCPU because it was going to bite me at some point. Thank you for pointing this one!! This has resolved my mistake!!

Regards,
Amit

Comment 8 Danny Auble 2017-05-12 14:06:58 MDT

No problem Amit, glad it was an easy one.

In the future could you mark bugs with severities lined out in https://schedmd.com/support.php?

A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system.

I am not sure I would classify this bug as a sev 2.  It helps with our SLAs.  I do understand it was confusing and annoying though.  You will find we are fairly fast at response no matter the severity though ;).

Thanks!