Ticket 3803

Summary: Heterogeneous allocation randomness
Product: Slurm Reporter: Amit Kumar <ahkumar>
Component: SchedulingAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: ahkumar
Version: 16.05.8   
Hardware: Linux   
OS: Linux   
Site: SMU Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 16.05.08 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Amit Kumar 2017-05-12 10:54:36 MDT
Dear SchedMD,

We are requesting three different types of nodes from a partition, that has included all our nodes in the cluster.

Different types of nodes we have are Broadwell nodes(36 CPU-cores), KNL(64 cores) nodes, and P100 nodes(36 CPU-cores and 1 P100). 

Here is a sample allocation request, that explicitly requests a set of 6 nodes with tasks (-n) equal to max cores combined. And the result is I am seeing random idle nodes allocated out of what I had requested

#salloc -J hybrid -n 272 --exclusive -p defq -w b001,b002,p035,p036,k003,k015 -x admin[01-03]
salloc: Pending job allocation 11534
salloc: job 11534 queued and waiting for resources
salloc: job 11534 has been allocated resources
salloc: Granted job allocation 11534


~]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             11534      defq   hybrid     root  R       0:05     10 b[001-002],k[003,015],login[01-04],p[035-036]


If you notice above I got allocations on all nodes I requested, although not all CPU-cores on them. Instead I was randomly give few cores off of login[01-04]. I think this is odd. 

If I were to Specify -N flag, then I get BadContrains as the reason and the allocation remains pending:

JobId=11535 JobName=hybrid
   UserId=root(0) GroupId=root(0) MCS_label=N/A
   Priority=0 Nice=0 Account=root QOS=normal
   JobState=PENDING Reason=BadConstraints Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2017-05-12T11:03:04 EligibleTime=2017-05-12T11:03:04
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=defq AllocNode:Sid=cm1:12429
   ReqNodeList=b[001-002],k[003,015],p[035-036] ExcNodeList=admin[01-03]
   NodeList=(null)
   NumNodes=6-6 NumCPUs=272 NumTasks=272 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=272,mem=557056,node=6
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/root
   Power=

 
How can I get around this any help here is greatly appreciated. 

Thank you,
Amit
Comment 1 Danny Auble 2017-05-12 11:05:00 MDT
Amit, could you send your slurm.conf?  

At first glance it is hard to tell if there is an error or not.
Comment 2 Amit Kumar 2017-05-12 11:10:14 MDT
Created attachment 4547 [details]
slurm.conf

Conf file attached
Comment 6 Danny Auble 2017-05-12 11:59:29 MDT
Amit, When I emulate your system I can see it grabbing an extra node just like you do.  But I always see all the nodes being completely allocated (never partially).

I believe your problem lies with

DefMemPerCPU=2048

You will notice your knl nodes don't have that much memory per cpu (each has 256 as we count threads as cpus).  So this means by default you have 96000/2048=46 cpus allocatible on your knls (thus the reason you have to grab other nodes).

On your other nodes you are set as 256000/2048=125 which is more than the number of cpus you have.

I always get the correct thing when I add --mem=0 as that will allocate all the memory on the node.

Does this make sense?

Let me know if you need anything else on this.
Comment 7 Amit Kumar 2017-05-12 12:57:19 MDT
Danny, Believe me while we were setting up the configuration file, I made a note that I need to change this DefMemPerCPU because it was going to bite me at some point. Thank you for pointing this one!! This has resolved my mistake!!

Regards,
Amit
Comment 8 Danny Auble 2017-05-12 14:06:58 MDT
No problem Amit, glad it was an easy one.

In the future could you mark bugs with severities lined out in https://schedmd.com/support.php?

A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system.

I am not sure I would classify this bug as a sev 2.  It helps with our SLAs.  I do understand it was confusing and annoying though.  You will find we are fairly fast at response no matter the severity though ;).

Thanks!