Ticket 8957

Summary: "matching or" constraints don't work as expected
Product: Slurm Reporter: Ahmed Essam ElMazaty <ahmed.mazaty>
Component: SchedulingAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: alex, bart, cinek, fordste5
Version: 19.05.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=16259
Site: KAUST Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 20.02.5
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Ahmed Essam ElMazaty 2020-04-29 01:05:43 MDT
Hello,
As stated in the documentation
"If only one of a set of possible options should be used for all allocated nodes, then use the OR operator and enclose the options within square brackets"
So the allocated node must have and of the constraints inside the square brackets
However I see a different behaviour when using this
some time I get allocated a node which doesn't have any of the requested constraints.

An example:
[mazatyae@cn509-02-l ~]$ srun -t 1 --pty -n 40 -N 1  --constraint="[cpu_intel_e5_2680_v2|cpu_intel_e5_2670_v2]" bash -l
srun: job 10446359 queued and waiting for resources

# scontrol show job 10446359
JobId=10446359 JobName=bash
   UserId=mazatyae(167627) GroupId=g-mazatyae(1167627) MCS_label=N/A
   Priority=479 Nice=0 Account=default QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2020-04-29T09:54:11 EligibleTime=2020-04-29T09:54:11
   AccrueTime=2020-04-29T09:54:11
   StartTime=2020-04-29T10:44:14 EndTime=2020-04-29T10:45:14 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-04-29T09:54:46
   Partition=batch AllocNode:Sid=cn509-02-l:111903
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=cn605-13-r
   NumNodes=1-1 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=40,mem=80G,node=1,billing=40
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=[cpu_intel_e5_2680_v2|cpu_intel_e5_2670_v2]&nolmem&nogpu DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/home/mazatyae
   Power=


However node "cn605-13-r" doesn't have any of the requested constraints
[mazatyae@cn509-02-l ~]$ scontrol show node cn605-13-r
NodeName=cn605-13-r Arch=x86_64 CoresPerSocket=20 
   CPUAlloc=40 CPUTot=40 CPULoad=40.51
   AvailableFeatures=dragon,cpu_intel_gold_6148,skylake,intel,ibex2018,nogpu,nolmem,local_200G,local_400G,local_500G
   ActiveFeatures=dragon,cpu_intel_gold_6148,skylake,intel,ibex2018,nogpu,nolmem,local_200G,local_400G,local_500G
   Gres=(null)
   NodeAddr=cn605-13-r NodeHostName=cn605-13-r Version=19.05.5
   OS=Linux 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019 
   RealMemory=375618 AllocMem=368640 FreeMem=52792 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A
   Partitions=batch 
   BootTime=2020-03-21T12:11:10 SlurmdStartTime=2020-03-21T12:29:05
   CfgTRES=cpu=40,mem=375618M,billing=40
   AllocTRES=cpu=40,mem=360G
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


This doesn't happen to me when I use the normal "or" without the square brackets.
Best regards,
Ahmed
Comment 1 Dominik Bartkiewicz 2020-04-29 07:54:52 MDT
Hi

I can reproduce this. I'll let you know when I have a fix.

Dominik
Comment 8 Dominik Bartkiewicz 2020-09-03 07:07:25 MDT
Hi

Finally, we commit fix for this issue:
https://github.com/SchedMD/slurm/commit/bb97ad45
It is in 20.02 branch and will be in included 20.02.5 releases.

Sorry that this took so long.
Closing as resolved/fixed, please reopen if needed.

Dominik