Ticket 5134

Summary: EnforcePartLimits Failure
Product: Slurm Reporter: Paul Edmon <pedmon>
Component: SchedulingAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart
Version: 17.11.5   
Hardware: Linux   
OS: Linux   
Site: Harvard University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Paul Edmon 2018-05-06 14:42:54 MDT
According to this description:

EnforcePartLimits
    If set to "ALL" then jobs which exceed a partition's size and/or time limits will be rejected at submission time. If job is submitted to multiple partitions, the job must satisfy the limits on all the requested partitions. If set to "NO" then the job will be accepted and remain queued until the partition limits are altered(Time and Node Limits). If set to "ANY" or "YES" a job must satisfy any of the requested partitions to be submitted. The default value is "NO". NOTE: If set, then a job's QOS can not be used to exceed partition limits. NOTE: The partition limits being considered are it's configured MaxMemPerCPU, MaxMemPerNode, MinNodes, MaxNodes, MaxTime, AllocNodes, AllowAccounts, AllowGroups, AllowQOS, and QOS usage threshold.

We have it set to ALL but then this job didn't get rejected:

[root@holyitc01 ~]# scontrol show job 43397066
JobId=43397066 JobName=gpc_allsky
   UserId=sgossage(559017) GroupId=conroy_lab(403048) MCS_label=N/A
   Priority=4184315 Nice=0 Account=conroy_lab QOS=normal
   JobState=PENDING Reason=Reservation Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=03:00:00 TimeMin=N/A
   SubmitTime=2018-05-05T23:46:03 EligibleTime=2018-05-05T23:46:03
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-05-06T16:35:59
   Partition=conroy-intel,shared AllocNode:Sid=rcnx01:4118
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=375G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=24000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/n/regal/conroy_lab/sgossage/gaia/clusters/jobscripts/gpc_allsky_S.sh
   WorkDir=/n/regal/conroy_lab/sgossage/gaia
   StdErr=/n/regal/conroy_lab/sgossage/gaia/logs/gpc_allsky_43397066_4294967294.err
   StdIn=/dev/null
   StdOut=/n/regal/conroy_lab/sgossage/gaia/logs/gpc_allsky_43397066_4294967294.out
   Power=

None of the nodes in either of those partitions has 375G available on a single host.  So this job should have been rejected as submission time.
Comment 1 Dominik Bartkiewicz 2018-05-07 03:16:32 MDT
Hi

Thank you for report, this is duplicate of bug 4960 and it was already fixed.
This fix will be included in 17.11.6, that should be released in few days.

Dominik

*** This ticket has been marked as a duplicate of ticket 4960 ***