| Summary: | EnforcePartLimits Failure | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Paul Edmon <pedmon> |
| Component: | Scheduling | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bart |
| Version: | 17.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Harvard University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Hi Thank you for report, this is duplicate of bug 4960 and it was already fixed. This fix will be included in 17.11.6, that should be released in few days. Dominik *** This ticket has been marked as a duplicate of ticket 4960 *** |
According to this description: EnforcePartLimits If set to "ALL" then jobs which exceed a partition's size and/or time limits will be rejected at submission time. If job is submitted to multiple partitions, the job must satisfy the limits on all the requested partitions. If set to "NO" then the job will be accepted and remain queued until the partition limits are altered(Time and Node Limits). If set to "ANY" or "YES" a job must satisfy any of the requested partitions to be submitted. The default value is "NO". NOTE: If set, then a job's QOS can not be used to exceed partition limits. NOTE: The partition limits being considered are it's configured MaxMemPerCPU, MaxMemPerNode, MinNodes, MaxNodes, MaxTime, AllocNodes, AllowAccounts, AllowGroups, AllowQOS, and QOS usage threshold. We have it set to ALL but then this job didn't get rejected: [root@holyitc01 ~]# scontrol show job 43397066 JobId=43397066 JobName=gpc_allsky UserId=sgossage(559017) GroupId=conroy_lab(403048) MCS_label=N/A Priority=4184315 Nice=0 Account=conroy_lab QOS=normal JobState=PENDING Reason=Reservation Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=03:00:00 TimeMin=N/A SubmitTime=2018-05-05T23:46:03 EligibleTime=2018-05-05T23:46:03 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-05-06T16:35:59 Partition=conroy-intel,shared AllocNode:Sid=rcnx01:4118 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1-1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=16,mem=375G,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=24000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/n/regal/conroy_lab/sgossage/gaia/clusters/jobscripts/gpc_allsky_S.sh WorkDir=/n/regal/conroy_lab/sgossage/gaia StdErr=/n/regal/conroy_lab/sgossage/gaia/logs/gpc_allsky_43397066_4294967294.err StdIn=/dev/null StdOut=/n/regal/conroy_lab/sgossage/gaia/logs/gpc_allsky_43397066_4294967294.out Power= None of the nodes in either of those partitions has 375G available on a single host. So this job should have been rejected as submission time.