Ticket 5087

Summary: Reservation is affecting jobs requesting too much memory
Product: Slurm Reporter: Davide Vanzo <davide.vanzo>
Component: SchedulingAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart
Version: 17.11.5   
Hardware: Linux   
OS: Linux   
Site: Vanderbilt Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Slurm configuration
cgroup configuration

Description Davide Vanzo 2018-04-20 09:32:50 MDT
Created attachment 6664 [details]
Slurm configuration

Hello guys,

I am hitting a very strange scheduling behavior when a reservation is present which affect jobs outside the reservation. Attached you can find our configuration files and here are the specifications of the reservation:

ReservationName=restest 
StartTime=2018-04-20T10:15:53 
EndTime=2018-04-30T10:15:53 
Duration=10-00:00:00
Nodes=cn355 
NodeCnt=1 
CoreCnt=1 
Features=(null) 
PartitionName=production 
Flags=
NodeName=cn355 
CoreIDs=0
TRES=cpu=2
Users=resuser
Accounts=(null) 
Licenses=(null) 
State=ACTIVE 
BurstBuffer=(null) 
Watts=n/a

Now if I submit a job as a different user than "resuser" in any partition, requesting an amount of memory per node that is smaller than RealMemory but bigger than the memory available to jobs (i.e. RealMemory - MemSpecLimit), the job remains pending because of "Reservation":

JobId=854671 JobName=sh
   UserId=vanzod(389801) GroupId=accre(36014) MCS_label=N/A
   Priority=99813 Nice=0 Account=accre QOS=normal
   JobState=PENDING Reason=Reservation Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2018-04-20T10:27:17 EligibleTime=2018-04-20T10:27:17
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-04-20T10:27:18
   Partition=debug AllocNode:Sid=gw1230:36827
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=123605M,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=7 MinMemoryNode=123605M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/gpfs22/home/vanzod
   Power=

If I delete the reservation, the job gets rejected as expected:

$ salloc --partition=debug --ntasks=1 --mem=123605
salloc: error: Job submit/allocate failed: Requested node configuration is not available
salloc: Job allocation 854675 has been revoked.


Davide
Comment 1 Davide Vanzo 2018-04-20 09:33:20 MDT
Created attachment 6665 [details]
cgroup configuration
Comment 2 Dominik Bartkiewicz 2018-04-23 02:27:26 MDT
Hi

Thanks for your report.
We know this bug.
The full fix will be in 17.11.6.

Dominik

*** This ticket has been marked as a duplicate of ticket 4960 ***