| Summary: | Reservation is affecting jobs requesting too much memory | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Davide Vanzo <davide.vanzo> |
| Component: | Scheduling | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bart |
| Version: | 17.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Vanderbilt | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
Slurm configuration
cgroup configuration |
||
Created attachment 6665 [details]
cgroup configuration
Hi Thanks for your report. We know this bug. The full fix will be in 17.11.6. Dominik *** This ticket has been marked as a duplicate of ticket 4960 *** |
Created attachment 6664 [details] Slurm configuration Hello guys, I am hitting a very strange scheduling behavior when a reservation is present which affect jobs outside the reservation. Attached you can find our configuration files and here are the specifications of the reservation: ReservationName=restest StartTime=2018-04-20T10:15:53 EndTime=2018-04-30T10:15:53 Duration=10-00:00:00 Nodes=cn355 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=production Flags= NodeName=cn355 CoreIDs=0 TRES=cpu=2 Users=resuser Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a Now if I submit a job as a different user than "resuser" in any partition, requesting an amount of memory per node that is smaller than RealMemory but bigger than the memory available to jobs (i.e. RealMemory - MemSpecLimit), the job remains pending because of "Reservation": JobId=854671 JobName=sh UserId=vanzod(389801) GroupId=accre(36014) MCS_label=N/A Priority=99813 Nice=0 Account=accre QOS=normal JobState=PENDING Reason=Reservation Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2018-04-20T10:27:17 EligibleTime=2018-04-20T10:27:17 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-04-20T10:27:18 Partition=debug AllocNode:Sid=gw1230:36827 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=123605M,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=7 MinMemoryNode=123605M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/gpfs22/home/vanzod Power= If I delete the reservation, the job gets rejected as expected: $ salloc --partition=debug --ntasks=1 --mem=123605 salloc: error: Job submit/allocate failed: Requested node configuration is not available salloc: Job allocation 854675 has been revoked. Davide