Hi, We recently switched from 17.02 to 17.11.8 and our preempt setup that had worked on preempt/qos is not longer preempting jobs. We also tried partition priority and it doesn't work. Is there a known issue with 17.11.8? I include our slurm.conf Thanks, Ian # # See the slurm.conf man page for more information. # ClusterName=SLURM_CLUSTER SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6800-6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/cm/shared/apps/slurm/var/cm/statesave SlurmdSpoolDir=/cm/local/apps/slurm/var/spool SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid #ProctrackType=proctrack/pgid ProctrackType=proctrack/cgroup PrologFlags=Contain #PluginDir= CacheGroups=0 #FirstJobId= ReturnToService=2 #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= TaskPlugin=task/cgroup #TrackWCKey=no #TreeWidth=50 #TmpFs= #UsePAM= #RebootProgram=/sbin/reboot RebootProgram=/cm/shared/apps/fi/bin/fi-reboot JobRequeue=0 #EnforcePartLimits=ALL # Try to work around slurm bug #5452 EnforcePartLimits=ANY # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory # # SCHEDULING #SchedulerAuth= #SchedulerPort= #SchedulerRootFilter= #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 # # LOGGING SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd #JobCompType=jobcomp/filetxt #JobCompLoc=/cm/local/apps/slurm/var/spool/job_comp.log # # ACCOUNTING JobAcctGatherType=jobacct_gather/linux #JobAcctGatherType=jobacct_gather/cgroup #JobAcctGatherFrequency=30 JobAcctGatherParams=NoOverMemoryKill AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm AccountingStorageEnforce=limits,qos PreemptType=preempt/partition_prio PreemptMode=CHECKPOINT # AccountingStorageLoc=slurm_acct_db # AccountingStoragePass=SLURMDBD_USERPASS # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE # Scheduler SchedulerType=sched/backfill # Master nodes ControlMachine=ironbcm1 ControlAddr=ironbcm1 BackupController=ironbcm2 BackupAddr=ironbcm2 AccountingStorageHost=ironbcm1 # Nodes NodeName=workermem00 CoresPerSocket=12 RealMemory=1500000 Sockets=4 NodeName=worker[0016-0047] CoresPerSocket=14 RealMemory=256000 Sockets=2 Feature=ib NodeName=workergpu[00-02] CoresPerSocket=14 RealMemory=384000 Sockets=2 Gres=gpu:2 Feature=k40 NodeName=worker[1000-1239] CoresPerSocket=14 RealMemory=512000 Sockets=2 Feature=opa,broadwell NodeName=workergpu[03-07] CoresPerSocket=14 RealMemory=512000 Sockets=2 Gres=gpu:2 Feature=p100 NodeName=workergpu[13-18] CoresPerSocket=18 RealMemory=768000 Sockets=2 Gres=gpu:4 Feature=v100,skylake NodeName=workergpu[08-12] CoresPerSocket=20 RealMemory=384000 Sockets=2 Gres=gpu:2 Feature=v100,skylake NodeName=worker[3000-3119] CoresPerSocket=20 RealMemory=768000 Sockets=2 Feature=skylake,opa NodeName=worker[2001-2119] CoresPerSocket=20 RealMemory=768000 Sockets=2 ThreadsPerCore=1 Feature=skylake NodeName=worker[0000-0015] CoresPerSocket=22 RealMemory=384000 Sockets=2 Feature=ib # Partitions PartitionName=scc Default=NO MinNodes=1 AllowGroups=scc PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=gen,scc LLN=NO QoS=scc ExclusiveUser=NO OverSubscribe=EXCLUSIVE OverTimeLimit=0 State=UP Nodes=worker[1000-1239,3000-3119] PartitionName=ccb Default=NO MinNodes=1 DefaultTime=7-00:00:00 MaxTime=7-00:00:00 AllowGroups=ccb PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=gen,ccb LLN=NO QoS=ccb ExclusiveUser=NO OverSubscribe=EXCLUSIVE OverTimeLimit=0 State=UP Nodes=worker[1000-1239,3000-3119] PartitionName=gen Default=YES MinNodes=1 DefaultTime=1-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=gen,inter LLN=NO QoS=inter ExclusiveUser=NO OverSubscribe=EXCLUSIVE OverTimeLimit=0 State=UP Nodes=worker[1000-1239,3000-3119] PartitionName=cca Default=NO MinNodes=1 DefaultTime=7-00:00:00 MaxTime=7-00:00:00 AllowGroups=cca PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=gen,cca LLN=NO QoS=cca ExclusiveUser=NO OverSubscribe=EXCLUSIVE OverTimeLimit=0 State=UP Nodes=worker[1000-1239,3000-3119] PartitionName=ccq Default=NO MinNodes=1 DefaultTime=7-00:00:00 AllowGroups=ccq PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=gen,ccq LLN=NO QoS=ccq ExclusiveUser=NO OverSubscribe=EXCLUSIVE OverTimeLimit=0 State=UP Nodes=worker[1000-1239,3000-3119] PartitionName=preempt Default=NO MinNodes=1 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=0 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=CANCEL ReqResv=NO AllowAccounts=ALL AllowQos=preempt LLN=NO QoS=preempt ExclusiveUser=NO OverSubscribe=EXCLUSIVE OverTimeLimit=0 State=UP Nodes=worker[1000-1239,3000-3119] PartitionName=ib Default=NO MinNodes=1 DefaultTime=7-00:00:00 MaxTime=7-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=gen,ib LLN=NO QoS=ib ExclusiveUser=NO OverSubscribe=EXCLUSIVE OverTimeLimit=0 State=UP Nodes=worker[0000-0047] PartitionName=gpu Default=NO MinNodes=1 DefaultTime=7-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=18000 AllowAccounts=ALL AllowQos=gen,gpu LLN=NO QoS=gpu ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=workergpu[00-18] PartitionName=mem Default=NO MinNodes=1 DefaultTime=7-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=gen,mem LLN=NO QoS=mem ExclusiveUser=NO OverSubscribe=EXCLUSIVE OverTimeLimit=0 State=UP Nodes=workermem00 PartitionName=bnl Default=NO MinNodes=1 DefaultTime=10-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=gen,bnl LLN=NO QoS=bnl ExclusiveUser=NO OverSubscribe=EXCLUSIVE OverTimeLimit=0 State=UP Nodes=worker[2001-2119] PartitionName=bnlx Default=NO MinNodes=1 DefaultTime=1-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=18000 AllowAccounts=ALL AllowQos=gen,bnlx LLN=NO QoS=bnlx ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=worker[2001-2119] PartitionName=info Default=NO MinNodes=1 DefaultTime=7-00:00:00 AllowGroups=genedata PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO QoS=infor ExclusiveUser=NO OverSubscribe=EXCLUSIVE OverTimeLimit=0 State=UP Nodes=worker[1000-1239,3000-3119] PartitionName=ccm Default=NO MinNodes=1 DefaultTime=7-00:00:00 MaxTime=7-00:00:00 AllowGroups=ccm PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=gen,ccm LLN=NO QoS=ccm ExclusiveUser=NO OverSubscribe=EXCLUSIVE OverTimeLimit=0 State=UP Nodes=worker[1000-1239,3000-3119] # Generic resources types GresTypes=gpu,mic # Epilog/Prolog parameters PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob Prolog=/cm/local/apps/cmd/scripts/prolog Epilog=/cm/local/apps/cmd/scripts/epilog # Fast Schedule option FastSchedule=1 # Power Saving SuspendTime=-1 # this disables power saving SuspendTimeout=60 ResumeTimeout=300 SuspendProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweroff ResumeProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweron # END AUTOGENERATED SECTION -- DO NOT REMOVE
Hi Ian, I am tracking down the commit that fixes this but I can tell you that this is caused by OverSubscribe=EXCLUSIVE and is not an issue in later versions such as 17.11.12. You could correct this issue be upgrading to 17.11.12. -Jason
We use bright, so we will struggle to upgrade until we move to the next version. Ian Sent from my iPhone On Dec 18, 2018, at 6:22 PM, bugs@schedmd.com wrote: *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=6265#c2> on bug 6265 <https://bugs.schedmd.com/show_bug.cgi?id=6265> from Jason Booth <jbooth@schedmd.com> * Hi Ian, I am tracking down the commit that fixes this but I can tell you that this is caused by OverSubscribe=EXCLUSIVE and is not an issue in later versions such as 17.11.12. You could correct this issue be upgrading to 17.11.12. -Jason ------------------------------ You are receiving this mail because: - You reported the bug.
Hi Ian, I am marking this issue as resolved but I wanted to mention one additional piece of information. > We use bright, so we will struggle to upgrade until we move to the next version. Bright can re-roll their RPMs with a later version of Slurm. They will respond that they test on specific versions so they can not guarantee that updating to a later version will not cause any issues with their integration, however, the move between minor versions such as 17.11.8 to 17.11.12 should not cause an issue. -Jason