Coming from bug 9024 (which is itself a duplicate of bug 7248). Reproducer: slurm.conf: # Nodes NodeName=DEFAULT RealMemory=4000 Sockets=1 CoresPerSocket=2 ThreadsPerCore=2 \ State=UNKNOWN Weight=1 NodeName=d1_[1-9] NodeAddr=localhost Port=56101-56109 NodeName=d1_10 NodeAddr=localhost Port=56110 RealMemory=1000 # Partitions EnforcePartLimits=any PartitionName=debug Nodes=ALL Default=YES Qos=normal PartitionName=bigmem Nodes=d1_9 PartitionName=smallmem Nodes=d1_10 MaxMemPerNode=1000 # MaxMemPerNode is optional in partition bigmem - it doesn't make a difference # to reproduce the bug Submit a job to fill the cluster. Example: sbatch -N<number of nodes> --exclusive --wrap="sleep 1000") Submit a multi-partition job to both bigmem and smallmem that requests more memory per node than any node has in smallmem: $ sbatch -N1 -Dtmp --mem=2000 -p smallmem,bigmem --wrap="srun whereami" Submitted batch job 1653 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1653 smallmem, wrap marshall PD 0:00 1 (Resources) 1649 debug wrap marshall R 9:26 11 d1_[1-11] When the backfill scheduler runs, the job's reason goes to "MaxMemPerLimit": [2020-05-19T16:58:42.385] backfill: beginning [2020-05-19T16:58:42.385] ========================================= [2020-05-19T16:58:42.386] Begin:2020-05-19T16:58:42 End:2020-05-20T16:58:42 Nodes:d1_[1-11] [2020-05-19T16:58:42.386] ========================================= [2020-05-19T16:58:42.386] backfill test for JobId=1653 Prio=5175 Partition=bigmem [2020-05-19T16:58:42.386] Test JobId=1653 at 2020-05-19T16:58:42 on d1_9 [2020-05-19T16:58:42.387] JobId=1653 to start at 2020-05-24T16:47:02, end at 2020-05-29T16:47:00 on nodes d1_9 in partition bigmem [2020-05-19T16:58:42.387] backfill: reached end of job queue [2020-05-19T16:58:42.387] backfill: completed testing 1(1) jobs, usec=2040 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1653 smallmem, wrap marshall PD 0:00 1 (MaxMemPerLimit) 1649 debug wrap marshall R 11:49 11 d1_[1-11] When the main scheduler runs, the job's reason goes to "Resources": sched: [2020-05-19T16:59:16.850] Running job scheduler sched: [2020-05-19T16:59:16.850] JobId=1653. State=PENDING. Reason=Resources. Priority=5275. Partition=smallmem,bigmem. $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1653 smallmem, wrap marshall PD 0:00 1 (Resources) 1649 debug wrap marshall R 12:18 11 d1_[1-11]
Paul, I'm adding you to CC to this bug as I mentioned in bug 9024. If you don't want to follow it feel free to remove yourself from CC. This is the bug where we're tracking the job's reason changing between MaxMemPerLimit and Resources for a multi partition job submission where the job can't run in one of the partitions because the memory per node request is larger than any node in that partition. See comment 0 for a description / reproducer.
Updating to reflect the preferred approach to resolving this, and similar, types of issues around multi-partition job submissions. At this point we do not have a plan to tackle this - and unfortunately do not in the 20.11 timeframe.