Ticket 9085 - Split reason and other partition-specific values into separate array/List in job_record_t
Summary: Split reason and other partition-specific values into separate array/List in ...
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.02.2
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Unassigned Developer
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-05-19 17:02 MDT by Marshall Garey
Modified: 2020-07-22 14:12 MDT (History)
2 users (show)

See Also:
Site: SchedMD
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Marshall Garey 2020-05-19 17:02:44 MDT
Coming from bug 9024 (which is itself a duplicate of bug 7248).

Reproducer:

slurm.conf:

# Nodes
NodeName=DEFAULT RealMemory=4000 Sockets=1 CoresPerSocket=2 ThreadsPerCore=2 \
	 State=UNKNOWN Weight=1
NodeName=d1_[1-9] NodeAddr=localhost Port=56101-56109
NodeName=d1_10 NodeAddr=localhost Port=56110 RealMemory=1000


# Partitions
EnforcePartLimits=any
PartitionName=debug Nodes=ALL Default=YES Qos=normal
PartitionName=bigmem Nodes=d1_9
PartitionName=smallmem Nodes=d1_10 MaxMemPerNode=1000
# MaxMemPerNode is optional in partition bigmem - it doesn't make a difference
# to reproduce the bug


Submit a job to fill the cluster. Example:
sbatch -N<number of nodes> --exclusive --wrap="sleep 1000")

Submit a multi-partition job to both bigmem and smallmem that requests more memory per node than any node has in smallmem:

$ sbatch -N1 -Dtmp --mem=2000 -p smallmem,bigmem --wrap="srun whereami"                                                                                                  
Submitted batch job 1653

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
              1653 smallmem,     wrap marshall PD       0:00      1 (Resources) 
              1649     debug     wrap marshall  R       9:26     11 d1_[1-11]


When the backfill scheduler runs, the job's reason goes to "MaxMemPerLimit":

[2020-05-19T16:58:42.385] backfill: beginning
[2020-05-19T16:58:42.385] =========================================
[2020-05-19T16:58:42.386] Begin:2020-05-19T16:58:42 End:2020-05-20T16:58:42 Nodes:d1_[1-11]
[2020-05-19T16:58:42.386] =========================================
[2020-05-19T16:58:42.386] backfill test for JobId=1653 Prio=5175 Partition=bigmem
[2020-05-19T16:58:42.386] Test JobId=1653 at 2020-05-19T16:58:42 on d1_9
[2020-05-19T16:58:42.387] JobId=1653 to start at 2020-05-24T16:47:02, end at 2020-05-29T16:47:00 on nodes d1_9 in partition bigmem
[2020-05-19T16:58:42.387] backfill: reached end of job queue
[2020-05-19T16:58:42.387] backfill: completed testing 1(1) jobs, usec=2040

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
              1653 smallmem,     wrap marshall PD       0:00      1 (MaxMemPerLimit) 
              1649     debug     wrap marshall  R      11:49     11 d1_[1-11]


When the main scheduler runs, the job's reason goes to "Resources":

sched: [2020-05-19T16:59:16.850] Running job scheduler
sched: [2020-05-19T16:59:16.850] JobId=1653. State=PENDING. Reason=Resources. Priority=5275. Partition=smallmem,bigmem.

$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
              1653 smallmem,     wrap marshall PD       0:00      1 (Resources) 
              1649     debug     wrap marshall  R      12:18     11 d1_[1-11]
Comment 1 Marshall Garey 2020-05-19 17:04:47 MDT
Paul, I'm adding you to CC to this bug as I mentioned in bug 9024. If you don't want to follow it feel free to remove yourself from CC. This is the bug where we're tracking the job's reason changing between MaxMemPerLimit and Resources for a multi partition job submission where the job can't run in one of the partitions because the memory per node request is larger than any node in that partition. See comment 0 for a description / reproducer.
Comment 11 Tim Wickberg 2020-07-14 12:07:13 MDT
Updating to reflect the preferred approach to resolving this, and similar, types of issues around multi-partition job submissions.

At this point we do not have a plan to tackle this - and unfortunately do not in the 20.11 timeframe.