Ticket 7271

Summary:	free cores in high priority partition are not considered for scheduling in lower priority partitions
Product:	Slurm	Reporter:	Bartosz Kostrzewa <bartosz_kostrzewa>
Component:	Scheduling	Assignee:	Jacob Jenson <jacob>
Status:	RESOLVED INVALID	QA Contact:
Severity:	6 - No support contract
Priority:	---
Version:	- Unsupported Older Versions
Hardware:	Linux
OS:	Linux
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	Debian	Machine Name:	QBIG
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Bartosz Kostrzewa 2019-06-21 02:29:55 MDT

We operate a small GPU cluster which is used for both GPU and CPU workloads. We have two kinds of GPU nodes (NVIDIA K20 and NVIDIA P100) and operate separate partitions for these two GPU types. In addition, we have lower priority "CPU" partitions which contain all nodes of the system. There are further "low priority" versions of all queues such that users can schedule low priority jobs which should run whenever the machine is genuinely idle. Finally, there are two development partitions (also overlapping with other partitions) with very high PriorityJobFactor and PriorityTier.

PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
pascal          up 4-00:00:00      3    mix lnode[13-15]
low_pascal      up 4-00:00:00      3    mix lnode[13-15]
kepler          up 4-00:00:00      9  alloc lnode[02-05,07-08,10-12]
kepler          up 4-00:00:00      2   resv lnode[06,09]
low_kepler      up 4-00:00:00      9  alloc lnode[02-05,07-08,10-12]
low_kepler      up 4-00:00:00      2   resv lnode[06,09]
cpubatch*       up 3-00:00:00      9  alloc lnode[02-05,07-08,10-12]
cpubatch*       up 3-00:00:00      2   resv lnode[06,09]
cpubatch*       up 3-00:00:00      5    mix lcpunode01,lnode[13-15],qbig
low_cpubatch    up 3-00:00:00      9  alloc lnode[02-05,07-08,10-12]
low_cpubatch    up 3-00:00:00      2   resv lnode[06,09]
low_cpubatch    up 3-00:00:00      5    mix lcpunode01,lnode[13-15],qbig
cpudevel        up 2-00:00:00      2    mix lcpunode01,qbig
devel           up 4-00:00:00      1  alloc lnode02
devel           up 4-00:00:00      1    mix lnode14

The GPU partitions have significantly higher PriorityJobFactor

PartitionName=pascal   Maxtime=96:00:00 Nodes=lnode[13-15] State=UP PriorityJobFactor=40000 PriorityTier=40000 OverSubscribe=NO MaxNodes=3 DefaultTime=02:00:00
PartitionName=low_pascal   Maxtime=96:00:00 Nodes=lnode[13-15] State=UP PriorityJobFactor=10000 PriorityTier=10000 OverSubscribe=NO MaxNodes=3 DefaultTime=02:00:00
PartitionName=kepler   Maxtime=96:00:00 Nodes=lnode[02-12] State=UP PriorityJobFactor=40000 PriorityTier=40000 OverSubscribe=NO MaxNodes=11 DefaultTime=02:00:00
PartitionName=low_kepler   Maxtime=96:00:00 Nodes=lnode[02-12] State=UP PriorityJobFactor=10000 PriorityTier=10000 OverSubscribe=NO MaxNodes=11 DefaultTime=02:00:00
PartitionName=cpubatch Maxtime=72:00:00 Nodes=qbig,lnode[02-12],lnode[13-15],lcpunode01 State=UP Default=YES PriorityJobFactor=5000 PriorityTier=5000 OverSubscribe=NO DefaultTime=00:30:00 MaxNodes=4
PartitionName=low_cpubatch Maxtime=72:00:00 Nodes=qbig,lnode[02-12],lnode[13-15],lcpunode01 State=UP Default=NO PriorityJobFactor=2000 PriorityTier=2000 OverSubscribe=NO DefaultTime=00:30:00 MaxNodes=4
PartitionName=cpudevel Maxtime=48:00:00 Nodes=qbig,lcpunode01 State=UP PriorityTier=65535 PriorityJobFactor=65535 OverSubscribe=NO MaxNodes=1 DefaultTime=01:00:00
PartitionName=devel Maxtime=96:00:00 Nodes=lnode14,lnode02 State=UP PriorityJobFactor=65535 PriorityTier=65535 OverSubscribe=NO MaxNodes=1 DefaultTime=01:00:00

This setup works rather nicely except for one issue which probably has to do with what is described in https://bugs.schedmd.com/show_bug.cgi?id=2217#c44

"The way Slurm handles overlapping partitions is once the first job in that partition/queue can't be scheduled due to lack of resources, the nodes associated with that partition are removed from consideration for jobs in other partitions. This the jobs in lower priority partitions from being allocated those resources and delaying jobs in higher priority partitions. (The jobs can still use nodes outside of the blocked higher priority partition.) That can trigger a cascade effect with respect to scheduling resources, which might be what you see."

The GPU jobs that run on the "pascal" partition, for example, use at most 16 out of 28 cores on those nodes. This leaves 12 cores for CPU jobs. Unfortunately, the CPU jobs are never scheduled to run on those nodes, probably due to the above.

Is there something that can be done to disable this behaviour? We really need the partitions to be overlapping because CPU jobs must be allowed to run everywhere if there are no GPU jobs and conversely, our GPU jobs should have very high priority relative to CPU jobs.

I would be very grateful for any help!

Comment 1 Bartosz Kostrzewa 2019-06-21 11:31:33 MDT

I guess this is somewhat of a duplicate of https://bugs.schedmd.com/show_bug.cgi?id=3881

Comment 2 Bartosz Kostrzewa 2019-09-30 12:00:18 MDT

In case anybody stumbles upon this kind of issue and is looking for a workaround, here is what we have done for now after several months of wrangling with conflicting reservations and lots of manual priorisation:

1) Combined all partitions into one

2) set up QoS which mirror the partition priorities that we had before

$ sacctmgr show qos format=name,priority
      Name   Priority 
---------- ---------- 
    normal          0 
    pascal      15000 
    kepler       5000 
     devel     100000

3) set up job_submit.lua which checks for GRES and assigns a given QoS, except for "devel", which gives a very high priority and is not modified by the lua script.

function slurm_job_submit(job_desc, part_list, submit_uid)
--    jobs which don't request any gres or any qos are given the "normal" qos
--    note that jobs which have 'deve' qos just go straight through
    if (job_desc.qos == nil) and (job_desc.gres == nil) then
       job_desc.qos = "normal"
    -- if a gres is set but no qos, we set the appropriate qos
    elseif (job_desc.qos == nil) and not (job_desc.gres == nil) then
       -- when the job requests GPU resources, we apply either the pascal
       -- or the kepler QoS
       if not (string.find(job_desc.gres, "gpu") == nil) then
           --- user has requested one of the low priority partitions, let's make
           --  sure it's the right one
           if not (string.find(job_desc.gres, "pascal") == nil) then
                slurm.log_info("slurm_job_submit: setting QoS to pascal")
                job_desc.qos = "pascal"
           else 
                slurm.log_info("slurm_job_submit: setting QoS to kepler")
                job_desc.qos = "kepler"
           end
       end
    -- in this branch, the qos is set, so we check if the correct gres is also set
    elseif not (job_desc.qos == nil) then
       -- pascal qos requires pascal gres
       if not (string.find(job_desc.qos, "pascal") == nil) then
          if ( job_desc.gres == nil ) then
              slurm.log_info("slurm_job_submit: Job rejected, pascal QoS selected but no GRES set!")
              slurm.log_user("slurm_job_submit: Job rejected, pascal QoS selected but no GRES set!")
              return slurm.ERROR
          elseif ( string.find(job_desc.gres, "pascal") == nil) then
              slurm.log_info("slurm_job_sumbit: Job reejcted, pascal QoS selected but invalid GRES set!")
              slurm.log_user("slurm_job_sumbit: Job reejcted, pascal QoS selected but invalid GRES set!")
              return slurm.ERROR
          end
       end
       -- kepler qos requires kepler gres
       if not (string.find(job_desc.qos, "kepler") == nil) then
          if ( job_desc.gres == nil ) then
              slurm.log_info("slurm_job_submit: Job rejected, kepler QoS selected but no GRES set!")
              slurm.log_user("slurm_job_submit: Job rejected, kepler QoS selected but no GRES set!")
              return slurm.ERROR
          elseif ( string.find(job_desc.gres, "kepler") == nil) then
              slurm.log_info("slurm_job_sumbit: Job reejcted, kepler QoS selected but invalid GRES set!")
              slurm.log_user("slurm_job_sumbit: Job reejcted, kepler QoS selected but invalid GRES set!")
              return slurm.ERROR
          end
       end
    end

    -- all jobs are placed into the batch partition
    job_desc.partition = "batch"
    
    return slurm.SUCCESS
end

Comment 3 Bartosz Kostrzewa 2019-09-30 12:05:32 MDT

It doesn't really solve the problem, as the free resources on the GPU nodes are still not considered for Jobs which don't request GPU resources, but it does prevent the inverse problem, in which pure CPU jobs would completely block the GPU nodes and SLURM would never even attempt to free the node properly...

Comment 4 Bartosz Kostrzewa 2019-11-11 04:30:10 MST

I wanted to add that while the workaround has helped to balance GPU and CPU jobs on our heterogeneous large nodes, the problem persists in situations where jobs of different sizes request GPUs.

In the above list, lnode13 and lnode15 both have 8 P100 GPUs, while lnode14 has 4. Due to algorithmic constraints, we have a particular problem which it is best to run on all three nodes using 4 GPUs per node.

Of course, this leaves 4 GPUs each free on lnode13 and lnode15. One would think that by scheduling jobs using up to 4 GPUs (respecting all other resource limits) one would be able to use the idle resources.

However, this is not so. lnode[13-15] are never even considered for scheduling of these smaller jobs as explicitly seen in slurmctld logs with high debug levels for the backfill scheduler.

From looking through the way slurm selects nodes for scheduling, I kind of understand why the design is as it is. Keeping the nodes in the list of candidates would result in many iterations of futile scheduling attempts. (for all the large jobs that won't fit on the nodes when a large job is already running).

It is also clear that these smaller jobs could potentially take over the nodes required for running the larger jobs and that removing these nodes from the candidate list early is a possible remedy against this.

Logically, the situation is clear: the scheduler needs to keep 4 GPUs per node free and other jobs should be free to run on the remaining resources. Algorithmically this seems difficult to implement in a general, dynamic fashion.

However, it seems to me that this is still a major design flaw which significantly impacts small or very dense installations...