| Summary: | free cores in high priority partition are not considered for scheduling in lower priority partitions | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Bartosz Kostrzewa <bartosz_kostrzewa> |
| Component: | Scheduling | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | Debian |
| Machine Name: | QBIG | CLE Version: | |
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Bartosz Kostrzewa
2019-06-21 02:29:55 MDT
I guess this is somewhat of a duplicate of https://bugs.schedmd.com/show_bug.cgi?id=3881 In case anybody stumbles upon this kind of issue and is looking for a workaround, here is what we have done for now after several months of wrangling with conflicting reservations and lots of manual priorisation:
1) Combined all partitions into one
2) set up QoS which mirror the partition priorities that we had before
$ sacctmgr show qos format=name,priority
Name Priority
---------- ----------
normal 0
pascal 15000
kepler 5000
devel 100000
3) set up job_submit.lua which checks for GRES and assigns a given QoS, except for "devel", which gives a very high priority and is not modified by the lua script.
function slurm_job_submit(job_desc, part_list, submit_uid)
-- jobs which don't request any gres or any qos are given the "normal" qos
-- note that jobs which have 'deve' qos just go straight through
if (job_desc.qos == nil) and (job_desc.gres == nil) then
job_desc.qos = "normal"
-- if a gres is set but no qos, we set the appropriate qos
elseif (job_desc.qos == nil) and not (job_desc.gres == nil) then
-- when the job requests GPU resources, we apply either the pascal
-- or the kepler QoS
if not (string.find(job_desc.gres, "gpu") == nil) then
--- user has requested one of the low priority partitions, let's make
-- sure it's the right one
if not (string.find(job_desc.gres, "pascal") == nil) then
slurm.log_info("slurm_job_submit: setting QoS to pascal")
job_desc.qos = "pascal"
else
slurm.log_info("slurm_job_submit: setting QoS to kepler")
job_desc.qos = "kepler"
end
end
-- in this branch, the qos is set, so we check if the correct gres is also set
elseif not (job_desc.qos == nil) then
-- pascal qos requires pascal gres
if not (string.find(job_desc.qos, "pascal") == nil) then
if ( job_desc.gres == nil ) then
slurm.log_info("slurm_job_submit: Job rejected, pascal QoS selected but no GRES set!")
slurm.log_user("slurm_job_submit: Job rejected, pascal QoS selected but no GRES set!")
return slurm.ERROR
elseif ( string.find(job_desc.gres, "pascal") == nil) then
slurm.log_info("slurm_job_sumbit: Job reejcted, pascal QoS selected but invalid GRES set!")
slurm.log_user("slurm_job_sumbit: Job reejcted, pascal QoS selected but invalid GRES set!")
return slurm.ERROR
end
end
-- kepler qos requires kepler gres
if not (string.find(job_desc.qos, "kepler") == nil) then
if ( job_desc.gres == nil ) then
slurm.log_info("slurm_job_submit: Job rejected, kepler QoS selected but no GRES set!")
slurm.log_user("slurm_job_submit: Job rejected, kepler QoS selected but no GRES set!")
return slurm.ERROR
elseif ( string.find(job_desc.gres, "kepler") == nil) then
slurm.log_info("slurm_job_sumbit: Job reejcted, kepler QoS selected but invalid GRES set!")
slurm.log_user("slurm_job_sumbit: Job reejcted, kepler QoS selected but invalid GRES set!")
return slurm.ERROR
end
end
end
-- all jobs are placed into the batch partition
job_desc.partition = "batch"
return slurm.SUCCESS
end
It doesn't really solve the problem, as the free resources on the GPU nodes are still not considered for Jobs which don't request GPU resources, but it does prevent the inverse problem, in which pure CPU jobs would completely block the GPU nodes and SLURM would never even attempt to free the node properly... I wanted to add that while the workaround has helped to balance GPU and CPU jobs on our heterogeneous large nodes, the problem persists in situations where jobs of different sizes request GPUs. In the above list, lnode13 and lnode15 both have 8 P100 GPUs, while lnode14 has 4. Due to algorithmic constraints, we have a particular problem which it is best to run on all three nodes using 4 GPUs per node. Of course, this leaves 4 GPUs each free on lnode13 and lnode15. One would think that by scheduling jobs using up to 4 GPUs (respecting all other resource limits) one would be able to use the idle resources. However, this is not so. lnode[13-15] are never even considered for scheduling of these smaller jobs as explicitly seen in slurmctld logs with high debug levels for the backfill scheduler. From looking through the way slurm selects nodes for scheduling, I kind of understand why the design is as it is. Keeping the nodes in the list of candidates would result in many iterations of futile scheduling attempts. (for all the large jobs that won't fit on the nodes when a large job is already running). It is also clear that these smaller jobs could potentially take over the nodes required for running the larger jobs and that removing these nodes from the candidate list early is a possible remedy against this. Logically, the situation is clear: the scheduler needs to keep 4 GPUs per node free and other jobs should be free to run on the remaining resources. Algorithmically this seems difficult to implement in a general, dynamic fashion. However, it seems to me that this is still a major design flaw which significantly impacts small or very dense installations... |