Ticket 7248

Summary: Job never runnable in partition
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: SchedulingAssignee: Marcin Stolarek <cinek>
Status: RESOLVED WONTFIX QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: pedmon, tim
Version: 18.08.7   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=5452
https://bugs.schedmd.com/show_bug.cgi?id=9024
https://bugs.schedmd.com/show_bug.cgi?id=9085
Site: Stanford Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name: Sherlock
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Kilian Cavalotti 2019-06-14 16:33:14 MDT
Hi SchedMD!

We have users who insist on blasting their jobs to a maximum number of partitions, no matter the resources they request and the partition's characteristics (like submitting GPU jobs to a list of 5 partitions with only one featuring GPUs). Which invariably result in a lot of: 

slurmctld[...]: _pick_best_nodes: JobId=44559985 never runnable in partition xxx

They don't really care in the end, because in the list of partitions they specify, there's at least one that can satisfy their resource requirements, so the scheduler accepts the job submission, and eventually runs it.

But it fills the logs with those "never runnable" messages, it clogs the output of squeue with irrelevant lines, and it very probably add some unnecessary load to the scheduler.

Is there a way to reject those jobs at submission time? Like EnforcePartLimits=ALL, but for resources instead of limits? 

Or maybe drop the partitions the job can never run in from the partition list?

Thanks!

Cheers,
-- 
Kilian
Comment 4 Marcin Stolarek 2019-06-19 09:53:35 MDT
Kilian,

I was able to reproduce it. I'm checking the code to find out a better way of handling it.

cheers,
Marcin
Comment 5 Kilian Cavalotti 2019-06-19 09:54:37 MDT
Hi!

I'm out of the office and will return on July 1st.
For urgent matters, please contact srcc-support@stanford.edu

Cheers,
--
Kilian
Comment 7 Kilian Cavalotti 2019-07-01 10:41:50 MDT
(In reply to Marcin Stolarek from comment #4)
> Kilian,
> 
> I was able to reproduce it. I'm checking the code to find out a better way
> of handling it.

Great news, thanks! Please let me know if you find anything.

Cheers,
-- 
Kilian
Comment 12 Marcin Stolarek 2019-07-22 06:50:37 MDT
Kilian, 

As you know EnforcePartLimits has an effect only on partition limits (i.e. MaxMemPerCPU, MaxMemPerNode, MinNodes etc.). Fulfilling your request will be either feature or change in behavior, which means that we can potentially address this in version 20.02. If we conclude that it's valuable in general. 

To address your issue directly. You can achieve multi-part jobs rejection with EnforceLimits=ALL because of nodes configuration with a local patch, like the one below:
>diff --git a/src/slurmctld/job_mgr.c b/src/slurmctld/job_mgr.c
>index 72afca6e41..15fda360df 100644
>--- a/src/slurmctld/job_mgr.c
>+++ b/src/slurmctld/job_mgr.c
>@@ -4781,7 +4781,8 @@ static int _select_nodes_parts(struct job_record >*job_ptr, bool test_only,
> 				    (part_limits_rc == WAIT_PART_DOWN))
> 					rc = ESLURM_PARTITION_DOWN;
> 			}
>-			if ((rc == ESLURM_REQUESTED_PART_CONFIG_UNAVAILABLE) &&
>+			if (((rc == ESLURM_REQUESTED_PART_CONFIG_UNAVAILABLE) ||
>+			     (rc == ESLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE))&&
Is a local patch maintenance something you can handle? Obviously this patch will make Slurm behavior not aligned with documentation. 

In the first message you mentioned "[...]it clogs the output of squeue with irrelevant lines". I'm not sure what do you mean? I don't see how the issue can have an impact on squeue output?

cheers,
Marcin
Comment 13 Kilian Cavalotti 2019-07-22 10:18:26 MDT
Hi Marcin, 

(In reply to Marcin Stolarek from comment #12)
> As you know EnforcePartLimits has an effect only on partition limits (i.e.
> MaxMemPerCPU, MaxMemPerNode, MinNodes etc.). Fulfilling your request will be
> either feature or change in behavior, which means that we can potentially
> address this in version 20.02. If we conclude that it's valuable in general. 

Sounds like a reasonable approach. Some sort of IgnorePartUnrunnable option could be useful to us.

> To address your issue directly. You can achieve multi-part jobs rejection
> with EnforceLimits=ALL because of nodes configuration with a local patch,
> like the one below:
> >diff --git a/src/slurmctld/job_mgr.c b/src/slurmctld/job_mgr.c
> >index 72afca6e41..15fda360df 100644
> >--- a/src/slurmctld/job_mgr.c
> >+++ b/src/slurmctld/job_mgr.c
> >@@ -4781,7 +4781,8 @@ static int _select_nodes_parts(struct job_record >*job_ptr, bool test_only,
> > 				    (part_limits_rc == WAIT_PART_DOWN))
> > 					rc = ESLURM_PARTITION_DOWN;
> > 			}
> >-			if ((rc == ESLURM_REQUESTED_PART_CONFIG_UNAVAILABLE) &&
> >+			if (((rc == ESLURM_REQUESTED_PART_CONFIG_UNAVAILABLE) ||
> >+			     (rc == ESLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE))&&
> Is a local patch maintenance something you can handle? Obviously this patch
> will make Slurm behavior not aligned with documentation. 

Thanks for the patch, we'll see if that's something we want to carry over, because of course, local modifications of the code base are not ideal.

Do you think that's something that could be managed in the job submit Lua plugin? Is the ESLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE information available there?

> In the first message you mentioned "[...]it clogs the output of squeue with
> irrelevant lines". I'm not sure what do you mean? I don't see how the issue
> can have an impact on squeue output?

Well, now that you mention it, I'm not too sure what I meant to write either. :)
I think it may have been about the output of sprio rather than squeue, as sprio will display one line per partition a job has been submitted to.

For instance:
slurmctld[119480]: _pick_best_nodes: JobId=46257903_1(46257911) never runnable in partition normal
slurmctld[119480]: _pick_best_nodes: JobId=46257903_1(46257911) never runnable in partition khavari

and yet:
# sprio -j 46257911
          JOBID PARTITION   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS                 TRES
       46257911 khavari        87798      26266      26349        620      10000          0   cpu=3000,mem=21562
       46257911 normal         55495      26266      26349        620        100          0     cpu=150,mem=2009
       46257911 owners         53355      26266      26349        620          1          0       cpu=10,mem=110


So I guess the bottom line is, because that job will never be able to run in those two partitions, carrying those unusable partitions over and testing them over and over again is probably just a waste of resources, and a source of confusion for users and sysadmins alike.

I understand a change of behavior needs a major version to be introduced. Local code modifications are an option, but not our preferred approach, so if there would be a way to achieve that result and drop those unrunnable partitions in the job submit plugin, that would be ideal.

Thanks!
-- 
Kilian
Comment 14 Marcin Stolarek 2019-07-23 03:56:57 MDT
Kilian, 

I fully understand your concern about the local patch and I agree that it's probably not something you should do if it's not a major issue for you. I shared it just in case you'd like to go this path. 

We have clients with sophisticated job_submit plugins, some of them even ended up with job_submit plugins always returning a job with only one partition. Unfortunately, there is no way to get ESLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE errno from the plugin. 

cheers,
Marcin
Comment 27 Marcin Stolarek 2020-01-31 04:25:20 MST
Kilian,

We had a long discussion and code analysis on the case which ended-up with an conclusion that we're going to keep current EnforcePartLimit behavior unchanged.

To summarize potential solutions for you are:
- Educate users.
- Filter out partitions with a job submit plugin.
- Locally maintain a patch that downgrades the log level for the messages

We can also consider addition of a parameter, like SchedulerParameters=drop_bad_partition that will remove partitions from the job if it would never run at submit time. However, this has to be considered as a request for enhancement, would you be interested in sponsoring it?

cheers,
Marcin
Comment 28 Kilian Cavalotti 2020-01-31 08:56:54 MST
Hi Marcin, 

(In reply to Marcin Stolarek from comment #27)
> We had a long discussion and code analysis on the case which ended-up with
> an conclusion that we're going to keep current EnforcePartLimit behavior
> unchanged.
> 
> To summarize potential solutions for you are:
> - Educate users.
> - Filter out partitions with a job submit plugin.
> - Locally maintain a patch that downgrades the log level for the messages

Thanks for the update!

As you can imagine, option 1 is what we're doing right now. But you know users, it's a sisyphean task. 

Option 2 is not really feasible I think, because it would mean describing the partition characteristics in the job submit plugin too, and check if a job's requirements can be matched, which seems redundant as it's exactly what the scheduler does already. Unless there's a quick way to get that information (if a given partition can satisfy a given job's requirements) in the job submit plugin?

Option 3 would just hide the effect but not really solve the problem, which is that the scheduler wastes resources considering a useless partition each cycle.


> We can also consider addition of a parameter, like
> SchedulerParameters=drop_bad_partition that will remove partitions from the
> job if it would never run at submit time. 

That would be the perfect solution, indeed! It seems like the right thing to do, from an efficiency perspective for the scheduler.

> However, this has to be considered as a request for enhancement, would you be interested in sponsoring it?

If you mean putting my full moral support behind the development of that feature, you got it! :) 

Now, if we're talking financial sponsorship, I'll ask, but I don't think this can go very far. The issue is actually not preventing anything to work, really, it's just that scheduling would be more efficient if it were fixed. But I don't think we have a budget to improve efficiency, unfortunately. :)

Thanks!
-- 
Kilian
Comment 34 Marshall Garey 2020-05-13 16:47:28 MDT
Ignore the last email - I posted on the wrong bug. Sorry about that
Comment 36 Marshall Garey 2020-05-19 17:07:28 MDT
*** Ticket 9024 has been marked as a duplicate of this ticket. ***