Summary: | Job never runnable in partition | ||
---|---|---|---|
Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
Component: | Scheduling | Assignee: | Marcin Stolarek <cinek> |
Status: | RESOLVED WONTFIX | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | pedmon, tim |
Version: | 18.08.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=5452 https://bugs.schedmd.com/show_bug.cgi?id=9024 https://bugs.schedmd.com/show_bug.cgi?id=9085 |
||
Site: | Stanford | Slinky Site: | --- |
Alineos Sites: | --- | Atos/Eviden Sites: | --- |
Confidential Site: | --- | Coreweave sites: | --- |
Cray Sites: | --- | DS9 clusters: | --- |
Google sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | Sherlock |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Kilian Cavalotti
2019-06-14 16:33:14 MDT
Kilian, I was able to reproduce it. I'm checking the code to find out a better way of handling it. cheers, Marcin Hi! I'm out of the office and will return on July 1st. For urgent matters, please contact srcc-support@stanford.edu Cheers, -- Kilian (In reply to Marcin Stolarek from comment #4) > Kilian, > > I was able to reproduce it. I'm checking the code to find out a better way > of handling it. Great news, thanks! Please let me know if you find anything. Cheers, -- Kilian Kilian,
As you know EnforcePartLimits has an effect only on partition limits (i.e. MaxMemPerCPU, MaxMemPerNode, MinNodes etc.). Fulfilling your request will be either feature or change in behavior, which means that we can potentially address this in version 20.02. If we conclude that it's valuable in general.
To address your issue directly. You can achieve multi-part jobs rejection with EnforceLimits=ALL because of nodes configuration with a local patch, like the one below:
>diff --git a/src/slurmctld/job_mgr.c b/src/slurmctld/job_mgr.c
>index 72afca6e41..15fda360df 100644
>--- a/src/slurmctld/job_mgr.c
>+++ b/src/slurmctld/job_mgr.c
>@@ -4781,7 +4781,8 @@ static int _select_nodes_parts(struct job_record >*job_ptr, bool test_only,
> (part_limits_rc == WAIT_PART_DOWN))
> rc = ESLURM_PARTITION_DOWN;
> }
>- if ((rc == ESLURM_REQUESTED_PART_CONFIG_UNAVAILABLE) &&
>+ if (((rc == ESLURM_REQUESTED_PART_CONFIG_UNAVAILABLE) ||
>+ (rc == ESLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE))&&
Is a local patch maintenance something you can handle? Obviously this patch will make Slurm behavior not aligned with documentation.
In the first message you mentioned "[...]it clogs the output of squeue with irrelevant lines". I'm not sure what do you mean? I don't see how the issue can have an impact on squeue output?
cheers,
Marcin
Hi Marcin, (In reply to Marcin Stolarek from comment #12) > As you know EnforcePartLimits has an effect only on partition limits (i.e. > MaxMemPerCPU, MaxMemPerNode, MinNodes etc.). Fulfilling your request will be > either feature or change in behavior, which means that we can potentially > address this in version 20.02. If we conclude that it's valuable in general. Sounds like a reasonable approach. Some sort of IgnorePartUnrunnable option could be useful to us. > To address your issue directly. You can achieve multi-part jobs rejection > with EnforceLimits=ALL because of nodes configuration with a local patch, > like the one below: > >diff --git a/src/slurmctld/job_mgr.c b/src/slurmctld/job_mgr.c > >index 72afca6e41..15fda360df 100644 > >--- a/src/slurmctld/job_mgr.c > >+++ b/src/slurmctld/job_mgr.c > >@@ -4781,7 +4781,8 @@ static int _select_nodes_parts(struct job_record >*job_ptr, bool test_only, > > (part_limits_rc == WAIT_PART_DOWN)) > > rc = ESLURM_PARTITION_DOWN; > > } > >- if ((rc == ESLURM_REQUESTED_PART_CONFIG_UNAVAILABLE) && > >+ if (((rc == ESLURM_REQUESTED_PART_CONFIG_UNAVAILABLE) || > >+ (rc == ESLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE))&& > Is a local patch maintenance something you can handle? Obviously this patch > will make Slurm behavior not aligned with documentation. Thanks for the patch, we'll see if that's something we want to carry over, because of course, local modifications of the code base are not ideal. Do you think that's something that could be managed in the job submit Lua plugin? Is the ESLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE information available there? > In the first message you mentioned "[...]it clogs the output of squeue with > irrelevant lines". I'm not sure what do you mean? I don't see how the issue > can have an impact on squeue output? Well, now that you mention it, I'm not too sure what I meant to write either. :) I think it may have been about the output of sprio rather than squeue, as sprio will display one line per partition a job has been submitted to. For instance: slurmctld[119480]: _pick_best_nodes: JobId=46257903_1(46257911) never runnable in partition normal slurmctld[119480]: _pick_best_nodes: JobId=46257903_1(46257911) never runnable in partition khavari and yet: # sprio -j 46257911 JOBID PARTITION PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS TRES 46257911 khavari 87798 26266 26349 620 10000 0 cpu=3000,mem=21562 46257911 normal 55495 26266 26349 620 100 0 cpu=150,mem=2009 46257911 owners 53355 26266 26349 620 1 0 cpu=10,mem=110 So I guess the bottom line is, because that job will never be able to run in those two partitions, carrying those unusable partitions over and testing them over and over again is probably just a waste of resources, and a source of confusion for users and sysadmins alike. I understand a change of behavior needs a major version to be introduced. Local code modifications are an option, but not our preferred approach, so if there would be a way to achieve that result and drop those unrunnable partitions in the job submit plugin, that would be ideal. Thanks! -- Kilian Kilian, I fully understand your concern about the local patch and I agree that it's probably not something you should do if it's not a major issue for you. I shared it just in case you'd like to go this path. We have clients with sophisticated job_submit plugins, some of them even ended up with job_submit plugins always returning a job with only one partition. Unfortunately, there is no way to get ESLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE errno from the plugin. cheers, Marcin Kilian, We had a long discussion and code analysis on the case which ended-up with an conclusion that we're going to keep current EnforcePartLimit behavior unchanged. To summarize potential solutions for you are: - Educate users. - Filter out partitions with a job submit plugin. - Locally maintain a patch that downgrades the log level for the messages We can also consider addition of a parameter, like SchedulerParameters=drop_bad_partition that will remove partitions from the job if it would never run at submit time. However, this has to be considered as a request for enhancement, would you be interested in sponsoring it? cheers, Marcin Hi Marcin, (In reply to Marcin Stolarek from comment #27) > We had a long discussion and code analysis on the case which ended-up with > an conclusion that we're going to keep current EnforcePartLimit behavior > unchanged. > > To summarize potential solutions for you are: > - Educate users. > - Filter out partitions with a job submit plugin. > - Locally maintain a patch that downgrades the log level for the messages Thanks for the update! As you can imagine, option 1 is what we're doing right now. But you know users, it's a sisyphean task. Option 2 is not really feasible I think, because it would mean describing the partition characteristics in the job submit plugin too, and check if a job's requirements can be matched, which seems redundant as it's exactly what the scheduler does already. Unless there's a quick way to get that information (if a given partition can satisfy a given job's requirements) in the job submit plugin? Option 3 would just hide the effect but not really solve the problem, which is that the scheduler wastes resources considering a useless partition each cycle. > We can also consider addition of a parameter, like > SchedulerParameters=drop_bad_partition that will remove partitions from the > job if it would never run at submit time. That would be the perfect solution, indeed! It seems like the right thing to do, from an efficiency perspective for the scheduler. > However, this has to be considered as a request for enhancement, would you be interested in sponsoring it? If you mean putting my full moral support behind the development of that feature, you got it! :) Now, if we're talking financial sponsorship, I'll ask, but I don't think this can go very far. The issue is actually not preventing anything to work, really, it's just that scheduling would be more efficient if it were fixed. But I don't think we have a budget to improve efficiency, unfortunately. :) Thanks! -- Kilian Ignore the last email - I posted on the wrong bug. Sorry about that *** Ticket 9024 has been marked as a duplicate of this ticket. *** |