Ticket 3668

Summary:	Backfill not working as desired
Product:	Slurm	Reporter:	Moe Jette <jette>
Component:	Scheduling	Assignee:	Danny Auble <da>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	sts
Version:	17.02.1
Hardware:	Linux
OS:	Linux
Site:	Harvard Medical School	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	swap backfill user and partition counter tests

Description Moe Jette 2017-04-05 12:54:43 MDT

Created attachment 4298 [details]
swap backfill user and partition counter tests

Code review please.

Here's the problem. The backfill max jobs per partition are checked and the counter incremented before the max jobs per user check. In this case, the max jobs per partition is 500 and per user 30. The top 500 jobs in a partition all belong to the same user, so one user consumes all 500 of the partition's slots and blocks any other user from running in the partition. This change just reverses the test so one user can't consume more than 30 of the 500 available job slots in a partition.

Comment 1 Danny Auble 2017-04-05 16:14:31 MDT

Moe wouldn't this change create the reverse problem?

Would keeping the code the same order and subtracting the already added job from bf_part_jobs[j] fix the problem and not cause the reverse issue?

We could probably set up a pointer to the chosen bf_part_jobs like below.  What do you think?

diff --git a/src/plugins/sched/backfill/backfill.c b/src/plugins/sched/backfill/backfill.c
index 5e5103b..de713c4 100644
--- a/src/plugins/sched/backfill/backfill.c
+++ b/src/plugins/sched/backfill/backfill.c
@@ -899,7 +899,8 @@ static int _attempt_backfill(void)
        int rc = 0;
        int job_test_count = 0, test_time_count = 0, pend_time;
        uint32_t *uid = NULL, nuser = 0, bf_parts = 0;
-       uint32_t *bf_part_jobs = NULL, *bf_part_resv = NULL;
+       uint32_t *bf_part_jobs = NULL, *bf_part_resv = NULL,
+               *curr_bf_part_jobs = NULL;
        uint16_t *njobs = NULL;
        bool already_counted;
        uint32_t reject_array_job_id = 0;
@@ -1220,6 +1221,7 @@ next_task:
                                             job_ptr->job_id);
                                continue;
                        }
+                       curr_bf_part_jobs = &bf_part_jobs[j];
                }
                if (max_backfill_job_per_user) {
                        for (j = 0; j < nuser; j++) {
@@ -1259,6 +1261,8 @@ next_task:
                                                     max_backfill_job_per_user,
                                                     job_ptr->user_id,
                                                     job_ptr->job_id);
+                                       if (curr_bf_part_jobs)
+                                               (*curr_bf_part_jobs)--;
                                        continue;
                                }
                        }

Comment 2 Moe Jette 2017-04-05 16:40:46 MDT

(In reply to Danny Auble from comment #1)
> Moe wouldn't this change create the reverse problem?

It shouldn't in practice, assuming the partition limit is much higher than the per-user limit and there are a more users than partitions. The worst case is that the per-user limits would prevent scanning all of the partitions. What HMS really wants is a per-user per-partition limit, which they have a patch for now (what I worked in for part of this afternoon).

HMS has a really odd workload with a handful of users who each submit several thousand jobs at a time. I'm hoping they at least move to job arrays, but that will take some time.


> Would keeping the code the same order and subtracting the already added job
> from bf_part_jobs[j] fix the problem and not cause the reverse issue?
> 
> We could probably set up a pointer to the chosen bf_part_jobs like below. 
> What do you think?

What probably makes more sense is to avoid incrementing _any_ of the counters (per-user, per-partition or per-user-partition) until after all of the tests have passed rather than incrementing the counters in one place and decrementing later (on failure). Perhaps this should all go into v17.11 and let them use the patch for now. I think that I'll work on that tonight.


> diff --git a/src/plugins/sched/backfill/backfill.c
> b/src/plugins/sched/backfill/backfill.c
> index 5e5103b..de713c4 100644
> --- a/src/plugins/sched/backfill/backfill.c
> +++ b/src/plugins/sched/backfill/backfill.c
> @@ -899,7 +899,8 @@ static int _attempt_backfill(void)
>         int rc = 0;
>         int job_test_count = 0, test_time_count = 0, pend_time;
>         uint32_t *uid = NULL, nuser = 0, bf_parts = 0;
> -       uint32_t *bf_part_jobs = NULL, *bf_part_resv = NULL;
> +       uint32_t *bf_part_jobs = NULL, *bf_part_resv = NULL,
> +               *curr_bf_part_jobs = NULL;
>         uint16_t *njobs = NULL;
>         bool already_counted;
>         uint32_t reject_array_job_id = 0;
> @@ -1220,6 +1221,7 @@ next_task:
>                                              job_ptr->job_id);
>                                 continue;
>                         }
> +                       curr_bf_part_jobs = &bf_part_jobs[j];
>                 }
>                 if (max_backfill_job_per_user) {
>                         for (j = 0; j < nuser; j++) {
> @@ -1259,6 +1261,8 @@ next_task:
>                                                     
> max_backfill_job_per_user,
>                                                      job_ptr->user_id,
>                                                      job_ptr->job_id);
> +                                       if (curr_bf_part_jobs)
> +                                               (*curr_bf_part_jobs)--;
>                                         continue;
>                                 }
>                         }

Comment 3 Danny Auble 2017-04-05 17:09:43 MDT

(In reply to Moe Jette from comment #2)
> (In reply to Danny Auble from comment #1)
> > Moe wouldn't this change create the reverse problem?
> 
> It shouldn't in practice, assuming the partition limit is much higher than
> the per-user limit and there are a more users than partitions. The worst
> case is that the per-user limits would prevent scanning all of the
> partitions. What HMS really wants is a per-user per-partition limit, which
> they have a patch for now (what I worked in for part of this afternoon).
> 
> HMS has a really odd workload with a handful of users who each submit
> several thousand jobs at a time. I'm hoping they at least move to job
> arrays, but that will take some time.

Why not just use a partition QOS and set MaxJobsPerUser?  That will do that exact thing.

> 
> 
> > Would keeping the code the same order and subtracting the already added job
> > from bf_part_jobs[j] fix the problem and not cause the reverse issue?
> > 
> > We could probably set up a pointer to the chosen bf_part_jobs like below. 
> > What do you think?
> 
> What probably makes more sense is to avoid incrementing _any_ of the
> counters (per-user, per-partition or per-user-partition) until after all of
> the tests have passed rather than incrementing the counters in one place and
> decrementing later (on failure). Perhaps this should all go into v17.11 and
> let them use the patch for now. I think that I'll work on that tonight.

I like that idea even better.  I would like to avoid this patch going into the code though.

Hopefully a partition QOS will solve there issues.

Comment 4 Danny Auble 2017-04-05 17:16:58 MDT

Drat, there is no MaxJobsPerUser only MaxTRESPerUser.  Sorry for that bad intel.  I would still like to wait for the non-increment patch instead of switching the order here.

Comment 5 Moe Jette 2017-04-05 17:43:00 MDT

(In reply to Danny Auble from comment #4)
> Drat, there is no MaxJobsPerUser only MaxTRESPerUser.  Sorry for that bad
> intel.  I would still like to wait for the non-increment patch instead of
> switching the order here.

I think this can wait as HMS has a working patch for now.

Looking at the logic, it seems that in the backfill scheduler some of the checks need to be moved around for job arrays. Notably, acct_policy_job_runnable_pre_select() needs to be moved under the "next_task" label.

I couldn't get a flight out tonight given the travel time required. I'll be leaving on Thursday morning and probably not coming into the office. Please try to get the two patches from bug 3650 in the next release. They are both very simple patches that are in use at HMS now.

Comment 6 Danny Auble 2017-04-06 13:59:01 MDT

I again stand corrected, QOS does have MaxJobPer[User|Account].  It just wasn't documented (it is now, commit a371c74e2377cb).  I would suggest using that.

Comment 7 Moe Jette 2017-04-06 14:14:36 MDT

(In reply to Danny Auble from comment #6)
> I again stand corrected, QOS does have MaxJobPer[User|Account].  It just
> wasn't documented (it is now, commit a371c74e2377cb).  I would suggest using
> that.

HMS advised of this solution. Closing ticket

Comment 8 Moe Jette 2017-04-07 07:13:20 MDT

*** Ticket 3669 has been marked as a duplicate of this ticket. ***