Ticket 2634

Summary: 15.08.10 incompatible with bf_min_prio_reserve patch
Product: Slurm Reporter: Doug Jacobsen <dmjacobsen>
Component: slurmctldAssignee: Moe Jette <jette>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 15.08.10   
Hardware: Cray XC   
OS: Linux   
Site: NERSC Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: bf_min_prio_reserve patch
updated patch

Description Doug Jacobsen 2016-04-13 07:14:24 MDT
Created attachment 2998 [details]
bf_min_prio_reserve patch

Hello,

I just tried to upgrade slurm on cori to 15.08.10 but it crashed as soon as I brought up the "regular" partition owing to some incompatibilities between the bf_min_prio_reserve patch we have and some modifications made in 15.08.10 reverting nearby logic.


In particular the segfault occurred in backfill.c:

                job_ptr->part_ptr = part_ptr;
                job_ptr->priority = job_queue_rec->priority;

when job_queue_rec is accessed after it is xfreed() earlier.

I both really need the bf_min_prio_reserve and I need to be able to upgrade 15.08.  Would it be possible that we can get an update of this patch?  or better, get bf_min_prio_reserve mainlined into 15.08 HEAD?

I'm attaching the mildly hacked up patch I'm using on top of 15.08.10.

I'm in a bit of a jam on cori now, because we are up, but I can't redeploy or get some of the other patches unless I revert changes to my build system...


Thank you for your help with this,
Doug
Comment 1 Moe Jette 2016-04-13 07:29:10 MDT
I've run into this exact problem before. Patch is making the change in the wrong place in with new code. I'll need to build you a new patch...
Comment 2 Moe Jette 2016-04-13 07:41:09 MDT
Created attachment 2999 [details]
updated patch

The line you identified is moved a bit higher in this patch.
I'm testing now, but I'm pretty sure that's all we need to change.
Comment 3 Moe Jette 2016-04-13 07:41:54 MDT
The original bug/patch are associated with bug 2565

https://bugs.schedmd.com/show_bug.cgi?id=2565
Comment 4 Doug Jacobsen 2016-04-13 07:49:27 MDT
I see -- I'm sorry for the trouble on this;  I really appreciate your help!

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Wed, Apr 13, 2016 at 1:29 PM, <bugs@schedmd.com> wrote:

> Moe Jette <jette@schedmd.com> changed bug 2634
> <https://bugs.schedmd.com/show_bug.cgi?id=2634>
> What Removed Added
> CC   jette@schedmd.com
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=2634#c1> on bug
> 2634 <https://bugs.schedmd.com/show_bug.cgi?id=2634> from Moe Jette
> <jette@schedmd.com> *
>
> I've run into this exact problem before. Patch is making the change in the
> wrong place in with new code. I'll need to build you a new patch...
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 5 Doug Jacobsen 2016-04-13 08:04:19 MDT
At least with the SLES version of patch the above doesn't go into 15.08.10:

+ patch -p1
patching file doc/man/man5/slurm.conf.5
patching file slurm/slurm.h.in
patching file src/plugins/sched/backfill/backfill.c
Hunk #4 succeeded at 1080 with fuzz 2 (offset 103 lines).
Hunk #5 FAILED at 1150.
Hunk #6 succeeded at 1382 (offset 4 lines).
Hunk #7 succeeded at 1539 (offset 4 lines).
1 out of 7 hunks FAILED -- saving rejects to file
src/plugins/sched/backfill/backfill.c.rej
patching file src/plugins/select/cons_res/select_cons_res.c
patching file src/plugins/select/linear/select_linear.c
patching file src/plugins/select/serial/select_serial.c


This segment (the rej file):

*************** static int _attempt_backfill(void)
*** 1136,1141 ****
                    !acct_policy_job_runnable_pre_select(job_ptr))
                        continue;

                orig_start_time = job_ptr->start_time;
                orig_time_limit = job_ptr->time_limit;
                xfree(job_queue_rec);
--- 1150,1166 ----
                    !acct_policy_job_runnable_pre_select(job_ptr))
                        continue;

+               job_no_reserve = 0;
+               if (bf_min_prio_reserve &&
+                   (job_ptr->priority < bf_min_prio_reserve)) {
+                       job_no_reserve = TEST_NOW_ONLY;
+               } else if (bf_min_age_reserve &&
job_ptr->details->begin_time) {
+                       pend_time = difftime(time(NULL),
+                                            job_ptr->details->begin_time);
+                       if (pend_time < bf_min_age_reserve)
+                               job_no_reserve = TEST_NOW_ONLY;
+               }
+
                orig_start_time = job_ptr->start_time;
                orig_time_limit = job_ptr->time_limit;
                xfree(job_queue_rec);



Thanks again,
Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Wed, Apr 13, 2016 at 1:41 PM, <bugs@schedmd.com> wrote:

> Moe Jette <jette@schedmd.com> changed bug 2634
> <https://bugs.schedmd.com/show_bug.cgi?id=2634>
> What Removed Added
> Assignee support@schedmd.com jette@schedmd.com
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=2634#c3> on bug
> 2634 <https://bugs.schedmd.com/show_bug.cgi?id=2634> from Moe Jette
> <jette@schedmd.com> *
>
> The original bug/patch are associated with bug 2565 <https://bugs.schedmd.com/show_bug.cgi?id=2565>
> https://bugs.schedmd.com/show_bug.cgi?id=2565
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 6 Moe Jette 2016-04-13 08:15:40 MDT
The important bit is to just make sure the job's priority is set before the job_queue_rec data structure where it comes from is freed.

@@ -984,6 +997,7 @@ static int _attempt_backfill(void)
 
 		part_ptr = job_queue_rec->part_ptr;
 		job_ptr->part_ptr = part_ptr;
+		job_ptr->priority = job_queue_rec->priority;
 		if (job_ptr->state_reason == FAIL_ACCOUNT) {
 			slurmdb_assoc_rec_t assoc_rec;
 			memset(&assoc_rec, 0, sizeof(slurmdb_assoc_rec_t));
Comment 7 Doug Jacobsen 2016-04-13 09:03:18 MDT
Hello,

I took your patch, applied it, it was still fuzzing that line out to the
wrong place.  I manually repaired it, and rediffed with more context lines:

diff -rpN -U 10 a/ b/

this seems to be applying more correctly.  I'll let you know the results
once I get some tests done.

-Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Wed, Apr 13, 2016 at 2:15 PM, <bugs@schedmd.com> wrote:

> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=2634#c6> on bug
> 2634 <https://bugs.schedmd.com/show_bug.cgi?id=2634> from Moe Jette
> <jette@schedmd.com> *
>
> The important bit is to just make sure the job's priority is set before the
> job_queue_rec data structure where it comes from is freed.
>
> @@ -984,6 +997,7 @@ static int _attempt_backfill(void)
>
>                 part_ptr = job_queue_rec->part_ptr;
>                 job_ptr->part_ptr = part_ptr;
> +               job_ptr->priority = job_queue_rec->priority;
>                 if (job_ptr->state_reason == FAIL_ACCOUNT) {
>                         slurmdb_assoc_rec_t assoc_rec;
>                         memset(&assoc_rec, 0, sizeof(slurmdb_assoc_rec_t));
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 8 Doug Jacobsen 2016-04-14 00:58:46 MDT
*** Ticket 2635 has been marked as a duplicate of this ticket. ***
Comment 9 Moe Jette 2016-04-14 03:33:43 MDT
Doug, Can I close this ticket now?
Comment 10 Moe Jette 2016-04-15 06:01:54 MDT
Sorry about the patch problems.

I hope that you'll be able to move to version 16.05 soon to avoid patching.