Created attachment 2998 [details] bf_min_prio_reserve patch Hello, I just tried to upgrade slurm on cori to 15.08.10 but it crashed as soon as I brought up the "regular" partition owing to some incompatibilities between the bf_min_prio_reserve patch we have and some modifications made in 15.08.10 reverting nearby logic. In particular the segfault occurred in backfill.c: job_ptr->part_ptr = part_ptr; job_ptr->priority = job_queue_rec->priority; when job_queue_rec is accessed after it is xfreed() earlier. I both really need the bf_min_prio_reserve and I need to be able to upgrade 15.08. Would it be possible that we can get an update of this patch? or better, get bf_min_prio_reserve mainlined into 15.08 HEAD? I'm attaching the mildly hacked up patch I'm using on top of 15.08.10. I'm in a bit of a jam on cori now, because we are up, but I can't redeploy or get some of the other patches unless I revert changes to my build system... Thank you for your help with this, Doug
I've run into this exact problem before. Patch is making the change in the wrong place in with new code. I'll need to build you a new patch...
Created attachment 2999 [details] updated patch The line you identified is moved a bit higher in this patch. I'm testing now, but I'm pretty sure that's all we need to change.
The original bug/patch are associated with bug 2565 https://bugs.schedmd.com/show_bug.cgi?id=2565
I see -- I'm sorry for the trouble on this; I really appreciate your help! ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Apr 13, 2016 at 1:29 PM, <bugs@schedmd.com> wrote: > Moe Jette <jette@schedmd.com> changed bug 2634 > <https://bugs.schedmd.com/show_bug.cgi?id=2634> > What Removed Added > CC jette@schedmd.com > > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=2634#c1> on bug > 2634 <https://bugs.schedmd.com/show_bug.cgi?id=2634> from Moe Jette > <jette@schedmd.com> * > > I've run into this exact problem before. Patch is making the change in the > wrong place in with new code. I'll need to build you a new patch... > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
At least with the SLES version of patch the above doesn't go into 15.08.10: + patch -p1 patching file doc/man/man5/slurm.conf.5 patching file slurm/slurm.h.in patching file src/plugins/sched/backfill/backfill.c Hunk #4 succeeded at 1080 with fuzz 2 (offset 103 lines). Hunk #5 FAILED at 1150. Hunk #6 succeeded at 1382 (offset 4 lines). Hunk #7 succeeded at 1539 (offset 4 lines). 1 out of 7 hunks FAILED -- saving rejects to file src/plugins/sched/backfill/backfill.c.rej patching file src/plugins/select/cons_res/select_cons_res.c patching file src/plugins/select/linear/select_linear.c patching file src/plugins/select/serial/select_serial.c This segment (the rej file): *************** static int _attempt_backfill(void) *** 1136,1141 **** !acct_policy_job_runnable_pre_select(job_ptr)) continue; orig_start_time = job_ptr->start_time; orig_time_limit = job_ptr->time_limit; xfree(job_queue_rec); --- 1150,1166 ---- !acct_policy_job_runnable_pre_select(job_ptr)) continue; + job_no_reserve = 0; + if (bf_min_prio_reserve && + (job_ptr->priority < bf_min_prio_reserve)) { + job_no_reserve = TEST_NOW_ONLY; + } else if (bf_min_age_reserve && job_ptr->details->begin_time) { + pend_time = difftime(time(NULL), + job_ptr->details->begin_time); + if (pend_time < bf_min_age_reserve) + job_no_reserve = TEST_NOW_ONLY; + } + orig_start_time = job_ptr->start_time; orig_time_limit = job_ptr->time_limit; xfree(job_queue_rec); Thanks again, Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Apr 13, 2016 at 1:41 PM, <bugs@schedmd.com> wrote: > Moe Jette <jette@schedmd.com> changed bug 2634 > <https://bugs.schedmd.com/show_bug.cgi?id=2634> > What Removed Added > Assignee support@schedmd.com jette@schedmd.com > > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=2634#c3> on bug > 2634 <https://bugs.schedmd.com/show_bug.cgi?id=2634> from Moe Jette > <jette@schedmd.com> * > > The original bug/patch are associated with bug 2565 <https://bugs.schedmd.com/show_bug.cgi?id=2565> > https://bugs.schedmd.com/show_bug.cgi?id=2565 > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
The important bit is to just make sure the job's priority is set before the job_queue_rec data structure where it comes from is freed. @@ -984,6 +997,7 @@ static int _attempt_backfill(void) part_ptr = job_queue_rec->part_ptr; job_ptr->part_ptr = part_ptr; + job_ptr->priority = job_queue_rec->priority; if (job_ptr->state_reason == FAIL_ACCOUNT) { slurmdb_assoc_rec_t assoc_rec; memset(&assoc_rec, 0, sizeof(slurmdb_assoc_rec_t));
Hello, I took your patch, applied it, it was still fuzzing that line out to the wrong place. I manually repaired it, and rediffed with more context lines: diff -rpN -U 10 a/ b/ this seems to be applying more correctly. I'll let you know the results once I get some tests done. -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Apr 13, 2016 at 2:15 PM, <bugs@schedmd.com> wrote: > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=2634#c6> on bug > 2634 <https://bugs.schedmd.com/show_bug.cgi?id=2634> from Moe Jette > <jette@schedmd.com> * > > The important bit is to just make sure the job's priority is set before the > job_queue_rec data structure where it comes from is freed. > > @@ -984,6 +997,7 @@ static int _attempt_backfill(void) > > part_ptr = job_queue_rec->part_ptr; > job_ptr->part_ptr = part_ptr; > + job_ptr->priority = job_queue_rec->priority; > if (job_ptr->state_reason == FAIL_ACCOUNT) { > slurmdb_assoc_rec_t assoc_rec; > memset(&assoc_rec, 0, sizeof(slurmdb_assoc_rec_t)); > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
*** Ticket 2635 has been marked as a duplicate of this ticket. ***
Doug, Can I close this ticket now?
Sorry about the patch problems. I hope that you'll be able to move to version 16.05 soon to avoid patching.