| Summary: | 15.08.10 incompatible with bf_min_prio_reserve patch | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | slurmctld | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 15.08.10 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | NERSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
bf_min_prio_reserve patch
updated patch |
||
I've run into this exact problem before. Patch is making the change in the wrong place in with new code. I'll need to build you a new patch... Created attachment 2999 [details]
updated patch
The line you identified is moved a bit higher in this patch.
I'm testing now, but I'm pretty sure that's all we need to change.
The original bug/patch are associated with bug 2565 https://bugs.schedmd.com/show_bug.cgi?id=2565 I see -- I'm sorry for the trouble on this; I really appreciate your help! ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Apr 13, 2016 at 1:29 PM, <bugs@schedmd.com> wrote: > Moe Jette <jette@schedmd.com> changed bug 2634 > <https://bugs.schedmd.com/show_bug.cgi?id=2634> > What Removed Added > CC jette@schedmd.com > > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=2634#c1> on bug > 2634 <https://bugs.schedmd.com/show_bug.cgi?id=2634> from Moe Jette > <jette@schedmd.com> * > > I've run into this exact problem before. Patch is making the change in the > wrong place in with new code. I'll need to build you a new patch... > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > At least with the SLES version of patch the above doesn't go into 15.08.10:
+ patch -p1
patching file doc/man/man5/slurm.conf.5
patching file slurm/slurm.h.in
patching file src/plugins/sched/backfill/backfill.c
Hunk #4 succeeded at 1080 with fuzz 2 (offset 103 lines).
Hunk #5 FAILED at 1150.
Hunk #6 succeeded at 1382 (offset 4 lines).
Hunk #7 succeeded at 1539 (offset 4 lines).
1 out of 7 hunks FAILED -- saving rejects to file
src/plugins/sched/backfill/backfill.c.rej
patching file src/plugins/select/cons_res/select_cons_res.c
patching file src/plugins/select/linear/select_linear.c
patching file src/plugins/select/serial/select_serial.c
This segment (the rej file):
*************** static int _attempt_backfill(void)
*** 1136,1141 ****
!acct_policy_job_runnable_pre_select(job_ptr))
continue;
orig_start_time = job_ptr->start_time;
orig_time_limit = job_ptr->time_limit;
xfree(job_queue_rec);
--- 1150,1166 ----
!acct_policy_job_runnable_pre_select(job_ptr))
continue;
+ job_no_reserve = 0;
+ if (bf_min_prio_reserve &&
+ (job_ptr->priority < bf_min_prio_reserve)) {
+ job_no_reserve = TEST_NOW_ONLY;
+ } else if (bf_min_age_reserve &&
job_ptr->details->begin_time) {
+ pend_time = difftime(time(NULL),
+ job_ptr->details->begin_time);
+ if (pend_time < bf_min_age_reserve)
+ job_no_reserve = TEST_NOW_ONLY;
+ }
+
orig_start_time = job_ptr->start_time;
orig_time_limit = job_ptr->time_limit;
xfree(job_queue_rec);
Thanks again,
Doug
----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov
------------- __o
---------- _ '\<,_
----------(_)/ (_)__________________________
On Wed, Apr 13, 2016 at 1:41 PM, <bugs@schedmd.com> wrote:
> Moe Jette <jette@schedmd.com> changed bug 2634
> <https://bugs.schedmd.com/show_bug.cgi?id=2634>
> What Removed Added
> Assignee support@schedmd.com jette@schedmd.com
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=2634#c3> on bug
> 2634 <https://bugs.schedmd.com/show_bug.cgi?id=2634> from Moe Jette
> <jette@schedmd.com> *
>
> The original bug/patch are associated with bug 2565 <https://bugs.schedmd.com/show_bug.cgi?id=2565>
> https://bugs.schedmd.com/show_bug.cgi?id=2565
>
> ------------------------------
> You are receiving this mail because:
>
> - You reported the bug.
>
>
The important bit is to just make sure the job's priority is set before the job_queue_rec data structure where it comes from is freed.
@@ -984,6 +997,7 @@ static int _attempt_backfill(void)
part_ptr = job_queue_rec->part_ptr;
job_ptr->part_ptr = part_ptr;
+ job_ptr->priority = job_queue_rec->priority;
if (job_ptr->state_reason == FAIL_ACCOUNT) {
slurmdb_assoc_rec_t assoc_rec;
memset(&assoc_rec, 0, sizeof(slurmdb_assoc_rec_t));
Hello, I took your patch, applied it, it was still fuzzing that line out to the wrong place. I manually repaired it, and rediffed with more context lines: diff -rpN -U 10 a/ b/ this seems to be applying more correctly. I'll let you know the results once I get some tests done. -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Apr 13, 2016 at 2:15 PM, <bugs@schedmd.com> wrote: > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=2634#c6> on bug > 2634 <https://bugs.schedmd.com/show_bug.cgi?id=2634> from Moe Jette > <jette@schedmd.com> * > > The important bit is to just make sure the job's priority is set before the > job_queue_rec data structure where it comes from is freed. > > @@ -984,6 +997,7 @@ static int _attempt_backfill(void) > > part_ptr = job_queue_rec->part_ptr; > job_ptr->part_ptr = part_ptr; > + job_ptr->priority = job_queue_rec->priority; > if (job_ptr->state_reason == FAIL_ACCOUNT) { > slurmdb_assoc_rec_t assoc_rec; > memset(&assoc_rec, 0, sizeof(slurmdb_assoc_rec_t)); > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > *** Ticket 2635 has been marked as a duplicate of this ticket. *** Doug, Can I close this ticket now? Sorry about the patch problems. I hope that you'll be able to move to version 16.05 soon to avoid patching. |
Created attachment 2998 [details] bf_min_prio_reserve patch Hello, I just tried to upgrade slurm on cori to 15.08.10 but it crashed as soon as I brought up the "regular" partition owing to some incompatibilities between the bf_min_prio_reserve patch we have and some modifications made in 15.08.10 reverting nearby logic. In particular the segfault occurred in backfill.c: job_ptr->part_ptr = part_ptr; job_ptr->priority = job_queue_rec->priority; when job_queue_rec is accessed after it is xfreed() earlier. I both really need the bf_min_prio_reserve and I need to be able to upgrade 15.08. Would it be possible that we can get an update of this patch? or better, get bf_min_prio_reserve mainlined into 15.08 HEAD? I'm attaching the mildly hacked up patch I'm using on top of 15.08.10. I'm in a bit of a jam on cori now, because we are up, but I can't redeploy or get some of the other patches unless I revert changes to my build system... Thank you for your help with this, Doug