| Summary: | slurmctld abort during backfill | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | slurmctld | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | jsrinivasan |
| Version: | 16.05.4 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | NERSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 16.05.6 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
thread apply all bt output
slurmctld log Patch to prevent abort |
||
|
Description
Doug Jacobsen
2016-10-17 15:42:37 MDT
Created attachment 3594 [details]
thread apply all bt output
Created attachment 3595 [details]
slurmctld log
Doug, do you know if there was any memory issues on the node? I am wondering it we ran out. The assert happened on bit_or(bitmap, orig_map); where orig_map was NULL. orig_map = bit_copy(bitmap); where bitmap=0x7f902c35eb00 bit_copy (meat) new = bit_alloc(newsize_bits); if (new) memcpy(&new[BITSTR_OVERHEAD], &b[BITSTR_OVERHEAD], len); If bit_alloc returned NULL here though I would expect a nice OOM message in the log as that is what the code is suppose to do inside bit_alloc. As that doesn't exist I am not sure how it got to be NULL. Other variables look good or I would expect memory corruption. Could you send 'bt full' on the thread provided in comment 0? It appears there are a lot of threads waiting, presumably because backfill had just given up the locks and we are not going through them. Nothing that appears too concerning. (gdb) bt full
#0 0x00007f90ebe20875 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007f90ebe21e51 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007f90ebe19740 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
#3 0x00000000005131c8 in bit_overlap (b1=0x7f902c35eb00, b2=0x0) at
bitstring.c:808
count = 0
bit = 0
bit_cnt = 32656
word_size = 32
__PRETTY_FUNCTION__ = "bit_overlap"
#4 0x00007f90e63a5652 in _will_run_test (job_ptr=0x7f8fdc513070,
bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2,
job_node_req=64000, preemptee_candidates=0x0,
preemptee_job_list=0x7f90e5f8ca70, exc_core_bitmap=0x0) at
select_cons_res.c:1882
first_job_ptr = 0x0
next_job_ptr = 0x0
overlap = 0
last_job_ptr = 0x0
rm_job_cnt = 0
time_window = 0
more_jobs = true
future_part = 0x7f902c4ad9e0
future_usage = 0x7f902cbbe850
tmp_job_ptr = 0x7f90a0388820
cr_job_list = 0x2cc8f30
job_iterator = 0x8c3d20
preemptee_iterator = 0x5804d651
orig_map = 0x7f902cb19280
action = 0
rc = -1
now = 1476737066
tmp_cr_type = 50
qos_preemptor = false
#5 0x00007f90e63a643d in select_p_job_test (job_ptr=0x7f8fdc513070,
bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2,
preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70,
exc_core_bitmap=0x0) at select_cons_res.c:2248
rc = 22
job_node_req = 64000
debug_cpu_bind = false
debug_check = true
#6 0x00007f90eb622f26 in other_job_test (job_ptr=0x7f8fdc513070,
bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2,
preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70,
exc_core_bitmap=0x0) at other_select.c:277
No locals.
#7 0x00007f90eb6210fa in select_p_job_test (job_ptr=0x7f8fdc513070,
bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2,
preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70,
exc_core_bitmap=0x0) at select_cray.c:1920
jobinfo = 0x7f8fdc60e6c0
#8 0x000000000052ef1f in select_g_job_test (job_ptr=0x7f8fdc513070,
bitmap=0x7f902c35eb00, min_nodes=2, max_nodes=2, req_nodes=2, mode=2,
preemptee_candidates=0x0, preemptee_job_list=0x7f90e5f8ca70,
exc_core_bitmap=0x0) at node_select.c:588
No locals.
#9 0x00007f90e5f91fd8 in _try_sched (job_ptr=0x7f8fdc513070,
avail_bitmap=0x7f90e5f8cd60, min_nodes=2, max_nodes=2, req_nodes=2,
exc_core_bitmap=0x0) at backfill.c:417
orig_shared = 0
now = 1476737066
str =
"\340\312\370\345\220\177\000\000\212\023Q\000\000\000\000\000\267\300\376W\000\000\000\000\333\356O\000\322\025\000\000\311\363d\000\000\000\000\000h\315\370\345\220\177\000\000\000\353\065,\220\177\000\000p0Q\177\000\000
\313\370\345\220\177\000\000\000\000\000\000\000\000\000\000
\313\370\345\220\177\000\000\300\216\f,\220\177\000\000\300\216\f,"
low_bitmap = 0x0
tmp_bitmap = 0x7f902ca3aed0
rc = 0
has_xor = false
feat_cnt = 0
detail_ptr = 0x7f8fdc34e660
preemptee_candidates = 0x0
preemptee_job_list = 0x0
feat_iter = 0x64f3c9
feat_ptr = 0x7f90e5f8ca70
__func__ = "_try_sched"
#10 0x00007f90e5f9521f in _attempt_backfill () at backfill.c:1479
tv1 = {tv_sec = 1476737066, tv_usec = 132024}
tv2 = {tv_sec = 1476737065, tv_usec = 432606}
tv_str = "usec=2001634\000\000\000\000\000\000\000"
delta_t = 2001634
---Type <return> to continue, or q <return> to quit---
filter_root = true
job_queue = 0x7f90e3e62840
job_queue_rec = 0x0
qos_ptr = 0x7f90e00054b0
bb = 1
i = -436679056
j = 33
node_space_recs = 44
mcs_select = 0
job_ptr = 0x7f8fdc513070
part_ptr = 0x7f902c48a420
bf_part_ptr = 0x0
end_time = 1476953400
end_reserve = 1477039800
deadline_time_limit = 0
time_limit = 1620
comp_time_limit = 1620
orig_time_limit = 1620
part_time_limit = 5760
min_nodes = 2
max_nodes = 2
req_nodes = 2
active_bitmap = 0x0
avail_bitmap = 0x7f902c35eb00
exc_core_bitmap = 0x0
resv_bitmap = 0x7f902c0c8ec0
now = 1476737065
sched_start = 1476737066
later_start = 1476870600
start_res = 1476856200
resv_end = 0
window_end = 1477082641
orig_sched_start = 1476737041
orig_start_time = 0
node_space = 0x7f90382f6020
bf_time1 = {tv_sec = 1476737041, tv_usec = 177590}
bf_time2 = {tv_sec = 1476736861, tv_usec = 163944}
rc = 0
job_test_count = 1
test_time_count = 0
pend_time = 423794
uid = 0x7f902c6099e0
nuser = 47
bf_parts = 0
bf_part_jobs = 0x0
njobs = 0x7f902cbbe040
already_counted = true
reject_array_job_id = 0
reject_array_part = 0x0
job_start_cnt = 0
start_time = 1476942600
config_update = 1476687614
part_update = 1476736860
start_tv = {tv_sec = 1476737066, tv_usec = 132024}
test_array_job_id = 2239675
test_array_count = 16
acct_max_nodes = 4294967295
wait_reason = 44
job_no_reserve = 0
resv_overlap = false
save_share_res = 3 '\003'
save_whole_node = 0 '\000'
test_fini = -1
__func__ = "_attempt_backfill"
---Type <return> to continue, or q <return> to quit---
bf_max_user_msg = true
#11 0x00007f90e5f92fbe in backfill_agent (args=0x0) at backfill.c:764
now = 1476737041
wait_time = 180
all_locks = {config = READ_LOCK, job = WRITE_LOCK, node =
WRITE_LOCK, partition = READ_LOCK}
load_config = false
short_sleep = false
last_backfill_time = 1476736861
__func__ = "backfill_agent"
#12 0x00007f90ec171806 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#13 0x00007f90ebecc9bd in clone () from /lib64/libc.so.6
No symbol table info available.
#14 0x0000000000000000 in ?? ()
No symbol table info available.
(gdb)
On 10/17/16 3:41 PM, bugs@schedmd.com wrote:
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=3185#c3> on bug
> 3185 <https://bugs.schedmd.com/show_bug.cgi?id=3185> from Danny Auble
> <mailto:da@schedmd.com> *
> Doug, do you know if there was any memory issues on the node? I am wondering
> it we ran out.
>
> The assert happened on
>
> bit_or(bitmap, orig_map);
>
> where orig_map was NULL.
>
> orig_map = bit_copy(bitmap);
>
> where bitmap=0x7f902c35eb00
>
> bit_copy (meat)
>
> new = bit_alloc(newsize_bits);
> if (new)
> memcpy(&new[BITSTR_OVERHEAD], &b[BITSTR_OVERHEAD], len);
>
> If bit_alloc returned NULL here though I would expect a nice OOM message in the
> log as that is what the code is suppose to do inside bit_alloc. As that
> doesn't exist I am not sure how it got to be NULL. Other variables look good
> or I would expect memory corruption.
>
> Could you send 'bt full' on the thread provided incomment 0 <show_bug.cgi?id=3185#c0>?
>
> It appears there are a lot of threads waiting, presumably because backfill had
> just given up the locks and we are not going through them. Nothing that
> appears too concerning.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
> * You reported the bug.
>
Hello, I don't see any evidence of the OOM Killer running on the slurm controller node. Just as an additional data point. -Doug On 10/17/16 3:41 PM, bugs@schedmd.com wrote: > > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=3185#c3> on bug > 3185 <https://bugs.schedmd.com/show_bug.cgi?id=3185> from Danny Auble > <mailto:da@schedmd.com> * > Doug, do you know if there was any memory issues on the node? I am wondering > it we ran out. > > The assert happened on > > bit_or(bitmap, orig_map); > > where orig_map was NULL. > > orig_map = bit_copy(bitmap); > > where bitmap=0x7f902c35eb00 > > bit_copy (meat) > > new = bit_alloc(newsize_bits); > if (new) > memcpy(&new[BITSTR_OVERHEAD], &b[BITSTR_OVERHEAD], len); > > If bit_alloc returned NULL here though I would expect a nice OOM message in the > log as that is what the code is suppose to do inside bit_alloc. As that > doesn't exist I am not sure how it got to be NULL. Other variables look good > or I would expect memory corruption. > > Could you send 'bt full' on the thread provided incomment 0 <show_bug.cgi?id=3185#c0>? > > It appears there are a lot of threads waiting, presumably because backfill had > just given up the locks and we are not going through them. Nothing that > appears too concerning. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Ah, I was looking at the wrong bitmap operation. overlap = bit_overlap(bitmap, tmp_job_ptr->node_bitmap); Could you send print *tmp_job_ptr from frame 3 See comment 0, I think i posted it there. On 10/17/16 4:50 PM, bugs@schedmd.com wrote: > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=3185#c6> on bug > 3185 <https://bugs.schedmd.com/show_bug.cgi?id=3185> from Danny Auble > <mailto:da@schedmd.com> * > Ah, I was looking at the wrong bitmap operation. > > overlap = bit_overlap(bitmap, > tmp_job_ptr->node_bitmap); > > Could you send > > print *tmp_job_ptr > > from frame 3 > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > I see it Doug, sorry for missing it before hand. Could you also print something like print *(select_jobinfo_t *)(job_ptr->select_jobinfo->data) Hopefully in that frame it will give you something. I am not holding my breath on that print. I am guessing you will get an incomplete type. I am fairly sure the job is in a cleaning state (NHC is running). I can easily see in the log it printing all those "has zero end_time" messages (fixed in 16.05.5). My guess is in the requeue of this job the node_bitmap was freed and shouldn't be in this check. We are discussing how to fix this correctly. I wouldn't expect this to happen very often though. But please let us know if it happens again. I'll let you know when we have a fix for this. I don't think we need much more from you now. Created attachment 3597 [details]
Patch to prevent abort
Doug,
It's going to take some time to fully understand how the job got into this state and caused the abort, but this patch will at least prevent this abort from happening again. I'd like to propose that you use this patch for now and we'll study the logs and code more on Tuesday. I'm hoping that we can provide you with something better then.
I now understand what happened and the patch that you have is probably the best solution, although I'll comment this better in that I commit. Here are some key bits of the log with comments: Job allocated resources: [2016-10-17T13:43:38.897] sched: Allocate JobID=2366862_75(2366942) NodeList=nid06105 #CPUs=2 Partition=shared Job launch fails, apparently due to some problem with nid06105: [2016-10-17T13:43:41.368] job_complete: JobID=2366862_75(2366942) State=0x1 NodeCnt=1 WEXITSTATUS 0 [2016-10-17T13:43:41.368] job_complete: requeue JobID=2366862_75(2366942) State=0x8000 NodeCnt=1 per user/system request [2016-10-17T13:43:41.368] job_complete: JobID=2366862_75(2366942) State=0x8000 NodeCnt=1 done Job gets requeued, clearing end_time and node_bitmap (as desired): [2016-10-17T13:44:22.824] Requeuing JobID=2366862_75(2366942) State=0x0 NodeCnt=0 Many bugs of this sort are generated by the backfill scheduler. These are due to NHC still running and will go away in Slurm version 16.05.5: [2016-10-17T13:44:23.447] error: Job 2366942 has zero end_time Job is cancelled, setting its end_time to the current time: [2016-10-17T13:44:25.800] _job_signal: of pending JobID=2366862_75(2366942) State=0x4 NodeCnt=0 successful Its the job cancel that makes the job state be: NHC is running end_time is non-zero node_bitmap is NULL The logic abort with this combination. Adding the check for a NULL node_bitmap fixes this. The patch that is attached is logically equivalent to what I have committed to Slurm version 16.05.6: https://github.com/SchedMD/slurm/commit/222746a65f7bd60e26330d8ed40248e54169c2d5 |