Description
ARC Admins
2020-03-10 10:45:40 MDT
Can you upload a backtrace from the core dump? Also, do you still have the band-aid patch from bug 8360 applied? I haven't resolved that one yet and I'm wondering if it is causing this crash. (In reply to Marshall Garey from comment #1) > Can you upload a backtrace from the core dump? (gdb) bt #0 0x00002adfc01e2337 in raise () from /lib64/libc.so.6 #1 0x00002adfc01e3a28 in abort () from /lib64/libc.so.6 #2 0x00002adfc01db156 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00002adfc01db202 in __assert_fail () from /lib64/libc.so.6 #4 0x00002adfbfa04e5c in bit_alloc (nbits=0) at bitstring.c:166 #5 0x00002adfbfa245fa in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7fff487048f8, buffer=buffer@entry=0x430a840, job_id=3222565, protocol_version=protocol_version@entry=8448) at gres.c:6006 #6 0x0000000000460816 in _load_job_state (buffer=buffer@entry=0x430a840, protocol_version=<optimized out>) at job_mgr.c:1879 #7 0x00000000004634f5 in load_all_job_state () at job_mgr.c:1114 #8 0x000000000049f56c in read_slurm_conf (recover=<optimized out>, reconfig=reconfig@entry=false) at read_config.c:1334 #9 0x000000000042d699 in main (argc=<optimized out>, argv=<optimized out>) at controller.c:662 (gdb) > Also, do you still have the band-aid patch from bug 8360 applied? I haven't > resolved that one yet and I'm wondering if it is causing this crash. Yes, that was resolved, we had a node that was removed but still listed in slurm.conf (In reply to ARCTS Admins from comment #2) > (In reply to Marshall Garey from comment #1) > > Can you upload a backtrace from the core dump? > > (gdb) bt Please call the following in gdb with the core: > set pagination off > set print pretty on > f 5 > info locals > info args (In reply to Nate Rini from comment #3) > Please call the following in gdb with the core: > set pagination off > set print pretty on > f 5 > info locals > info args (gdb) set pagination off (gdb) set print pretty on (gdb) f 5 #5 0x00002adfbfa245fa in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7fff487048f8, buffer=buffer@entry=0x430a840, job_id=3222565, protocol_version=protocol_version@entry=8448) at gres.c:6006 6006 gres.c: No such file or directory. (gdb) info locals tmp_str = 0x4323480 "0x" _size = 0 _tmp_uint32 = 3 i = 0 rc = 0 magic = 1133130964 plugin_id = 7696487 utmp32 = 1 rec_cnt = 0 has_more = 1 '\001' gres_ptr = <optimized out> gres_job_ptr = 0x4323380 __func__ = "gres_plugin_job_state_unpack" (gdb) info args gres_list = 0x7fff487048f8 buffer = 0x430a840 job_id = 3222565 protocol_version = 8448 (gdb) Can you also upload the output of the following from gdb? frame 5 p *gres_job_ptr (In reply to Marshall Garey from comment #6) > Can you also upload the output of the following from gdb? > > frame 5 > p *gres_job_ptr (gdb) frame 5 #5 0x00002adfbfa245fa in gres_plugin_job_state_unpack (gres_list=gres_list@entry=0x7fff487048f8, buffer=buffer@entry=0x430a840, job_id=3222565, protocol_version=protocol_version@entry=8448) at gres.c:6006 6006 in gres.c (gdb) p *gres_job_ptr $1 = { gres_name = 0x0, type_id = 0, type_name = 0x0, flags = 0, cpus_per_gres = 0, gres_per_job = 0, gres_per_node = 0, gres_per_socket = 0, gres_per_task = 2, mem_per_gres = 4096, def_cpus_per_gres = 0, def_mem_per_gres = 0, total_node_cnt = 0, gres_bit_select = 0x0, gres_cnt_node_select = 0x0, total_gres = 0, node_cnt = 1, gres_bit_alloc = 0x4323460, gres_cnt_node_alloc = 0x4323440, gres_bit_step_alloc = 0x0, gres_cnt_step_alloc = 0x0 } (gdb) Please also call the following in gdb with the core:
> p *buffer
> p buffer->size
> p buffer->processed
> p buffer->mmaped
> p buffer->head
> x/32xb buffer->head
(In reply to Nate Rini from comment #9) > Please also call the following in gdb with the core: > > p *buffer > > p buffer->size > > p buffer->processed > > p buffer->mmaped > > p buffer->head > > x/32xb buffer->head (gdb) p buffer->size $3 = 1324725 (gdb) p buffer->processed $4 = 759 (gdb) p buffer->mmaped $5 = true (gdb) p buffer->head $6 = 0x2adfc855f000 <Address 0x2adfc855f000 out of bounds> (gdb) x/32xb buffer->head 0x2adfc855f000: Cannot access memory at address 0x2adfc855f000 (gdb) Created attachment 13324 [details]
workaround to get slurmctld running
Can you apply this patch and restart slurmctld? This is a workaround to make slurmctld start running while we investigate the cause. Also, can you save a copy of the (pre-patch) slurmctld binary, libraries, and coredump so we can keep asking for information about it?
Oh, and can you also keep a copy of your current state save location? (In reply to Marshall Garey from comment #12) > Oh, and can you also keep a copy of your current state save location? Please attach a copy of your statesave directory when convenient. Created attachment 13327 [details]
workaround v2
I missed something in the previous patch that I believe will still allow slurmctld to crash - the previous patch failed to unpack something that was packed. I believe this one will correctly handle a size 0 bitmap, which is what appears to have happened. Can you apply this one instead? There still might be a crash somewhere else because of the 0 size bitmap but hopefully this at least lets slurmctld progress further.
(In reply to Marshall Garey from comment #14) > Created attachment 13327 [details] > workaround v2 > > I missed something in the previous patch that I believe will still allow > slurmctld to crash - the previous patch failed to unpack something that was > packed. I believe this one will correctly handle a size 0 bitmap, which is > what appears to have happened. Can you apply this one instead? There still > might be a crash somewhere else because of the 0 size bitmap but hopefully > this at least lets slurmctld progress further. This reply is 3 comments dated but I think might be of some value: I have a copy of the slurm.state directory prior to the upgrade saved. I applied the patch and ran slurmctld -D That resulted in this error: slurmctld: debug3: create_mmap_buf: loaded file `/var/spool/slurm.state/node_state` as Buf slurmctld: debug3: Version string in node_state header is PROTOCOL_VERSION slurmctld: Recovered state of 407 nodes slurmctld: Down nodes: gl3269 slurmctld: debug3: create_mmap_buf: loaded file `/var/spool/slurm.state/job_state` as Buf slurmctld: debug3: Version string in job_state header is PROTOCOL_VERSION slurmctld: debug3: Job id in job_state header is 4793557 slurmctld: error: unpackmem_xmalloc: Buffer to be unpacked is too large (1936994115 > 100000000) slurmctld: error: Incomplete job record slurmctld: fatal: Incomplete job state save file, start with '-i' to ignore this I then ran: slurmctld -D -i It seems to be up and staying up, but my queue is empty, I suspect that is the outcome of the -i (In reply to ARCTS Admins from comment #15) > It seems to be up and staying up, but my queue is empty, I suspect that is > the outcome of the -i Correct, do you still want to try to recover those jobs or would you rather lose those jobs and stay online? (In reply to Marshall Garey from comment #14) > Created attachment 13327 [details] > workaround v2 > > I missed something in the previous patch that I believe will still allow > slurmctld to crash - the previous patch failed to unpack something that was > packed. I believe this one will correctly handle a size 0 bitmap, which is > what appears to have happened. Can you apply this one instead? There still > might be a crash somewhere else because of the 0 size bitmap but hopefully > this at least lets slurmctld progress further. I will apply this patch, also attached is the /var/spool/slurm* backup prior to the upgrade Created attachment 13328 [details]
Slurm state files
(In reply to Nate Rini from comment #16) > (In reply to ARCTS Admins from comment #15) > > It seems to be up and staying up, but my queue is empty, I suspect that is > > the outcome of the -i > > Correct, do you still want to try to recover those jobs or would you rather > lose those jobs and stay online? Yes, if possible, we would like to restore the jobs from that were in the queue prior to the update. (In reply to ARCTS Admins from comment #18) > Created attachment 13328 [details] > Slurm state files Please also attach your slurm.conf (& friends). (In reply to ARCTS Admins from comment #19) > (In reply to Nate Rini from comment #16) > > (In reply to ARCTS Admins from comment #15) > > > It seems to be up and staying up, but my queue is empty, I suspect that is > > > the outcome of the -i > > > > Correct, do you still want to try to recover those jobs or would you rather > > lose those jobs and stay online? > > Yes, if possible, we would like to restore the jobs from that were in the > queue prior to the update. You will need to do the following: 1. stop slurmctld 2. apply patch from comment #14 3. restore the state directory to before the upgrade 4. run slurmctld (without -i) Please attach logs and bt if it crashes Created attachment 13329 [details]
Slurm.conf and friends
(In reply to Nate Rini from comment #21) > (In reply to ARCTS Admins from comment #19) > > (In reply to Nate Rini from comment #16) > > > (In reply to ARCTS Admins from comment #15) > > > > It seems to be up and staying up, but my queue is empty, I suspect that is > > > > the outcome of the -i > > > > > > Correct, do you still want to try to recover those jobs or would you rather > > > lose those jobs and stay online? > > > > Yes, if possible, we would like to restore the jobs from that were in the > > queue prior to the update. > > You will need to do the following: > 1. stop slurmctld > 2. apply patch from comment #14 > 3. restore the state directory to before the upgrade > 4. run slurmctld (without -i) > > Please attach logs and bt if it crashes I've applied the patch, followed the steps provided and slurmctld seems to be up and the queue is showing jobs. (In reply to ARCTS Admins from comment #24) > I've applied the patch, followed the steps provided and slurmctld seems to > be up and the queue is showing jobs. Great, feel free to remove the patch when convenient as it should only be used to read the old 18.08 state. Lowering ticket severity. We have your state and can recreate the issue locally. (In reply to Nate Rini from comment #25) > (In reply to ARCTS Admins from comment #24) > > I've applied the patch, followed the steps provided and slurmctld seems to > > be up and the queue is showing jobs. > > Great, feel free to remove the patch when convenient as it should only be > used to read the old 18.08 state. > > Lowering ticket severity. We have your state and can recreate the issue > locally. Thanks, I will stop the slurmctld service, revert to the un-patched version and will report back. (In reply to ARCTS Admins from comment #26) > (In reply to Nate Rini from comment #25) > > (In reply to ARCTS Admins from comment #24) > > > I've applied the patch, followed the steps provided and slurmctld seems to > > > be up and the queue is showing jobs. > > > > Great, feel free to remove the patch when convenient as it should only be > > used to read the old 18.08 state. > > > > Lowering ticket severity. We have your state and can recreate the issue > > locally. > > Thanks, I will stop the slurmctld service, revert to the un-patched version > and will report back. Running the un-patched 19.05.5 looks good, jobs showing up in the queue. It turns out this was already fixed in commit 6e94ef316. But, it's in 20.02 since changed a macro used in the protocol layer code so we didn't want to make that sort of change in a stable version of Slurm. I updated the documentation and log message about slurmctld -i to warn that it will throw out any unrecoverable data from the StateSaveLocation. That's in commit cc07bc9341 in 20.02.1. I'm closing this ticket as resolved/fixed in 20.02.0 per commit 6e94ef316 |