On an internal Cray system with CLE 5.2UP04 and slurm 16.05.6, slurmd segfaulted in _decompress_data_zlib: (gdb) bt full #0 0x00007ffff7235885 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00007ffff7236e61 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00007ffff727688f in __libc_message () from /lib64/libc.so.6 No symbol table info available. #3 0x00007ffff727c098 in malloc_printerr () from /lib64/libc.so.6 No symbol table info available. #4 0x00007ffff72810dc in free () from /lib64/libc.so.6 No symbol table info available. #5 0x00007ffff7bcf004 in inflateEnd () from /lib64/libz.so.1 No symbol table info available. #6 0x00000000005445c0 in _decompress_data_zlib (req=0x7f6940) at file_bcast.c:488 ret = <optimized out> flush = 0 have = 262144 buf_in_offset = <optimized out> buf_out_offset = 0 out_buf = 0x1210920 "" strm = { next_in = 0x10445fc ":\342\310`\306\302\273<5X\321\321\277\314b,\356\004\067Y\321iQ\205L\366)av\323_\350\361M\214\261\310\256\025\070\003\005\v\277\224\205H\352\216$T\331b\035\234\214g\037\355\372\205n\252\325X\350`\331\023]G\347\022\221cW\177\302\024\035\221\067\301\022\277{\256%>\271\207\223\375A,\331\b\033z\311\022\017\n-\376\006\304g\315e\n\361u\022<\342_@\a\307\272'\006\304'y\304/\231m\211o\030f\262\367\b3\004\342\277\267\304\203B\213\217&T\331\240\276R\373\317m\342\317\b\355[\017\022\332\337\362\210w\314\263\304\307}\344d_\020\346\032\304\337\267\304\203B\213\377\214Pq\035F:\331;]\213&:Y"..., avail_in = 262052, total_in = 34184, next_out = 0x7ffff7f49b40 "", avail_out = 262144, total_out = 262144, msg = 0x7ffff7bd7e47 "invalid literal/lengths set", state = 0x813b30, zalloc = 0x7ffff7bd3c50, zfree = 0x7ffff7bd3c40, opaque = 0x0, data_type = 5, adler = 3361490262, reserved = 0} #7 0x000000000043cf30 in _rpc_file_bcast (msg=0x7f74f0) at req.c:3745 rc = <optimized out> offset = 0 inx = <optimized out> file_info = <optimized out> req = 0x7f6940 key = {data = 0x7ffff7f89cc0, fd = 7680, file_size = 738734374921, fname = 0x7f3780 "/tmp/cti_daemonvPLZsE1.tar", gid = 12790, job_id = 1728173, last_update = 4867746, max_blocks = 8354584, received_blocks = 0, start_time = 8354596, uid = 24918} #8 0x000000000043ed5d in slurmd_req (msg=0x3074) at req.c:416 rc = <optimized out> #9 0x0000000000431fd0 in _service_connection (arg=<optimized out>) at slurmd.c:606 con = 0x7f75b0 msg = 0x7f74f0 rc = 0 #10 0x00007ffff7586806 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #11 0x00007ffff72e165d in clone () from /lib64/libc.so.6 No symbol table info available. #12 0x0000000000000000 in ?? () No symbol table info available. Here's the end of the slurmd log file: [2016-11-29T14:29:09.962] [1728175.0] core_spec/cray: init [2016-11-29T14:29:10.122] [1728175.0] (switch_cray.c: 656: switch_p_job_init) gres_cnt: 2072 0 [2016-11-29T14:29:10.200] [1728175.0] task/cgroup: /slurm/uid_24918/job_1728175: alloc=8192MB mem.limit=8192MB memsw.limit=unlimited [2016-11-29T14:29:10.201] [1728175.0] task/cgroup: /slurm/uid_24918/job_1728175/step_0: alloc=8192MB mem.limit=8192MB memsw.limit=unlimited [2016-11-29T14:29:10.202] [1728175.0] Created file /var/opt/cray/alps/spool/status1728175 [2016-11-29T14:29:10.265] [1728175.0] task_p_pre_launch: Using sched_affinity for tasks [2016-11-29T14:29:10.827] sbcast req_uid=24918 job_id=1728174 fname=/tmp/cti_daemon6wRFT81.tar block_no=1 [2016-11-29T14:29:11.019] sbcast req_uid=24918 job_id=1728175 fname=/tmp/cti_daemonKhW4z21.tar block_no=1 [2016-11-29T14:29:11.036] sbcast req_uid=24918 job_id=1728172 fname=/tmp/cti_daemonwpl0Q01.tar block_no=1 [2016-11-29T14:29:11.377] sbcast req_uid=24918 job_id=1728173 fname=/tmp/cti_daemonvPLZsE1.tar block_no=1 [2016-11-29T14:29:11.892] error: sbcast: data decompression error for UID 24918, file /tmp/cti_daemonwpl0Q01.tar [2016-11-29T14:29:13.552] error: sbcast: data decompression error for UID 24918, file /tmp/cti_daemon6wRFT81.tar [2016-11-29T14:39:26.158] [1728172.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused [2016-11-29T14:39:26.164] [1728172.0] done with job [2016-11-29T14:39:26.461] [1728173.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused [2016-11-29T14:39:26.464] [1728173.0] done with job [2016-11-29T14:57:23.549] [1728175.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused [2016-11-29T14:57:23.554] [1728175.0] done with job [2016-11-29T14:57:55.386] [1728174.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused [2016-11-29T14:57:55.391] [1728174.0] done with job We have zlib version 1.2.7-0.12.3.
On another node, a slightly different trace: (gdb) bt full #0 0x00007ffff7bcf581 in ?? () from /lib64/libz.so.1 No symbol table info available. #1 0x00007ffff7bcfcb7 in inflate () from /lib64/libz.so.1 No symbol table info available. #2 0x00000000005445a0 in _decompress_data_zlib (req=0x7f76c0) at file_bcast.c:482 ret = <optimized out> flush = 0 have = 262144 buf_in_offset = 151695 buf_out_offset = 0 out_buf = 0x16f4580 "" strm = { next_in = 0x14c7c0f "\325\321<\354\020 \270C\b\036\334Cp\v\020\334!P\234\224\342\356^\212\273\273{).\301)P\240X\261\006\b\024w(\356\016\205\020\276\366\235{{}3\277\247\271\256\376s3\335\347\234=\273\263\063\263\263\263>\302\337\316\326\366U^\370\221\006z<o\205O\256!\362N\377\307\217%\274y\210\216\353\226\200?\340\273\214\200\257\204\333xT\342\330\302kG\223\361\237\005\336\fn\343B\247\341kS\312H\310@\242j\205\070\302O\224\325\373\002\233\341\035\337h}[0\256\360\331\344U\272<\366\252\360\326>z\037g9\374\371\237\322\216\333\257\211\202?\236\247\343?\327\342\t\367\357\255\343\066\257\340\337\337\320\363\250Y|\341U\375\245"..., avail_in = 240353, total_in = 49678, next_out = 0x7ffff1285ab1 "", avail_out = 151695, total_out = 403575, msg = 0x7ffff7bd88b3 "incorrect data check", state = 0x0, zalloc = 0x7ffff7bd3c50, zfree = 0x7ffff7bd3c40, opaque = 0x0, data_type = 66, adler = 507598220, reserved = 0} #3 0x000000000043cf30 in _rpc_file_bcast (msg=0x7f74f0) at req.c:3745 rc = <optimized out> offset = 0 inx = <optimized out> file_info = <optimized out> req = 0x7f76c0 key = {data = 0x7ffff12aacc0, fd = 7680, file_size = 738734374921, fname = 0x7f7760 "/tmp/cti_daemonSeLKvj1.tar", gid = 12790, job_id = 1728176, last_update = 4867746, max_blocks = 8349480, received_blocks = 0, start_time = 8349492, uid = 24918} #4 0x000000000043ed5d in slurmd_req (msg=0x799380 <strm.13045>) at req.c:416 rc = <optimized out> #5 0x0000000000431fd0 in _service_connection (arg=<optimized out>) at slurmd.c:606 con = 0x7f7620 msg = 0x7f74f0 rc = 0 #6 0x00007ffff7586806 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #7 0x00007ffff72e165d in clone () from /lib64/libc.so.6 No symbol table info available. #8 0x0000000000000000 in ?? () No symbol table info available.
Is this the same file both times?
It appears to be a temporary file, so the name is different each time. From a few different nodes: [2016-11-30T08:06:30.199] error: sbcast: data decompression error for UID 4038, file /tmp/cti_daemonNdmZrt1.tar [2016-11-30T08:06:31.486] error: sbcast: data decompression error for UID 4038, file /tmp/cti_daemonzQA9ci1.tar This issue has taken down 79 nodes in a day.
Created attachment 3772 [details] Fix for race within slurmd on zlib decompression. I can see the issue now - multiple sbcast'd files to the same node using zlib can cause corruption of each other's decompression streams, leading to the segfault. This only affects zlib - lz4 compression or no compression are unaffected. Attached patch should correct this. Unfortunately you'll need to restart the slurmd's to get the fix done. As an alternative temporary measure, if you built and installed srun/sbcast commands *without* libz support they'd fall back to no compression and avoid this issue entirely. While not directly related to this issue, I'll note that lz4 is much faster than zlib, and I'd encourage you to look into deploying that on future installs. (And if the library was included in CLE by default that'd be even better - I suspect NERSC has asked about this already.) - Tim
I'm working on adding lz4 to CLE now. It's going through our legal process at the moment. It has strange licensing with both BSD and GPLv2. I'll ask the system administrator if he wants to try the patch.
(In reply to David Gloe from comment #5) > I'm working on adding lz4 to CLE now. It's going through our legal process > at the moment. It has strange licensing with both BSD and GPLv2. Glad to hear that, it's much faster in our testing. On a local machine zlib is always slower than not compressing at all; whereas lz4 will still show a performance gain even over the loopback interface. > I'll ask the system administrator if he wants to try the patch. Without that patch, any slurmd receiving compressed sbcast files from separate jobs simultaneously is subject to that race condition. I haven't been able to trigger it locally, but the potential is there, and you've obviously hit it a number of times in your environment. Replacing the client commands with non-zlib-enabled variants is probably quicker and safest, but fixing the slurmd is the only long-term option.
Patch is in commit 8c5765c96027a1b, and will be in 16.05.7 when released. Let me know if there's anything else I can address on this, otherwise marking fixed now.11