Ticket 3299

Summary: slurmd segfault in _decompress_data_zlib
Product: Slurm Reporter: David Gloe <david.gloe>
Component: slurmdAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: ---    
Version: 16.05.6   
Hardware: Linux   
OS: Linux   
Site: CRAY Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: Cray Internal
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 16.05.7 17.02-pre4 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Fix for race within slurmd on zlib decompression.

Description David Gloe 2016-11-30 09:07:17 MST
On an internal Cray system with CLE 5.2UP04 and slurm 16.05.6, slurmd segfaulted in _decompress_data_zlib:

(gdb) bt full
#0  0x00007ffff7235885 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007ffff7236e61 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007ffff727688f in __libc_message () from /lib64/libc.so.6
No symbol table info available.
#3  0x00007ffff727c098 in malloc_printerr () from /lib64/libc.so.6
No symbol table info available.
#4  0x00007ffff72810dc in free () from /lib64/libc.so.6
No symbol table info available.
#5  0x00007ffff7bcf004 in inflateEnd () from /lib64/libz.so.1
No symbol table info available.
#6  0x00000000005445c0 in _decompress_data_zlib (req=0x7f6940) at file_bcast.c:488
        ret = <optimized out>
        flush = 0
        have = 262144
        buf_in_offset = <optimized out>
        buf_out_offset = 0
        out_buf = 0x1210920 ""
        strm = {
          next_in = 0x10445fc ":\342\310`\306\302\273<5X\321\321\277\314b,\356\004\067Y\321iQ\205L\366)av\323_\350\361M\214\261\310\256\025\070\003\005\v\277\224\205H\352\216$T\331b\035\234\214g\037\355\372\205n\252\325X\350`\331\023]G\347\022\221cW\177\302\024\035\221\067\301\022\277{\256%>\271\207\223\375A,\331\b\033z\311\022\017\n-\376\006\304g\315e\n\361u\022<\342_@\a\307\272'\006\304'y\304/\231m\211o\030f\262\367\b3\004\342\277\267\304\203B\213\217&T\331\240\276R\373\317m\342\317\b\355[\017\022\332\337\362\210w\314\263\304\307}\344d_\020\346\032\304\337\267\304\203B\213\377\214Pq\035F:\331;]\213&:Y"..., avail_in = 262052,
          total_in = 34184, next_out = 0x7ffff7f49b40 "", avail_out = 262144, total_out = 262144, msg = 0x7ffff7bd7e47 "invalid literal/lengths set", state = 0x813b30, zalloc = 0x7ffff7bd3c50,
          zfree = 0x7ffff7bd3c40, opaque = 0x0, data_type = 5, adler = 3361490262, reserved = 0}
#7  0x000000000043cf30 in _rpc_file_bcast (msg=0x7f74f0) at req.c:3745
        rc = <optimized out>
        offset = 0
        inx = <optimized out>
        file_info = <optimized out>
        req = 0x7f6940
        key = {data = 0x7ffff7f89cc0, fd = 7680, file_size = 738734374921, fname = 0x7f3780 "/tmp/cti_daemonvPLZsE1.tar", gid = 12790, job_id = 1728173, last_update = 4867746, max_blocks = 8354584,
          received_blocks = 0, start_time = 8354596, uid = 24918}
#8  0x000000000043ed5d in slurmd_req (msg=0x3074) at req.c:416
        rc = <optimized out>
#9  0x0000000000431fd0 in _service_connection (arg=<optimized out>) at slurmd.c:606
        con = 0x7f75b0
        msg = 0x7f74f0
        rc = 0
#10 0x00007ffff7586806 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#11 0x00007ffff72e165d in clone () from /lib64/libc.so.6
No symbol table info available.
#12 0x0000000000000000 in ?? ()
No symbol table info available.

Here's the end of the slurmd log file:
[2016-11-29T14:29:09.962] [1728175.0] core_spec/cray: init
[2016-11-29T14:29:10.122] [1728175.0] (switch_cray.c: 656: switch_p_job_init) gres_cnt: 2072 0
[2016-11-29T14:29:10.200] [1728175.0] task/cgroup: /slurm/uid_24918/job_1728175: alloc=8192MB mem.limit=8192MB memsw.limit=unlimited
[2016-11-29T14:29:10.201] [1728175.0] task/cgroup: /slurm/uid_24918/job_1728175/step_0: alloc=8192MB mem.limit=8192MB memsw.limit=unlimited
[2016-11-29T14:29:10.202] [1728175.0] Created file /var/opt/cray/alps/spool/status1728175
[2016-11-29T14:29:10.265] [1728175.0] task_p_pre_launch: Using sched_affinity for tasks
[2016-11-29T14:29:10.827] sbcast req_uid=24918 job_id=1728174 fname=/tmp/cti_daemon6wRFT81.tar block_no=1
[2016-11-29T14:29:11.019] sbcast req_uid=24918 job_id=1728175 fname=/tmp/cti_daemonKhW4z21.tar block_no=1
[2016-11-29T14:29:11.036] sbcast req_uid=24918 job_id=1728172 fname=/tmp/cti_daemonwpl0Q01.tar block_no=1
[2016-11-29T14:29:11.377] sbcast req_uid=24918 job_id=1728173 fname=/tmp/cti_daemonvPLZsE1.tar block_no=1
[2016-11-29T14:29:11.892] error: sbcast: data decompression error for UID 24918, file /tmp/cti_daemonwpl0Q01.tar
[2016-11-29T14:29:13.552] error: sbcast: data decompression error for UID 24918, file /tmp/cti_daemon6wRFT81.tar
[2016-11-29T14:39:26.158] [1728172.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2016-11-29T14:39:26.164] [1728172.0] done with job
[2016-11-29T14:39:26.461] [1728173.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2016-11-29T14:39:26.464] [1728173.0] done with job
[2016-11-29T14:57:23.549] [1728175.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2016-11-29T14:57:23.554] [1728175.0] done with job
[2016-11-29T14:57:55.386] [1728174.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2016-11-29T14:57:55.391] [1728174.0] done with job

We have zlib version 1.2.7-0.12.3.
Comment 1 David Gloe 2016-11-30 09:13:17 MST
On another node, a slightly different trace:
(gdb) bt full
#0  0x00007ffff7bcf581 in ?? () from /lib64/libz.so.1
No symbol table info available.
#1  0x00007ffff7bcfcb7 in inflate () from /lib64/libz.so.1
No symbol table info available.
#2  0x00000000005445a0 in _decompress_data_zlib (req=0x7f76c0) at file_bcast.c:482
        ret = <optimized out>
        flush = 0
        have = 262144
        buf_in_offset = 151695
        buf_out_offset = 0
        out_buf = 0x16f4580 ""
        strm = {
          next_in = 0x14c7c0f "\325\321<\354\020 \270C\b\036\334Cp\v\020\334!P\234\224\342\356^\212\273\273{).\301)P\240X\261\006\b\024w(\356\016\205\020\276\366\235{{}3\277\247\271\256\376s3\335\347\234=\273\263\063\263\263\263>\302\337\316\326\366U^\370\221\006z<o\205O\256!\362N\377\307\217%\274y\210\216\353\226\200?\340\273\214\200\257\204\333xT\342\330\302kG\223\361\237\005\336\fn\343B\247\341kS\312H\310@\242j\205\070\302O\224\325\373\002\233\341\035\337h}[0\256\360\331\344U\272<\366\252\360\326>z\037g9\374\371\237\322\216\333\257\211\202?\236\247\343?\327\342\t\367\357\255\343\066\257\340\337\337\320\363\250Y|\341U\375\245"...,
          avail_in = 240353, total_in = 49678, next_out = 0x7ffff1285ab1 "", avail_out = 151695, total_out = 403575, msg = 0x7ffff7bd88b3 "incorrect data check", state = 0x0, zalloc = 0x7ffff7bd3c50,
          zfree = 0x7ffff7bd3c40, opaque = 0x0, data_type = 66, adler = 507598220, reserved = 0}
#3  0x000000000043cf30 in _rpc_file_bcast (msg=0x7f74f0) at req.c:3745
        rc = <optimized out>
        offset = 0
        inx = <optimized out>
        file_info = <optimized out>
        req = 0x7f76c0
        key = {data = 0x7ffff12aacc0, fd = 7680, file_size = 738734374921, fname = 0x7f7760 "/tmp/cti_daemonSeLKvj1.tar", gid = 12790, job_id = 1728176, last_update = 4867746, max_blocks = 8349480,
          received_blocks = 0, start_time = 8349492, uid = 24918}
#4  0x000000000043ed5d in slurmd_req (msg=0x799380 <strm.13045>) at req.c:416
        rc = <optimized out>
#5  0x0000000000431fd0 in _service_connection (arg=<optimized out>) at slurmd.c:606
        con = 0x7f7620
        msg = 0x7f74f0
        rc = 0
#6  0x00007ffff7586806 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#7  0x00007ffff72e165d in clone () from /lib64/libc.so.6
No symbol table info available.
#8  0x0000000000000000 in ?? ()
No symbol table info available.
Comment 2 Tim Wickberg 2016-11-30 09:15:15 MST
Is this the same file both times?
Comment 3 David Gloe 2016-11-30 09:23:25 MST
It appears to be a temporary file, so the name is different each time. From a few different nodes:

[2016-11-30T08:06:30.199] error: sbcast: data decompression error for UID 4038, file /tmp/cti_daemonNdmZrt1.tar

[2016-11-30T08:06:31.486] error: sbcast: data decompression error for UID 4038, file /tmp/cti_daemonzQA9ci1.tar

This issue has taken down 79 nodes in a day.
Comment 4 Tim Wickberg 2016-11-30 10:20:04 MST
Created attachment 3772 [details]
Fix for race within slurmd on zlib decompression.

I can see the issue now - multiple sbcast'd files to the same node using zlib can cause corruption of each other's decompression streams, leading to the segfault. This only affects zlib - lz4 compression or no compression are unaffected.

Attached patch should correct this. Unfortunately you'll need to restart the slurmd's to get the fix done. As an alternative temporary measure, if you built and installed srun/sbcast commands *without* libz support they'd fall back to no compression and avoid this issue entirely.

While not directly related to this issue, I'll note that lz4 is much faster than zlib, and I'd encourage you to look into deploying that on future installs. (And if the library was included in CLE by default that'd be even better - I suspect NERSC has asked about this already.)

- Tim
Comment 5 David Gloe 2016-11-30 10:26:16 MST
I'm working on adding lz4 to CLE now. It's going through our legal process at the moment. It has strange licensing with both BSD and GPLv2.

I'll ask the system administrator if he wants to try the patch.
Comment 6 Tim Wickberg 2016-11-30 10:34:35 MST
(In reply to David Gloe from comment #5)
> I'm working on adding lz4 to CLE now. It's going through our legal process
> at the moment. It has strange licensing with both BSD and GPLv2.

Glad to hear that, it's much faster in our testing. On a local machine zlib is always slower than not compressing at all; whereas lz4 will still show a performance gain even over the loopback interface.

> I'll ask the system administrator if he wants to try the patch.

Without that patch, any slurmd receiving compressed sbcast files from separate jobs simultaneously is subject to that race condition. I haven't been able to trigger it locally, but the potential is there, and you've obviously hit it a number of times in your environment.

Replacing the client commands with non-zlib-enabled variants is probably quicker and safest, but fixing the slurmd is the only long-term option.
Comment 7 Tim Wickberg 2016-11-30 10:51:10 MST
Patch is in commit 8c5765c96027a1b, and will be in 16.05.7 when released. Let me know if there's anything else I can address on this, otherwise marking fixed now.11