Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent SIGSEGV with Fluent Bit 1.8.7 #4164

Closed
LionelCons opened this issue Oct 5, 2021 · 32 comments
Closed

Frequent SIGSEGV with Fluent Bit 1.8.7 #4164

LionelCons opened this issue Oct 5, 2021 · 32 comments
Labels

Comments

@LionelCons
Copy link
Contributor

After having upgraded my machines to Fluent Bit 1.8.7, I see frequent crashes with SIGSEGV.

Weirdly, the stack traces vary quite a lot. Here are some examples.

[2021/10/04 06:07:53] [engine] caught signal (SIGSEGV)
#0  0x64163b            in  ???() at ???:0
#1  0x6293ad            in  ???() at ???:0
#2  0x5965be            in  lua_pushmsgpack() at plugins/filter_lua/lua.c:80
#3  0x596738            in  lua_pushmsgpack() at plugins/filter_lua/lua.c:110
#4  0x59774b            in  cb_lua_filter() at plugins/filter_lua/lua.c:544
#5  0x4b765b            in  flb_filter_do() at src/flb_filter.c:124
#6  0x4ea6d8            in  flb_input_chunk_append_raw() at src/flb_input_chunk.c:1035
#7  0x50f2f0            in  process_content() at plugins/in_tail/tail_file.c:476
#8  0x5113b4            in  flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1197
#9  0x50998f            in  in_tail_collect_event() at plugins/in_tail/tail.c:261
#10 0x50ab40            in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
#11 0x4b70f9            in  flb_input_collector_fd() at src/flb_input.c:1107
#12 0x4c9486            in  flb_engine_handle_event() at src/flb_engine.c:411
#13 0x4c9486            in  flb_engine_start() at src/flb_engine.c:686
#14 0x4abc0f            in  flb_lib_worker() at src/flb_lib.c:627
#15 0x7f1a4a06cea4      in  ???() at ???:0
#16 0x7f1a492409fc      in  ???() at ???:0
#17 0xffffffffffffffff  in  ???() at ???:0
[2021/10/04 23:01:02] [engine] caught signal (SIGSEGV)
#0  0x5d7d31            in  mpack_load_u8() at lib/mpack-amalgamation-1.0/src/mpack/mpack.h:2029
#1  0x5ddb49            in  mpack_parse_tag() at lib/mpack-amalgamation-1.0/src/mpack/mpack.c:2630
#2  0x5de69d            in  mpack_read_tag() at lib/mpack-amalgamation-1.0/src/mpack/mpack.c:2979
#3  0x5de75e            in  mpack_discard() at lib/mpack-amalgamation-1.0/src/mpack/mpack.c:3028
#4  0x4e64b8            in  flb_mp_count() at src/flb_mp.c:44
#5  0x4b1f67            in  flb_pack_msgpack_to_json_format() at src/flb_pack.c:807
#6  0x549b8f            in  cb_http_flush() at plugins/out_http/http.c:321
#7  0x4b875b            in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:509
#8  0x7fbd25            in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9  0xffffffffffffffff  in  ???() at ???:0
[2021/10/05 13:08:01] [engine] caught signal (SIGSEGV)
#0  0x7d3cdb            in  cio_chunk_is_locked() at lib/chunkio/src/cio_chunk.c:343
#1  0x4e9c19            in  input_chunk_get() at src/flb_input_chunk.c:700
#2  0x4ea42d            in  flb_input_chunk_append_raw() at src/flb_input_chunk.c:976
#3  0x50f2f0            in  process_content() at plugins/in_tail/tail_file.c:476
#4  0x5113b4            in  flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1197
#5  0x50998f            in  in_tail_collect_event() at plugins/in_tail/tail.c:261
#6  0x50ab40            in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
#7  0x4b70f9            in  flb_input_collector_fd() at src/flb_input.c:1107
#8  0x4c9486            in  flb_engine_handle_event() at src/flb_engine.c:411
#9  0x4c9486            in  flb_engine_start() at src/flb_engine.c:686
#10 0x4abc0f            in  flb_lib_worker() at src/flb_lib.c:627
#11 0x7f09fc8e3ea4      in  ???() at ???:0
#12 0x7f09fbab79fc      in  ???() at ???:0
#13 0xffffffffffffffff  in  ???() at ???:0
[2021/10/03 20:01:00] [engine] caught signal (SIGSEGV)
#0  0x7f60921960bb      in  ???() at ???:0
#1  0x7d7095            in  cio_file_write() at lib/chunkio/src/cio_file.c:824
#2  0x7d3917            in  cio_chunk_write() at lib/chunkio/src/cio_chunk.c:173
#3  0x4e86ae            in  flb_input_chunk_write() at src/flb_input_chunk.c:91
#4  0x4ea526            in  flb_input_chunk_append_raw() at src/flb_input_chunk.c:1005
#5  0x50f2f0            in  process_content() at plugins/in_tail/tail_file.c:476
#6  0x5113b4            in  flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1197
#7  0x50998f            in  in_tail_collect_event() at plugins/in_tail/tail.c:261
#8  0x50ab40            in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
#9  0x4b70f9            in  flb_input_collector_fd() at src/flb_input.c:1107
#10 0x4c9486            in  flb_engine_handle_event() at src/flb_engine.c:411
#11 0x4c9486            in  flb_engine_start() at src/flb_engine.c:686
#12 0x4abc0f            in  flb_lib_worker() at src/flb_lib.c:627
#13 0x7f6092f6bea4      in  ???() at ???:0
#14 0x7f609213f9fc      in  ???() at ???:0
#15 0xffffffffffffffff  in  ???() at ???:0
@LionelCons
Copy link
Contributor Author

I managed to capture a core dump.

Fluent Bit reported:

[2021/10/06 09:03:48] [engine] caught signal (SIGSEGV)
#0  0x475952            in  extent_arena_ind_get() at lib/jemalloc-5.2.1/include/jemalloc/internal/extent_inlines.h:40
#1  0x475952            in  extent_arena_get() at lib/jemalloc-5.2.1/include/jemalloc/internal/extent_inlines.h:49
#2  0x475952            in  je_large_dalloc() at lib/jemalloc-5.2.1/src/large.c:361
#3  0x447818            in  arena_dalloc_large() at lib/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:281
#4  0x447818            in  arena_dalloc() at lib/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:323
#5  0x447818            in  idalloctm() at lib/jemalloc-5.2.1/include/jemalloc/internal/jemalloc_internal_inlines_c.h:118
#6  0x447818            in  ifree() at lib/jemalloc-5.2.1/src/jemalloc.c:2589
#7  0x447818            in  je_free_default() at lib/jemalloc-5.2.1/src/jemalloc.c:2799
#8  0x50dac1            in  msgpack_sbuffer_destroy() at lib/msgpack-c/include/msgpack/sbuffer.h:41
#9  0x510ae4            in  flb_tail_file_remove() at plugins/in_tail/tail_file.c:1024
#10 0x510cdb            in  flb_tail_file_remove_all() at plugins/in_tail/tail_file.c:1070
#11 0x509dc2            in  in_tail_exit() at plugins/in_tail/tail.c:412
#12 0x4b60d9            in  flb_input_instance_exit() at src/flb_input.c:641
#13 0x4b6141            in  flb_input_exit_all() at src/flb_input.c:661
#14 0x4c98ce            in  flb_engine_shutdown() at src/flb_engine.c:800
#15 0x4c968e            in  flb_engine_start() at src/flb_engine.c:724
#16 0x4abc0f            in  flb_lib_worker() at src/flb_lib.c:627
#17 0x7fc3af71dea4      in  ???() at ???:0
#18 0x7fc3ae8f19fc      in  ???() at ???:0
#19 0xffffffffffffffff  in  ???() at ???:0

And gdb reports:

(gdb) bt
#0  0x00007fb2925e1387 in raise () from /lib64/libc.so.6
#1  0x00007fb2925e2a78 in abort () from /lib64/libc.so.6
#2  0x0000000000440e5a in flb_signal_handler (signal=11) at /builddir/build/BUILD/fluent-bit-1.8.7/src/fluent-bit.c:561
#3  <signal handler called>
#4  extent_arena_ind_get (extent=<optimized out>) at /builddir/build/BUILD/fluent-bit-1.8.7/lib/jemalloc-5.2.1/include/jemalloc/internal/extent_inlines.h:40
#5  extent_arena_get (extent=0x0) at /builddir/build/BUILD/fluent-bit-1.8.7/lib/jemalloc-5.2.1/include/jemalloc/internal/extent_inlines.h:49
#6  je_large_dalloc (tsdn=tsdn@entry=0x7fb28fffdb80, extent=0x0) at /builddir/build/BUILD/fluent-bit-1.8.7/lib/jemalloc-5.2.1/src/large.c:361
#7  0x0000000000447819 in arena_dalloc_large (slow_path=<optimized out>, szind=<optimized out>, tcache=<optimized out>, ptr=<optimized out>, 
    tsdn=<optimized out>) at /builddir/build/BUILD/fluent-bit-1.8.7/lib/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:281
#8  arena_dalloc (slow_path=<optimized out>, alloc_ctx=<optimized out>, tcache=<optimized out>, ptr=<optimized out>, tsdn=<optimized out>)
    at /builddir/build/BUILD/fluent-bit-1.8.7/lib/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:323
#9  idalloctm (slow_path=<optimized out>, is_internal=<optimized out>, alloc_ctx=<optimized out>, tcache=<optimized out>, ptr=<optimized out>, 
    tsdn=<optimized out>) at /builddir/build/BUILD/fluent-bit-1.8.7/lib/jemalloc-5.2.1/include/jemalloc/internal/jemalloc_internal_inlines_c.h:118
#10 ifree (slow_path=<optimized out>, tcache=<optimized out>, ptr=<optimized out>, tsd=<optimized out>)
    at /builddir/build/BUILD/fluent-bit-1.8.7/lib/jemalloc-5.2.1/src/jemalloc.c:2589
#11 je_free_default (ptr=<optimized out>) at /builddir/build/BUILD/fluent-bit-1.8.7/lib/jemalloc-5.2.1/src/jemalloc.c:2799
#12 0x000000000050dac2 in msgpack_sbuffer_destroy (sbuf=0x7fb28f22ef88) at /builddir/build/BUILD/fluent-bit-1.8.7/lib/msgpack-c/include/msgpack/sbuffer.h:41
#13 0x0000000000510ae5 in flb_tail_file_remove (file=0x7fb28f22ef00) at /builddir/build/BUILD/fluent-bit-1.8.7/plugins/in_tail/tail_file.c:1024
#14 0x0000000000510cdc in flb_tail_file_remove_all (ctx=0x7fb28f24c340) at /builddir/build/BUILD/fluent-bit-1.8.7/plugins/in_tail/tail_file.c:1070
#15 0x0000000000509dc3 in in_tail_exit (data=0x7fb28f24c340, config=0x7fb290411cc0) at /builddir/build/BUILD/fluent-bit-1.8.7/plugins/in_tail/tail.c:412
#16 0x00000000004b60da in flb_input_instance_exit (ins=0x7fb2904367c0, config=0x7fb290411cc0) at /builddir/build/BUILD/fluent-bit-1.8.7/src/flb_input.c:641
#17 0x00000000004b6142 in flb_input_exit_all (config=0x7fb290411cc0) at /builddir/build/BUILD/fluent-bit-1.8.7/src/flb_input.c:661
#18 0x00000000004c98cf in flb_engine_shutdown (config=0x7fb290411cc0) at /builddir/build/BUILD/fluent-bit-1.8.7/src/flb_engine.c:800
#19 0x00000000004c968f in flb_engine_start (config=0x7fb290411cc0) at /builddir/build/BUILD/fluent-bit-1.8.7/src/flb_engine.c:724
#20 0x00000000004abc10 in flb_lib_worker (data=0x7fb290410000) at /builddir/build/BUILD/fluent-bit-1.8.7/src/flb_lib.c:627
#21 0x00007fb2934d5ea5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007fb2926a99fd in clone () from /lib64/libc.so.6

@leonardo-albertovich
Copy link
Collaborator

Hi Lionel, we'll get in touch with you soon to troubleshoot this. Thanks for the report!

@LionelCons
Copy link
Contributor Author

LionelCons commented Oct 7, 2021

It seems that the problem is not specific to 1.8.7. After downgrading to 1.8.6 I got:

[2021/10/07 10:01:28] [engine] caught signal (SIGSEGV)
#0  0x7d323b            in  cio_chunk_is_locked() at lib/chunkio/src/cio_chunk.c:343
#1  0x4e9aa1            in  input_chunk_get() at src/flb_input_chunk.c:700
#2  0x4ea080            in  flb_input_chunk_append_raw() at src/flb_input_chunk.c:933
#3  0x50ef26            in  process_content() at plugins/in_tail/tail_file.c:476
#4  0x510fea            in  flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1197
#5  0x5095da            in  in_tail_collect_event() at plugins/in_tail/tail.c:261
#6  0x50a78b            in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
#7  0x4b6fce            in  flb_input_collector_fd() at src/flb_input.c:1097
#8  0x4c932e            in  flb_engine_handle_event() at src/flb_engine.c:411
#9  0x4c932e            in  flb_engine_start() at src/flb_engine.c:686
#10 0x4abb6f            in  flb_lib_worker() at src/flb_lib.c:627
#11 0x7fa5aed15ea4      in  ???() at ???:0
#12 0x7fa5adee99fc      in  ???() at ???:0
#13 0xffffffffffffffff  in  ???() at ???:0

@ioquatix
Copy link

We just experienced this in production after deploying an update which otherwise passed pre-production. It seems like load and/or message format might play a part in this. We are still investigating but if anything obvious comes out of our investigation I'll try to give more details here.

@0x022b
Copy link
Contributor

0x022b commented Oct 13, 2021

While I was testing different versions because of another bug, the earliest version where I saw SIGSEGV was 1.8.2, after pressing Ctrl-C. Unfortunately I only have the following small log snippet saved. This happened on Windows Server 2019.

[2021/10/13 09:35:43] [engine] caught signal (SIGINT)
[2021/10/13 09:35:43] [ info] [input] pausing tail.0
[2021/10/13 09:35:43] [ info] [input:tail:tail.0] flushing pending multiline data...
[2021/10/13 09:35:43] [debug] [task] created task=000001EF4F167D50 id=2 OK
[2021/10/13 09:35:43] [ warn] [engine] service will stop in 5 seconds
[2021/10/13 09:35:43] [debug] [upstream] connection #704 failed to redacted.blob.core.windows.net:443
[2021/10/13 09:35:43] [error] [output:azure_blob:azure_blob.0] cannot create upstream connection for container check
[2021/10/13 09:35:43] [debug] [out coro] cb_destroy coro_id=2
[2021/10/13 09:35:43] [debug] [retry] new retry created for task_id=2 attempts=1
[2021/10/13 09:35:43] [ warn] [engine] failed to flush chunk '3852-1634117743.568243100.flb', retry in 6 seconds: task_id=2, input=tail.0 > output=azure_blob.0 (out_id=0)
[2021/10/13 09:35:47] [debug] [upstream] connection #704 failed to redacted.blob.core.windows.net:443
[2021/10/13 09:35:47] [error] [output:azure_blob:azure_blob.0] cannot create upstream connection for container check
[2021/10/13 09:35:47] [debug] [out coro] cb_destroy coro_id=3
[2021/10/13 09:35:47] [debug] [task] task_id=0 reached retry-attempts limit 1/1
[2021/10/13 09:35:47] [ warn] [engine] chunk '3852-1634117732.537421900.flb' cannot be retried: task_id=0, input=tail.0 > output=azure_blob.0
[2021/10/13 09:35:47] [debug] [task] destroy task=000001EF4F167570 (task_id=0)
[2021/10/13 09:35:48] [ info] [engine] service stopped
[2021/10/13 09:35:48] [debug] [input:tail:tail.0] inode=281474976710698 removing file name D:\app\restart.log
[2021/10/13 09:35:48] [debug] [input:tail:tail.0] inode=281474976710697 removing file name D:\app\service.log
[2021/10/13 09:35:48] [debug] [input:tail:tail.0] inode=1970324836974642 removing file name D:\app\service.log
[2021/10/13 09:35:48] [debug] [task] destroy task=000001EF4F167690 (task_id=1)
[2021/10/13 09:35:48] [debug] [retry] task retry=000001EF4F2053F0, invalidated from the scheduler
[2021/10/13 09:35:48] [debug] [task] destroy task=000001EF4F167D50 (task_id=2)
[2021/10/13 09:35:48] [engine] caught signal (SIGSEGV)

If log format has something to do with this, I might be able to provide a redacted log example, if necessary.

@edsiper
Copy link
Member

edsiper commented Oct 18, 2021

it looks like there is corruption, that's why stack traces might look different.

Do you have access to the specific file it was being processed when the crash occurred ?

@ioquatix
Copy link

We wondered if there was some issue that was causing memory exhaustion. In our case, we noticed that the logs were not ending up in Splunk. In this case, we wonder if the logs were buffered and eventually exceeded available memory.

@LionelCons
Copy link
Contributor Author

Looking at a recent crash, I see 188 leftover files in the storage.path directory. This is a lot.

On that machine here is the gdb backtrace:

(gdb) bt
#0  0x00007fc1728807c1 in do_lookup_x () from /lib64/ld-linux-x86-64.so.2
#1  0x00007fc17288109f in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2
#2  0x00007fc172885dee in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#3  0x00007fc17288d9ea in _dl_runtime_resolve_xsave () from /lib64/ld-linux-x86-64.so.2
#4  0x0000000000440c24 in flb_signal_handler (signal=11) at /builddir/build/BUILD/fluent-bit-1.8.6/src/fluent-bit.c:530
#5  <signal handler called>
#6  0x00007fc1728807c1 in do_lookup_x () from /lib64/ld-linux-x86-64.so.2
#7  0x00007fc17288109f in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2
#8  0x00007fc172885dee in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#9  0x00007fc17288d9ea in _dl_runtime_resolve_xsave () from /lib64/ld-linux-x86-64.so.2
#10 0x000000000067034a in lj_cf_os_time ()
#11 0x000000000063b305 in lj_BC_FUNCC ()
#12 0x0000000000629c12 in lua_pcall ()
#13 0x000000000059700b in cb_lua_filter (data=0x7fc172a95020, bytes=326, tag=0x7fc16e400930 "collectd", tag_len=8, out_buf=0x7fc16f1fce80, out_bytes=0x7fc16f1fce70, 
    f_ins=0x7fc16f643ac0, filter_context=0x7fc16e449480, config=0x7fc16f611700) at /builddir/build/BUILD/fluent-bit-1.8.6/plugins/filter_lua/lua.c:499
#14 0x00000000004b7531 in flb_filter_do (ic=0x7fc16e416000, data=0x7fc16d9ac000, bytes=192, tag=0x7fc16e4000f0 "collectd", tag_len=8, config=0x7fc16f611700)
    at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_filter.c:124
#15 0x00000000004ea32c in flb_input_chunk_append_raw (in=0x7fc16f635800, tag=0x7fc16e4000f0 "collectd", tag_len=8, buf=0x7fc16d9ac000, buf_size=192)
    at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_input_chunk.c:992
#16 0x000000000050cf8a in file_pending_flush (ctx=0x7fc16e44c380, file=0x7fc16e42e300, now=1634617160)
    at /builddir/build/BUILD/fluent-bit-1.8.6/plugins/in_tail/tail_multiline.c:518
#17 0x000000000050d135 in flb_tail_mult_pending_flush (ins=0x7fc16f635800, config=0x7fc16f611700, context=0x7fc16e44c380)
    at /builddir/build/BUILD/fluent-bit-1.8.6/plugins/in_tail/tail_multiline.c:568
#18 0x00000000004b6fcf in flb_input_collector_fd (fd=30, config=0x7fc16f611700) at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_input.c:1097
#19 0x00000000004c932f in flb_engine_handle_event (config=0x7fc16f611700, mask=1, fd=30) at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_engine.c:411
#20 flb_engine_start (config=0x7fc16f611700) at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_engine.c:686
#21 0x00000000004abb70 in flb_lib_worker (data=0x7fc16f610000) at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_lib.c:627
#22 0x00007fc172662ea5 in start_thread () from /lib64/libpthread.so.0
#23 0x00007fc1718369fd in clone () from /lib64/libc.so.6

We see collectd as tag so the file being process was maybe /var/log/collectd.log.

Looking at another recent crash, I see a different tag corresponding to a different file being processed.

@LionelCons
Copy link
Contributor Author

In case it helps, here is the second backtrace:

(gdb) bt
#0  0x00007f61ff732387 in raise () from /lib64/libc.so.6
#1  0x00007f61ff733a78 in abort () from /lib64/libc.so.6
#2  0x0000000000440dba in flb_signal_handler (signal=11) at /builddir/build/BUILD/fluent-bit-1.8.6/src/fluent-bit.c:561
#3  <signal handler called>
#4  0x00000000007d323b in cio_chunk_is_locked (ch=0x0) at /builddir/build/BUILD/fluent-bit-1.8.6/lib/chunkio/src/cio_chunk.c:343
#5  0x00000000004e9aa2 in input_chunk_get (in=0x7f61fd6360c0, tag=0x7f61fc4001d0 "syslog.messages", tag_len=15, chunk_size=142, set_down=0x7f61fd1fcc0c)
    at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_input_chunk.c:700
#6  0x00000000004ea081 in flb_input_chunk_append_raw (in=0x7f61fd6360c0, tag=0x7f61fc4001d0 "syslog.messages", tag_len=15, buf=0x7f61fc418000, buf_size=142)
    at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_input_chunk.c:933
#7  0x000000000050ef27 in process_content (file=0x7f61fc482480, bytes=0x7f61fd1fce78) at /builddir/build/BUILD/fluent-bit-1.8.6/plugins/in_tail/tail_file.c:476
#8  0x0000000000510feb in flb_tail_file_chunk (file=0x7f61fc482480) at /builddir/build/BUILD/fluent-bit-1.8.6/plugins/in_tail/tail_file.c:1197
#9  0x00000000005095db in in_tail_collect_event (file=0x7f61fc482480, config=0x7f61fd611340) at /builddir/build/BUILD/fluent-bit-1.8.6/plugins/in_tail/tail.c:261
#10 0x000000000050a78c in tail_fs_event (ins=0x7f61fd6360c0, config=0x7f61fd611340, in_context=0x7f61fc45bc40)
    at /builddir/build/BUILD/fluent-bit-1.8.6/plugins/in_tail/tail_fs_inotify.c:268
#11 0x00000000004b6fcf in flb_input_collector_fd (fd=67, config=0x7f61fd611340) at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_input.c:1097
#12 0x00000000004c932f in flb_engine_handle_event (config=0x7f61fd611340, mask=1, fd=67) at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_engine.c:411
#13 flb_engine_start (config=0x7f61fd611340) at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_engine.c:686
#14 0x00000000004abb70 in flb_lib_worker (data=0x7f61fd610000) at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_lib.c:627
#15 0x00007f6200626ea5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f61ff7fa9fd in clone () from /lib64/libc.so.6

Tag syslog.messages is for /var/log/messages. Note that Lua does not appear in the backtrace...

@LionelCons
Copy link
Contributor Author

FWIW, here is another backtrace, almost identical to the previous one but with yet another file (this time /var/log/audit/audit.log).

Note that the chunk is NULL in cio_chunk_is_locked.

(gdb) bt
#0  0x00007f66c9df4387 in raise () from /lib64/libc.so.6
#1  0x00007f66c9df5a78 in abort () from /lib64/libc.so.6
#2  0x0000000000440dba in flb_signal_handler (signal=11) at /builddir/build/BUILD/fluent-bit-1.8.6/src/fluent-bit.c:561
#3  <signal handler called>
#4  0x00000000007d323b in cio_chunk_is_locked (ch=0x0) at /builddir/build/BUILD/fluent-bit-1.8.6/lib/chunkio/src/cio_chunk.c:343
#5  0x00000000004e9aa2 in input_chunk_get (in=0x7f66c7c35640, tag=0x7f66c6a55060 "audit", tag_len=5, chunk_size=869, set_down=0x7f66c77fcc0c) at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_input_chunk.c:700
#6  0x00000000004ea081 in flb_input_chunk_append_raw (in=0x7f66c7c35640, tag=0x7f66c6a55060 "audit", tag_len=5, buf=0x7f66c6a23000, buf_size=869) at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_input_chunk.c:933
#7  0x000000000050ef27 in process_content (file=0x7f66c6a2e180, bytes=0x7f66c77fce78) at /builddir/build/BUILD/fluent-bit-1.8.6/plugins/in_tail/tail_file.c:476
#8  0x0000000000510feb in flb_tail_file_chunk (file=0x7f66c6a2e180) at /builddir/build/BUILD/fluent-bit-1.8.6/plugins/in_tail/tail_file.c:1197
#9  0x00000000005095db in in_tail_collect_event (file=0x7f66c6a2e180, config=0x7f66c7c116c0) at /builddir/build/BUILD/fluent-bit-1.8.6/plugins/in_tail/tail.c:261
#10 0x000000000050a78c in tail_fs_event (ins=0x7f66c7c35640, config=0x7f66c7c116c0, in_context=0x7f66c6a4b1c0) at /builddir/build/BUILD/fluent-bit-1.8.6/plugins/in_tail/tail_fs_inotify.c:268
#11 0x00000000004b6fcf in flb_input_collector_fd (fd=26, config=0x7f66c7c116c0) at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_input.c:1097
#12 0x00000000004c932f in flb_engine_handle_event (config=0x7f66c7c116c0, mask=1, fd=26) at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_engine.c:411
#13 flb_engine_start (config=0x7f66c7c116c0) at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_engine.c:686
#14 0x00000000004abb70 in flb_lib_worker (data=0x7f66c7c10000) at /builddir/build/BUILD/fluent-bit-1.8.6/src/flb_lib.c:627
#15 0x00007f66cace8ea5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f66c9ebc9fd in clone () from /lib64/libc.so.6

@LionelCons
Copy link
Contributor Author

In case it helps to find the culprit, I've noticed that SELinux complains about flb-pipeline failing. For instance:

type=ANOM_ABEND msg=audit(1635242535.958:4790): auid=429967295 uid=0 gid=0 ses=429496295 subj=system_u:system_r:unconfined_service_t:s0 pid=6495 comm="flb-pipeline" reason="memory violation" sig=6

@LionelCons
Copy link
Contributor Author

FWIW, I have downgraded once more (this time to 1.8.5) and I still see these crashes.

So the problem is present at least in 1.8.5, 1.8.6 and 1.8.7...

@nokute78
Copy link
Collaborator

nokute78 commented Nov 8, 2021

#4164 (comment) is similar to aws/aws-for-fluent-bit#255 (comment) and #4137.
flb_tail_file_remove -> msgpack_sbuffer_destroy -> free (SIGSEGV)

The patch #4187 may fix this issue. It is merged from v1.8.9.
https://fluentbit.io/announcements/v1.8.9/

Avoid double free for multiline msgpack buffer

@LionelCons
Copy link
Contributor Author

Maybe this indeed fixes this problem but I cannot test 1.8.9 on our production machines because of #4255...

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Dec 10, 2021
@LionelCons
Copy link
Contributor Author

FWIW, none of the nodes that I have upgraded to 1.8.10 have crashed so far. So it seems that this problem has indeed been fixed (for my use case).

@edsiper
Copy link
Member

edsiper commented Dec 10, 2021

@LionelCons thanks for the update. Please close this ticket if you think is safe to do it.

@LionelCons
Copy link
Contributor Author

OK, closing then...

@LionelCons
Copy link
Contributor Author

Well, I have upgraded to 1.8.11 and I now see the problem again. So either the problem was not fixed or there is a new problem that looks very similar.

Here is what a 1.8.11 stack trace looks like:

[2021/12/15 10:01:25] [engine] caught signal (SIGSEGV)
#0  0x7f88d1c5ddbb      in  ???() at ???:0
#1  0x7da87f            in  cio_file_write() at lib/chunkio/src/cio_file.c:881
#2  0x7d6de2            in  cio_chunk_write() at lib/chunkio/src/cio_chunk.c:183
#3  0x4e973a            in  flb_input_chunk_write() at src/flb_input_chunk.c:242
#4  0x4eb83a            in  flb_input_chunk_append_raw() at src/flb_input_chunk.c:1307
#5  0x5106f6            in  process_content() at plugins/in_tail/tail_file.c:476
#6  0x51280c            in  flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1210
#7  0x50ad83            in  in_tail_collect_event() at plugins/in_tail/tail.c:261
#8  0x50bf34            in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
#9  0x4b750d            in  flb_input_collector_fd() at src/flb_input.c:1108
#10 0x4c9dd1            in  flb_engine_handle_event() at src/flb_engine.c:411
#11 0x4c9dd1            in  flb_engine_start() at src/flb_engine.c:696
#12 0x4ac00f            in  flb_lib_worker() at src/flb_lib.c:627
#13 0x7f88d2a33ea4      in  ???() at ???:0
#14 0x7f88d1c079fc      in  ???() at ???:0
#15 0xffffffffffffffff  in  ???() at ???:0

@LionelCons LionelCons reopened this Dec 15, 2021
@edsiper
Copy link
Member

edsiper commented Dec 15, 2021

@LionelCons since the original issue is faced in a Lua filter this looks different, e.g:

#0  0x64163b            in  ???() at ???:0
#1  0x6293ad            in  ???() at ???:0
#2  0x5965be            in  lua_pushmsgpack() at plugins/filter_lua/lua.c:80
#3  0x596738            in  lua_pushmsgpack() at plugins/filter_lua/lua.c:110
#4  0x59774b            in  cb_lua_filter() at plugins/filter_lua/lua.c:544

do you have the full config and log file associated ?

@github-actions github-actions bot removed the Stale label Dec 16, 2021
@LionelCons
Copy link
Contributor Author

FWIW, downgrading to 1.8.10 makes the problem go away so maybe the recent changes should be looked at.

Regarding the configuration files, they cannot be posted here but I can send them to you by email.

@nokute78
Copy link
Collaborator

Note: #4164 (comment) is similar to #3412 . (The issue is closed but not fixed)

@akshaal
Copy link

akshaal commented Jan 18, 2022

Looks like the same or a similar issue here, 1.8.11....

[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546412.804429731.flb' cannot be retried: task_id=15, input=systemd.0 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546412.803995427.flb' cannot be retried: task_id=12, input=systemd.0 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546414.157072111.flb' cannot be retried: task_id=32, input=emitter_for_rewrite_tag.22 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546416.990513928.flb' cannot be retried: task_id=18, input=emitter_for_rewrite_tag.22 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546413.804195219.flb' cannot be retried: task_id=27, input=systemd.0 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546424.241078338.flb' cannot be retried: task_id=78, input=emitter_for_rewrite_tag.22 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546413.907724005.flb' cannot be retried: task_id=28, input=systemd.0 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546411.804432028.flb' cannot be retried: task_id=10, input=emitter_for_rewrite_tag.55 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546414.693826950.flb' cannot be retried: task_id=39, input=systemd.0 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546418.349529903.flb' cannot be retried: task_id=41, input=systemd.0 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546424.247035033.flb' cannot be retried: task_id=2, input=systemd.0 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546424.247102008.flb' cannot be retried: task_id=71, input=systemd.0 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546424.247764831.flb' cannot be retried: task_id=76, input=systemd.0 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546424.349777584.flb' cannot be retried: task_id=83, input=systemd.0 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546424.350099818.flb' cannot be retried: task_id=85, input=systemd.0 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546424.350505829.flb' cannot be retried: task_id=87, input=systemd.0 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546434.350428972.flb' cannot be retried: task_id=97, input=systemd.0 > output=loki.0
[2022/01/18 23:54:06] [ warn] [engine] chunk '2886010-1642546434.597958950.flb' cannot be retried: task_id=102, input=systemd.0 > output=loki.0
[2022/01/18 23:54:08] [engine] caught signal (SIGSEGV)
#0  0x55e004520f3b      in  extent_arena_ind_get() at lib/jemalloc-5.2.1/include/jemalloc/internal/extent_inlines.h:40
#1  0x55e004520f3b      in  je_tcache_bin_flush_small() at lib/jemalloc-5.2.1/src/tcache.c:159
#2  0x55e0045226be      in  je_tcache_event_hard() at lib/jemalloc-5.2.1/src/tcache.c:55
#3  0x55e0044b6ed4      in  arena_malloc() at lib/jemalloc-5.2.1/include/jemalloc/internal/tcache_inlines.h:37
#4  0x55e0044b6ed4      in  iallocztm() at lib/jemalloc-5.2.1/include/jemalloc/internal/jemalloc_internal_inlines_c.h:53
#5  0x55e0044b6ed4      in  imalloc_no_sample() at lib/jemalloc-5.2.1/src/jemalloc.c:1949
#6  0x55e0044b6ed4      in  imalloc_body() at lib/jemalloc-5.2.1/src/jemalloc.c:2149
#7  0x55e0044b6ed4      in  imalloc() at lib/jemalloc-5.2.1/src/jemalloc.c:2258
#8  0x55e0044b6ed4      in  calloc() at lib/jemalloc-5.2.1/src/jemalloc.c:2494
#9  0x55e004855623      in  mbedtls_mpi_grow() at lib/mbedtls-2.27.0/library/bignum.c:125
#10 0x55e004855970      in  mbedtls_mpi_copy() at lib/mbedtls-2.27.0/library/bignum.c:243
#11 0x55e004858705      in  mbedtls_mpi_mul_mpi() at lib/mbedtls-2.27.0/library/bignum.c:1725
#12 0x55e004866036      in  mbedtls_mpi_mul_mod() at lib/mbedtls-2.27.0/library/ecp.c:1235
#13 0x55e004868dee      in  ecp_double_add_mxz() at lib/mbedtls-2.27.0/library/ecp.c:2538
#14 0x55e004869394      in  ecp_mul_mxz() at lib/mbedtls-2.27.0/library/ecp.c:2619
#15 0x55e0048695a1      in  mbedtls_ecp_mul_restartable() at lib/mbedtls-2.27.0/library/ecp.c:2700
#16 0x55e0048645a1      in  ecdh_compute_shared_restartable() at lib/mbedtls-2.27.0/library/ecdh.c:124
#17 0x55e004864659      in  mbedtls_ecdh_compute_shared() at lib/mbedtls-2.27.0/library/ecdh.c:153
#18 0x55e004864e7a      in  ecdh_calc_secret_internal() at lib/mbedtls-2.27.0/library/ecdh.c:673
#19 0x55e004864f8a      in  mbedtls_ecdh_calc_secret() at lib/mbedtls-2.27.0/library/ecdh.c:709
#20 0x55e004841ffc      in  ssl_write_client_key_exchange() at lib/mbedtls-2.27.0/library/ssl_cli.c:3866
#21 0x55e00484321d      in  mbedtls_ssl_handshake_client_step() at lib/mbedtls-2.27.0/library/ssl_cli.c:4495
#22 0x55e004838b42      in  mbedtls_ssl_handshake_step() at lib/mbedtls-2.27.0/library/ssl_tls.c:5778
#23 0x55e004838c3e      in  mbedtls_ssl_handshake() at lib/mbedtls-2.27.0/library/ssl_tls.c:5815
#24 0x55e0045722f7      in  tls_net_handshake() at src/tls/mbedtls.c:376
#25 0x55e004572a62      in  flb_tls_session_create() at src/tls/flb_tls.c:318
#26 0x55e00457ea7c      in  flb_io_net_connect() at src/flb_io.c:110
#27 0x55e00455e556      in  create_conn() at src/flb_upstream.c:542
#28 0x55e00455eaeb      in  flb_upstream_conn_get() at src/flb_upstream.c:687
#29 0x55e004607997      in  cb_loki_flush() at plugins/out_loki/loki.c:1184
#30 0x55e0045438f4      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#31 0x55e004ac47aa      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#32 0x7f8b4a1bcaff      in  ???() at ???:0
Aborted (core dumped)

@ioquatix
Copy link

Looking at co_init... and where it's dying

https://github.com/jemalloc/jemalloc/blob/5.2.1/include/jemalloc/internal/extent_inlines.h#L40

I can't help wonder if some TLS is not being copied/clobbered correctly.

@LionelCons
Copy link
Contributor Author

FWIW, I have tried with 1.8.12 and the problem is still present...

@bharathiram
Copy link

I can able to reproduce the issue in 1.9.0 as well

@LionelCons
Copy link
Contributor Author

@bharathiram Could you please describe how you can reproduce the problem?

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Jun 15, 2022
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@vaibhav-dhingra
Copy link

I am also facing a similar issue. My fluent-bit version is 1.8.13. I am getting this error, when I try to terminate the fluent-bit container by using the 'kill' command in a bash script.
It fails with a SIGSEGV after receiving a SIGTERM. I also noticed that some of my logs were not flushed by the fluent-bit container. Any help would be appreciated!

@dosten
Copy link

dosten commented Dec 27, 2022

Hi, do anyone knows if this was fixed in some fluent-bit version? We are facing this one using 1.8.11

@vaibhav-dhingra
Copy link

Hi, do anyone knows if this was fixed in some fluent-bit version? We are facing this one using 1.8.11

@dosten Don't know if the issue was fixed, but we moved to v2.0.3 and haven't faced this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants