Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frequent crash on tail plugin with multiline on and file rotation #255

Closed
wubn2000 opened this issue Oct 11, 2021 · 12 comments
Closed

frequent crash on tail plugin with multiline on and file rotation #255

wubn2000 opened this issue Oct 11, 2021 · 12 comments
Assignees
Labels
bug Something isn't working pending release

Comments

@wubn2000
Copy link

Describe the bug

When turn on the multiline parser, ( the one I used is
[PARSER]
Name multiline_stats
Format regex
Regex /^(?\d{10,}.*)/
), if monitoring on a rotate log file, it always crashed due to the file rotation (after the rotated file got deleted), with info like:

*** Error in `./fluent-bit': double free or corruption (out): 0x00007f66c2f65e80 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81329)[0x7f66fea7e329]
./fluent-bit[0x4b73ea]
./fluent-bit(flb_tail_file_remove+0x13e)[0x4ba469]
./fluent-bit(flb_tail_file_purge+0x293)[0x4bc068]
./fluent-bit(flb_input_collector_fd+0x3b2)[0x4614bd]
./fluent-bit(flb_engine_start+0x707)[0x472660]
./fluent-bit[0x456cf5]

Without turning on multiline parser, it won't crash. But with it on, it always crash during the file deletion.

To Reproduce

Rubular link if applicable:
Example log message if applicable:
{"log":"YOUR LOG MESSAGE HERE","stream":"stdout","time":"2018-06-11T14:37:30.681701731Z"}
Note here is one message, with the last field as a multiline field:

1633801021538,"customer2788",13,2788,1623239523,"java.lang.RuntimeException: Something has gone wrong, aborting!
at com.myproject.module.MyProject.badMethod(MyProject.java:22)
at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18)
at com.myproject.module.MyProject.anotherMethod(MyProject.java:14)
at com.myproject.module.MyProject.someMethod(MyProject.java:10)
at com.myproject.module.MyProject.main(MyProject.java:6)","Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam. Ut enim ad minim ven."

Steps to reproduce the problem:
FLuentbit is monitoring only stats-0.csv. I keep appending this message to a file stats-0.csv, when the size reaches a threshold (can be 500k), do file rotation by "mv stats-0.csv stats-0-timestamp1.csv", then empty stats-0.csv and keep appending the message again. After a while, some oldest stats-0-timestamp1.csv will be deleted from the directory. The crash always happens if the old file stats-0-timestamp1.csv got deleted.

Expected behavior

The fluentbit should not crash.

Screenshots

Your Environment

Version used: 1.8.8
Configuration:
[INPUT]
Name tail
Path /data/test/stats-0.csv
Path_Key file_name
Tag stats
Buffer_Chunk_Size 32KB
Buffer_Max_Size 128KB
Skip_Long_Lines On
Multiline On
Parser_Firstline multiline_stats
Multiline_Flush 5
Rotate_Wait 35
Refresh_Interval 1
Read_from_head true
Mem_Buf_Limit 500MB
storage.type filesystem
DB /data/tail-0.db
DB.locking true
[SERVICE]
Flush 1
Daemon off
Log_Level info
Parsers_File parser.example.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
storage.path /data/storage
storage.backlog.mem_limit 500M
storage.max_chunks_up 200
@include inputs.example.conf

[OUTPUT]
Name stdout

Environment name and version (e.g. Kubernetes? What version?): ec2 machine
Server type and version:
Operating System and version:
Filters and plugins:
Additional context

@hossain-rayhan
Copy link
Contributor

Upstream issue: fluent/fluent-bit#4177
@wubn2000 thanks for opening here also.

@wubn2000
Copy link
Author

I tried using valgrind, and found that msgpack_sbuffer_destroy was called twice:
one is from tail_multiline.c:491, and second time is from tail_file.c:1024 when the rotated file got deleted, so that crashed the program, not sure if there is a quick fix for this? But I believe there are some other similar cases related to this, this is just one reason it crashed.

@PettitWesley
Copy link
Contributor

@wubn2000 please post your valgrind output logs

@PettitWesley
Copy link
Contributor

@wubn2000 please also attach your config files and example log files as separate files in the issue, or please use markdown formatting to make them clear like:

all file content goes here

Otherwise it's just a mess of text which is hard to read. Also I think I copied your config wrong, since I am getting:

[2021/10/12 00:14:46] [error] [parser:multiline_stats] Invalid regex pattern /^(?\d{10,}.*)/

@wubn2000
Copy link
Author

[input:tail:tail.0] purge: monitored file has been deleted: /data/testdata/v000/stats-0-10-11-21-21-24-03.csv (deleted)
[2021/10/11 23:13:16] [debug] [input:tail:tail.0] db: file deleted from database: /data/testdata/v000/stats-0-10-11-21-21-24-03.csv (deleted)
[2021/10/11 23:13:16] [debug] [input:tail:tail.0] inode=1074662979 removing file name /data/testdata/v000/stats-0-10-11-21-21-24-03.csv (deleted)
==19148== Thread 2 flb-pipeline:
==19148== Invalid free() / delete / delete[] / realloc()
==1914== at 0x4C2CD28: free (vg_replace_malloc.c:530)
==19148== by 0x4A8402: msgpack_sbuffer_destroy (sbuffer.h:41)
==19148== by 0x4AB473: flb_tail_file_remove (tail_file.c:1024)
==19148== by 0x4ACD71: check_purge_deleted_file (tail_file.c:1561)
==19148== by 0x4AD1F6: flb_tail_file_purge (tail_file.c:1633)
==19148== by 0x44FA97: flb_input_collector_fd (flb_input.c:1107)
==19148== by 0x462038: flb_engine_handle_event (flb_engine.c:411)
==19148== by 0x462038: flb_engine_start (flb_engine.c:696)
==19148== by 0x444666: flb_lib_worker (flb_lib.c:627)
==19148== by 0x4E3F40A: start_thread (in /usr/lib64/libpthread-2.26.so)
==19148== by 0x61690BE: clone (in /usr/lib64/libc-2.26.so)
==19148== Address 0xa783430 is 0 bytes inside a block of size 8,192 free'd
==19148== at 0x4C2CD28: free (vg_replace_malloc.c:530)
==19148== by 0x4A67B8: msgpack_sbuffer_destroy (sbuffer.h:41)
==19148== by 0x4A7B92: flb_tail_mult_flush (tail_multiline.c:491)
==19148== by 0x4A7C41: file_pending_flush (tail_multiline.c:516)
==19148== by 0x4A7E22: flb_tail_mult_pending_flush (tail_multiline.c:568)
==19148== by 0x44FA97: flb_input_collector_fd (flb_input.c:1107)
==19148== by 0x462038: flb_engine_handle_event (flb_engine.c:411)
==19148== by 0x462038: flb_engine_start (flb_engine.c:696)
==19148== by 0x444666: flb_lib_worker (flb_lib.c:627)
==19148== by 0x4E3F40A: start_thread (in /usr/lib64/libpthread-2.26.so)
==19148== by 0x61690BE: clone (in /usr/lib64/libc-2.26.so)
==19148== Block was alloc'd at
==19148== at 0x4C2BABC: malloc (vg_replace_malloc.c:298)
==19148== by 0x4C2DC6F: realloc (vg_replace_malloc.c:785)
==19148== by 0x4733BD: msgpack_sbuffer_write (sbuffer.h:81)
==19148== by 0x4730AE: msgpack_pack_map (pack_template.h:745)
==19148== by 0x4738CA: flb_parser_regex_do (flb_parser_regex.c:144)
==19148== by 0x472431: flb_parser_do (flb_parser.c:885)
==19148== by 0x4A73F5: flb_tail_mult_process_content (tail_multiline.c:294)
==19148== by 0x4A9A79: process_content (tail_file.c:428)
==19148== by 0x4ABD52: flb_tail_file_chunk (tail_file.c:1197)
==19148== by 0x4A41AD: in_tail_collect_event (tail.c:261)
==19148== by 0x4A5386: tail_fs_event (tail_fs_inotify.c:268)
==19148== by 0x44FA97: flb_input_collector_fd (flb_input.c:1107)
==19148==

@wubn2000
Copy link
Author

wubn2000 commented Oct 12, 2021

parser.example.conf

[PARSER]
    Name    multiline_stats
    Format  regex
    Regex   /^(?<key>\d{10,}.*)/

conf.conf

[SERVICE]
    Flush     1
    Daemon    off
    Log_Level debug
    Parsers_File parser.example.conf
    HTTP_Server               On
    HTTP_Listen               0.0.0.0
    HTTP_PORT                 2020
    Health_Check On
    HC_Errors_Count 5 
    HC_Retry_Failure_Count 5 
    HC_Period 5 
    storage.path              /data/storage
    storage.backlog.mem_limit 4000M
    storage.max_chunks_up     4000

@INCLUDE inputs.example.conf

[OUTPUT]
    Name            stdout
    Match           stats

inputs.example.conf

[INPUT]
    Name              tail
    Path              /data/testdata/**/stats-0.csv
    Path_Key          file_name
    Tag               stats
    Buffer_Chunk_Size 32KB
    Buffer_Max_Size   128KB
    Skip_Long_Lines   On
    Multiline         On
    Parser_Firstline  multiline_stats
    Multiline_Flush   5
    Rotate_Wait               10
    Refresh_Interval 1
    Read_from_head    true
    Mem_Buf_Limit     4000MB
    storage.type      filesystem
    DB                /data/tail-0.db
    DB.locking        true

@wubn2000
Copy link
Author

wubn2000 commented Oct 12, 2021

one sample line of message (it has multiline fields, but to crash the fluentbit, you can use single line here):

1633801021538,"customer2788",13,2788,1623239523,"java.lang.RuntimeException: Something has gone wrong, aborting! 
 at com.myproject.module.MyProject.badMethod(MyProject.java:22) 
 at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18) 
  at com.myproject.module.MyProject.anotherMethod(MyProject.java:14) 
 at com.myproject.module.MyProject.someMethod(MyProject.java:10) 
 at com.myproject.module.MyProject.main(MyProject.java:6)","ddddd"

@PettitWesley
Copy link
Contributor

@wubn2000 Thanks, I was able to reproduce it with a simple script:

for i in {0..9}
do
     cat log.txt > stats-0.csv
     sleep 1
done
echo "rotating log file"
mv stats-0.csv stats-0-timestamp1.csv
for i in {0..9}
do
     cat log.txt > stats-0.csv
     sleep 1
done
echo "removing rotated log file"
rm stats-0-timestamp1.csv
for i in {0..9}
do
     cat log.txt > stats-0.csv
     sleep 1
done

log.txt is just a file with the log lines in your comment above.

[2021/10/12 00:52:51] [engine] caught signal (SIGSEGV)
#0  0x7f5783d90228      in  ???() at ???:0
#1  0x4ab302            in  msgpack_sbuffer_destroy() at lib/msgpack-c/include/msgpack/sbuffer.h:41
#2  0x4ae373            in  flb_tail_file_remove() at plugins/in_tail/tail_file.c:1024
#3  0x4a81ba            in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:234
#4  0x4500fa            in  flb_input_collector_fd() at src/flb_input.c:1107
#5  0x462664            in  flb_engine_handle_event() at src/flb_engine.c:411
#6  0x462664            in  flb_engine_start() at src/flb_engine.c:696
#7  0x444cd6            in  flb_lib_worker() at src/flb_lib.c:627
#8  0x7f57853b840a      in  ???() at ???:0
#9  0x7f5783e020be      in  ???() at ???:0
#10 0xffffffffffffffff  in  ???() at ???:0
Aborted

Valgrind:

==12458== Thread 2 flb-pipeline:
==12458== Invalid free() / delete / delete[] / realloc()
==12458==    at 0x4C2CD28: free (vg_replace_malloc.c:530)
==12458==    by 0x4AB302: msgpack_sbuffer_destroy (sbuffer.h:41)
==12458==    by 0x4AE373: flb_tail_file_remove (tail_file.c:1024)
==12458==    by 0x4A81BA: tail_fs_event (tail_fs_inotify.c:234)
==12458==    by 0x4500FA: flb_input_collector_fd (flb_input.c:1107)
==12458==    by 0x462664: flb_engine_handle_event (flb_engine.c:411)
==12458==    by 0x462664: flb_engine_start (flb_engine.c:696)
==12458==    by 0x444CD6: flb_lib_worker (flb_lib.c:627)
==12458==    by 0x4E3F40A: start_thread (in /usr/lib64/libpthread-2.26.so)
==12458==    by 0x64340BE: clone (in /usr/lib64/libc-2.26.so)
==12458==  Address 0x9c8df50 is 0 bytes inside a block of size 1,920 free'd
==12458==    at 0x4C2CD28: free (vg_replace_malloc.c:530)
==12458==    by 0x4A96B8: msgpack_sbuffer_destroy (sbuffer.h:41)
==12458==    by 0x4AAA92: flb_tail_mult_flush (tail_multiline.c:491)
==12458==    by 0x4AAB41: file_pending_flush (tail_multiline.c:516)
==12458==    by 0x4AAD22: flb_tail_mult_pending_flush (tail_multiline.c:568)
==12458==    by 0x4500FA: flb_input_collector_fd (flb_input.c:1107)
==12458==    by 0x462664: flb_engine_handle_event (flb_engine.c:411)
==12458==    by 0x462664: flb_engine_start (flb_engine.c:696)
==12458==    by 0x444CD6: flb_lib_worker (flb_lib.c:627)
==12458==    by 0x4E3F40A: start_thread (in /usr/lib64/libpthread-2.26.so)
==12458==    by 0x64340BE: clone (in /usr/lib64/libc-2.26.so)
==12458==  Block was alloc'd at
==12458==    at 0x4C2DC25: realloc (vg_replace_malloc.c:785)
==12458==    by 0x4A97B5: msgpack_sbuffer_write (sbuffer.h:81)
==12458==    by 0x4A95D7: msgpack_pack_str_body (pack_template.h:784)
==12458==    by 0x4AA185: flb_tail_mult_append_raw (tail_multiline.c:238)
==12458==    by 0x4AA47E: flb_tail_mult_process_content (tail_multiline.c:348)
==12458==    by 0x4AC979: process_content (tail_file.c:428)
==12458==    by 0x4AEC52: flb_tail_file_chunk (tail_file.c:1197)
==12458==    by 0x4A7130: in_tail_collect_event (tail.c:261)
==12458==    by 0x4A8309: tail_fs_event (tail_fs_inotify.c:268)
==12458==    by 0x4500FA: flb_input_collector_fd (flb_input.c:1107)
==12458==    by 0x462664: flb_engine_handle_event (flb_engine.c:411)
==12458==    by 0x462664: flb_engine_start (flb_engine.c:696)
==12458==    by 0x444CD6: flb_lib_worker (flb_lib.c:627)
==12458==
^C==12458== Invalid free() / delete / delete[] / realloc()
==12458==    at 0x4C2CD28: free (vg_replace_malloc.c:530)
==12458==    by 0x4AB302: msgpack_sbuffer_destroy (sbuffer.h:41)
==12458==    by 0x4AE373: flb_tail_file_remove (tail_file.c:1024)
==12458==    by 0x4AE56B: flb_tail_file_remove_all (tail_file.c:1070)
==12458==    by 0x4A7566: in_tail_exit (tail.c:412)
==12458==    by 0x44F0C3: flb_input_instance_exit (flb_input.c:641)
==12458==    by 0x44F12A: flb_input_exit_all (flb_input.c:661)
==12458==    by 0x462A73: flb_engine_shutdown (flb_engine.c:810)
==12458==    by 0x46286E: flb_engine_start (flb_engine.c:734)
==12458==    by 0x444CD6: flb_lib_worker (flb_lib.c:627)
==12458==    by 0x4E3F40A: start_thread (in /usr/lib64/libpthread-2.26.so)
==12458==    by 0x64340BE: clone (in /usr/lib64/libc-2.26.so)
==12458==  Address 0xa7bdb80 is 0 bytes inside a block of size 1,920 free'd
==12458==    at 0x4C2CD28: free (vg_replace_malloc.c:530)
==12458==    by 0x4A96B8: msgpack_sbuffer_destroy (sbuffer.h:41)
==12458==    by 0x4AAA92: flb_tail_mult_flush (tail_multiline.c:491)
==12458==    by 0x4AAB41: file_pending_flush (tail_multiline.c:516)
==12458==    by 0x4AAD22: flb_tail_mult_pending_flush (tail_multiline.c:568)
==12458==    by 0x4500FA: flb_input_collector_fd (flb_input.c:1107)
==12458==    by 0x462664: flb_engine_handle_event (flb_engine.c:411)
==12458==    by 0x462664: flb_engine_start (flb_engine.c:696)
==12458==    by 0x444CD6: flb_lib_worker (flb_lib.c:627)
==12458==    by 0x4E3F40A: start_thread (in /usr/lib64/libpthread-2.26.so)
==12458==    by 0x64340BE: clone (in /usr/lib64/libc-2.26.so)
==12458==  Block was alloc'd at
==12458==    at 0x4C2DC25: realloc (vg_replace_malloc.c:785)
==12458==    by 0x4A97B5: msgpack_sbuffer_write (sbuffer.h:81)
==12458==    by 0x4A95D7: msgpack_pack_str_body (pack_template.h:784)
==12458==    by 0x4AA185: flb_tail_mult_append_raw (tail_multiline.c:238)
==12458==    by 0x4AA47E: flb_tail_mult_process_content (tail_multiline.c:348)
==12458==    by 0x4AC979: process_content (tail_file.c:428)
==12458==    by 0x4AEC52: flb_tail_file_chunk (tail_file.c:1197)
==12458==    by 0x4A7130: in_tail_collect_event (tail.c:261)
==12458==    by 0x4A8309: tail_fs_event (tail_fs_inotify.c:268)
==12458==    by 0x4500FA: flb_input_collector_fd (flb_input.c:1107)
==12458==    by 0x462664: flb_engine_handle_event (flb_engine.c:411)
==12458==    by 0x462664: flb_engine_start (flb_engine.c:696)
==12458==    by 0x444CD6: flb_lib_worker (flb_lib.c:627)
==12458==

Looks like its double free-ing the msgpack in some case.

@PettitWesley
Copy link
Contributor

@hossain-rayhan Can you own fixing this bug?

@hossain-rayhan hossain-rayhan self-assigned this Oct 12, 2021
@hossain-rayhan
Copy link
Contributor

I was able to reproduce similar behavior using Wesley's script and @wubn2000's config file. I will take a look for details.

==15788== Invalid free() / delete / delete[] / realloc()
==15788==    at 0x4C2CD28: free (vg_replace_malloc.c:530)
==15788==    by 0x4A9243: msgpack_sbuffer_destroy (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4AC2B4: flb_tail_file_remove (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4AC4AC: flb_tail_file_remove_all (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4A54A7: in_tail_exit (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x44EAD3: flb_input_instance_exit (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x44EB3A: flb_input_exit_all (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4628BE: flb_engine_shutdown (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x46267A: flb_engine_start (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4446B6: flb_lib_worker (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4E3F40A: start_thread (pthread_create.c:465)
==15788==    by 0x616940E: clone (clone.S:95)
==15788==  Address 0xb4154a0 is 0 bytes inside a block of size 480 free'd
==15788==    at 0x4C2CD28: free (vg_replace_malloc.c:530)
==15788==    by 0x4A75F9: msgpack_sbuffer_destroy (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4A89D3: flb_tail_mult_flush (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4A8A82: file_pending_flush (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4A8C63: flb_tail_mult_pending_flush (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x44FB0A: flb_input_collector_fd (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x462470: flb_engine_start (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4446B6: flb_lib_worker (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4E3F40A: start_thread (pthread_create.c:465)
==15788==    by 0x616940E: clone (clone.S:95)
==15788==  Block was alloc'd at
==15788==    at 0x4C2DC25: realloc (vg_replace_malloc.c:785)
==15788==    by 0x4A76F6: msgpack_sbuffer_write (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4A7518: msgpack_pack_str_body (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4A80C6: flb_tail_mult_append_raw (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4A83BF: flb_tail_mult_process_content (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4AA8BA: process_content (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4ACB93: flb_tail_file_chunk (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4A5071: in_tail_collect_event (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4A624A: tail_fs_event (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x44FB0A: flb_input_collector_fd (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x462470: flb_engine_start (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788==    by 0x4446B6: flb_lib_worker (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==15788== 

@PettitWesley
Copy link
Contributor

FYI: #100 (comment)

@PettitWesley
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending release
Projects
None yet
Development

No branches or pull requests

4 participants