Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent crash when in_tail multiline on and file rotation #4177

Closed
wubn2000 opened this issue Oct 9, 2021 · 4 comments · Fixed by #4187
Closed

Frequent crash when in_tail multiline on and file rotation #4177

wubn2000 opened this issue Oct 9, 2021 · 4 comments · Fixed by #4187

Comments

@wubn2000
Copy link

wubn2000 commented Oct 9, 2021

Bug Report

Describe the bug

When turn on the multiline parser, ( the one I used is
[PARSER]
Name multiline_stats
Format regex
Regex /^(?\d{10,}.*)/
), if monitoring on a rotate log file, it always crashed due to the file rotation (after the rotated file got deleted), with info like:

*** Error in `./fluent-bit': double free or corruption (out): 0x00007f66c2f65e80 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81329)[0x7f66fea7e329]
./fluent-bit[0x4b73ea]
./fluent-bit(flb_tail_file_remove+0x13e)[0x4ba469]
./fluent-bit(flb_tail_file_purge+0x293)[0x4bc068]
./fluent-bit(flb_input_collector_fd+0x3b2)[0x4614bd]
./fluent-bit(flb_engine_start+0x707)[0x472660]
./fluent-bit[0x456cf5]

Without turning on multiline parser, it won't crash. But with it on, it always crash during the file deletion.

To Reproduce

  • Rubular link if applicable:
  • Example log message if applicable:
{"log":"YOUR LOG MESSAGE HERE","stream":"stdout","time":"2018-06-11T14:37:30.681701731Z"}

Note here is one message, with the last field as a multiline field:

1633801021538,"customer2788",13,2788,1623239523,"java.lang.RuntimeException: Something has gone wrong, aborting!
at com.myproject.module.MyProject.badMethod(MyProject.java:22)
at com.myproject.module.MyProject.oneMoreMethod(MyProject.java:18)
at com.myproject.module.MyProject.anotherMethod(MyProject.java:14)
at com.myproject.module.MyProject.someMethod(MyProject.java:10)
at com.myproject.module.MyProject.main(MyProject.java:6)","Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam. Ut enim ad minim ven."

  • Steps to reproduce the problem:

FLuentbit is monitoring only stats-0.csv. I keep appending this message to a file stats-0.csv, when the size reaches a threshold (can be 500k), do file rotation by "mv stats-0.csv stats-0-timestamp1.csv", then empty stats-0.csv and keep appending the message again. After a while, some oldest stats-0-timestamp1.csv will be deleted from the directory. The crash always happens if the old file stats-0-timestamp1.csv got deleted.

Expected behavior

The fluentbit should not crash.

Screenshots

Your Environment

  • Version used: 1.8.8
  • Configuration:
  • [INPUT]
    Name tail
    Path /data/test/stats-0.csv
    Path_Key file_name
    Tag stats
    Buffer_Chunk_Size 32KB
    Buffer_Max_Size 128KB
    Skip_Long_Lines On
    Multiline On
    Parser_Firstline multiline_stats
    Multiline_Flush 5
    Rotate_Wait 35
    Refresh_Interval 1
    Read_from_head true
    Mem_Buf_Limit 500MB
    storage.type filesystem
    DB /data/tail-0.db
    DB.locking true
    [SERVICE]
    Flush 1
    Daemon off
    Log_Level info
    Parsers_File parser.example.conf
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_PORT 2020
    storage.path /data/storage
    storage.backlog.mem_limit 500M
    storage.max_chunks_up 200

@include inputs.example.conf

[OUTPUT]
Name stdout

  • Environment name and version (e.g. Kubernetes? What version?): ec2 machine
  • Server type and version:
  • Operating System and version:
  • Filters and plugins:

Additional context

@PettitWesley
Copy link
Contributor

@wubn2000 Can you open an issue here to track this one on the AWS side: https://github.com/aws/aws-for-fluent-bit

We'll try to get some info on the root cause using the valgrind tool: https://www.valgrind.org/

@PettitWesley PettitWesley changed the title Frequent crash when multiline on and file rotation Frequent crash when in_tail multiline on and file rotation Oct 11, 2021
@wubn2000
Copy link
Author

Thanks Wesley, just opened one there.

@PettitWesley
Copy link
Contributor

I've been unable to reproduce this using the steps provided.

I did get a crash on startup though, and its telling me the regex is invalid:

$ ./bin/fluent-bit -c ~/workday/fluent-bit.conf
Fluent Bit v1.8.8
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/10/12 00:14:46] [error] [parser:multiline_stats] Invalid regex pattern /^(?\d{10,}.*)/
[2021/10/12 00:14:46] [ info] [engine] started (pid=12092)
[2021/10/12 00:14:46] [ info] [storage] version=1.1.4, initializing...
[2021/10/12 00:14:46] [ info] [storage] in-memory
[2021/10/12 00:14:46] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/10/12 00:14:46] [ info] [cmetrics] version=0.2.2
[2021/10/12 00:14:46] [engine] caught signal (SIGSEGV)
#0  0x7f7c5419d84a      in  ???() at ???:0
#1  0x472941            in  flb_parser_get() at src/flb_parser.c:872
#2  0x4a9ab6            in  flb_tail_mult_create() at plugins/in_tail/tail_multiline.c:68
#3  0x4b29c8            in  flb_tail_config_create() at plugins/in_tail/tail_config.c:186
#4  0x4a7191            in  in_tail_init() at plugins/in_tail/tail.c:283
#5  0x44eec6            in  flb_input_instance_init() at src/flb_input.c:568
#6  0x44efa3            in  flb_input_init_all() at src/flb_input.c:604
#7  0x4621ba            in  flb_engine_start() at src/flb_engine.c:587
#8  0x444cd6            in  flb_lib_worker() at src/flb_lib.c:627
#9  0x7f7c557b740a      in  ???() at ???:0
#10 0x7f7c542010be      in  ???() at ???:0
#11 0xffffffffffffffff  in  ???() at ???:0
Aborted

Valgrind gives this:

[2021/10/12 00:22:27] [error] [parser:multiline_stats] Invalid regex pattern /^(?\d{10,}.*)/
[2021/10/12 00:22:27] [ info] [engine] started (pid=12159)
[2021/10/12 00:22:27] [ info] [storage] version=1.1.4, initializing...
==12159== Thread 2 flb-pipeline:
==12159== Invalid read of size 8
==12159==    at 0x47292F: flb_parser_get (flb_parser.c:872)
==12159==    by 0x4A9AB6: flb_tail_mult_create (tail_multiline.c:68)
==12159==    by 0x4B29C8: flb_tail_config_create (tail_config.c:186)
==12159==    by 0x4A7191: in_tail_init (tail.c:283)
==12159==    by 0x44EEC6: flb_input_instance_init (flb_input.c:568)
==12159==    by 0x44EFA3: flb_input_init_all (flb_input.c:604)
==12159==    by 0x4621BA: flb_engine_start (flb_engine.c:587)
==12159==    by 0x444CD6: flb_lib_worker (flb_lib.c:627)
==12159==    by 0x4E3F40A: start_thread (in /usr/lib64/libpthread-2.26.so)
==12159==    by 0x64340BE: clone (in /usr/lib64/libc-2.26.so)
==12159==  Address 0x8315978 is 8 bytes inside a block of size 144 free'd
==12159==    at 0x4C2CD28: free (vg_replace_malloc.c:530)
==12159==    by 0x470611: flb_free (flb_mem.h:122)
==12159==    by 0x4712BE: flb_parser_create (flb_parser.c:212)
==12159==    by 0x471EA2: parser_conf_file (flb_parser.c:544)
==12159==    by 0x472890: flb_parser_conf_file (flb_parser.c:848)
==12159==    by 0x45926E: flb_config_set_property (flb_config.c:522)
==12159==    by 0x4394BC: flb_service_conf (fluent-bit.c:797)
==12159==    by 0x43A3A4: flb_main (fluent-bit.c:1235)
==12159==    by 0x43A51D: main (fluent-bit.c:1302)
==12159==  Block was alloc'd at
==12159==    at 0x4C2DA2E: calloc (vg_replace_malloc.c:711)
==12159==    by 0x4705D8: flb_calloc (flb_mem.h:78)
==12159==    by 0x4710C1: flb_parser_create (flb_parser.c:175)
==12159==    by 0x471EA2: parser_conf_file (flb_parser.c:544)
==12159==    by 0x472890: flb_parser_conf_file (flb_parser.c:848)
==12159==    by 0x45926E: flb_config_set_property (flb_config.c:522)
==12159==    by 0x4394BC: flb_service_conf (fluent-bit.c:797)
==12159==    by 0x43A3A4: flb_main (fluent-bit.c:1235)
==12159==    by 0x43A51D: main (fluent-bit.c:1302)
==12159==
==12159== Invalid read of size 1
==12159==    at 0x63D0830: __strcmp_sse2_unaligned (in /usr/lib64/libc-2.26.so)
==12159==    by 0x472941: flb_parser_get (flb_parser.c:872)
==12159==    by 0x4A9AB6: flb_tail_mult_create (tail_multiline.c:68)
==12159==    by 0x4B29C8: flb_tail_config_create (tail_config.c:186)
==12159==    by 0x4A7191: in_tail_init (tail.c:283)
==12159==    by 0x44EEC6: flb_input_instance_init (flb_input.c:568)
==12159==    by 0x44EFA3: flb_input_init_all (flb_input.c:604)
==12159==    by 0x4621BA: flb_engine_start (flb_engine.c:587)
==12159==    by 0x444CD6: flb_lib_worker (flb_lib.c:627)
==12159==    by 0x4E3F40A: start_thread (in /usr/lib64/libpthread-2.26.so)
==12159==    by 0x64340BE: clone (in /usr/lib64/libc-2.26.so)
==12159==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==12159==
[2021/10/12 00:22:27] [2021/10/12 00:22:27] [ info] [storage] in-memory
[engine] caught signal (SIGSEGV)
[2021/10/12 00:22:27] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/10/12 00:22:27] [ info] [cmetrics] version=0.2.2
#0  0x63d0830           in  ???() at ???:0
#1  0x472941            in  flb_parser_get() at src/flb_parser.c:872
#2  0x4a9ab6            in  flb_tail_mult_create() at plugins/in_tail/tail_multiline.c:68
#3  0x4b29c8            in  flb_tail_config_create() at plugins/in_tail/tail_config.c:186
#4  0x4a7191            in  in_tail_init() at plugins/in_tail/tail.c:283
#5  0x44eec6            in  flb_input_instance_init() at src/flb_input.c:568
#6  0x44efa3            in  flb_input_init_all() at src/flb_input.c:604
#7  0x4621ba            in  flb_engine_start() at src/flb_engine.c:587
#8  0x444cd6            in  flb_lib_worker() at src/flb_lib.c:627
#9  0x4e3f40a           in  ???() at ???:0
#10 0x64340be           in  ???() at ???:0
#11 0xffffffffffffffff  in  ???() at ???:0
==12159==
==12159== Process terminating with default action of signal 6 (SIGABRT)
==12159==    at 0x637AC20: raise (in /usr/lib64/libc-2.26.so)
==12159==    by 0x637C0C7: abort (in /usr/lib64/libc-2.26.so)
==12159==    by 0x438DD1: flb_signal_handler (fluent-bit.c:561)
==12159==    by 0x637AC8F: ??? (in /usr/lib64/libc-2.26.so)
==12159==    by 0x63D082F: ??? (in /usr/lib64/libc-2.26.so)
==12159==    by 0x4A9AB6: flb_tail_mult_create (tail_multiline.c:68)
==12159==    by 0x4B29C8: flb_tail_config_create (tail_config.c:186)
==12159==    by 0x4A7191: in_tail_init (tail.c:283)
==12159==    by 0x44EEC6: flb_input_instance_init (flb_input.c:568)
==12159==    by 0x44EFA3: flb_input_init_all (flb_input.c:604)
==12159==    by 0x4621BA: flb_engine_start (flb_engine.c:587)
==12159==    by 0x444CD6: flb_lib_worker (flb_lib.c:627)
==12159==
==12159== HEAP SUMMARY:
==12159==     in use at exit: 118,947 bytes in 554 blocks
==12159==   total heap usage: 1,083 allocs, 529 frees, 170,448 bytes allocated
==12159==
==12159== LEAK SUMMARY:
==12159==    definitely lost: 0 bytes in 0 blocks
==12159==    indirectly lost: 0 bytes in 0 blocks
==12159==      possibly lost: 39,071 bytes in 505 blocks
==12159==    still reachable: 79,876 bytes in 49 blocks
==12159==                       of which reachable via heuristic:
==12159==                         newarray           : 320 bytes in 4 blocks
==12159==         suppressed: 0 bytes in 0 blocks
==12159== Rerun with --leak-check=full to see details of leaked memory
==12159==
==12159== For counts of detected and suppressed errors, rerun with: -v
==12159== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
Aborted

I think this is a separate issue where an invalid parser causes it to crash. Changing the parser to remove the invalid warning got rid of the crash.

@PettitWesley
Copy link
Contributor

Valgrind logs for crash: aws/aws-for-fluent-bit#255 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants