[storage layer] Fluent-bit crashes on high contention if corrupted chunks exists #1950

dgrala · 2020-02-13T22:25:39Z

Bug Report

Describe the bug
It might be related to corrupted files in storage, or just a large backlog there. On start, fluentbit will try to load all the files, and crash soon afterwards with segfault.

To Reproduce

[2020/02/12 23:24:15] [error] [storage] format check failed: systemd.2/8-1581461952.525678621.flb
[2020/02/12 23:24:15] [error] [storage] format check failed: systemd.2/8-1581467643.662417426.flb
[2020/02/12 23:24:25] [error] [storage] format check failed: tail.0/163-1581549864.900113196.flb
[2020/02/12 23:24:25] [error] [storage] [cio file] cannot map chunk: tail.0/163-1581549864.900113196.flb
[2020/02/12 23:24:25] [error] [storage] format check failed: tail.0/163-1581549865.134681941.flb
[2020/02/12 23:24:25] [error] [storage] [cio file] cannot map chunk: tail.0/163-1581549865.134681941.flb
[engine] caught signal (SIGSEGV)
#0  0x5636a8cf8f0e      in  cio_file_st_get_meta_len() at lib/chunkio/include/chunkio/cio_file_st.h:72
#1  0x5636a8cf8f42      in  cio_file_st_get_content() at lib/chunkio/include/chunkio/cio_file_st.h:93
#2  0x5636a8cf93f1      in  cio_chunk_get_content() at lib/chunkio/src/cio_chunk.c:193
#3  0x5636a8a964f5      in  flb_input_chunk_flush() at src/flb_input_chunk.c:550
#4  0x5636a8a7c823      in  flb_engine_dispatch() at src/flb_engine_dispatch.c:146
#5  0x5636a8a79b22      in  flb_engine_flush() at src/flb_engine.c:85
#6  0x5636a8a7b318      in  flb_engine_handle_event() at src/flb_engine.c:247
#7  0x5636a8a7b318      in  flb_engine_start() at src/flb_engine.c:489
#8  0x5636a89ec813      in  main() at src/fluent-bit.c:853
#9  0x7f75cf5ecb96      in  ???() at ???:0
#10 0x5636a89ea9d9      in  ???() at ???:0
#11 0xffffffffffffffff  in  ???() at ???:0
Aborted (core dumped)

Steps to reproduce the problem:
One example we had 1.2GB files stuck in storage. It repro'ed consistently. I moved the files out of storage, ran again, it didn't crash. When I moved them back, it didn't seem to crash. Might be hard to repro, but seems to happen regularly in our prod.

Expected behavior
No crashes

Screenshots

Your Environment

Version used: 1.3.2
Configuration:

  td-agent-bit.conf: |
    [SERVICE]
        # Rely on the supervisor service (eg kubelet) to restart
        # the fluentbit daemon when the configuration changes.
        Config_Watch              on
        # Given we run in a container stay in the foreground.
        Daemon                    off
        Flush                     1
        HTTP_Server               on
        HTTP_Listen               0.0.0.0
        HTTP_Port                 2020
        Log_Level                 warning
        Parsers_File              parsers.conf
        storage.path              /var/lib/fluentbit/storage/
        storage.sync              full
        storage.checksum          on
        # from https://github.com/fluent/fluent-bit/issues/1362#issuecomment-500166931
        storage.backlog.mem_limit 100M

Problematic input (one of 3):

[INPUT]
        Name              tail
        Tag               kube.<namespace_name>.<pod_name>.<container_name>.<docker_id>
        Tag_Regex         (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/lib/fluentbit/input_tail_kube.db
        Docker_Mode       On
        Mem_Buf_Limit     50MB
        Buffer_Chunk_Size 1MB
        Buffer_Max_Size   1MB
        Skip_Long_Lines   On
        storage.type      filesystem
        Refresh_Interval  10

Environment name and version (e.g. Kubernetes? What version?):
k8s
Server type and version:
Operating System and version:
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
Filters and plugins:

Additional context
Ideally we would see no crashes. Seems after any crash, the storage files are corrupted and can't be loaded by subsequent runs.

The text was updated successfully, but these errors were encountered:

edsiper · 2020-02-13T22:31:12Z

@dgrala

thanks for reporting this issue.

Are you able to reproduce the problem only with chunk tail.0/163-1581549864.900113196.flb ?

if so, can you provide me that chunk ? (can be in private by email, slack or another if required)

dgrala · 2020-02-13T22:48:12Z

Thanks @edsiper for looking into this. I don't have that pod or file anymore. I can verify when it pops up again.

One thing I forgot to mention, on the node that was crashing consistently on start after loading 1.2 GB of storage backlog, we've increased storage.backlog.mem_limit from 100 MB to 2 GB and it wouldn't repro anymore. So there is some relation with available memory and files loaded from the storage, which might make it hard to repro with a single file.

edsiper · 2020-02-14T01:17:43Z

Good news: I was able to reproduce the issue

Troubleshooting now

rmacian · 2020-02-14T15:09:36Z

I am having a similar problem after enabling filesystem storage in 1.3.7

2020/02/14 15:07:30] [debug] [storage] [cio file] alloc_size from 36864 to 102400
[2020/02/14 15:07:30] [debug] [storage] [cio file] synced at: tail.4/1-1581692850.200312113.flb
[2020/02/14 15:07:30] [debug] [in_tail] file=/var/log/containers/purchase-event-manager-5-vmmb9_gvp_purchase-event-manager-02060e929d3e460da6ac351689cd2f419d6ba192cced026a458110b095d93812.log read=32689 lines=71
[2020/02/14 15:07:30] [debug] [task] destroy task=0x7f14075faee0 (task_id=33)
[2020/02/14 15:07:30] [debug] [storage] tail.0:1-1581692755.636380503.flb mapped OK
[2020/02/14 15:07:30] [debug] [task] created task=0x7f14075faee0 id=33 OK
[2020/02/14 15:07:30] [debug] [out_forward] request 143790 bytes to flush
[2020/02/14 15:07:30] [debug] [out_fw] 154 entries tag='kube.var.log.containers.message-manager-north-5-f4cvk_tid_message-manager-north-4f3d19b958f53e2dc6e59555ba887569d0153f63a399ef1ba8d80b61f873e316.log' tag_len=148
[2020/02/14 15:07:30] [debug] [task] created task=0x7f140743ef40 id=126 OK
[2020/02/14 15:07:30] [debug] [out_forward] request 1239 bytes to flush
[2020/02/14 15:07:30] [debug] [out_fw] 1 entries tag='kube.var.log.containers.epg-agent-wait-2-46fc8_gvp_epg-agent-wait-e41122f15d432269e0156784949f781883d92669b0c5592300be5123a6060544.log' tag_len=134
[2020/02/14 15:07:30] [ warn] [task] retry for task 7 could not be re-scheduled
[2020/02/14 15:07:30] [debug] [retry] task retry=0x7f140740bae0, invalidated from the scheduler
[2020/02/14 15:07:30] [debug] [task] destroy task=0x7f14074400c0 (task_id=7)
[engine] caught signal (SIGSEGV)
#0  0x5653af67bebb      in  __mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:87
#1  0x5653af67bef2      in  mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:93
#2  0x5653af67c779      in  flb_sched_request_destroy() at src/flb_scheduler.c:314
#3  0x5653af67c8c7      in  flb_sched_event_handler() at src/flb_scheduler.c:375
#4  0x5653af67a17a      in  flb_engine_start() at src/flb_engine.c:548
#5  0x5653af5e8cc6      in  main() at src/fluent-bit.c:854
#6  0x7f14088b72e0      in  ???() at ???:0
#7  0x5653af5e71a9      in  ???() at ???:0
#8  0xffffffffffffffff  in  ???() at ???:0

edsiper · 2020-02-14T23:15:03Z

@dgrala

I've added the proper fixes on 1.3 branch, would you please deploy and evaluate the following test image?

edsiper/fluent-bit-1.3.8-next:1

if you confirm is OK, I will proceed with a formal release.

note: that image is based in Ubuntu and heavier than the normal one.

edsiper · 2020-02-14T23:19:29Z

@rmacian your issue looks different than this one, I've opened #1956 for this. Would you please follow up there ?

dgrala · 2020-02-15T00:02:17Z

Thanks for a quick fix @edsiper!

Where can I get the custom release? Normally we use apt-get to install it from https://packages.fluentbit.io/ubuntu/bionic

edsiper · 2020-02-15T02:41:15Z

Ah! I thought you were using containers, which distro?

…

On Fri, Feb 14, 2020, 18:02 Darek Grala ***@***.***> wrote: Thanks for a quick fix @edsiper <https://github.com/edsiper>! Where can I get the custom release? Normally we use apt-get to install it from https://packages.fluentbit.io/ubuntu/bionic — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1950?email_source=notifications&email_token=AAC2INSTEBVJLTEZWM6FUFDRC4WIVA5CNFSM4KU4HKOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL22YJI#issuecomment-586525733>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAC2INVIFSAPY2A6FL3MNETRC4WIVANCNFSM4KU4HKOA> .

edsiper · 2020-02-15T05:01:19Z

@dgrala

Please give it a try to the following TEST PACKAGE:

https://fluentbit.io/issues/1950/td-agent-bit_1.3.8_amd64.deb

DISCLAIMER: THIS IS NOT AN OFFICIAL RELEASE!

edsiper · 2020-02-17T19:16:23Z

@dgrala do you have any feedback ^ ?

dgrala · 2020-02-18T18:58:39Z

Thanks @edsiper! We do run containers, but we use apt-get to build them. Thanks for the test package. Anything in particular we should watch out for? We don't have any decent test environment right now, so it will take us some time to set things up.

dgrala · 2020-02-19T00:20:50Z

A new issue we're seeing today is td-agent-bit hanging making no progress #1958 - not sure if related, we're still on the old version

dgrala · 2020-02-21T21:56:33Z

Thanks, haven't seen crashes lately. We've updated to 1.3.8 today.

edsiper self-assigned this Feb 13, 2020

edsiper added bug troubleshooting waiting-for-user Waiting for more information, tests or requested changes labels Feb 13, 2020

edsiper added work-in-process and removed waiting-for-user Waiting for more information, tests or requested changes labels Feb 14, 2020

edsiper added fixed waiting-for-user Waiting for more information, tests or requested changes and removed troubleshooting work-in-process labels Feb 14, 2020

edsiper changed the title ~~Fluent-bit crashes~~ [storage layer] Fluent-bit crashes on high contention if corrupted chunks exists Feb 14, 2020

edsiper mentioned this issue Feb 14, 2020

scheduler corruption on high number of retries #1956

Closed

dgrala closed this as completed Feb 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[storage layer] Fluent-bit crashes on high contention if corrupted chunks exists #1950

[storage layer] Fluent-bit crashes on high contention if corrupted chunks exists #1950

dgrala commented Feb 13, 2020

edsiper commented Feb 13, 2020

dgrala commented Feb 13, 2020

edsiper commented Feb 14, 2020

rmacian commented Feb 14, 2020

edsiper commented Feb 14, 2020

edsiper commented Feb 14, 2020

dgrala commented Feb 15, 2020

edsiper commented Feb 15, 2020 via email

edsiper commented Feb 15, 2020

edsiper commented Feb 17, 2020

dgrala commented Feb 18, 2020

dgrala commented Feb 19, 2020

dgrala commented Feb 21, 2020

[storage layer] Fluent-bit crashes on high contention if corrupted chunks exists #1950

[storage layer] Fluent-bit crashes on high contention if corrupted chunks exists #1950

Comments

dgrala commented Feb 13, 2020

Bug Report

edsiper commented Feb 13, 2020

dgrala commented Feb 13, 2020

edsiper commented Feb 14, 2020

rmacian commented Feb 14, 2020

edsiper commented Feb 14, 2020

edsiper commented Feb 14, 2020

dgrala commented Feb 15, 2020

edsiper commented Feb 15, 2020 via email

edsiper commented Feb 15, 2020

edsiper commented Feb 17, 2020

dgrala commented Feb 18, 2020

dgrala commented Feb 19, 2020

dgrala commented Feb 21, 2020