Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[storage layer] Fluent-bit crashes on high contention if corrupted chunks exists #1950

Closed
dgrala opened this issue Feb 13, 2020 · 13 comments
Closed
Assignees
Labels
bug fixed waiting-for-user Waiting for more information, tests or requested changes

Comments

@dgrala
Copy link

dgrala commented Feb 13, 2020

Bug Report

Describe the bug
It might be related to corrupted files in storage, or just a large backlog there. On start, fluentbit will try to load all the files, and crash soon afterwards with segfault.

To Reproduce

[2020/02/12 23:24:15] [error] [storage] format check failed: systemd.2/8-1581461952.525678621.flb
[2020/02/12 23:24:15] [error] [storage] format check failed: systemd.2/8-1581467643.662417426.flb
[2020/02/12 23:24:25] [error] [storage] format check failed: tail.0/163-1581549864.900113196.flb
[2020/02/12 23:24:25] [error] [storage] [cio file] cannot map chunk: tail.0/163-1581549864.900113196.flb
[2020/02/12 23:24:25] [error] [storage] format check failed: tail.0/163-1581549865.134681941.flb
[2020/02/12 23:24:25] [error] [storage] [cio file] cannot map chunk: tail.0/163-1581549865.134681941.flb
[engine] caught signal (SIGSEGV)
#0  0x5636a8cf8f0e      in  cio_file_st_get_meta_len() at lib/chunkio/include/chunkio/cio_file_st.h:72
#1  0x5636a8cf8f42      in  cio_file_st_get_content() at lib/chunkio/include/chunkio/cio_file_st.h:93
#2  0x5636a8cf93f1      in  cio_chunk_get_content() at lib/chunkio/src/cio_chunk.c:193
#3  0x5636a8a964f5      in  flb_input_chunk_flush() at src/flb_input_chunk.c:550
#4  0x5636a8a7c823      in  flb_engine_dispatch() at src/flb_engine_dispatch.c:146
#5  0x5636a8a79b22      in  flb_engine_flush() at src/flb_engine.c:85
#6  0x5636a8a7b318      in  flb_engine_handle_event() at src/flb_engine.c:247
#7  0x5636a8a7b318      in  flb_engine_start() at src/flb_engine.c:489
#8  0x5636a89ec813      in  main() at src/fluent-bit.c:853
#9  0x7f75cf5ecb96      in  ???() at ???:0
#10 0x5636a89ea9d9      in  ???() at ???:0
#11 0xffffffffffffffff  in  ???() at ???:0
Aborted (core dumped)
  • Steps to reproduce the problem:
    One example we had 1.2GB files stuck in storage. It repro'ed consistently. I moved the files out of storage, ran again, it didn't crash. When I moved them back, it didn't seem to crash. Might be hard to repro, but seems to happen regularly in our prod.

Expected behavior
No crashes

Screenshots

Your Environment

  • Version used: 1.3.2
  • Configuration:
  td-agent-bit.conf: |
    [SERVICE]
        # Rely on the supervisor service (eg kubelet) to restart
        # the fluentbit daemon when the configuration changes.
        Config_Watch              on
        # Given we run in a container stay in the foreground.
        Daemon                    off
        Flush                     1
        HTTP_Server               on
        HTTP_Listen               0.0.0.0
        HTTP_Port                 2020
        Log_Level                 warning
        Parsers_File              parsers.conf
        storage.path              /var/lib/fluentbit/storage/
        storage.sync              full
        storage.checksum          on
        # from https://github.com/fluent/fluent-bit/issues/1362#issuecomment-500166931
        storage.backlog.mem_limit 100M

Problematic input (one of 3):

[INPUT]
        Name              tail
        Tag               kube.<namespace_name>.<pod_name>.<container_name>.<docker_id>
        Tag_Regex         (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/lib/fluentbit/input_tail_kube.db
        Docker_Mode       On
        Mem_Buf_Limit     50MB
        Buffer_Chunk_Size 1MB
        Buffer_Max_Size   1MB
        Skip_Long_Lines   On
        storage.type      filesystem
        Refresh_Interval  10
  • Environment name and version (e.g. Kubernetes? What version?):
    k8s

  • Server type and version:

  • Operating System and version:
    NAME="Ubuntu"
    VERSION="18.04.3 LTS (Bionic Beaver)"

  • Filters and plugins:

Additional context
Ideally we would see no crashes. Seems after any crash, the storage files are corrupted and can't be loaded by subsequent runs.

@edsiper
Copy link
Member

edsiper commented Feb 13, 2020

@dgrala

thanks for reporting this issue.

Are you able to reproduce the problem only with chunk tail.0/163-1581549864.900113196.flb ?

if so, can you provide me that chunk ? (can be in private by email, slack or another if required)

@edsiper edsiper self-assigned this Feb 13, 2020
@edsiper edsiper added bug troubleshooting waiting-for-user Waiting for more information, tests or requested changes labels Feb 13, 2020
@dgrala
Copy link
Author

dgrala commented Feb 13, 2020

Thanks @edsiper for looking into this. I don't have that pod or file anymore. I can verify when it pops up again.

One thing I forgot to mention, on the node that was crashing consistently on start after loading 1.2 GB of storage backlog, we've increased storage.backlog.mem_limit from 100 MB to 2 GB and it wouldn't repro anymore. So there is some relation with available memory and files loaded from the storage, which might make it hard to repro with a single file.

@edsiper
Copy link
Member

edsiper commented Feb 14, 2020

Good news: I was able to reproduce the issue

Troubleshooting now

@rmacian
Copy link

rmacian commented Feb 14, 2020

I am having a similar problem after enabling filesystem storage in 1.3.7

2020/02/14 15:07:30] [debug] [storage] [cio file] alloc_size from 36864 to 102400
[2020/02/14 15:07:30] [debug] [storage] [cio file] synced at: tail.4/1-1581692850.200312113.flb
[2020/02/14 15:07:30] [debug] [in_tail] file=/var/log/containers/purchase-event-manager-5-vmmb9_gvp_purchase-event-manager-02060e929d3e460da6ac351689cd2f419d6ba192cced026a458110b095d93812.log read=32689 lines=71
[2020/02/14 15:07:30] [debug] [task] destroy task=0x7f14075faee0 (task_id=33)
[2020/02/14 15:07:30] [debug] [storage] tail.0:1-1581692755.636380503.flb mapped OK
[2020/02/14 15:07:30] [debug] [task] created task=0x7f14075faee0 id=33 OK
[2020/02/14 15:07:30] [debug] [out_forward] request 143790 bytes to flush
[2020/02/14 15:07:30] [debug] [out_fw] 154 entries tag='kube.var.log.containers.message-manager-north-5-f4cvk_tid_message-manager-north-4f3d19b958f53e2dc6e59555ba887569d0153f63a399ef1ba8d80b61f873e316.log' tag_len=148
[2020/02/14 15:07:30] [debug] [task] created task=0x7f140743ef40 id=126 OK
[2020/02/14 15:07:30] [debug] [out_forward] request 1239 bytes to flush
[2020/02/14 15:07:30] [debug] [out_fw] 1 entries tag='kube.var.log.containers.epg-agent-wait-2-46fc8_gvp_epg-agent-wait-e41122f15d432269e0156784949f781883d92669b0c5592300be5123a6060544.log' tag_len=134
[2020/02/14 15:07:30] [ warn] [task] retry for task 7 could not be re-scheduled
[2020/02/14 15:07:30] [debug] [retry] task retry=0x7f140740bae0, invalidated from the scheduler
[2020/02/14 15:07:30] [debug] [task] destroy task=0x7f14074400c0 (task_id=7)
[engine] caught signal (SIGSEGV)
#0  0x5653af67bebb      in  __mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:87
#1  0x5653af67bef2      in  mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:93
#2  0x5653af67c779      in  flb_sched_request_destroy() at src/flb_scheduler.c:314
#3  0x5653af67c8c7      in  flb_sched_event_handler() at src/flb_scheduler.c:375
#4  0x5653af67a17a      in  flb_engine_start() at src/flb_engine.c:548
#5  0x5653af5e8cc6      in  main() at src/fluent-bit.c:854
#6  0x7f14088b72e0      in  ???() at ???:0
#7  0x5653af5e71a9      in  ???() at ???:0
#8  0xffffffffffffffff  in  ???() at ???:0

@edsiper edsiper added work-in-process and removed waiting-for-user Waiting for more information, tests or requested changes labels Feb 14, 2020
@edsiper
Copy link
Member

edsiper commented Feb 14, 2020

@dgrala

I've added the proper fixes on 1.3 branch, would you please deploy and evaluate the following test image?

edsiper/fluent-bit-1.3.8-next:1

if you confirm is OK, I will proceed with a formal release.

note: that image is based in Ubuntu and heavier than the normal one.

@edsiper edsiper added fixed waiting-for-user Waiting for more information, tests or requested changes and removed troubleshooting work-in-process labels Feb 14, 2020
@edsiper edsiper changed the title Fluent-bit crashes [storage layer] Fluent-bit crashes on high contention if corrupted chunks exists Feb 14, 2020
@edsiper
Copy link
Member

edsiper commented Feb 14, 2020

@rmacian your issue looks different than this one, I've opened #1956 for this. Would you please follow up there ?

@dgrala
Copy link
Author

dgrala commented Feb 15, 2020

Thanks for a quick fix @edsiper!

Where can I get the custom release? Normally we use apt-get to install it from https://packages.fluentbit.io/ubuntu/bionic

@edsiper
Copy link
Member

edsiper commented Feb 15, 2020 via email

@edsiper
Copy link
Member

edsiper commented Feb 15, 2020

@dgrala

Please give it a try to the following TEST PACKAGE:

DISCLAIMER: THIS IS NOT AN OFFICIAL RELEASE!

@edsiper
Copy link
Member

edsiper commented Feb 17, 2020

@dgrala do you have any feedback ^ ?

@dgrala
Copy link
Author

dgrala commented Feb 18, 2020

Thanks @edsiper! We do run containers, but we use apt-get to build them. Thanks for the test package. Anything in particular we should watch out for? We don't have any decent test environment right now, so it will take us some time to set things up.

@dgrala
Copy link
Author

dgrala commented Feb 19, 2020

A new issue we're seeing today is td-agent-bit hanging making no progress #1958 - not sure if related, we're still on the old version

@dgrala
Copy link
Author

dgrala commented Feb 21, 2020

Thanks, haven't seen crashes lately. We've updated to 1.3.8 today.

@dgrala dgrala closed this as completed Feb 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fixed waiting-for-user Waiting for more information, tests or requested changes
Projects
None yet
Development

No branches or pull requests

3 participants