Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluent-bit stuck not doing any work #1958

Closed
dgrala opened this issue Feb 19, 2020 · 25 comments
Closed

Fluent-bit stuck not doing any work #1958

dgrala opened this issue Feb 19, 2020 · 25 comments
Assignees

Comments

@dgrala
Copy link

dgrala commented Feb 19, 2020

Bug Report

td-agent-bit seems stuck at the same stack trace, not doing any work. New instances of the process work fine on the same node.

To Reproduce
Seems stuck at this place:

Thread 2 (Thread 0x7f6bec7ff700 (LWP 39)):
#0  0x00007f6bedad1bb7 in epoll_wait (epfd=3, events=0x7f6becc3f0c0, maxevents=16, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00005615ca1a1a46 in _mk_event_wait (loop=0x7f6becc150a8) at /tmp/fluent-bit-1.3.2/lib/monkey/mk_core/mk_event_epoll.c:345
#2  0x00005615ca1a1d36 in mk_event_wait (loop=0x7f6becc150a8) at /tmp/fluent-bit-1.3.2/lib/monkey/mk_core/mk_event.c:163
#3  0x00005615c9dfa4fe in log_worker_collector (data=0x7f6becc68000) at /tmp/fluent-bit-1.3.2/src/flb_log.c:124
#4  0x00005615c9e0f22e in step_callback (data=0x7f6becc6e070) at /tmp/fluent-bit-1.3.2/src/flb_worker.c:51
#5  0x00007f6bee9ee6db in start_thread (arg=0x7f6bec7ff700) at pthread_create.c:463
#6  0x00007f6bedad188f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f6beee224c0 (LWP 6)):
#0  0x00007f6bee9f8384 in __libc_read (fd=99, buf=0x7f6be3c7b008, nbytes=5) at ../sysdeps/unix/sysv/linux/read.c:27
#1  0x00005615c9efc9a7 in mbedtls_net_recv (ctx=0x7f6becc6e7d0, buf=0x7f6be3c7b008 "\026\003\003", len=5)
    at /tmp/fluent-bit-1.3.2/lib/mbedtls-2.16.3/library/net_sockets.c:549
#2  0x00005615c9f02878 in mbedtls_ssl_fetch_input (ssl=0x7f6beb6c8d00, nb_want=5)
    at /tmp/fluent-bit-1.3.2/lib/mbedtls-2.16.3/library/ssl_tls.c:2715
#3  0x00005615c9f06ec1 in ssl_get_next_record (ssl=0x7f6beb6c8d00) at /tmp/fluent-bit-1.3.2/lib/mbedtls-2.16.3/library/ssl_tls.c:4971
#4  0x00005615c9f05a83 in mbedtls_ssl_read_record (ssl=0x7f6beb6c8d00, update_hs_digest=1)
    at /tmp/fluent-bit-1.3.2/lib/mbedtls-2.16.3/library/ssl_tls.c:4338
#5  0x00005615c9f0cade in mbedtls_ssl_read (ssl=0x7f6beb6c8d00, 
    buf=0x7f6beb6e8000 "# Kubernetes-managed hosts file.\n127.0.0.1\tlocalhost\n::1\tlocalhost ip6-localhost ip6-loopback\nfe00::0\tip6-localnet\nfe00::0\tip6-mcastprefix\nfe00::1\tip6-allnodes\nfe00::2\tip6-allrouters\n10.162.233.179\tfl"..., len=4095)
    at /tmp/fluent-bit-1.3.2/lib/mbedtls-2.16.3/library/ssl_tls.c:8330
#6  0x00005615c9e1bb38 in flb_io_tls_net_read (th=0x7f6becc6e700, u_conn=0x7f6becc6e770, buf=0x7f6beb6e8000, len=4095)
    at /tmp/fluent-bit-1.3.2/src/flb_io_tls.c:410
#7  0x00005615c9e26835 in flb_io_net_read (u_conn=0x7f6becc6e770, buf=0x7f6beb6e8000, len=4095)
    at /tmp/fluent-bit-1.3.2/src/flb_io.c:484
#8  0x00005615c9ea4c3a in flb_http_do (c=0x7f6becc65120, bytes=0x7f6be3c87e00) at /tmp/fluent-bit-1.3.2/src/flb_http_client.c:849
#9  0x00005615c9e7c250 in cb_stackdriver_flush (data=0x7f6beeddb0ab, bytes=532, 
    tag=0x7f6becc64ea0 "kube.prod.ourservice--1-0-1071--sf7g73f2-6db9f5768d-h99sn.application.a1193baa2f619f8fa729915e41d0b2d6512ab35e1688b4d72a4c68e86ad07b7d", tag_len=147, i_ins=0x7f6becc5e000, out_context=0x7f6becc64a40, config=0x7f6becc0f680)
    at /tmp/fluent-bit-1.3.2/plugins/out_stackdriver/stackdriver.c:662
#10 0x00005615c9e09c9f in output_pre_cb_flush () at /tmp/fluent-bit-1.3.2/include/fluent-bit/flb_output.h:315
#11 0x00005615ca1a4107 in co_init () at /tmp/fluent-bit-1.3.2/lib/monkey/deps/flb_libco/amd64.c:117
#12 0x0000000000000000 in ?? ()
  • Steps to reproduce the problem:
    Unclear. It hags on our prod infra intermittently. We use fluentbit.output.proc.records.total.counter and fluentbit.input.records.total.counter to monitor, but a hung process doesn't produce input nor output metrics.

Note that I don't see ourservice under /var/log/containers/ourservice--1-0-1071*, but it might've been there before.

Expected behavior
Application crashes so we can restart.

Your Environment

  • Version used: 1.3.2
  • Configuration:
  td-agent-bit.conf: |
    [SERVICE]
        # Rely on the supervisor service (eg kubelet) to restart
        # the fluentbit daemon when the configuration changes.
        Config_Watch              on
        # Given we run in a container stay in the foreground.
        Daemon                    off
        Flush                     1
        HTTP_Server               on
        HTTP_Listen               0.0.0.0
        HTTP_Port                 2020
        Log_Level                 warning
        Parsers_File              parsers.conf
        storage.path              /var/lib/fluentbit/storage/
        storage.sync              full
        storage.checksum          on
        # from https://github.com/fluent/fluent-bit/issues/1362#issuecomment-500166931
        storage.backlog.mem_limit 100M
[INPUT]
        Name              tail
        Tag               kube.<namespace_name>.<pod_name>.<container_name>.<docker_id>
        Tag_Regex         (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/lib/fluentbit/input_tail_kube.db
        Docker_Mode       On
        Mem_Buf_Limit     50MB
        Buffer_Chunk_Size 1MB
        Buffer_Max_Size   1MB
        Skip_Long_Lines   On
        storage.type      filesystem
        Refresh_Interval  10
  • Environment name and version (e.g. Kubernetes? What version?):
    k8s

  • Server type and version:

  • Operating System and version:
    NAME="Ubuntu"
    VERSION="18.04.3 LTS (Bionic Beaver)"

  • Filters and plugins:

Additional context

@edsiper
Copy link
Member

edsiper commented Feb 19, 2020

Please try the test image:

edsiper/fluent-bit-1.3.8-next:1

@dgrala
Copy link
Author

dgrala commented Feb 19, 2020

@edsiper we can't deploy the test image to our prod infra right away. Do you expect that the hang is related to the other crashes?

@edsiper
Copy link
Member

edsiper commented Feb 19, 2020

I don't see a crash here, just a normal trace.

I will release v1.3.8 today, so you can give it a try.

Between 1.3.2 and 1.3.7 there are many fixes, the upgrade between the stable series is always encouraged. you can take a look here:

https://fluentbit.io/announcements/

@dgrala
Copy link
Author

dgrala commented Feb 19, 2020

Yes the problem is fluentbit is stuck at the same trace, making no progress despite there being logs waiting on the system.

Thanks, we will need to upgrade.

@edsiper
Copy link
Member

edsiper commented Feb 20, 2020

@dgrala

Fluent Bit v1.3.8 has been officially released, please upgrade and send us some feedback:

https://fluentbit.io/announcements/v1.3.8/

Docker image: fluent/fluent-bit:1.3.8

@edsiper edsiper added bug waiting-for-user Waiting for more information, tests or requested changes labels Feb 20, 2020
@edsiper edsiper self-assigned this Feb 20, 2020
@dgrala
Copy link
Author

dgrala commented Feb 21, 2020

Hi @edsiper we're still seeing fluentbit hanging with 1.3.8. What can we do to triage it better? Seems like something that might be specific to the Stackdriver output plugin?

The stack trace we're seemingly stuck at:

Thread 2 (Thread 0x7f70359ff700 (LWP 37)):
#0  0x00007f7036e62bb7 in epoll_wait (epfd=3, events=0x7f7035e3f0c0, maxevents=16, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055bd530239b0 in _mk_event_wait (loop=0x7f7035e150c0) at /tmp/fluent-bit-1.3.8/lib/monkey/mk_core/mk_event_epoll.c:345
#2  0x000055bd53023ca0 in mk_event_wait (loop=0x7f7035e150c0) at /tmp/fluent-bit-1.3.8/lib/monkey/mk_core/mk_event.c:163
#3  0x000055bd52c6731e in log_worker_collector (data=0x7f7035e68000) at /tmp/fluent-bit-1.3.8/src/flb_log.c:124
#4  0x000055bd52c7d313 in step_callback (data=0x7f7035e6e070) at /tmp/fluent-bit-1.3.8/src/flb_worker.c:51
#5  0x00007f7037d7f6db in start_thread (arg=0x7f70359ff700) at pthread_create.c:463
#6  0x00007f7036e6288f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f70381b34c0 (LWP 1)):
#0  0x00007f7037d89384 in __libc_read (fd=51, buf=0x7f70284da7c8, nbytes=5) at ../sysdeps/unix/sysv/linux/read.c:27
#1  0x000055bd52d6b617 in mbedtls_net_recv (ctx=0x7f7035e51970, buf=0x7f70284da7c8 "\026\003\003", len=5)
    at /tmp/fluent-bit-1.3.8/lib/mbedtls-2.16.3/library/net_sockets.c:549
#2  0x000055bd52d714e8 in mbedtls_ssl_fetch_input (ssl=0x7f7026051980, nb_want=5)
    at /tmp/fluent-bit-1.3.8/lib/mbedtls-2.16.3/library/ssl_tls.c:2715
#3  0x000055bd52d75b31 in ssl_get_next_record (ssl=0x7f7026051980) at /tmp/fluent-bit-1.3.8/lib/mbedtls-2.16.3/library/ssl_tls.c:4971
#4  0x000055bd52d746f3 in mbedtls_ssl_read_record (ssl=0x7f7026051980, update_hs_digest=1)
    at /tmp/fluent-bit-1.3.8/lib/mbedtls-2.16.3/library/ssl_tls.c:4338
#5  0x000055bd52d7b74e in mbedtls_ssl_read (ssl=0x7f7026051980, 
    buf=0x7f7035e48000 "POST /v2/entries:write HTTP/1.1\r\nHost: logging.googleapis.com:443\r\nContent-Length: 2706\r\nUser-Agent: Fluent-Bit\r\nContent-Type: application/json\r\nAuthorization: Bearer ya29.c.Ko8BvwfQtVBPh3QAEUSoJhfAWZ"..., len=4095)
    at /tmp/fluent-bit-1.3.8/lib/mbedtls-2.16.3/library/ssl_tls.c:8330
#6  0x000055bd52c89c68 in flb_io_tls_net_read (th=0x7f7035e6e690, u_conn=0x7f7035e51900, buf=0x7f7035e48000, len=4095)
    at /tmp/fluent-bit-1.3.8/src/flb_io_tls.c:410
#7  0x000055bd52c9491e in flb_io_net_read (u_conn=0x7f7035e51900, buf=0x7f7035e48000, len=4095)
    at /tmp/fluent-bit-1.3.8/src/flb_io.c:483
#8  0x000055bd52d138ab in flb_http_do (c=0x7f7035e64a40, bytes=0x7f70287df240) at /tmp/fluent-bit-1.3.8/src/flb_http_client.c:873
#9  0x000055bd52ceac96 in cb_stackdriver_flush (data=0x7f703812c085, bytes=435, 
    tag=0x7f7035e6e620 "kube.infrastructure.consul-vlz2j.application.ff6870f26ed9dc642db97e76ed37f88185b2207ff711de1333b7a6981f4e51aa", tag_len=109, i_ins=0x7f7035e5e280, out_context=0x7f7035e66200, config=0x7f7035e0f980)
    at /tmp/fluent-bit-1.3.8/plugins/out_stackdriver/stackdriver.c:662
#10 0x000055bd52c7759c in output_pre_cb_flush () at /tmp/fluent-bit-1.3.8/include/fluent-bit/flb_output.h:320
#11 0x000055bd53026067 in co_init () at /tmp/fluent-bit-1.3.8/lib/monkey/deps/flb_libco/amd64.c:117
#12 0x00007f7035e68de0 in ?? ()
#13 0x0000000152d139e3 in ?? ()
#14 0x00007f7035e6e5b0 in ?? ()
#15 0x00007f7035e6e5b0 in ?? ()
#16 0x00007f70287df3c0 in ?? ()
#17 0x000055bd52ceade7 in cb_stackdriver_flush (data=0x0, bytes=0, tag=0x0, tag_len=0, i_ins=0x0, out_context=0x0, config=0x0)
    at /tmp/fluent-bit-1.3.8/plugins/out_stackdriver/stackdriver.c:695
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

@edsiper
Copy link
Member

edsiper commented Feb 21, 2020

  • how did you generate this stacktrace ?
  • when that happens, how big is your filesystem queue?
  • in order to discard a disk i/o perf problem, if you change storage parameters to this do you see any difference?
        storage.sync           normal
        storage.checksum       off

@dgrala
Copy link
Author

dgrala commented Feb 21, 2020

I attached gdb to the running process.

The storage backlog is 62MB on one of the nodes I'm looking at.

I will change the storage settings. Will report back.

@edsiper
Copy link
Member

edsiper commented Feb 21, 2020

One thing to try to discard slow TLS handshakes, in the output plugin section add the new keepalive feature:

KeepAlive On

@dgrala
Copy link
Author

dgrala commented Feb 21, 2020

If I set storage.checksum to off, it crashes with:

[engine] caught signal (SIGSEGV)
[2020/02/21 22:39:45] [ warn] unknown time format 7a2cd9e0
#0  0x55eecb233fd2      in  flb_time_pop_from_msgpack() at src/flb_time.c:184
#1  0x55eecb2a065b      in  stackdriver_format() at plugins/out_stackdriver/stackdriver.c:528
#2  0x55eecb2a0b42      in  cb_stackdriver_flush() at plugins/out_stackdriver/stackdriver.c:633
#3  0x55eecb22d59b      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:320
#4  0x55eecb5dc066      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
Segmentation fault (core dumped)

@edsiper
Copy link
Member

edsiper commented Feb 21, 2020

@dgrala would it be possible I can get a copy of your storage backlog? (through a private mechanism of course)

@dgrala
Copy link
Author

dgrala commented Feb 21, 2020

@edsiper sorry we won't be able to share logs from our production cluster :/ . We could perhaps set up a test cluster, with some toy applications running but that will take some time to set up.

@edsiper
Copy link
Member

edsiper commented Feb 21, 2020

can you identify the corrupted chunk file ? (it's corrupted anyways)

@dgrala
Copy link
Author

dgrala commented Feb 21, 2020

If I turn storage.checksum to on, it shows multiple corrupted chunk files on start. I'm not clear if they're result of the issues we're seeing, or the cause?

@edsiper
Copy link
Member

edsiper commented Feb 21, 2020

if checksum is on and it warns about corrupted chunks, the corrupted chunks are not processed.

Now keeping storage.sync normal and keepalive on makes any difference in the data flow ?

@dgrala
Copy link
Author

dgrala commented Feb 21, 2020

If I start a new td-agent-bit process, it will process the backlog in storage just fine. So it doesn't repro on the same node.

We will have to apply the config change cluster-wide and wait. Let me try that.

@dgrala
Copy link
Author

dgrala commented Feb 24, 2020

Hi @edsiper we've changed our settings to storage.sync normal and keepalive on, but still seeing pods get stuck. I see lots of logs being written to /var/log/containers/... and can confirm with tail -f that logs are constantly written there. However, there is nothing new in /var/lib/fluentbit/storage, other than some 768KB in 3 files under tail.2 (one log line per file). These we could share with you.

fluentbit.input.records.total.counter doesn't show any new logs for a couple hours.

What would you suggest we can do?

@edsiper
Copy link
Member

edsiper commented Feb 24, 2020

hi @dgrala I am just troubleshooting an issue similar to this but with a high scale of backlog files. I am troubleshooting and I will post my findings shortly.

@edsiper
Copy link
Member

edsiper commented Feb 25, 2020

@dgrala on #1975 I am waiting for final confirmation before to do a new release with further performance improvement.

Now I am not 100% confident this is the exact same issue you are mentioned but could be related.

Do you have specific steps I can follow to reproduce the problem in an isolated way ?

@dgrala
Copy link
Author

dgrala commented Feb 25, 2020

@edsiper thanks for the update - it seems like a different issue. For us, many pods just stop making progress and are blocked for good. Might be related to storage, but we don't see much data in storage.

We have to restart 30%+ of the fleet every couple hours. Don't have an isolated repro at this moment, as we're only running this setup in our production cluster.

@edsiper
Copy link
Member

edsiper commented Feb 25, 2020

@dgrala if you look at the metrics exposed by the HTTP end-point, do you see a lot of retries in the output plugins ? e.g:

$ curl -s http://127.0.0.1:2020/api/v1/metrics|jq
{
  "input": {
    "cpu.0": {
      "records": 17,
      "bytes": 5389
    }
  },
  "filter": {},
  "output": {
    "stdout.0": {
      "proc_records": 16,
      "proc_bytes": 5072,
      "errors": 0,
      "retries": 0,
      "retries_failed": 0
    }
  }
}

@dgrala
Copy link
Author

dgrala commented Feb 25, 2020

@edsiper no, not too many retries:

{
  "input": {
    "tail.0": {
      "records": 28609700,
      "bytes": 6132562878,
      "files_opened": 797,
      "files_closed": 722,
      "files_rotated": 612
    },
    "systemd.1": {
      "records": 1583,
      "bytes": 1784999
    },
    "systemd.2": {
      "records": 421,
      "bytes": 324958
    },
    "storage_backlog.3": {
      "records": 0,
      "bytes": 0
    }
  },
  "filter": {
    "lua.0": {
      "drop_records": 0,
      "add_records": 0
    },
    "modify.1": {
      "drop_records": 0,
      "add_records": 0
    }
  },
  "output": {
    "stackdriver.0": {
      "proc_records": 28610950,
      "proc_bytes": 13444541803,
      "errors": 0,
      "retries": 2,
      "retries_failed": 0
    }
  }
}

The metrics don't change if I run the query a few minutes later, on "stuck" pods.

I actually do see a little network traffic from td-agent-bit, so the app doesn't seem deadlocked:
nethogs screenshot

@edsiper
Copy link
Member

edsiper commented Mar 2, 2020

@dgrala would be possible to arrange a zoom session so we can troubleshoot ?

@dgrala
Copy link
Author

dgrala commented Mar 3, 2020

Yes @edsiper emailed you the zoom meeting inv, but looks like you couldn't make it earlier today? What day/time works for you?

@edsiper edsiper added fixed and removed waiting-for-user Waiting for more information, tests or requested changes labels Mar 5, 2020
@dgrala
Copy link
Author

dgrala commented Jul 24, 2020

Confirmed solved as of fluentbit 1.4.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants