Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluent-bit prints garbage on shutdown #429

Closed
jsravn opened this issue Nov 13, 2017 · 31 comments
Closed

fluent-bit prints garbage on shutdown #429

jsravn opened this issue Nov 13, 2017 · 31 comments
Assignees

Comments

@jsravn
Copy link
Contributor

jsravn commented Nov 13, 2017

Whenever I send a SIGTERM to fluent-bit, I get this output:

[2017/11/08 17:35:40] [ warn] [engine] service will stop in 5 seconds
[2017/11/08 17:35:44] [ info] [engine] service stopped
Fluent-Bit v0.12.8
Copyright (C) Treasure Data

"took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors"

Seems harmless otherwise, but looks like some sort of buffer overrun or something.

@edsiper
Copy link
Member

edsiper commented Nov 13, 2017

hmm looks like related to Elasticsearch. output plugin.

@zersh01
Copy link

zersh01 commented Nov 28, 2017

Hi.
I have this "error" too:

$ uname -a
Linux ott-1 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/issue
Ubuntu 16.04.1 LTS

Fluent-Bit v0.12.8
Copyright (C) Treasure Data

[2017/11/28 10:07:57] [ info] Configuration
[2017/11/28 10:07:57] [ info] flush time : 3 seconds
[2017/11/28 10:07:57] [ info] input plugins :
[2017/11/28 10:07:57] [ info] tail
[2017/11/28 10:07:57] [ info] tail
[2017/11/28 10:07:57] [ info] tail
[2017/11/28 10:07:57] [ info] collectors :
"took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors"....

when shutdown and every 15 minnutes

@dardymek
Copy link

dardymek commented Dec 5, 2017

Hi, I have the same problem few minutes after restart and part of logs were not imported to elasticsearch. No errors in elasticsearch logs.
3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Dec 05 14:19:14 newhope-LOCUST2 td-agent-bit[13024]: Fluent-Bit v0.12.9
Dec 05 14:19:14 newhope-LOCUST2 td-agent-bit[13024]: Copyright (C) Treasure Data
Dec 05 14:19:14 newhope-LOCUST2 td-agent-bit[13024]: "took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""er
Dec 05 14:24:18 newhope-LOCUST2 td-agent-bit[13024]: k""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors
Dec 05 14:24:18 newhope-LOCUST2 td-agent-bit[13024]: rrors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""to

Config:

[SERVICE]
    Flush         1
    Log_Level     info
    Daemon        off
    Parsers_File  parsers.conf
[INPUT]
    Name              systemd
    Tag               services.*
    Systemd_Filter    _SYSTEMD_UNIT=locust-master.service
    Systemd_Filter    [email protected]
    Buffer_Chunk_Size 400k
[OUTPUT]
    Name            es-services
    Match           services.*
    Host            10.91.118.85
    Port            9200
    Logstash_Format On
    Logstash_Prefix logstash
    Retry_Limit     False```

@hdiass
Copy link

hdiass commented Dec 13, 2017

Hello
I'm also facing this issue using ES output.

@edsiper
Copy link
Member

edsiper commented Feb 13, 2018

Is this garbage reproducible with latest v0.12.13 ?

@intelliguy
Copy link

intelliguy commented Feb 20, 2018

v0.12.14 also produce the messages.
There is a details on below.

kubectl logs fluentbit-mzs9k

+ exec /fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf
[2018/02/19 02:37:35] [ info] [engine] started
[2018/02/19 02:37:35] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/02/19 02:37:35] [ info] [filter_kube] local POD info OK
[2018/02/19 02:37:35] [ info] [filter_kube] testing connectivity with API server...
[2018/02/19 02:37:35] [error] [filter_kube] could not get meta for POD taco-aio
Fluent-Bit v0.12.14
Copyright (C) Treasure Data
"took""errors"............."took""errors""too2018-02-19T02:47:44.134022547Z
k""errors""took"........."errors""took""e2018-02-19T02:51:50.062630753Z
rrors""took"........"errors""took""error2018-02-19T02:55:57.010085357Z

@onorua
Copy link
Contributor

onorua commented Feb 26, 2018

We faced the same issues, I've builded fluent-bit with debugging symbols and here is what happened:

"took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors"init vars finishing
fluent-bit: /tmp/src/lib/monkey/deps/flb_libco/amd64.c:121: crash: Assertion `0' failed.
Aborted (core dumped)

core file analysis:

(gdb) thread apply all bt

Thread 6 (Thread 0x7ff5e09ff700 (LWP 26)):
#0  0x00007ff5e3a1824d in nanosleep () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007ff5e3a1819a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2  0x0000559ba72c0cd1 in mk_clock_worker_init (data=0x7ff5e261c000) at /tmp/src/lib/monkey/mk_server/mk_clock.c:112
#3  0x00007ff5e4ae3494 in start_thread (arg=0x7ff5e09ff700) at pthread_create.c:333
#4  0x00007ff5e3a48aff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

Thread 5 (Thread 0x7ff5e01fe700 (LWP 27)):
#0  0x00007ff5e3a490f3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
#1  0x0000559ba72c7e2b in _mk_event_wait (loop=0x7ff5df622000) at /tmp/src/lib/monkey/mk_core/mk_event_epoll.c:340
#2  0x0000559ba72c811b in mk_event_wait (loop=0x7ff5df622000) at /tmp/src/lib/monkey/mk_core/mk_event.c:163
#3  0x0000559ba72c1d86 in mk_server_worker_loop (server=0x7ff5e261c000) at /tmp/src/lib/monkey/mk_server/mk_server.c:435
#4  0x0000559ba72b93ba in mk_sched_launch_worker_loop (data=0x7ff5e0a0d008) at /tmp/src/lib/monkey/mk_server/mk_scheduler.c:411
#5  0x00007ff5e4ae3494 in start_thread (arg=0x7ff5e01fe700) at pthread_create.c:333
#6  0x00007ff5e3a48aff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

Thread 4 (Thread 0x7ff5e1dfe700 (LWP 24)):
#0  0x00007ff5e3a1824d in nanosleep () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007ff5e3a1819a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2  0x0000559ba70a21c6 in time_ticker (args=0x7ff5e262c030) at /tmp/src/plugins/filter_throttle/throttle.c:83
#3  0x00007ff5e4ae3494 in start_thread (arg=0x7ff5e1dfe700) at pthread_create.c:333
#4  0x00007ff5e3a48aff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

Thread 3 (Thread 0x7ff5e15fd700 (LWP 25)):
#0  0x00007ff5e3a490f3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
#1  0x0000559ba72c7e2b in _mk_event_wait (loop=0x7ff5e262c318) at /tmp/src/lib/monkey/mk_core/mk_event_epoll.c:340
#2  0x0000559ba72c811b in mk_event_wait (loop=0x7ff5e262c318) at /tmp/src/lib/monkey/mk_core/mk_event.c:163
#3  0x0000559ba72b13cc in mk_lib_worker (data=0x7ff5e262c330) at /tmp/src/lib/monkey/mk_server/mk_lib.c:150
#4  0x00007ff5e4ae3494 in start_thread (arg=0x7ff5e15fd700) at pthread_create.c:333
#5  0x00007ff5e3a48aff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

Thread 2 (Thread 0x7ff5e25ff700 (LWP 23)):
#0  0x00007ff5e3a490f3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
#1  0x0000559ba72c7e2b in _mk_event_wait (loop=0x7ff5e262c018) at /tmp/src/lib/monkey/mk_core/mk_event_epoll.c:340
#2  0x0000559ba72c811b in mk_event_wait (loop=0x7ff5e262c018) at /tmp/src/lib/monkey/mk_core/mk_event.c:163
---Type <return> to continue, or q <return> to quit---
#3  0x0000559ba70430f4 in log_worker_collector (data=0x7ff5e266e060) at /tmp/src/src/flb_log.c:122
#4  0x0000559ba7053039 in step_callback (data=0x7ff5e2677070) at /tmp/src/src/flb_worker.c:50
#5  0x00007ff5e4ae3494 in start_thread (arg=0x7ff5e25ff700) at pthread_create.c:333
#6  0x00007ff5e3a48aff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

Thread 1 (Thread 0x7ff5e4e82d00 (LWP 22)):
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ff5e39943fa in __GI_abort () at abort.c:89
#2  0x00007ff5e398be37 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x559ba7370fa3 "0",
    file=file@entry=0x559ba7370f78 "/tmp/src/lib/monkey/deps/flb_libco/amd64.c", line=line@entry=121,
    function=function@entry=0x559ba7370fa5 <__PRETTY_FUNCTION__.2983> "crash") at assert.c:92
#3  0x00007ff5e398bee2 in __GI___assert_fail (assertion=0x559ba7370fa3 "0", file=0x559ba7370f78 "/tmp/src/lib/monkey/deps/flb_libco/amd64.c", line=121,
    function=0x559ba7370fa5 <__PRETTY_FUNCTION__.2983> "crash") at assert.c:101
#4  0x0000559ba72ca32a in crash () at /tmp/src/lib/monkey/deps/flb_libco/amd64.c:121
#5  0x0000000000000000 in ?? ()

Core file attached:
https://www.dropbox.com/s/e844qewbr5s1i3q/core-fluent-bit?dl=0

fluent-bit compiled from master with ES index strftime patch introduced in the PR #512

@edsiper
Copy link
Member

edsiper commented Feb 27, 2018

I've tried to reproduce the "garbage on exit" issue without success.

Would you please provide exact steps to reproduce it ?

@onorua
Copy link
Contributor

onorua commented Feb 27, 2018

Unfortunately I can't reproduce it reliably as well, it happens with reasonably high rate of messages (more than 2000 sec in our case) and es output plugin. This happens 1 time per 24 hours in average in our case.
We have bunch of other fluent-bit instances with lower message rate, and this doesn't happen there.

@onorua
Copy link
Contributor

onorua commented Feb 27, 2018

I can try to gather all needed information, if you tell me what are you interested in? I can provide you with core file, or anything you request, if it will be possible.

@marians
Copy link

marians commented Mar 26, 2018

We see this error with fluent-bit 0.13 from the official docker image fluent/fluent-bit-0.13-dev:0.13. We have the Elasticsearch output plugin in use.

I can't find any evidence that this occurs with throughput spikes. CPU, memory and throughput have been on a baseline level. That means 230 log entries per minute, coming from 5 fluent-bit instances.

@mudrii
Copy link

mudrii commented Mar 31, 2018

I have same issues with both v0.12.14 and v0.12.15.
Running on OpenShift with external Elastic on 6.2. All pods in the cluster shows same outlput.
Ex.
oc logs fluent-bit-dbscm
[2018/03/31 03:29:11] [ info] [engine] started
[2018/03/31 03:29:11] [ info] [filter_kube] https=1 host=kubernetes.default.svc port=443
[2018/03/31 03:29:11] [ info] [filter_kube] local POD info OK
[2018/03/31 03:29:11] [ info] [filter_kube] testing connectivity with API server...
[2018/03/31 03:29:12] [ info] [filter_kube] API server connectivity OK
Fluent-Bit v0.12.5
Copyright (C) Treasure Data

"took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took""errors""took"

This affects all pods

@edsiper
Copy link
Member

edsiper commented Mar 31, 2018

I've tried to reproduce this problem without success.

When this happens, is there any special condition happening in the remote Elasticsearch server like a premature TCP close, network outage or similar ?

@jgsqware
Copy link

jgsqware commented Mar 31, 2018 via email

@edsiper
Copy link
Member

edsiper commented Apr 1, 2018

does it happen only on shutdown or also on runtime ?

@jgsqware
Copy link

jgsqware commented Apr 1, 2018 via email

@intelliguy
Copy link

I run more than 10 fluent-bit on k8s clusters.
But the only one(pod name = fluentbit-x8ddc) generate this message.
I'm not sure what is different.
Once you've sorted out the situation, for some reason fluentbit generates garbage and records it as a log every three minutes.

screenshot

@StevenACoffman
Copy link
Contributor

StevenACoffman commented Apr 5, 2018

@edsiper Using fluent/fluent-bit-0.13-dev:0.14 on certain pods only, I am getting the same error (e.g. 344 restarts over two days):

fluent-bit: /tmp/src/lib/monkey/deps/flb_libco/amd64.c:121: crash: Assertion `0' failed.
Fluent-Bit v0.13.0
Copyright (C) Treasure Data

I am using Fluent-bit without elasticsearch output, but with two http outputs configured.

For instance, instead of fluent-bit-configmap.yaml:

    [OUTPUT]
        Name           http
        Match          kube.*
        Host           127.0.0.1
        Format         json
        Port           3000
        URI            /

    [OUTPUT]
        Name           http
        Match          kube.*
        Host           127.0.0.1
        Format         json
        Port           4000
        URI            /

I made HTTP web-services that produced to Kafka and ElasticSearch to rule out those output plugins execution paths since I could not get meaningful error messages.

The logs aren't helpful, but the timing between startup and last success is very close to the three minutes mentioned by @intelliguy:

[2018/04/05 14:46:18] [ info] [engine] started
[2018/04/05 14:46:18] [ info] [filter_kube] https=1 host=kubernetes.default.svc.cluster.local port=443
[2018/04/05 14:46:18] [ info] [filter_kube] local POD info OK
[2018/04/05 14:46:18] [ info] [filter_kube] testing connectivity with API server...
[2018/04/05 14:46:23] [ info] [filter_kube] API server connectivity OK
[2018/04/05 14:46:23] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
... skipping successes ...
[2018/04/05 14:49:02] [ info] [out_http] HTTP STATUS=200

@edsiper
Copy link
Member

edsiper commented Apr 5, 2018

@StevenACoffman @intelliguy @jgsqware

when connecting to elasticsearch, are you using plain HTTP communication or HTTPS (TLS) ?, which specific elasticsearch version are you (each one) using ?

@StevenACoffman
Copy link
Contributor

StevenACoffman commented Apr 5, 2018

I am not connecting to ElasticSearch at all, and still get the error. I am using plain HTTP to talk to another container I added to the pod. One HTTP port is a container that writes log files to AWS S3 storage. The other processes the logs and sends them on to Kafka.

Mine may be an unrelated problem, but the error message was very similar, so I thought I would mention it here in the hope that there was a common root cause and this was a clue you could use.

@edsiper
Copy link
Member

edsiper commented Apr 6, 2018 via email

@jgsqware
Copy link

jgsqware commented Apr 6, 2018 via email

@StevenACoffman
Copy link
Contributor

Plain HTTP for us, no TLS.

@edsiper
Copy link
Member

edsiper commented Apr 6, 2018

update: I am finally able to reproduce the garbage without stopping the service (kill -15 PID). Work in process.

edsiper added a commit that referenced this issue Apr 6, 2018
@edsiper
Copy link
Member

edsiper commented Apr 6, 2018

I've found the guilty code:

https://github.com/fluent/fluent-bit/blob/master/plugins/out_es/es.c#L409

and it author:

$ git blame ../plugins/out_es/es.c -L409,+1
81320ba8a (Eduardo Silva 2017-07-13 22:41:51 -0600 409)         msgpack_object_print(stdout, key);

It was a debug line that was not removed, it's not easy to catch since it does a fprintf() to the buffered stdout, so the message is only visible when the buffer is flushed by the OS.

Fixed by 02d9505

All: if you see the message amd64.c:121: crash: Assertion `0' failed somewhere after this fix (in the new images to be available), please let me know ASAP. I am not confident that error is associated to the garbage message recently fixed..

@edsiper
Copy link
Member

edsiper commented Apr 6, 2018

@onorua

regarding the core file provided in Dropbox, do you have the exact fluent-bit binary that you built and was used to generate the core file ?

@edsiper edsiper self-assigned this Apr 6, 2018
edsiper added a commit that referenced this issue Apr 6, 2018
@StevenACoffman
Copy link
Contributor

StevenACoffman commented Apr 7, 2018

@edsiper In my configuration I am not using ElasticSearch output, and with the image fluent/fluent-bit-0.13-dev:0.15 (aka fluent/fluent-bit-0.13-dev@sha256:a4f22b6de839b5ca88e7092f8c238ecd983c1fc73288bdaabdf0ce67e6c3252c) configured (as described above) to use only the http output plugin (twice) over HTTP (no TLS) to a container in the same pod, I just now got:

[2018/04/07 01:44:14] [ info] [out_http] HTTP STATUS=200
[2018/04/07 01:44:15] [ info] [out_http] HTTP STATUS=200
fluent-bit: /tmp/src/lib/monkey/deps/flb_libco/amd64.c:121: crash: Assertion `0' failed.
Fluent-Bit v0.13.0
Copyright (C) Treasure Data

[engine] caught signal
#0  0x7f6defdfa529      in  ???() at ???:0
#1  0x7f6defdf1e66      in  ???() at ???:0
#2  0x7f6defdf1f11      in  ???() at ???:0
#3  0x55994d80c389      in  ???() at ???:0
#4  0xffffffffffffffff  in  ???() at ???:0

Again, this may be a separate issue.

@StevenACoffman
Copy link
Contributor

By the way, I am using this to receive the termination message:

kubectl logs -n logging fluent-bit-k9phx -c fluent-bit --previous

@jgsqware
Copy link

jgsqware commented Apr 7, 2018 via email

@StevenACoffman
Copy link
Contributor

@jgsqware To clarify, you get that error message with fluent/fluent-bit-0.13-dev:0.15, correct?
Also, are you using the ElasticSearch output plugin? Are you also using any other output plugins (such as HTTP)?

@edsiper
Copy link
Member

edsiper commented Apr 7, 2018

I've filed #557 to troubleshoot/fix the crash issue. Since the garbage problem is fixed and releases available I am closing this ticket:

ref: http://fluentbit.io/announcements/v0.12.17/

@edsiper edsiper closed this as completed Apr 7, 2018
micklonge pushed a commit to micklonge/elk_fluent-bit that referenced this issue Jun 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests