Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluent-bit pod generate core dump every time the pod is restarted after migrating to 1.4.2 #2127

Closed
huzhou opened this issue Apr 22, 2020 · 17 comments
Assignees
Labels
waiting-for-user Waiting for more information, tests or requested changes

Comments

@huzhou
Copy link

huzhou commented Apr 22, 2020

Bug Report

Describe the bug
A core dump is generated every time the fluent-bit pod is restarted/shutdown, after we migrate to fluent-bit 1.4.2.

To Reproduce

  • Example log message if applicable:
{"log":"Generated core dump file: /var/lib/systemd/coredump/core.fluent-bit.1003060000.457ffee6fdf34f108294e8ab87adc013.74848.1587581403000000.xz","stream":"stdout","time":"2020-04-22T18:50:07.407147Z"}
  • Steps to reproduce the problem:
  • Need to have some sort of core-dump monitoring to monitor the core dump generated by killing a pod. In our case, we have setup the linux to call systemd coredump every time a container process is cored so that we won't lose core from pod/container restart.
  • Take the fluent-bit 1.4.2 official docker image as base image, use in_tail as input plugin, use a customer output plugin written in Golang using fluent-bit-go. Create a docker image and deployed in a kubenetes/openshift cluster. After the pod is up, simply do a kubectl delete pod <fluent-bit-pod>
    Expected behavior
    Pod restart/shutdown properly without core dump generated.

Screenshots

Your Environment

  • Version used: fluent-bit 1.4.2
  • Configuration:
    [SERVICE]
    Flush 1
    Daemon off
    Log_Level debug

[INPUT]
Name tail
Path /etc/fluent-bit/fdf/fluent-bit_fdf*
Refresh_Interval 2

[OUTPUT]
Name fdf_prom_plugin
Match *
where fdf_prom_plugin is written using fluent-bit-go

  • Environment name and version (e.g. Kubernetes? What version?):OpenShift 3.7/3.11(Kubernetes 1.7/1.11)
  • Server type and version:
  • Operating System and version:Red Hat Enterprise Linux 7.4 and above
  • Filters and plugins: see above

Additional context
I have the stacktrace of the core dump, it seems like that the signal handling has some issue or bug. Here is the stack trace:
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /fluent-bit/bin/fluent-bit...done.
[New LWP 1]
[New LWP 10]
[New LWP 12]
[New LWP 7]
[New LWP 8]
[New LWP 9]
[New LWP 11]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by /fluent-bit/bin/fluent-bit -c /etc/fluent-bit/config.flb -e /etc/plugin/out_plu'. Program terminated with signal 11, Segmentation fault. #0 0x00007f6c75ef2a17 in abort () from /lib64/libc.so.6 warning: Missing auto-load scripts referenced in section .debug_gdb_scripts of file /etc/plugin/out_plugin.so Use info auto-load python [REGEXP]' to list them.
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7.x86_64 libgcc-4.8.5-11.el7.x86_64
(gdb) bt
#0 0x00007f6c75ef2a17 in abort () from /lib64/libc.so.6
#1 0x000000000042b374 in flb_signal_handler ()
#2 0x00007f6c752bd81d in runtime.sigfwd () at /usr/lib/golang/src/runtime/sys_linux_amd64.s:286
#3 0x00007fffe7de3cd8 in ?? ()
#4 0x00007f6c752a47fb in runtime.sigfwdgo (sig=11, info=0x7fffe7de3f30, ctx=0x7fffe7de3e00, ~r3=false) at /usr/lib/golang/src/runtime/signal_unix.go:630
#5 0x00007f6c752a3b1b in runtime.sigtrampgo (sig=11, info=0x7fffe7de3f30, ctx=0x7fffe7de3e00) at /usr/lib/golang/src/runtime/signal_unix.go:272
#6 0x00007f6c752bd873 in runtime.sigtramp () at /usr/lib/golang/src/runtime/sys_linux_amd64.s:306
#7
#8 0x0000000000000000 in ?? ()
#9 0x00000000004578af in flb_proxy_cb_exit ()
#10 0x000000000043cee4 in flb_output_exit ()
#11 0x00000000004476e0 in flb_engine_shutdown ()
#12 0x0000000000447532 in flb_engine_start ()
#13 0x000000000042c4ee in main ()

@edsiper
Copy link
Member

edsiper commented Apr 22, 2020

is that golang plugin open source and available somewhere to see the code ?

@edsiper edsiper added the waiting-for-user Waiting for more information, tests or requested changes label Apr 22, 2020
@huzhou
Copy link
Author

huzhou commented Apr 22, 2020

that golang plugin is not opensourced, but it worked when I use fluent-bit 1.1.0. Actually, you can use the example from fluent-bit-go repo in github to reproduce..

@edsiper
Copy link
Member

edsiper commented Apr 22, 2020

I tried on 1.1 and 1.4 and I see no issues on SIGTERM:

$ bin/fluent-bit -e ../../coding/fluent-bit-go/examples/out_gstdout/out_gstdout.so -i cpu -o gstdout -f 1
Fluent Bit v1.4.3
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/04/22 15:52:44] [ info] [storage] version=1.0.3, initializing...
[2020/04/22 15:52:44] [ info] [storage] in-memory
[2020/04/22 15:52:44] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/04/22 15:52:44] [ info] [engine] started (pid=1246641)
[flb-go] plugin parameter = ''
[2020/04/22 15:52:44] [ info] [sp] stream processor started
[0] cpu.0: [2020-04-22 15:52:45.000234088 -0600 CST, {"cpu0.p_system": 0, "cpu2.p_system": 1, "cpu4.p_system": 1, "cpu5.p_cpu": 1, "user_p": 0.5714285714285714, "cpu1.p_user": 1, "cpu3.p_system": 0, "cpu4.p_user": 0, "cpu_p": 1.1428571428571428, "system_p": 0.5714285714285714, "cpu0.p_cpu": 1, "cpu2.p_cpu": 2, "cpu2.p_user": 1, "cpu4.p_cpu": 1, "cpu5.p_user": 0, "cpu5.p_system": 1, "cpu0.p_user": 1, "cpu1.p_cpu": 2, "cpu1.p_system": 1, "cpu3.p_cpu": 1, "cpu3.p_user": 1, "cpu6.p_cpu": 3, "cpu6.p_user": 1, "cpu6.p_system": 2, }
[0] cpu.0: [2020-04-22 15:52:46.000733516 -0600 CST, {"cpu2.p_system": 0, "cpu3.p_user": 1, "cpu3.p_system": 0, "cpu4.p_user": 0, "cpu6.p_cpu": 1, "cpu0.p_user": 1, "cpu1.p_cpu": 14, "cpu2.p_cpu": 0, "cpu6.p_system": 0, "cpu6.p_user": 1, "cpu0.p_cpu": 1, "cpu2.p_user": 0, "cpu5.p_cpu": 0, "cpu4.p_cpu": 2, "cpu4.p_system": 2, "cpu5.p_user": 0, "cpu5.p_system": 0, "user_p": 2.0000000000000004, "cpu0.p_system": 0, "cpu1.p_user": 10, "cpu3.p_cpu": 1, "cpu_p": 2.857142857142857, "system_p": 0.8571428571428571, "cpu1.p_system": 4, }
[0] cpu.0: [2020-04-22 15:52:47.000435377 -0600 CST, {"cpu_p": 1.5714285714285714, "system_p": 0.5714285714285714, "cpu0.p_system": 1, "cpu2.p_system": 0, "cpu3.p_user": 0, "cpu4.p_user": 1, "cpu6.p_cpu": 0, "cpu6.p_user": 0, "cpu6.p_system": 0, "user_p": 1.0000000000000002, "cpu0.p_user": 1, "cpu1.p_cpu": 2, "cpu1.p_user": 2, "cpu4.p_system": 1, "cpu0.p_cpu": 2, "cpu2.p_user": 4, "cpu3.p_cpu": 0, "cpu5.p_cpu": 1, "cpu5.p_system": 0, "cpu1.p_system": 0, "cpu2.p_cpu": 4, "cpu3.p_system": 0, "cpu4.p_cpu": 2, "cpu5.p_user": 1, }
[0] cpu.0: [2020-04-22 15:52:48.000482016 -0600 CST, {"cpu5.p_system": 2, "cpu3.p_user": 13, "cpu5.p_cpu": 15, "cpu0.p_cpu": 14, "cpu0.p_user": 13, "cpu1.p_cpu": 13, "cpu1.p_system": 1, "cpu2.p_cpu": 13, "cpu2.p_user": 13, "cpu_p": 13.714285714285714, "user_p": 12.571428571428571, "cpu6.p_system": 1, "cpu4.p_user": 12, "cpu4.p_system": 3, "cpu1.p_user": 12, "cpu2.p_system": 0, "cpu3.p_cpu": 14, "cpu3.p_system": 1, "cpu4.p_cpu": 15, "cpu5.p_user": 13, "cpu6.p_cpu": 13, "cpu6.p_user": 12, "system_p": 1.1428571428571428, "cpu0.p_system": 1, }
[0] cpu.0: [2020-04-22 15:52:49.000516386 -0600 CST, {"cpu1.p_system": 0, "cpu2.p_system": 0, "cpu5.p_user": 0, "cpu6.p_system": 1, "cpu_p": 1.5714285714285714, "user_p": 1.2857142857142858, "cpu0.p_cpu": 0, "cpu0.p_system": 0, "cpu2.p_user": 1, "cpu3.p_user": 2, "cpu4.p_user": 3, "system_p": 0.2857142857142857, "cpu0.p_user": 0, "cpu4.p_cpu": 3, "cpu4.p_system": 0, "cpu3.p_system": 0, "cpu5.p_cpu": 0, "cpu5.p_system": 0, "cpu6.p_cpu": 3, "cpu1.p_cpu": 1, "cpu1.p_user": 1, "cpu2.p_cpu": 1, "cpu3.p_cpu": 2, "cpu6.p_user": 2, }
[1] cpu.0: [2020-04-22 15:52:50.000170897 -0600 CST, {"cpu1.p_system": 0, "cpu2.p_system": 1, "cpu4.p_user": 3, "cpu5.p_user": 1, "cpu6.p_system": 0, "cpu5.p_system": 1, "user_p": 1.5714285714285714, "system_p": 0.5714285714285714, "cpu0.p_cpu": 1, "cpu0.p_system": 0, "cpu1.p_user": 2, "cpu3.p_user": 1, "cpu_p": 2.142857142857143, "cpu0.p_user": 1, "cpu1.p_cpu": 2, "cpu4.p_system": 1, "cpu6.p_cpu": 1, "cpu6.p_user": 1, "cpu2.p_cpu": 2, "cpu2.p_user": 1, "cpu3.p_cpu": 2, "cpu3.p_system": 1, "cpu4.p_cpu": 4, "cpu5.p_cpu": 2, }
[0] cpu.0: [2020-04-22 15:52:51.000155505 -0600 CST, {"cpu2.p_system": 0, "cpu4.p_cpu": 2, "cpu4.p_system": 0, "cpu5.p_system": 0, "cpu_p": 1.7142857142857142, "user_p": 1.2857142857142858, "system_p": 0.42857142857142855, "cpu0.p_user": 1, "cpu6.p_user": 1, "cpu5.p_user": 0, "cpu0.p_cpu": 1, "cpu1.p_user": 1, "cpu1.p_system": 2, "cpu4.p_user": 2, "cpu5.p_cpu": 0, "cpu0.p_system": 0, "cpu1.p_cpu": 3, "cpu2.p_cpu": 1, "cpu3.p_cpu": 4, "cpu6.p_system": 1, "cpu2.p_user": 1, "cpu3.p_user": 4, "cpu3.p_system": 0, "cpu6.p_cpu": 2, }
[engine] caught signal (SIGTERM)
[2020/04/22 15:52:51] [ info] [input] pausing cpu.0
[2020/04/22 15:52:51] [ warn] [engine] service will stop in 5 seconds
[2020/04/22 15:52:55] [ info] [engine] service stopped

I am happy to help if you provide a simplified plugin (and its source) to reproduce the issue.

@huzhou
Copy link
Author

huzhou commented Apr 22, 2020

Indeed, I just downloaded some example and I am testing with fluent-bit 1.4.2. There is no such error. So the problem is narrowed down to my plugin that is probably not working well. I will share my finding in this thread tomorrow, if I find the problem. Otherwise I will create a simplified version of my plugin and ask for your help @edsiper

Thank you for your support

@huzhou
Copy link
Author

huzhou commented Apr 23, 2020

I was wondering if my problem is related to this change: #1421. As the core dump stack trace is actually from flb_proxy_cb_exit ().
I am asking because the same plugin have no issue when using fluent-bit 1.1.0, I was wondering if there is any changes in the flb_proxy_cb_exit() could cause the incompatibility with my plugin code

@huzhou
Copy link
Author

huzhou commented Apr 23, 2020

actually, this is the log I have for the cored one:
Fluent Bit v1.4.2

  • Copyright (C) 2019-2020 The Fluent Bit Authors
  • Copyright (C) 2015-2018 Treasure Data
  • Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
  • https://fluentbit.io

[2020/04/23 17:21:52] [ info] Configuration:
[2020/04/23 17:21:52] [ info] flush time | 1.000000 seconds
[2020/04/23 17:21:52] [ info] grace | 5 seconds
[2020/04/23 17:21:52] [ info] daemon | 0
[2020/04/23 17:21:52] [ info] ___________
[2020/04/23 17:21:52] [ info] inputs:
[2020/04/23 17:21:52] [ info] tail
[2020/04/23 17:21:52] [ info] ___________
[2020/04/23 17:21:52] [ info] filters:
[2020/04/23 17:21:52] [ info] ___________
[2020/04/23 17:21:52] [ info] outputs:
[2020/04/23 17:21:52] [ info] fdf_prom_endpoint.0
[2020/04/23 17:21:52] [ info] ___________
[2020/04/23 17:21:52] [ info] collectors:
[2020/04/23 17:21:52] [debug] [storage] [cio stream] new stream registered: tail.0
[2020/04/23 17:21:52] [ info] [storage] version=1.0.3, initializing...
[2020/04/23 17:21:52] [ info] [storage] in-memory
[2020/04/23 17:21:52] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/04/23 17:21:52] [ info] [engine] started (pid=1)
[2020/04/23 17:21:52] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/04/23 17:21:52] [debug] [input:tail:tail.0] inotify watch fd=19
[2020/04/23 17:21:52] [debug] [input:tail:tail.0] scanning path /etc/fluent-bit/fdf/fluent-bit_fdf*
[2020/04/23 17:21:52] [error] [input:tail:tail.0] read error, check permissions: /etc/fluent-bit/fdf/fluent-bit_fdf*
[2020/04/23 17:21:52] [debug] [router] match rule tail.0:fdf_prom_endpoint.0
[2020/04/23 17:21:52] [ info] [sp] stream processor started
^C[engine] caught signal (SIGINT)
[2020/04/23 17:21:57] [ info] [input] pausing tail.0
[engine] caught signal (SIGSEGV)
[2020/04/23 17:21:57] [debug] [GO] running exit callback

And this is the one without core:
Fluent Bit v1.4.2

  • Copyright (C) 2019-2020 The Fluent Bit Authors
  • Copyright (C) 2015-2018 Treasure Data
  • Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
  • https://fluentbit.io

[2020/04/23 17:22:59] [ info] [storage] version=1.0.3, initializing...
[2020/04/23 17:22:59] [ info] [storage] in-memory
[2020/04/23 17:22:59] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/04/23 17:22:59] [ info] [engine] started (pid=1)
[flb-go] plugin parameter = ''
[2020/04/23 17:22:59] [error] [input:tail:tail.0] read error, check permissions: /tmp/log/fluent-bit-events_info_oxf.log*
[2020/04/23 17:22:59] [ info] [sp] stream processor started
^C[engine] caught signal (SIGINT)
[2020/04/23 17:23:06] [ info] [input] pausing tail.0

The main difference is at the exit time, for fluent-bit 1.4.2, there is another SIGSEGV which I think is the cause of the core. However, the one for 1.1.0 doesn't have that SIGSEGV. They are tested with the exactly same out_plugin.

@PettitWesley PettitWesley self-assigned this Apr 23, 2020
@PettitWesley
Copy link
Contributor

PettitWesley commented Apr 23, 2020

@huzhou The changes from #1421 haven't affected any of the AWS go plugins, which I maintain.

Looking at the stacktrace, I think this is a bug in your Go code. Before 1.4, the go plugin exit callback was not actually called. It would never be executed. So if your Go code had a bug, it would not show up because it was never run. Starting with 1.4, it will be run, so that is probably why you are seeing an issue now.

@huzhou
Copy link
Author

huzhou commented Apr 23, 2020

@PettitWesley Thanks for the info. Can you please share a bit more info on how the exit callback is working? In my code, I just simply return a output.FLB_OK. This will help me to understand and troubleshoot my code. Thank you!

@huzhou
Copy link
Author

huzhou commented Apr 23, 2020

Hi @PettitWesley and @edsiper
I just want to give you an update on my issue. I found out what is the bug in my code. Basically I am compiling the plugin using -buildmode=c-shared. And I forgot to put the cgo comment //export FLBPluginExit for FLBPluginExit(). After I add the comment and build again, the issue is solved.

@PettitWesley
Copy link
Contributor

@huzhou Awesome! Can we close the issue?

@huzhou
Copy link
Author

huzhou commented Apr 23, 2020

Yes

@huzhou huzhou closed this as completed Apr 23, 2020
@edsiper
Copy link
Member

edsiper commented Apr 23, 2020

Actually it is a bug in OUR code.

If the golang function is not exporting the symbol will not be resolved in our side, leading to have plugin->cb_exit equals NULL, but our code was always trying to invoke that function but there is no function reference.

I've pushed a fix into a test branch I am working now that will be merged shortly. The fix is here: 379b244

@huzhou
Copy link
Author

huzhou commented Apr 23, 2020

Do you think this fix can potentially be applied to all the register, initialize and flush steps of the plugin?

@edsiper
Copy link
Member

edsiper commented Apr 23, 2020

yeah found another place where there is no check...

ideally, the fix must do the checks and make sure at least one flush() callback type is defined, otherwise abort and let the user know about it

anyone ?

@huzhou
Copy link
Author

huzhou commented Apr 23, 2020

sounds good, actually, if there is no out plugin registered, you can probably already abort. If the plugin is registered, then it must be initialized and flushable.
How is that sound?
(Sorry if that sounds silly to you, as I am not familiar with the fluent-bit internal implementation, maybe this is already what you have today :D)

@edsiper
Copy link
Member

edsiper commented Apr 23, 2020

initialization check is there, but no for flush

@PettitWesley
Copy link
Contributor

If the plugin is registered, then it must be initialized and flushable.
ideally, the fix must do the checks and make sure at least one flush() callback type is defined, otherwise abort and let the user know about it

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-for-user Waiting for more information, tests or requested changes
Projects
None yet
Development

No branches or pull requests

3 participants