Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault during shutdown. #613

Closed
m15o opened this issue Mar 29, 2023 · 11 comments
Closed

Segmentation fault during shutdown. #613

m15o opened this issue Mar 29, 2023 · 11 comments

Comments

@m15o
Copy link

m15o commented Mar 29, 2023

Describe the question/issue

Multiple output configuration cause SIGSEGV during shutdown.

Configuration

[SERVICE]
    Daemon             off

[OUTPUT]
    Name               cloudwatch
    log_group_name     /test-1
    log_stream_name    o1
    region             us-west-2

[OUTPUT]
    Name               cloudwatch
    log_group_name     /test-2
    log_stream_name    o2
    region             us-west-2

Fluent Bit Log Output

Fluent Bit v1.9.10
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/03/28 05:30:16] [ info] Configuration:
[2023/03/28 05:30:16] [ info]  flush time     | 1.000000 seconds
[2023/03/28 05:30:16] [ info]  grace          | 5 seconds
[2023/03/28 05:30:16] [ info]  daemon         | 0
[2023/03/28 05:30:16] [ info] ___________
[2023/03/28 05:30:16] [ info]  inputs:
[2023/03/28 05:30:16] [ info] ___________
[2023/03/28 05:30:16] [ info]  filters:
[2023/03/28 05:30:16] [ info] ___________
[2023/03/28 05:30:16] [ info]  outputs:
[2023/03/28 05:30:16] [ info]      cloudwatch.0
[2023/03/28 05:30:16] [ info]      cloudwatch.1
[2023/03/28 05:30:16] [ info] ___________
[2023/03/28 05:30:16] [ info]  collectors:
[2023/03/28 05:30:16] [ info] [fluent bit] version=1.9.10, commit=6d654b0570, pid=16

...

[2023/03/28 05:30:41] [engine] caught signal (SIGTERM)
[2023/03/28 05:30:41] [ warn] [engine] service will shutdown in max 5 seconds
[2023/03/28 05:30:42] [ info] [engine] service has stopped (0 pending tasks)
[2023/03/28 05:30:42] [debug] [GO] running exit callback
[2023/03/28 05:30:42] [engine] caught signal (SIGSEGV)
Aborted (core dumped)

Fluent Bit Version Info

I tried v2.31.7

  • public.ecr.aws/aws-observability/aws-for-fluent-bit:2.31.7
  • Fluent Bit v1.9.10

Cluster Details

It can be reproduced with docker run

Application Details

Steps to reproduce issue

  1. docker run --rm --name fluent-bit -it public.ecr.aws/aws-observability/aws-for-fluent-bit:2.31.7 bash
  2. /fluent-bit/bin/fluent-bit -e /fluent-bit/cloudwatch.so \
         -o cloudwatch -p region=us-west-2 -p log_group_name=/test-1 -p log_stream_name=o1 \
         -o cloudwatch -p region=us-west-2 -p log_group_name=/test-2 -p log_stream_name=o2
  3. kill -TERM <pid> or CTRL-C

Related Issues

@PettitWesley
Copy link
Contributor

docker run --rm --name fluent-bit -it public.ecr.aws/aws-observability/aws-for-fluent-bit:2.31.7 bash

Why run bash command inside the image?

Please explain more. Thanks!

@m15o
Copy link
Author

m15o commented Mar 30, 2023

docker run --rm --name fluent-bit -it public.ecr.aws/aws-observability/aws-for-fluent-bit:2.31.7 bash

Why run bash command inside the image?

You dont need bash. At first I had created the fluent-bit.conf file in heredoc using the reproduction procedure, but when I rewrote it, I must have forgotten to delete bash. Sorry for the distruction.

So the same thing happens with the following procedure:

docker run --rm --name fluent-bit -it public.ecr.aws/aws-observability/aws-for-fluent-bit:2.31.7 \
     /fluent-bit/bin/fluent-bit -e /fluent-bit/cloudwatch.so \
     -o cloudwatch -p region=us-west-2 -p log_group_name=/test-1 -p log_stream_name=o1 \
     -o cloudwatch -p region=us-west-2 -p log_group_name=/test-2 -p log_stream_name=o2

@PettitWesley
Copy link
Contributor

@m15o How quickly does the crash happen? Can you share example logs that you sent? What's the input in this case? I don't see an input...

@PettitWesley
Copy link
Contributor

When I run this myself it does not immediately crash...

@m15o
Copy link
Author

m15o commented Mar 31, 2023

You don't need to feed any input to reproduce.

I tried to reproduce my personal Apple Silicon Mac, the behavior is slightly different from a x86 machine.
You need to set ulimit to generate core file, and might not be able to see Bus error message without shell.

Here is my log:

docker run --name fluent-bit -it public.ecr.aws/aws-observability/aws-for-fluent-bit:2.31.7 bash
bash-4.2# ulimit -c unlimited
bash-4.2# /fluent-bit/bin/fluent-bit -e /fluent-bit/cloudwatch.so \
>      -o cloudwatch -p region=us-west-2 -p log_group_name=/test-1 -p log_stream_name=o1 \
>      -o cloudwatch -p region=us-west-2 -p log_group_name=/test-2 -p log_stream_name=o2
Fluent Bit v1.9.10
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/03/31 23:33:57] [ info] [fluent bit] version=1.9.10, commit=38ae287c8a, pid=7
[2023/03/31 23:33:57] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2023/03/31 23:33:57] [ info] [cmetrics] version=0.3.7
INFO[0000] [cloudwatch 0] plugin parameter log_group_name = '/test-1' 
INFO[0000] [cloudwatch 0] plugin parameter default_log_group_name = 'fluentbit-default' 
INFO[0000] [cloudwatch 0] plugin parameter log_stream_prefix = '' 
INFO[0000] [cloudwatch 0] plugin parameter log_stream_name = 'o1' 
INFO[0000] [cloudwatch 0] plugin parameter default_log_stream_name = '/fluentbit-default' 
INFO[0000] [cloudwatch 0] plugin parameter region = 'us-west-2' 
INFO[0000] [cloudwatch 0] plugin parameter log_key = '' 
INFO[0000] [cloudwatch 0] plugin parameter role_arn = '' 
INFO[0000] [cloudwatch 0] plugin parameter auto_create_group = 'false' 
INFO[0000] [cloudwatch 0] plugin parameter auto_create_stream = 'true' 
INFO[0000] [cloudwatch 0] plugin parameter new_log_group_tags = '' 
INFO[0000] [cloudwatch 0] plugin parameter log_retention_days = '0' 
INFO[0000] [cloudwatch 0] plugin parameter endpoint = '' 
INFO[0000] [cloudwatch 0] plugin parameter sts_endpoint = '' 
INFO[0000] [cloudwatch 0] plugin parameter external_id = '' 
INFO[0000] [cloudwatch 0] plugin parameter credentials_endpoint =  
INFO[0000] [cloudwatch 0] plugin parameter log_format = '' 
INFO[0000] [cloudwatch 1] plugin parameter log_group_name = '/test-2' 
INFO[0000] [cloudwatch 1] plugin parameter default_log_group_name = 'fluentbit-default' 
INFO[0000] [cloudwatch 1] plugin parameter log_stream_prefix = '' 
INFO[0000] [cloudwatch 1] plugin parameter log_stream_name = 'o2' 
INFO[0000] [cloudwatch 1] plugin parameter default_log_stream_name = '/fluentbit-default' 
INFO[0000] [cloudwatch 1] plugin parameter region = 'us-west-2' 
INFO[0000] [cloudwatch 1] plugin parameter log_key = '' 
INFO[0000] [cloudwatch 1] plugin parameter role_arn = '' 
INFO[0000] [cloudwatch 1] plugin parameter auto_create_group = 'false' 
INFO[0000] [cloudwatch 1] plugin parameter auto_create_stream = 'true' 
INFO[0000] [cloudwatch 1] plugin parameter new_log_group_tags = '' 
INFO[0000] [cloudwatch 1] plugin parameter log_retention_days = '0' 
INFO[0000] [cloudwatch 1] plugin parameter endpoint = '' 
INFO[0000] [cloudwatch 1] plugin parameter sts_endpoint = '' 
INFO[0000] [cloudwatch 1] plugin parameter external_id = '' 
INFO[0000] [cloudwatch 1] plugin parameter credentials_endpoint =  
INFO[0000] [cloudwatch 1] plugin parameter log_format = '' 
[2023/03/31 23:33:57] [ info] [sp] stream processor started
^C[2023/03/31 23:33:58] [engine] caught signal (SIGINT)
[2023/03/31 23:33:58] [ warn] [engine] service will shutdown in max 5 seconds
Bus error (core dumped)

@PettitWesley
Copy link
Contributor

As I expected, this problem is caused by the destroy callback running twice, once for each [OUTPUT] definition:

[2023/04/03 23:07:47] [engine] caught signal (SIGTERM)
[2023/04/03 23:07:47] [ warn] [engine] service will shutdown in max 5 seconds
[2023/04/03 23:07:47] [ info] [engine] service has stopped (0 pending tasks)
[2023/04/03 23:07:47] [ info] [go] cb_exit firehose
[2023/04/03 23:07:47] [ info] [go] cb_exit firehose
[2023/04/03 23:07:47] [engine] caught signal (SIGSEGV)

Each time it runs it frees the loaded Go output plugin... so the second time it runs there is a crash.

Need to figure out why it runs twice...

I checked that the go plugin seems to only be registered once...

@PettitWesley
Copy link
Contributor

@m15o

I think I've found a solution, it will take some time to test and clean up: https://github.com/PettitWesley/fluent-bit/commits/go-exit-debug-1_9

When it comes to plugins, there are two concepts:

  1. Output plugin => our cloudwatch plugin for example. Some piece of code that can send data. There's only one plugin named cloudwatch.
  2. Output plugin instance => this corresponds to a config [OUTPUT] section. You can have multiple instances of the cloudwatch plugin.

Basically, on shut down there are two separate steps:

  1. Calling exit on the plugin instance.
  2. De-registering and destroying the actual output plugin.

The current code basically mixes these two together and performs them in one function.

@PettitWesley
Copy link
Contributor

fluent/fluent-bit#7120

@PettitWesley
Copy link
Contributor

Looks like this was already fixed in master, I should've checked that first: fluent/fluent-bit#6469

This was referenced Apr 5, 2023
@PettitWesley
Copy link
Contributor

This will be resolved soon with 2.31.9

@PettitWesley
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants