Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unrecoverable error "caught signal (SIGSEGV)" in the forward output #3940

Closed
panaji opened this issue Aug 11, 2021 · 18 comments
Closed

Unrecoverable error "caught signal (SIGSEGV)" in the forward output #3940

panaji opened this issue Aug 11, 2021 · 18 comments
Labels

Comments

@panaji
Copy link

panaji commented Aug 11, 2021

Bug Report

Describe the bug
I'm seeing this issue with forward output plugin and restarting fluentbit won't fix it. I have to temporarily change the output to null then revert it back to mitigate. I was using v1.7.9 and updated the image to v1.8.3 on the fly (and still saw this issue).

To Reproduce

  • Example log message if applicable:
[2021/08/10 18:36:01] [error] [upstream] connection #-1 to fluentd.pipeline:24224 timed out after 10 seconds
[2021/08/10 18:36:01] [error] [upstream] connection #-1 to fluentd.pipeline:24224 timed out after 10 seconds
[2021/08/10 18:36:01] [engine] caught signal (SIGSEGV)
#0  0x55dd33597564      in  mk_event_add() at lib/monkey/mk_core/mk_event.c:96
#1  0x55dd330b6f22      in  net_connect_async() at src/flb_network.c:369
#2  0x55dd330b7bf2      in  flb_net_tcp_connect() at src/flb_network.c:832
#3  0x55dd330dd254      in  flb_io_net_connect() at src/flb_io.c:89
#4  0x55dd330c2eb1      in  create_conn() at src/flb_upstream.c:497
#5  0x55dd330c337b      in  flb_upstream_conn_get() at src/flb_upstream.c:640
#6  0x55dd3313e726      in  cb_forward_flush() at plugins/out_forward/forward.c:1183
#7  0x55dd330ad0de      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:490
#8  0x55dd335999a6      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9  0x7fcce18671f5      in  ???() at ???:0 (edited) 
  • Steps to reproduce the problem:
    Not sure how to repro this, but have seen this few times.

Expected behavior
Fluentbit should be able to recover gracefully.

Screenshots

Your Environment

  • Version used: v1.7.9/v1.8.3
  • Configuration:
[SERVICE]
    Flush                     1
    Log_Level                 info
    Parsers_File              /fluent-bit/etc/parsers.conf
    Parsers_File              /forwarder/etc/parsers_custom.conf
    Plugins_File              /fluent-bit/etc/plugins.conf
    HTTP_Server               On
    storage.path              /var/log/flb-storage/
    storage.max_chunks_up     128
    storage.backlog.mem_limit 256M
    storage.metrics           on
[INPUT]
    Name              tail
    Tag               kubernetes.*
    Path              /var/log/containers/*.log
    Parser            cri
    DB                /var/log/flb-tail.db
    DB.sync           normal
    Refresh_Interval  15
    Read_from_Head    On
    Buffer_Chunk_Size 128K
    Buffer_Max_Size   128K
    Skip_Long_Lines   On
    Mem_Buf_Limit     256M
    storage.type      filesystem
[FILTER]
    Name                kubernetes
    Match               kubernetes.var.log.containers.*
    Kube_Tag_Prefix     kubernetes.var.log.containers.
    Annotations         Off
    K8S-Logging.Exclude On
[OUTPUT]
    Name                       forward
    Match                      kubernetes.*
    Host                       aggregator
    Port                       24224
    Retry_Limit                False
    Require_ack_response       True
    storage.total_limit_size   16G
    net.keepalive              on
    net.keepalive_max_recycle  300
  • Environment name and version (e.g. Kubernetes? What version?): 1.19.x
  • Server type and version:
  • Operating System and version:
  • Filters and plugins: tail, kubernetes,forward

Additional context

From @edsiper, the fluenbit team is triaging a similar issue slack thread

@panaji
Copy link
Author

panaji commented Aug 12, 2021

Here's the complete debug log:
fluentbit-SIGSEGV-debug.log

@panaji
Copy link
Author

panaji commented Aug 13, 2021

Mitigation that I found was:

  • Delete the filestorage: rm -rf /var/log/flb-storage
  • Restart fluentbit

Not great, but simpler than changing the forward output to null and reverting it back.

@senior88oqz
Copy link

hey @panaji, we're expericing the same issue here with fluent-bit 1.8.3, have you tried other 1.8.x version?

@panaji
Copy link
Author

panaji commented Aug 13, 2021

@senior88oqz, I tried with 1.7.9 and 1.8.3 (current latest) and both have the same issue ... so, i think anything in between would be the same

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Sep 13, 2021
@panaji
Copy link
Author

panaji commented Sep 13, 2021

unstale

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Oct 15, 2021
@panaji
Copy link
Author

panaji commented Oct 15, 2021

unstale

@github-actions github-actions bot removed the Stale label Oct 16, 2021
@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Nov 16, 2021
@panaji
Copy link
Author

panaji commented Nov 16, 2021

unstale

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Feb 25, 2022
@panaji
Copy link
Author

panaji commented Feb 28, 2022

unstale

@agup006 agup006 removed the Stale label Mar 1, 2022
@agup006
Copy link
Member

agup006 commented Mar 1, 2022

Removing the stale label, though is this reproducable with latest 1.8.12 and 1.9? I'm wondering if this might be related to bad chunks created

@panaji
Copy link
Author

panaji commented Mar 1, 2022

i still see this in 1.8.7 ... we only recently deployed 1.8.12, but it's not long enough to know if the issue still exists.

@agup006
Copy link
Member

agup006 commented Mar 1, 2022 via email

@nokute78
Copy link
Collaborator

Note: The backtrace is similar to #4107 (comment)

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Jun 11, 2022
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants