Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault in 1.7.9 #3687

Closed
andsens opened this issue Jun 25, 2021 · 8 comments · Fixed by #4125
Closed

segfault in 1.7.9 #3687

andsens opened this issue Jun 25, 2021 · 8 comments · Fixed by #4125
Labels

Comments

@andsens
Copy link

andsens commented Jun 25, 2021

Bug Report

Originally reported in #3661, but it seems to be somewhat different from the getaddrinfo() issue.
When starting fluent-bit, which is configured to forward logs to a fluentd server, the client segfaults after ~15s:

Jun 25 10:16:04 some-hostname td-agent-bit[26483]: [2021/06/25 10:16:04] [ warn] [net] getaddrinfo(host='fluentd.company.tld'): Unknown error
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: [2021/06/25 10:16:04] [engine] caught signal (SIGSEGV)
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #0  0x55ab671d3ad0      in  __mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:87
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #1  0x55ab671d3b07      in  mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:93
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #2  0x55ab671d460b      in  prepare_destroy_conn() at src/flb_upstream.c:390
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #3  0x55ab671d466d      in  prepare_destroy_conn_safe() at src/flb_upstream.c:412
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #4  0x55ab671d4943      in  create_conn() at src/flb_upstream.c:501
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #5  0x55ab671d4e3f      in  flb_upstream_conn_get() at src/flb_upstream.c:640
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #6  0x55ab6723dbb6      in  cb_forward_flush() at plugins/out_forward/forward.c:1183
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #7  0x55ab671bec49      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:470
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #8  0x55ab6766d846      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117

I have observed this on 7 different servers, all running v1.7.9. They all had quite a few log messages in the queue.
Note that some entries actually make it through to fluentd

Configuration:

[SERVICE]
  Flush     2
  Daemon    off
  Log_Level info
  Mem_Buf_Limit 10MB

@INCLUDE inputs.d/*.conf
@INCLUDE filters.d/*.conf

[OUTPUT]
  Name  forward
  Match *
  Host  fluentd.company.tld
  Port  24224
  Shared_Key XXX
  tls   on
  tls.ca_file  /etc/fluent-bit/tls/ca.crt
  tls.crt_file /etc/fluent-bit/tls/tls.crt
  tls.key_file /etc/fluent-bit/tls/tls.key

This is on EC2 HVM instances running debian stretch.

My guess would be that once fluentd starts throttling new connection attempts, fluent-bit is unable to handle that, or that once the bandwidth is sufficiently utilized (because of all the logs that are sent) too much delay in getaddrinfo() causes errors which are then not handled correctly.

Downgrading to 1.7.8 fixes the problem consistently across all 7 servers.

@sossickd
Copy link

sossickd commented Jul 8, 2021

Just to confirm we are also experiencing Fluent-bit pods CrashLoop after a few minutes connecting to fluentd service in cluster.

Fluent-bit helm chart version: 0.15.15
Image: 1.7.9
Environment: AKS
Kubernetes: 1.19.11

kubectl logs fluent-bit-xxxxx --previous -n logging

[2021/07/08 07:33:53] [ warn] [net] getaddrinfo(host='fluentd.logging.svc'): Unknown error
[2021/07/08 07:33:53] [engine] caught signal (SIGSEGV)
#0  0x55ae38c24e1b      in  prepare_destroy_conn_safe() at src/flb_upstream.c:408
#1  0x55ae38c25114      in  create_conn() at src/flb_upstream.c:501
#2  0x55ae38c25576      in  flb_upstream_conn_get() at src/flb_upstream.c:640
#3  0x55ae38c8dc9e      in  cb_forward_flush() at plugins/out_forward/forward.c:1183
#4  0x55ae38c0f800      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:470
#5  0x55ae390c2046      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#6  0xffffffffffffffff  in  ???() at ???:0

Reverting helm chart to version 0.15.14, image version 1.7.8 resolves the issue.

Fluent-bit config:

config:
  filters: |
    [FILTER]
      Name kubernetes
      Match kube.*
      Merge_Log On
      Merge_Log_Key log_processed
      Merge_Log_Trim On
      Keep_Log Off
      K8S-Logging.Parser On
      K8S-Logging.Exclude On

    [FILTER]
      Name          nest
      Match         kube.*
      Operation     lift
      Nested_under  kubernetes
      Add_prefix    kubernetes_

    [FILTER]
      Name          nest
      Match         kube.*
      Operation     lift
      Nested_under  kubernetes_labels
      Add_prefix    kubernetes_labels_

    [FILTER]
      Name          nest
      Match         kube.*
      Operation     lift
      Nested_under  kubernetes_annotations
      Add_prefix    kubernetes_annotations_
  inputs: |
    [INPUT]
      Name tail
      Path /var/log/containers/*.log
      Parser cri
      Tag kube.*
      Skip_Long_Lines On
      Buffer_Chunk_Size 32k
      Buffer_Max_Size 256k
      DB /var/log/flb-storage/tail.db
      DB.Sync normal
      storage.type  filesystem

    [INPUT]
      Name systemd
      Systemd_Filter _SYSTEMD_UNIT=docker.service
      Systemd_Filter _SYSTEMD_UNIT=containerd.service
      Systemd_Filter _SYSTEMD_UNIT=kubelet.service
      Tag host.*
      Strip_Underscores On
      DB /var/log/flb-storage/systemd.db
      DB.Sync normal
      storage.type  filesystem
  outputs: |
    [OUTPUT]
      Name forward
      Match *
      Host fluentd.logging.svc
      Port 24224
  service: |
    [SERVICE]
      Daemon Off
      Flush 1
      Log_Level info
      HTTP_Server On
      HTTP_Listen 0.0.0.0
      HTTP_Port 2020
      storage.path /var/log/flb-storage/
      storage.sync normal
      storage.checksum off
      storage.max_chunks_up 128
      storage.backlog.mem_limit 16M
      storage.metrics on
      Parsers_File parsers.conf
      Parsers_File custom_parsers.conf

@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2021

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Aug 8, 2021
@andsens
Copy link
Author

andsens commented Aug 8, 2021

/remove-lifecycle stale

@github-actions github-actions bot removed the Stale label Aug 9, 2021
@tlefevre
Copy link

We're experiencing this issue as well.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 1, 2021

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Oct 1, 2021
@andsens
Copy link
Author

andsens commented Oct 1, 2021

/remove-lifecycle stale

@github-actions github-actions bot removed the Stale label Oct 5, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2021

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Nov 5, 2021
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants