segfault in 1.7.9 #3687

andsens · 2021-06-25T08:41:35Z

Bug Report

Originally reported in #3661, but it seems to be somewhat different from the getaddrinfo() issue.
When starting fluent-bit, which is configured to forward logs to a fluentd server, the client segfaults after ~15s:

Jun 25 10:16:04 some-hostname td-agent-bit[26483]: [2021/06/25 10:16:04] [ warn] [net] getaddrinfo(host='fluentd.company.tld'): Unknown error
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: [2021/06/25 10:16:04] [engine] caught signal (SIGSEGV)
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #0  0x55ab671d3ad0      in  __mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:87
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #1  0x55ab671d3b07      in  mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:93
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #2  0x55ab671d460b      in  prepare_destroy_conn() at src/flb_upstream.c:390
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #3  0x55ab671d466d      in  prepare_destroy_conn_safe() at src/flb_upstream.c:412
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #4  0x55ab671d4943      in  create_conn() at src/flb_upstream.c:501
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #5  0x55ab671d4e3f      in  flb_upstream_conn_get() at src/flb_upstream.c:640
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #6  0x55ab6723dbb6      in  cb_forward_flush() at plugins/out_forward/forward.c:1183
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #7  0x55ab671bec49      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:470
Jun 25 10:16:04 some-hostname td-agent-bit[26483]: #8  0x55ab6766d846      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117

I have observed this on 7 different servers, all running v1.7.9. They all had quite a few log messages in the queue.
Note that some entries actually make it through to fluentd

Configuration:

[SERVICE]
  Flush     2
  Daemon    off
  Log_Level info
  Mem_Buf_Limit 10MB

@INCLUDE inputs.d/*.conf
@INCLUDE filters.d/*.conf

[OUTPUT]
  Name  forward
  Match *
  Host  fluentd.company.tld
  Port  24224
  Shared_Key XXX
  tls   on
  tls.ca_file  /etc/fluent-bit/tls/ca.crt
  tls.crt_file /etc/fluent-bit/tls/tls.crt
  tls.key_file /etc/fluent-bit/tls/tls.key

This is on EC2 HVM instances running debian stretch.

My guess would be that once fluentd starts throttling new connection attempts, fluent-bit is unable to handle that, or that once the bandwidth is sufficiently utilized (because of all the logs that are sent) too much delay in getaddrinfo() causes errors which are then not handled correctly.

Downgrading to 1.7.8 fixes the problem consistently across all 7 servers.

The text was updated successfully, but these errors were encountered:

sossickd · 2021-07-08T07:56:23Z

Just to confirm we are also experiencing Fluent-bit pods CrashLoop after a few minutes connecting to fluentd service in cluster.

Fluent-bit helm chart version: 0.15.15
Image: 1.7.9
Environment: AKS
Kubernetes: 1.19.11

kubectl logs fluent-bit-xxxxx --previous -n logging

[2021/07/08 07:33:53] [ warn] [net] getaddrinfo(host='fluentd.logging.svc'): Unknown error
[2021/07/08 07:33:53] [engine] caught signal (SIGSEGV)
#0  0x55ae38c24e1b      in  prepare_destroy_conn_safe() at src/flb_upstream.c:408
#1  0x55ae38c25114      in  create_conn() at src/flb_upstream.c:501
#2  0x55ae38c25576      in  flb_upstream_conn_get() at src/flb_upstream.c:640
#3  0x55ae38c8dc9e      in  cb_forward_flush() at plugins/out_forward/forward.c:1183
#4  0x55ae38c0f800      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:470
#5  0x55ae390c2046      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#6  0xffffffffffffffff  in  ???() at ???:0

Reverting helm chart to version 0.15.14, image version 1.7.8 resolves the issue.

Fluent-bit config:

config:
  filters: |
    [FILTER]
      Name kubernetes
      Match kube.*
      Merge_Log On
      Merge_Log_Key log_processed
      Merge_Log_Trim On
      Keep_Log Off
      K8S-Logging.Parser On
      K8S-Logging.Exclude On

    [FILTER]
      Name          nest
      Match         kube.*
      Operation     lift
      Nested_under  kubernetes
      Add_prefix    kubernetes_

    [FILTER]
      Name          nest
      Match         kube.*
      Operation     lift
      Nested_under  kubernetes_labels
      Add_prefix    kubernetes_labels_

    [FILTER]
      Name          nest
      Match         kube.*
      Operation     lift
      Nested_under  kubernetes_annotations
      Add_prefix    kubernetes_annotations_
  inputs: |
    [INPUT]
      Name tail
      Path /var/log/containers/*.log
      Parser cri
      Tag kube.*
      Skip_Long_Lines On
      Buffer_Chunk_Size 32k
      Buffer_Max_Size 256k
      DB /var/log/flb-storage/tail.db
      DB.Sync normal
      storage.type  filesystem

    [INPUT]
      Name systemd
      Systemd_Filter _SYSTEMD_UNIT=docker.service
      Systemd_Filter _SYSTEMD_UNIT=containerd.service
      Systemd_Filter _SYSTEMD_UNIT=kubelet.service
      Tag host.*
      Strip_Underscores On
      DB /var/log/flb-storage/systemd.db
      DB.Sync normal
      storage.type  filesystem
  outputs: |
    [OUTPUT]
      Name forward
      Match *
      Host fluentd.logging.svc
      Port 24224
  service: |
    [SERVICE]
      Daemon Off
      Flush 1
      Log_Level info
      HTTP_Server On
      HTTP_Listen 0.0.0.0
      HTTP_Port 2020
      storage.path /var/log/flb-storage/
      storage.sync normal
      storage.checksum off
      storage.max_chunks_up 128
      storage.backlog.mem_limit 16M
      storage.metrics on
      Parsers_File parsers.conf
      Parsers_File custom_parsers.conf

github-actions · 2021-08-08T01:47:56Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

andsens · 2021-08-08T11:18:33Z

/remove-lifecycle stale

tlefevre · 2021-08-31T21:20:37Z

We're experiencing this issue as well.

github-actions · 2021-10-01T01:52:34Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

andsens · 2021-10-01T07:45:55Z

/remove-lifecycle stale

github-actions · 2021-11-05T01:48:38Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2021-11-11T01:48:08Z

This issue was closed because it has been stalled for 5 days with no activity.

andsens mentioned this issue Jun 25, 2021

getaddrinfo Unknown error with v1.7.8 and ES #3661

Closed

nokute78 mentioned this issue Aug 6, 2021

kinesis_firehose: Crashing, log loss/duplication #3917

Closed

github-actions bot added the Stale label Aug 8, 2021

github-actions bot removed the Stale label Aug 9, 2021

krispraws mentioned this issue Sep 21, 2021

net: Fix handling of upstream connection timeout events. #4040 #4107

Closed

github-actions bot added the Stale label Oct 1, 2021

github-actions bot removed the Stale label Oct 5, 2021

github-actions bot added the Stale label Nov 5, 2021

github-actions bot closed this as completed Nov 11, 2021

leonardo-albertovich linked a pull request Nov 11, 2021 that will close this issue

upstream: flb_upstream_conn busy flag addition #4125

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segfault in 1.7.9 #3687

segfault in 1.7.9 #3687

andsens commented Jun 25, 2021 •

edited

Loading

sossickd commented Jul 8, 2021

github-actions bot commented Aug 8, 2021

andsens commented Aug 8, 2021

tlefevre commented Aug 31, 2021

github-actions bot commented Oct 1, 2021

andsens commented Oct 1, 2021

github-actions bot commented Nov 5, 2021

github-actions bot commented Nov 11, 2021

segfault in 1.7.9 #3687

segfault in 1.7.9 #3687

Comments

andsens commented Jun 25, 2021 • edited Loading

Bug Report

sossickd commented Jul 8, 2021

github-actions bot commented Aug 8, 2021

andsens commented Aug 8, 2021

tlefevre commented Aug 31, 2021

github-actions bot commented Oct 1, 2021

andsens commented Oct 1, 2021

github-actions bot commented Nov 5, 2021

github-actions bot commented Nov 11, 2021

andsens commented Jun 25, 2021 •

edited

Loading