Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[engine] caught signal (SIGSEGV) on 2 pods in the daemonset #4195

Closed
bstolz opened this issue Oct 15, 2021 · 27 comments
Closed

[engine] caught signal (SIGSEGV) on 2 pods in the daemonset #4195

bstolz opened this issue Oct 15, 2021 · 27 comments
Labels

Comments

@bstolz
Copy link

bstolz commented Oct 15, 2021

Bug Report

Describe the bug
Over time I have observed that our fluent bit pods running 1.8.4 occasionally show pod restarts due to termination (kubernetes exit code 139). Interestingly, the restarts seem to only happen on a few pods in the daemonset:

NAME               READY   STATUS    RESTARTS   AGE
fluent-bit-2cxmz   1/1     Running   0          7d7h
fluent-bit-5sxv7   1/1     Running   0          7d8h
fluent-bit-5twrh   1/1     Running   0          7d8h
fluent-bit-6dcqx   1/1     Running   0          7d8h
fluent-bit-6w8hw   1/1     Running   4          6d21h
fluent-bit-d5487   1/1     Running   0          7d7h
fluent-bit-fv2m6   1/1     Running   0          6d23h
fluent-bit-fxgs8   1/1     Running   0          7d8h
fluent-bit-kqd2n   1/1     Running   0          7d8h
fluent-bit-kw5hq   1/1     Running   0          7d8h
fluent-bit-kwsxl   1/1     Running   0          7d7h
fluent-bit-l9gsr   1/1     Running   0          7d8h
fluent-bit-lxkdx   1/1     Running   12         7d11h
fluent-bit-rqdnc   1/1     Running   0          7d8h
fluent-bit-z2w48   1/1     Running   0          7d8h

To Reproduce

  • Example log message if applicable:
"[2021/10/14 17:23:17]inotify_fs_remove(): inode=528908 watch_fd=6\n","stream":"stderr","time":"2021-10-14T23:23:17.829964755Z
...then nothing, then
"[2021/10/14 23:46:09] [engine] caught signal (SIGSEGV) "
#0  0x55f2ca5dd1c2      in  atomic_load_p() at lib/jemalloc-5.2.1/include/jemalloc/internal/atomic.h:62
#1  0x55f2ca5dd1c2      in  extent_arena_get() at lib/jemalloc-5.2.1/include/jemalloc/internal/extent_inlines.h:51
#2  0x55f2ca5dd1c2      in  je_large_dalloc() at lib/jemalloc-5.2.1/src/large.c:361
#3  0x55f2ca59efcc      in  arena_dalloc_large() at lib/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:281
#4  0x55f2ca59efcc      in  arena_dalloc() at lib/jemalloc-5.2.1/include/jemalloc/internal/arena_inlines_b.h:323
#5  0x55f2ca59efcc      in  idalloctm() at lib/jemalloc-5.2.1/include/jemalloc/internal/jemalloc_internal_inlines_c.h:118
#6  0x55f2ca59efcc      in  ifree() at lib/jemalloc-5.2.1/src/jemalloc.c:2589
#7  0x55f2ca59efcc      in  je_free_default() at lib/jemalloc-5.2.1/src/jemalloc.c:2799
#8  0x55f2ca64f100      in  flb_free() at include/fluent-bit/flb_mem.h:122
#9  0x55f2ca651806      in  flb_http_client_destroy() at src/flb_http_client.c:1325
#10 0x55f2ca6d1a0c      in  cb_stackdriver_flush() at plugins/out_stackdriver/stackdriver.c:2219
#11 0x55f2ca61d19d      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:509
#12 0x55f2cab15786      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#13 0x7f258a06ebc0      in  ???() at ???:0
  • Steps to reproduce the problem:

There doesn't seem to be anything specific that we do other than just running for a period of time.

Expected behavior
I suppose in an ideal world this error would not happen.

Your Environment

  • Version used: 1.8.4
  • Configuration:
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes GKE 1.19.13-gke.1200
  • Server type and version: GKE
  • Operating System and version: Container-Optimized OS with Docker (cos)
  • Filters and plugins: GKE

Additional context

It doesn't seem to affect us other than tripping our restart monitors. We're not sure why it's restarting (eg: is it something we're doing?)

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Nov 15, 2021
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@GuillaumeSmaha
Copy link
Contributor

I got a similar issue:

[2021/11/14 18:32:22] [engine] caught signal (SIGSEGV)
#0  0x55585d8f2fa1      in  destroy_chunk_list() at lib/msgpack-c/src/zone.c:39
#1  0x55585d8f336f      in  msgpack_zone_destroy() at lib/msgpack-c/src/zone.c:175
#2  0x55585d8f3492      in  msgpack_zone_free() at lib/msgpack-c/src/zone.c:220
#3  0x55585d8f0ac2      in  msgpack_unpacked_destroy() at lib/msgpack-c/include/msgpack/unpack.h:260
#4  0x55585d8f2e02      in  msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:661
#5  0x55585d4a1392      in  elasticsearch_format() at plugins/out_es/es.c:358
#6  0x55585d4a20f2      in  cb_es_flush() at plugins/out_es/es.c:772
#7  0x55585d41427c      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#8  0x55585d911046      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9  0x7ff320a13234      in  ???() at ???:0

or

[2021/11/16 23:07:04] [debug] [input:tail:tail.1] 0 new files found on path '/var/log/syslog'
[2021/11/16 23:07:04] [debug] [input:tail:tail.0] inode=40894548 events: IN_MODIFY 
[2021/11/16 23:07:05] [debug] [input:tail:tail.0] inode=40894626 events: IN_MODIFY 
[2021/11/16 23:07:05] [debug] [output:es:es.0] HTTP Status=400 URI=/_bulk
[2021/11/16 23:07:05] [error] [output:es:es.0] HTTP status=400 URI=/_bulk, response:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Malformed action/metadata line [3], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]"}],"type":"illegal_argument_exception","reason":"Malformed action/metadata line [3], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]"},"status":400}

[2021/11/16 23:07:05] [debug] [upstream] KA connection #120 to elasticsearch-master:9200 is now available
[2021/11/16 23:07:05] [debug] [input:tail:tail.0] inode=40894548 events: IN_MODIFY 
[2021/11/16 23:07:05] [debug] [http_client] not using http_proxy for header
[2021/11/16 23:07:05] [debug] [http_client] not using http_proxy for header
[2021/11/16 23:07:06] [debug] [out_es] converted_size is 0
[2021/11/16 23:07:06] [engine] caught signal (SIGSEGV)
#0  0x7f75158f9701      in  ???() at ???:0

or

[2021/11/25 11:16:36] [error] [output:es:es.0] HTTP status=400 URI=/_bulk, response:
{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"}],"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"},"status":400}

[2021/11/25 11:16:36] [ warn] [engine] failed to flush chunk '1-1637838990.757540396.flb', retry in 6 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0)
[2021/11/25 11:16:47] [engine] caught signal (SIGSEGV)
#0  0x55f0c1dd9861      in  cio_chunk_is_up() at lib/chunkio/src/cio_chunk.c:474
#1  0x55f0c1b047af      in  flb_input_chunk_flush() at src/flb_input_chunk.c:1456
#2  0x55f0c1ae3d4b      in  flb_engine_dispatch() at src/flb_engine_dispatch.c:243
#3  0x55f0c1ae1e25      in  flb_engine_flush() at src/flb_engine.c:115
#4  0x55f0c1ae31d1      in  flb_engine_handle_event() at src/flb_engine.c:396
#5  0x55f0c1ae31d1      in  flb_engine_start() at src/flb_engine.c:696
#6  0x55f0c1ac581e      in  flb_lib_worker() at src/flb_lib.c:627
#7  0x7f10f3fbefa2      in  ???() at ???:0
#8  0x7f10f369f4ce      in  ???() at ???:0
#9  0xffffffffffffffff  in  ???() at ???:0

Is it possible a parsing issue is creating a segfault ?

@bstolz
Copy link
Author

bstolz commented Nov 25, 2021

"Is it possible a parsing issue is creating a segfault ?"

I think so. I am currently experimenting with adding "more and more" functionality back into my parser.

Here's the parser line that I was using when I see the segfaults:

    [PARSER]
        Name        elasticsearch_multiline
        Format      regex
        Regex       ^*\\"type\\": \\"(?<type>server)\\", \\"timestamp\\": \\"(?<time>[^\\]*)\\", \\"level\\": \\"(?<severity>[^\\]*)\\", \\"component\\": \\"(?<component>[^\\]*)\\", \\"cluster.name\\": \\"(?<cluster_name>[^\\]*)\\", \\"node.name\\": \\"(?<node_name>[^\\]*)\\", \\"message\\": \\"(?<message>[^\\]*)\\", \\"cluster.uuid\\": \\"(?<cluster_uuid>[^\\]*)\\", \\"node.id\\": \\"(?<node_id>[^\\]*)\\"[,]?[\s]?(?<stacktrace>[^\`]*$)
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S,%L%z

@GuillaumeSmaha
Copy link
Contributor

I did a test by replacing elastic-search output by simple stdout and I don't have segfault anymore.
elastic-search output:

    [OUTPUT]
        Name            es
        Match           *
        Host            elasticsearch-master
        Port            9200
        tls             Off
        tls.verify      Off
        Generate_ID     On
        Logstash_Format On
        Time_Key_Format %Y-%m-%dT%H:%M:%S
        Replace_Dots    On
        Time_Key        ts
        Retry_Limit     False
        Buffer_Size     64KB

stdout used:

    [OUTPUT]
        Name            stdout
        Match           *

@bstolz can you try on your side with stdout to confirm it could be an issue with elastic-search exporter ?

@bstolz
Copy link
Author

bstolz commented Nov 26, 2021

Ahh you're using the elasticsearch exporter - I am exporting to stackdriver and getting the same. I can try standard out on a non-production environment.

@GuillaumeSmaha
Copy link
Contributor

GuillaumeSmaha commented Nov 30, 2021

I upgraded to 1.8.10:

  • works with stdout
  • fails with es

I have

ogs fluent-bit-xjmdk
Fluent Bit v1.8.10
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

Invalid configuration value at FILTER.Name
[2021/11/30 09:07:34] [ info] [engine] started (pid=1)
[2021/11/30 09:07:34] [ info] [storage] version=1.1.5, initializing...
[2021/11/30 09:07:34] [ info] [storage] in-memory
[2021/11/30 09:07:34] [ info] [storage] full synchronization mode, checksum enabled, max_chunks_up=128
[2021/11/30 09:07:34] [ info] [cmetrics] version=0.2.2
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] multiline core started
[2021/11/30 09:07:34] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2021/11/30 09:07:34] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2021/11/30 09:07:34] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2021/11/30 09:07:34] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2021/11/30 09:07:34] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/11/30 09:07:34] [ info] [sp] stream processor started
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224257 watch_fd=3 name=/var/log/containers/cilium-mlvm6_kube-system_cilium-agent-bdfb18631cb184400d7045bcf03e01eb592c8737af9963dead8eee41c6cb3aea.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224846 watch_fd=4 name=/var/log/containers/cilium-mlvm6_kube-system_cilium-agent-ee021432c3c584782aba98c77d8d6d90fca3be851e42403b68fb9911926829c1.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224845 watch_fd=5 name=/var/log/containers/cilium-mlvm6_kube-system_clean-cilium-state-2ad64fcca13c2b12be4b85b06118e09f82dfe372ec39dd02cd214be83951ccdc.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224823 watch_fd=6 name=/var/log/containers/cilium-mlvm6_kube-system_mount-cgroup-7e466ef8836bece0943fa6a65d9b2c9426f23c386b779cf3f4325ab7bc81ca02.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224838 watch_fd=7 name=/var/log/containers/cilium-mlvm6_kube-system_wait-for-node-init-2c1a981f06c37c999741ac23a609afdde95f05d80880d296a2935a34f4044887.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18220129 watch_fd=8 name=/var/log/containers/cilium-node-init-5cm7k_kube-system_node-init-26156fdfe5ef2b8a28e4f497bf5ae796c9fd54847979f4ceec5e7842158e13f3.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18224480 watch_fd=9 name=/var/log/containers/cilium-node-init-5cm7k_kube-system_node-init-d760121abf51c6dde80769dddf2c54dee2342588fbc83974ea7a0ce2c1a0812c.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18354769 watch_fd=11 name=/var/log/containers/fluent-bit-xjmdk_monitoring_fluent-bit-b731b932d1f4455c1bc1267337a799e3d09a6388774c27362ce1d0281ffc2338.log
...
2021/11/30 09:07:41] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18354768 watch_fd=47 name=/var/log/containers/fluent-bit-xjmdk_monitoring_fluent-bit-f004961b80d7ac0bc5a58506270837eb5e34350abd0250478afee6925727283d.log
[2021/11/30 09:07:34] [ info] [input:tail:tail.1] inotify_fs_add(): inode=18089658 watch_fd=1 name=/var/log/syslog
[2021/11/30 09:07:41] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:41] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:42] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:43] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:07:43] [ warn] [http_client] cannot increase buffer: current=64000 requested=96768 max=64000
[2021/11/30 09:08:38] [engine] caught signal (SIGSEGV)
#0  0x5566019fcb51      in  destroy_chunk_list() at lib/msgpack-c/src/zone.c:39
#1  0x5566019fcf1f      in  msgpack_zone_destroy() at lib/msgpack-c/src/zone.c:175
#2  0x5566019fd042      in  msgpack_zone_free() at lib/msgpack-c/src/zone.c:220
#3  0x5566019fa672      in  msgpack_unpacked_destroy() at lib/msgpack-c/include/msgpack/unpack.h:260
#4  0x5566019fc9b2      in  msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:661
#5  0x5566015aaa88      in  elasticsearch_format() at plugins/out_es/es.c:358
#6  0x5566015ab7e8      in  cb_es_flush() at plugins/out_es/es.c:772
#7  0x55660151d28e      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#8  0x556601a1ac06      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9  0x7ffbf3c13234      in  ???() at ???:0

@GuillaumeSmaha
Copy link
Contributor

GuillaumeSmaha commented Nov 30, 2021

First time I saw this line Invalid configuration value at FILTER.Name lol but I think it is not related to the issue

UPDATED: Issue fixed for Invalid configuration value at FILTER.Name and I still have the sigfault:

[2021/11/30 09:54:38] [engine] caught signal (SIGSEGV)
#0  0x7f248b8e0820      in  ???() at ???:0
#1  0x5571447fc2cd      in  msgpack_sbuffer_write() at lib/msgpack-c/include/msgpack/sbuffer.h:88
#2  0x5571447fc09f      in  msgpack_pack_str_body() at lib/msgpack-c/include/msgpack/pack_template.h:784
#3  0x5571447fe0cb      in  flb_ml_flush_stream_group() at src/multiline/flb_ml.c:981
#4  0x5571447fc5d1      in  flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#5  0x5571447fd3bd      in  flb_ml_append() at src/multiline/flb_ml.c:617
#6  0x55714483a4c3      in  process_content() at plugins/in_tail/tail_file.c:375
#7  0x55714483c9f1      in  flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1210
#8  0x557144834ea0      in  in_tail_collect_event() at plugins/in_tail/tail.c:261
#9  0x557144836079      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
#10 0x5571447ddc16      in  flb_input_collector_fd() at src/flb_input.c:1108
#11 0x5571447f03fa      in  flb_engine_handle_event() at src/flb_engine.c:411
#12 0x5571447f03fa      in  flb_engine_start() at src/flb_engine.c:696
#13 0x5571447d281e      in  flb_lib_worker() at src/flb_lib.c:627
#14 0x7f248c19cfa2      in  ???() at ???:0
#15 0x7f248b87d4ce      in  ???() at ???:0
#16 0xffffffffffffffff  in  ???() at ???:0

@bstolz
Copy link
Author

bstolz commented Nov 30, 2021

As mentioned previously, I am logging to stackdriver and upon changing it to stdout I still see the same issue. I'm slowly removing some of the "extra" lines I have in the input block - multiline, skip long lines, and reduction of the buffer_max_size to see how it goes. Then I will set all of the buffers to default. After that, I will start investigating the regex.

@edsiper edsiper reopened this Nov 30, 2021
@edsiper
Copy link
Member

edsiper commented Nov 30, 2021

thanks for the info, would you please provide a simple repro case so we can troubleshoot ?

@bstolz
Copy link
Author

bstolz commented Nov 30, 2021

I am observing this running on all of our elasticsearch 7 clusters (ECK) in GKE - it happens a few times in the period of 24hrs. I am using the same configmap for other clusters not running elasticsearch and I do not see the issue.

I am not sure how to provide a "simple" repro case given an elasticsearch cluster is involved. I've included the relevant bits of the configmap. Let me know how else I can provide what you need? I can also extend an offer to help as I do have a demo environment.

`

[INPUT]
    Name              tail
    Path              /var/log/containers/*_elasticsearch*.log
    Exclude_Path      /var/log/containers/*_kube-system_*.log,/var/log/containers/*_istio-system_*.log,/var/log/containers/*_knative-serving_*.log,/var/log/containers/*_gke-system_*.log,/var/log/containers/*_config-management-system_*.log,*_istio-proxy*.log,/var/log/containers/*_dd-agent*.log, /var/log/containers/*_fluent-bit-custom*.log
    Tag               kube.elasticsearch.*
    DB                /var/log/flb_elasticsearch.db
    Read_from_head    true
    Multiline         off
    Parser_Firstline  elasticsearch_multiline
    Mem_Buf_Limit     20MB
    Refresh_Interval  10
    Buffer_Chunk_Size 256k
    Buffer_Max_Size   1024k

[FILTER]
    Name         parser
    Match        kube.*
    Key_Name     log
    Reserve_Data True
    Parser       docker

[FILTER]
    Name        modify
    Match       *
    Hard_rename log message

[FILTER]
    Name         parser
    Match        kube.*
    Key_Name     message
    Reserve_Data True
    Parser       glog
    Parser       json

[OUTPUT]
    Name    stackdriver
    Match   kube.*
    resource  k8s_container
    k8s_cluster_name  <redacted>
    k8s_cluster_location  <redacted>
    severity_key severity

[PARSER]
    Name        elasticsearch_multiline
    Format      regex
    Regex       ^*\\"type\\": \\"(?<type>server)\\", \\"timestamp\\": \\"(?<time>[^\\]*)\\", \\"level\\": \\"(?<severity>[^\\]*)\\", \\"component\\": \\"(?<component>[^\\]*)\\", \\"cluster.name\\": \\"(?<cluster_name>[^\\]*)\\", \\"node.name\\": \\"(?<node_name>[^\\]*)\\", \\"message\\": \\"(?<message>[^\\]*)\\", \\"cluster.uuid\\": \\"(?<cluster_uuid>[^\\]*)\\", \\"node.id\\": \\"(?<node_id>[^\\]*)\\"[,]?[\s]?(?<stacktrace>[^\`]*$)
    Time_Key    time
    Time_Format %Y-%m-%dT%H:%M:%S,%L%z

[PARSER]
    Name   json
    Format json
    Time_Key time
    Time_Format %d/%b/%Y:%H:%M:%S %z

[PARSER]
    Name        glog
    Format      regex
    Regex       ^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source_file>[^ \]]+)\:(?<source_line>\d+)\]\s(?<message>.*)$
    Time_Key    time
    Time_Format %m%d %H:%M:%S.%L%z

`

@GuillaumeSmaha
Copy link
Contributor

GuillaumeSmaha commented Nov 30, 2021

I did a new install of the latest chart for elasticsearch (docker image: docker.elastic.co/elasticsearch/elasticsearch:7.15.0)

I clean my configuration to keep less parser/filter and it still crashs with this configuration:

    [SERVICE]
        Config_Watch  On

        # Flush
        # =====
        # set an interval of seconds before to flush records to a destination
        flush        1

        # Daemon
        # ======
        # instruct Fluent Bit to run in foreground or background mode.
        daemon       Off

        # Log_Level
        # =========
        # Set the verbosity level of the service, values can be:
        #
        # - error
        # - warning
        # - info
        # - debug
        # - trace
        #
        # by default 'info' is set, that means it includes 'error' and 'warning'.
        log_level    info

        # Parsers File
        # ============
        # specify an optional 'Parsers' configuration file
        parsers_file parsers.conf
        parsers_file  custom_parsers.conf

        # HTTP Server
        # ===========
        # Enable/Disable the built-in HTTP Server for metrics
        http_server  On
        http_listen  0.0.0.0
        http_port    2020

        # Storage
        # =======
        storage.sync      full
        storage.checksum  on

    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        # Except fluent-bit to avoid feedback loop
        Exclude_Path      ["*fluent-bit*.log"]
        DB                /var/log/fluentbit/flb_kube.db
        Mem_Buf_Limit     64MB
        Skip_Long_Lines   On
        Refresh_Interval  10
        # Parser for containerd (Container Runtime Interface)
        multiline.parser  cri

    [INPUT]
        Name              tail
        Tag               system.*
        Path              /var/log/syslog
        Parser            rsyslog
        DB                /var/log/fluentbit/flb_syslog.db
        Mem_Buf_Limit     64MB
        Skip_Long_Lines   On
        Refresh_Interval  10

    [FILTER]
        Name           kubernetes
        Match          kube.*
        Kube_URL       https://kubernetes.default.svc:443
        Merge_Log On
        Keep_Log Off
        K8S-Logging.Parser On
        K8S-Logging.Exclude Off

    [OUTPUT]
        Name            es
        Match           *
        Host            elasticsearch-master
        Port            9200
        tls             Off
        tls.verify      Off
        Generate_ID     On
        Logstash_Format On
        Time_Key_Format %Y-%m-%dT%H:%M:%S
        Replace_Dots    On
        Time_Key        ts
        Retry_Limit     False
        Buffer_Size     64KB

    [PARSER]
        Name        rsyslog
        Format      regex
        Regex       (?<syslog_time>^[A-Za-z]+ [ 1-3][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2}) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<log>.*)$
        Time_Key    syslog_time
        Time_Format %b %d %H:%M:%S
        Time_Keep   On

log:

Fluent Bit v1.8.10
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/11/30 16:56:31] [ info] [engine] started (pid=1)
[2021/11/30 16:56:31] [ info] [storage] version=1.1.5, initializing...
[2021/11/30 16:56:31] [ info] [storage] in-memory
[2021/11/30 16:56:31] [ info] [storage] full synchronization mode, checksum enabled, max_chunks_up=128
[2021/11/30 16:56:31] [ info] [cmetrics] version=0.2.2
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] multiline core started
[2021/11/30 16:56:31] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2021/11/30 16:56:31] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2021/11/30 16:56:31] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2021/11/30 16:56:31] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2021/11/30 16:56:31] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/11/30 16:56:31] [ info] [sp] stream processor started
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885569 watch_fd=2 name=/var/log/containers/cilium-kvwhq_kube-system_cilium-agent-9f584588e4be9ae6ee66cfa45ad7458d2d891b6fb680d2f18bd2910f62bd142d.log
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885566 watch_fd=3 name=/var/log/containers/cilium-kvwhq_kube-system_clean-cilium-state-4949ccf208b74e3c36949068cec5bf4a8f22308e6ddf32a146294ee0940c2fe1.log
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885535 watch_fd=4 name=/var/log/containers/cilium-kvwhq_kube-system_mount-cgroup-b2939032ba97cc068dce2ac616d6a4929faff24f7304231ce62a4a1a5c24c1b3.log
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885563 watch_fd=5 name=/var/log/containers/cilium-kvwhq_kube-system_wait-for-node-init-c5d20e3c171339e307120e69fa64baa706b4d83a1514470666ad184ae1542ad8.log
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885560 watch_fd=6 name=/var/log/containers/cilium-node-init-2v4n6_kube-system_node-init-d5fab00554f962d3cea541308f6358e850638d0643f923ef5347c8c28a92c757.log
[2021/11/30 16:56:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=56885364 watch_fd=7 name=/var/log/containers/fluent-bit-9h2rs_monitoring_fluent-bit-ed4d4dc0f608994a6bcb613bc49a121b2fd74d4aebb5cace4fc6f3f316b6a660.log
...
[2021/11/30 16:56:31] [ info] [input:tail:tail.1] inotify_fs_add(): inode=56885366 watch_fd=1 name=/var/log/syslog
[2021/11/30 16:58:24] [engine] caught signal (SIGSEGV)
#0  0x55ff5827cb51      in  destroy_chunk_list() at lib/msgpack-c/src/zone.c:39
#1  0x55ff5827cf1f      in  msgpack_zone_destroy() at lib/msgpack-c/src/zone.c:175
#2  0x55ff5827d042      in  msgpack_zone_free() at lib/msgpack-c/src/zone.c:220
#3  0x55ff5827a672      in  msgpack_unpacked_destroy() at lib/msgpack-c/include/msgpack/unpack.h:260
#4  0x55ff5827c9b2      in  msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:661
#5  0x55ff57e2aa88      in  elasticsearch_format() at plugins/out_es/es.c:358
#6  0x55ff57e2b7e8      in  cb_es_flush() at plugins/out_es/es.c:772
#7  0x55ff57d9d28e      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:514
#8  0x55ff5829ac06      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9  0x7f19e864c04d      in  ???() at ???:0

Another log after the pod crashed:

[2021/11/30 17:02:06] [error] [output:es:es.0] HTTP status=400 URI=/_bulk, response:
{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"}],"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"},"status":400}

[2021/11/30 17:02:06] [ warn] [engine] failed to flush chunk '\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u00�*ec', retry in 9 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:02:07] [ warn] [engine] failed to flush chunk '1-1638291726.807180974.flb', retry in 11 seconds: task_id=3, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:02:17] [error] [output:es:es.0] HTTP status=400 URI=/_bulk, response:
{"error":{"root_cause":[{"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"}],"type":"json_parse_exception","reason":"Illegal character ((CTRL-CHAR, code 0)): only regular white space (\\r, \\n, \\t) is allowed between tokens\n at [Source: (ByteArrayInputStream); line: 1, column: 2]"},"status":400}
[2021/11/30 17:02:17] [ warn] [engine] failed to flush chunk '00\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u
', retry in 16 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:02:18] [ warn] [engine] failed to flush chunk '1-1638291726.807180974.flb', retry in 17 seconds: task_id=3, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:02:36] [ warn] [engine] failed to flush chunk '1-1638291726.807180974.flb', retry in 8 seconds: task_id=3, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:02:43] [ warn] [engine] failed to flush chunk '1-1638291726.807180974.flb', retry in 58 seconds: task_id=3, input=tail.0 > output=es.0 (out_id=0)
[2021/11/30 17:03:22] [engine] caught signal (SIGSEGV)
#0  0x7f6316a0c7b0      in  ???() at ???:0
#1  0x55600e56093a      in  flb_sds_cat() at src/flb_sds.c:132
#2  0x55600e5609a4      in  flb_sds_cat_safe() at src/flb_sds.c:145
#3  0x55600e583b82      in  package_content() at src/multiline/flb_ml.c:324
#4  0x55600e584030      in  process_append() at src/multiline/flb_ml.c:495
#5  0x55600e58424e      in  ml_append_try_parser() at src/multiline/flb_ml.c:564
#6  0x55600e584456      in  flb_ml_append() at src/multiline/flb_ml.c:630
#7  0x55600e5c14c3      in  process_content() at plugins/in_tail/tail_file.c:375
#8  0x55600e5c39f1      in  flb_tail_file_chunk() at plugins/in_tail/tail_file.c:1210
#9  0x55600e5bbea0      in  in_tail_collect_event() at plugins/in_tail/tail.c:261
#10 0x55600e5bd079      in  tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:268
#11 0x55600e564c16      in  flb_input_collector_fd() at src/flb_input.c:1108
#12 0x55600e5773fa      in  flb_engine_handle_event() at src/flb_engine.c:411
#13 0x55600e5773fa      in  flb_engine_start() at src/flb_engine.c:696
#14 0x55600e55981e      in  flb_lib_worker() at src/flb_lib.c:627
#15 0x7f63172c8fa2      in  ???() at ???:0
#16 0x7f63169a94ce      in  ???() at ???:0
#17 0xffffffffffffffff  in  ???() at ???:0

Full log with trace enabled:
fluent-bit-9h2rs.log.zip

I saw the CPU
usage is high, I could allow more CPU or remove checksum and full sync.
image
image

I added

        storage.sync      full
        storage.checksum  on

because I read an issue where it recommanded to enable it when there is too much warning log with [ warn] [engine] failed to flush chunk '1-1638291872.61468128.flb', retry in 11 seconds: task_id=1, input=tail.0 > output=es.0 (out_id=0)

UPDATE: remove storage.sync full & storage.checksum on and increase cpu to 2 (reach 1 cpu usage): no change for the segfault

@github-actions github-actions bot removed the Stale label Dec 2, 2021
@GuillaumeSmaha
Copy link
Contributor

GuillaumeSmaha commented Dec 3, 2021

I run valgrind on a debug build of master branch (9ccc6f8) valgrind-fluent-bit-w67h2-2.log
And another with option --leak-check=full:
valgrind-fluent-bit-w67h2-3.log

They crash with the same stack:

#0  0x483c933           in  ???() at ???:0
#1  0x2b0116            in  es_bulk_append() at plugins/out_es/es_bulk.c:105
#2  0x2ada44            in  elasticsearch_format() at plugins/out_es/es.c:537
#3  0x2ae828            in  cb_es_flush() at plugins/out_es/es.c:772
#4  0x21f072            in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:511
#5  0x71c0c6            in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117

It seems output_pre_cb_flush() is present in all SEGV stack we got with @bstolz (https://github.com/fluent/fluent-bit/blob/master/include/fluent-bit/flb_output.h#L511)

@GuillaumeSmaha
Copy link
Contributor

@bstolz I notice you use

    Parser_Firstline  elasticsearch_multiline

and I am using:

        multiline.parser  cri

I think there is an issue in the flushing part for multiline parser.
With gdb, I got this backtrace:

#0  __memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:384
#1  0x0000555555688943 in msgpack_sbuffer_write (data=0x7ffff61f4280, 
    buf=0x7ffff4aee010 "0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u00"..., len=1968975920)
    at /tmpdir/fluent-bit/lib/msgpack-c/include/msgpack/sbuffer.h:88
#2  0x0000555555688715 in msgpack_pack_str_body (x=0x7ffff61f4270, 
    b=0x7ffff4aee010, l=1968975920)
    at /tmpdir/fluent-bit/lib/msgpack-c/include/msgpack/pack_template.h:784
#3  0x000055555568a8e4 in flb_ml_flush_stream_group (ml_parser=0x7ffff663e070, 
    mst=0x7ffff543cf50, group=0x7ffff49e4180)
    at /tmpdir/fluent-bit/src/multiline/flb_ml.c:1012
#4  0x0000555555688c47 in flb_ml_flush_parser_instance (ml=0x7ffff5452150, 
    parser_i=0x7ffff5441f00, stream_id=17521209660683586930)
    at /tmpdir/fluent-bit/src/multiline/flb_ml.c:117
#5  0x0000555555689a33 in flb_ml_append (ml=0x7ffff5452150, 
    stream_id=951830318436279800, type=0, tm=0x7ffff61f4490, 
    buf=0x7ffff5527d40, size=182)
    at /tmpdir/fluent-bit/src/multiline/flb_ml.c:617
#6  0x00005555556c7532 in process_content (file=0x7ffff542f080, 
    bytes=0x7ffff61f45e8) at /tmpdir/fluent-bit/plugins/in_tail/tail_file.c:375
#7  0x00005555556c9a60 in flb_tail_file_chunk (file=0x7ffff542f080)
    at /tmpdir/fluent-bit/plugins/in_tail/tail_file.c:1210
#8  0x00005555556c1f0f in in_tail_collect_event (file=0x7ffff542f080, 
    config=0x7ffff6632f80) at /tmpdir/fluent-bit/plugins/in_tail/tail.c:261
#9  0x00005555556c30e8 in tail_fs_event (ins=0x7ffff6648f00, 
    config=0x7ffff6632f80, in_context=0x7ffff5456000)
    at /tmpdir/fluent-bit/plugins/in_tail/tail_fs_inotify.c:268
#10 0x0000555555669c67 in flb_input_collector_fd (fd=27, config=0x7ffff6632f80)
    at /tmpdir/fluent-bit/src/flb_input.c:1108
#11 0x000055555567c8bc in flb_engine_handle_event (config=0x7ffff6632f80, 
    mask=1, fd=27) at /tmpdir/fluent-bit/src/flb_engine.c:411
#12 flb_engine_start (config=0x7ffff6632f80)
    at /tmpdir/fluent-bit/src/flb_engine.c:696
#13 0x000055555565e83f in flb_lib_worker (data=0x7ffff660b340)
    at /tmpdir/fluent-bit/src/flb_lib.c:627
#14 0x00007ffff7fb1fa3 in start_thread (arg=<optimized out>)
    at pthread_create.c:486
#15 0x00007ffff76924cf in clone ()

@GuillaumeSmaha
Copy link
Contributor

GuillaumeSmaha commented Dec 8, 2021

I added one debug log in es plugin: GuillaumeSmaha@5b908dd which should not occur because from what I am understanding it means even with buffer resizing, it is still inferior to the requirement size used in the memcpy at the end of the fuction:

memcpy(bulk->ptr + bulk->len, json, j_len);
.

And I got a segv just after:


�[1m[�[0m2021/12/08 11:55:05�[1m]�[0m [�[93mdebug�[0m] [out_es] converted_size is 0
�[1m[�[0m2021/12/08 11:55:05�[1m]�[0m [�[93mdebug�[0m] [out_es] required size (197133, append: 4096) is still superior to available buffer (37999, previous available: 33903)

Thread 2 "flb-pipeline" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff61ff700 (LWP 45)]
__memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:262
262	../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0  __memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:262
#1  0x00005555556fc274 in es_bulk_append (bulk=0x7ffff5400d60, 
    index=0x7ffff4ac0a40 "{\"create\":{\"_index\":\"logstash-2021.12.08\",\"_type\":\"_doc\",\"_id\":\"a76df76b-d2c7-9d2b-e172-93e0ec3c3d4f\"}}\n", i_len=104, 
    json=0x7ffff253fad0 "{\"@timestamp\":\"2021-12-08T11:54:07.027Z\",\"time\":\"2021-12-08T11:54:07.027621077Z\",\"stream\":\"stdout\",\"_p\":\"P\",\"log\":\"\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\"..., j_len=196967, whole_size=33903, converted_size=0)
    at /tmpdir/fluent-bit/plugins/out_es/es_bulk.c:111
#2  0x00005555556f9af9 in elasticsearch_format (config=0x7ffff6632f80, 
    ins=0x7ffff6648f00, plugin_context=0x7ffff54a9600, flush_ctx=0x0, 
    tag=0x7ffff54a2700 "kube.var.log.containers.toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588_toolkit-jobs_toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588-1-db65750898be03ec7ef665854d8036800d9609adeb19a0cc5b021"..., 
    tag_len=215, data=0x7ffff2341f00, bytes=33903, out_data=0x7ffff4ac0f60, 
    out_size=0x7ffff4ac0f58) at /tmpdir/fluent-bit/plugins/out_es/es.c:537
#3  0x00005555556fa8c5 in cb_es_flush (data=0x7ffff2341f00, bytes=33903, 
    tag=0x7ffff54a2700 "kube.var.log.containers.toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588_toolkit-jobs_toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588-1-db65750898be03ec7ef665854d8036800d9609adeb19a0cc5b021"..., 
    tag_len=215, ins=0x7ffff6648f00, out_context=0x7ffff54a9600, 
    config=0x7ffff6632f80) at /tmpdir/fluent-bit/plugins/out_es/es.c:772
--Type <RET> for more, q to quit, c to continue without paging--
#4  0x000055555566b2df in output_pre_cb_flush ()
    at /tmpdir/fluent-bit/include/fluent-bit/flb_output.h:514
#5  0x0000555555b69d47 in co_init ()
    at /tmpdir/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6  0x0000000000000000 in ?? ()
(gdb) 

gdb-fluent-bit-4vljw.log

@bstolz
Copy link
Author

bstolz commented Dec 8, 2021

@GuillaumeSmaha great digging, thank you.

@GuillaumeSmaha
Copy link
Contributor

GuillaumeSmaha commented Dec 9, 2021

@bstolz I will create another issue because the issue found is inside plugin "elasticsearch output". Reverting #3788 seems to fix my issue (fluentbit doesn't crash anymore)
Maybe, there is another issue with the multiline parser (with the backtrace I got here #4195 (comment))

I just created #4412

@GuillaumeSmaha
Copy link
Contributor

@bstolz My fix on es output was merged on master and branch 1.8. I am not sure it will fix your issue but you can try

@bstolz
Copy link
Author

bstolz commented Dec 9, 2021

@GuillaumeSmaha wonderful! Thank you!

That was quick ;-) I appreciate you diving in and posting that example.

@fhitchen
Copy link

I have tried fluent-bit versions 1.7.x and 1.8.x and they all crash using tail input and elastic output with heavy JSON formatted logs. I tried the change in commit 0a061fb. Fluent-bit was a bit more stable but still crashed. I have gone back to version 1.6.10 which "just runs".

I have noticed another issue with the 1.8.x series, it is continually logging errors on writing to elastic, but the "status": element in the elastic response only ever shows status 201 which should be "document created" and not an error. When I see errors reported with 1.6 I always see a real error status in the response, like a 429 for example. I think that the elastic response handling has also been broken.

@GuillaumeSmaha
Copy link
Contributor

@fhitchen can you try on the latest 1.8.11 with this change: #4361 (1.8.11 already includes a change like 0a061fb)

@fhitchen
Copy link

So I took the nokute78:fix_4311_es_fix and it ran for a few minutes before SEGV-ing. I'm back to 1.6.20.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Mar 18, 2022
@fhitchen
Copy link

Hi, I just built and tested the 1.8.13 fluent-bit release and it crashed after processing for about 5 minutes. Again 1.6.20 works flawlessly so something has been badly broken in the elastic plugin memory management since 1.7. It does not appear to be related to the input data, I have tried reprocessing the same file and it does not crash in the same place. If I replace the output with STDOUT it does not crash.

Fluent Bit v1.8.13
* Copyright (C) 2015-2021 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2022/03/21 14:59:12] [engine] caught signal (SIGSEGV)
#0  0x7ffff6647fbe      in  ???() at ???:0
#1  0x44dde2            in  flb_free() at include/fluent-bit/flb_mem.h:122
#2  0x44ef11            in  flb_sds_destroy() at src/flb_sds.c:408
#3  0x4e1dfb            in  elasticsearch_format() at plugins/out_es/es.c:540
#4  0x4e2b6b            in  cb_es_flush() at plugins/out_es/es.c:772
#5  0x456a08            in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:517
#6  0x787905            in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
Segmentation fault

@github-actions github-actions bot removed the Stale label Mar 22, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Jun 20, 2022
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@omidraha
Copy link

omidraha commented Sep 6, 2023

Similar issue or related issue #3585:

Info 1:

$ kubectl get pods  -n amazon-cloudwatch

NAME               READY   STATUS             RESTARTS        AGE
fluent-bit-8t7fk   0/1     CrashLoopBackOff   6 (2m42s ago)   8m29s
fluent-bit-98w2b   0/1     CrashLoopBackOff   6 (2m34s ago)   8m29s
fluent-bit-chzjh   0/1     CrashLoopBackOff   6 (2m45s ago)   8m29s
fluent-bit-gfs7x   0/1     CrashLoopBackOff   6 (2m32s ago)   8m29s
fluent-bit-szn69   0/1     CrashLoopBackOff   6 (2m38s ago)   8m29s
fluent-bit-wxbkm   0/1     CrashLoopBackOff   6 (2m46s ago)   8m29s

Info 2:

$ kubectl  logs fluent-bit-wxbkm -n amazon-cloudwatch

Fluent Bit v1.9.10
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/09/06 16:58:31] [ info] [fluent bit] version=1.9.10, commit=a2eaf59628, pid=1
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693618299.930665963.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693618299.938087318.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693655545.974952011.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693681220.89964544.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693681223.693057732.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693723157.518142203.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693731509.684250129.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693744367.53739593.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693771245.718195391.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693864172.706175934.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693866941.88886208.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693914041.952195655.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1693972809.563850435.flb
[2023/09/06 16:58:31] [error] [storage] format check failed: tail.0/1-1694017319.593944428.flb
[2023/09/06 16:58:31] [ info] [storage] version=1.4.0, type=memory+filesystem, sync=normal, checksum=disabled, max_chunks_up=128
[2023/09/06 16:58:31] [ info] [storage] backlog input plugin: storage_backlog.7
[2023/09/06 16:58:31] [ info] [cmetrics] version=0.3.7
[2023/09/06 16:58:31] [ info] [input:tail:tail.0] multiline core started
[2023/09/06 16:58:31] [ info] [input:tail:tail.1] multiline core started
[2023/09/06 16:58:31] [ info] [input:systemd:systemd.2] seek_cursor=s=2866bded7659498a96e0a8ad2beccf0a;i=7bf... OK
[2023/09/06 16:58:31] [ info] [input:tail:tail.3] multiline core started
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] queue memory limit: 4.8M
[2023/09/06 16:58:31] [ info] [filter:kubernetes:kubernetes.0] https=1 host=127.0.0.1 port=10250
[2023/09/06 16:58:31] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2023/09/06 16:58:31] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2023/09/06 16:58:31] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with Kubelet...
[2023/09/06 16:58:31] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2023/09/06 16:58:31] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] worker #0 started
[2023/09/06 16:58:31] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] worker #0 started
[2023/09/06 16:58:31] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] worker #0 started
[2023/09/06 16:58:31] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2023/09/06 16:58:31] [ info] [sp] stream processor started
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019169.378350213.flb
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019170.248838268.flb
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019187.497863650.flb
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019215.460809884.flb
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019256.480162041.flb
[2023/09/06 16:58:31] [ info] [input:storage_backlog:storage_backlog.7] register tail.0/1-1694019339.487434952.flb
[2023/09/06 16:58:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=14152341 watch_fd=3 name=/var/log/containers/coredns-6ff9c46cd8-c2h4p_kube-system_coredns-3c90a2ab2fae2141af415dc72026004233e8de1f17edd6bbc2df9e68f607ed92.log
[2023/09/06 16:58:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=14152350 watch_fd=4 name=/var/log/containers/coredns-6ff9c46cd8-c2h4p_kube-system_coredns-4dec2ea945a6ee553ffd5ff906c33358f5bef079cf13809111db2f95cdc082cd.log
[2023/09/06 16:58:31] [engine] caught signal (SIGSEGV)
AWS for Fluent Bit Container Image Version 2.31.12.20230821

Info 3:

$ kubectl  describe pod fluent-bit-8t7fk -n amazon-cloudwatch
Name:             fluent-bit-8t7fk
Namespace:        amazon-cloudwatch
Priority:         0
Service Account:  fluent-bit
Node:             ip-11-0-156-115.us-west-2.compute.internal/11.0.156.115
Start Time:       Wed, 06 Sep 2023 09:52:48 -0700
Labels:           controller-revision-hash=6cb6db756b
                  k8s-app=fluent-bit
                  kubernetes.io/cluster-service=true
                  pod-template-generation=1
                  version=v1
Annotations:      <none>
Status:           Running
IP:               11.0.156.115
IPs:
  IP:           11.0.156.115
Controlled By:  DaemonSet/fluent-bit
Containers:
  fluent-bit:
    Container ID:   containerd://69bacf029e31a28f243e051c72d49edf24416173b9e9c88e67d2d393cd775af4
    Image:          public.ecr.aws/aws-observability/aws-for-fluent-bit:latest
    Image ID:       public.ecr.aws/aws-observability/aws-for-fluent-bit@sha256:2e52d17a2f34197707dcfe27626cfa45c30ced37999004638768bf533cd7f444
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    139
      Started:      Wed, 06 Sep 2023 10:14:01 -0700
      Finished:     Wed, 06 Sep 2023 10:14:01 -0700
    Ready:          False
    Restart Count:  9
    Limits:
      memory:  200Mi
    Requests:
      cpu:     500m
      memory:  100Mi
    Environment:
      AWS_REGION:             <set to the key 'logs.region' of config map 'fluent-bit-cluster-info'>   Optional: false
      CLUSTER_NAME:           <set to the key 'cluster.name' of config map 'fluent-bit-cluster-info'>  Optional: false
      HTTP_SERVER:            <set to the key 'http.server' of config map 'fluent-bit-cluster-info'>   Optional: false
      HTTP_PORT:              <set to the key 'http.port' of config map 'fluent-bit-cluster-info'>     Optional: false
      READ_FROM_HEAD:         <set to the key 'read.head' of config map 'fluent-bit-cluster-info'>     Optional: false
      READ_FROM_TAIL:         <set to the key 'read.tail' of config map 'fluent-bit-cluster-info'>     Optional: false
      HOST_NAME:               (v1:spec.nodeName)
      HOSTNAME:               fluent-bit-8t7fk (v1:metadata.name)
      CI_VERSION:             k8s/1.3.16
    Mounts:
      /fluent-bit/etc/ from fluent-bit-config (rw)
      /run/log/journal from runlogjournal (ro)
      /var/fluent-bit/state from fluentbitstate (rw)
      /var/lib/docker/containers from varlibdockercontainers (ro)
      /var/log from varlog (ro)
      /var/log/dmesg from dmesg (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h2lml (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  fluentbitstate:
    Type:          HostPath (bare host directory volume)
    Path:          /var/fluent-bit/state
    HostPathType:  
  varlog:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  varlibdockercontainers:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/docker/containers
    HostPathType:  
  fluent-bit-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      fluent-bit-config
    Optional:  false
  runlogjournal:
    Type:          HostPath (bare host directory volume)
    Path:          /run/log/journal
    HostPathType:  
  dmesg:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/dmesg
    HostPathType:  
  kube-api-access-h2lml:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 :NoExecute op=Exists
                             :NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  22m                   default-scheduler  Successfully assigned amazon-cloudwatch/fluent-bit-8t7fk to ip-11-0-156-115.us-west-2.compute.internal
  Normal   Pulled     22m                   kubelet            Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest" in 160.524711ms (160.534769ms including waiting)
  Normal   Pulled     22m                   kubelet            Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest" in 180.919554ms (180.968511ms including waiting)
  Normal   Pulled     22m                   kubelet            Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest" in 166.16697ms (166.176944ms including waiting)
  Normal   Started    21m (x4 over 22m)     kubelet            Started container fluent-bit
  Normal   Pulled     21m                   kubelet            Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest" in 145.097853ms (145.109643ms including waiting)
  Normal   Pulling    21m (x5 over 22m)     kubelet            Pulling image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest"
  Normal   Created    21m (x5 over 22m)     kubelet            Created container fluent-bit
  Normal   Pulled     21m                   kubelet            Successfully pulled image "public.ecr.aws/aws-observability/aws-for-fluent-bit:latest" in 153.296922ms (153.322186ms including waiting)
  Warning  BackOff    2m29s (x94 over 22m)  kubelet

Info 4:

# fluent-bit.conf
fluent_bit_conf = """
[SERVICE]
    Flush             5
    Grace             30
    Log_Level         info
    Daemon            off
    Parsers_File          parsers.conf
    HTTP_Server           ${HTTP_SERVER}
    HTTP_Listen           0.0.0.0
    HTTP_Port         ${HTTP_PORT}
    storage.path          /var/fluent-bit/state/flb-storage/
    storage.sync          normal
    storage.checksum      off
    storage.backlog.mem_limit 5M
@INCLUDE application-log.conf
@INCLUDE dataplane-log.conf
@INCLUDE host-log.conf
"""
# application-log.conf
application_log_conf = """
[INPUT]
    Name        tail
    Tag     application.<namespace_name>.<deployment_name>.log
    Tag_Regex   ^\/var\/log\/containers\/(?<pod_name>[^_]+)_(?<namespace_name>[^_]+)_(?<deployment_name>[^_]+)-(?<docker_id>[a-z0-9]{64})\.log$
    Exclude_Path    /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
    Path        /var/log/containers/*.log
    multiline.parser    docker, cri
    DB          /var/fluent-bit/state/flb_container.db
    Mem_Buf_Limit       50MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Rotate_Wait     30
    storage.type    filesystem
    Read_from_Head      ${READ_FROM_HEAD}
[INPUT]
    Name        tail
    Tag         application.*
    Path        /var/log/containers/cloudwatch-agent*
    multiline.parser    docker, cri
    DB          /var/fluent-bit/state/flb_cwagent.db
    Mem_Buf_Limit       5MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Read_from_Head      ${READ_FROM_HEAD}
[FILTER]
    Name        kubernetes
    Match           application.*
    Kube_URL        https://kubernetes.default.svc:443
    Kube_Tag_Prefix     application.
    Regex_Parser        k8s-custom-tag
    Merge_Log       On
    Merge_Log_Key       log_processed
    K8S-Logging.Parser  On
    K8S-Logging.Exclude Off
    Labels          Off
    Annotations     Off
    Use_Kubelet     On
    Kubelet_Port    10250
    Buffer_Size     0
[OUTPUT]
    Name        cloudwatch_logs
    Match           application.*
    region          ${AWS_REGION}
    log_group_name      /${CLUSTER_NAME}/application
    log_stream_prefix   log-
    auto_create_group   true
    extra_user_agent    container-insights
"""

# dataplane-log.conf
dataplane_log_conf = """
[INPUT]
    Name        systemd
    Tag         dataplane.systemd.*
    Systemd_Filter      _SYSTEMD_UNIT=docker.service
    Systemd_Filter      _SYSTEMD_UNIT=containerd.service
    Systemd_Filter      _SYSTEMD_UNIT=kubelet.service
    DB          /var/fluent-bit/state/systemd.db
    Path        /var/log/journal
    Read_From_Tail      ${READ_FROM_TAIL}
[INPUT]
    Name        tail
    Tag         dataplane.tail.*
    Path        /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
    multiline.parser    docker, cri
    DB          /var/fluent-bit/state/flb_dataplane_tail.db
    Mem_Buf_Limit       50MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Rotate_Wait     30
    storage.type    filesystem
    Read_from_Head      ${READ_FROM_HEAD}
[FILTER]
    Name        modify
    Match           dataplane.systemd.*
    Rename          _HOSTNAME           hostname
    Rename          _SYSTEMD_UNIT           systemd_unit
    Rename          MESSAGE             message
    Remove_regex    ^((?!hostname|systemd_unit|message).)*$
[FILTER]
    Name        aws
    Match           dataplane.*
    imds_version    v1
[OUTPUT]
    Name        cloudwatch_logs
    Match           dataplane.*
    region          ${AWS_REGION}
    log_group_name      /${CLUSTER_NAME}/dataplane
    log_stream_prefix   log-
    auto_create_group   true
    extra_user_agent    container-insights
"""
# host-log.conf
host_log_conf = """
[INPUT]
    Name        tail
    Tag         host.dmesg
    Path        /var/log/dmesg
    Key         message
    DB          /var/fluent-bit/state/flb_dmesg.db
    Mem_Buf_Limit       5MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Read_from_Head      ${READ_FROM_HEAD}
[INPUT]
    Name        tail
    Tag         host.messages
    Path        /var/log/messages
    Parser          syslog
    DB          /var/fluent-bit/state/flb_messages.db
    Mem_Buf_Limit       5MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Read_from_Head      ${READ_FROM_HEAD}
[INPUT]
    Name        tail
    Tag         host.secure
    Path        /var/log/secure
    Parser          syslog
    DB          /var/fluent-bit/state/flb_secure.db
    Mem_Buf_Limit       5MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Read_from_Head      ${READ_FROM_HEAD}
[FILTER]
    Name        aws
    Match           host.*
    imds_version    v1
[OUTPUT]
    Name        cloudwatch_logs
    Match           host.*
    region          ${AWS_REGION}
    log_group_name      /${CLUSTER_NAME}/host
    log_stream_prefix   log-
    auto_create_group   true
    extra_user_agent    container-insights
"""
# parsers.conf
parsers_conf = r"""
[PARSER]
    Name        syslog
    Format          regex
    Regex           ^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
    Time_Key        time
    Time_Format     %b %d %H:%M:%S
[PARSER]
    Name        container_firstline
    Format          regex
    Regex           (?<log>(?<="log":")\S(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
    Time_Key        time
    Time_Format     %Y-%m-%dT%H:%M:%S.%LZ
[PARSER]
    Name        cwagent_firstline
    Format          regex
    Regex           (?<log>(?<="log":")\d{4}[\/-]\d{1,2}[\/-]\d{1,2}[ T]\d{2}:\d{2}:\d{2}(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
    Time_Key        time
    Time_Format     %Y-%m-%dT%H:%M:%S.%LZ
[PARSER]
    Name        k8s-custom-tag
    Format      regex
    Regex       ^(?<namespace_name>[^.]+)\.(?<deployment_name>[^.]+)\.log$
"""

Codes:

https://github.com/omidraha/pulumi_example/blob/main/cw/cw.py
https://github.com/omidraha/pulumi_example/blob/main/cw/setup.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants