Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent errors when using output plugin cloudwatch_logs #274

Open
yangyang919 opened this issue Dec 1, 2021 · 29 comments
Open

Frequent errors when using output plugin cloudwatch_logs #274

yangyang919 opened this issue Dec 1, 2021 · 29 comments

Comments

@yangyang919
Copy link

yangyang919 commented Dec 1, 2021

Describe the question/issue

I know this is not a new issue. But I'm wondering whether the root cause is identified? We are using cloudwatch_logs plugin to send logs from our Pods to AWS cloud watch. The errors look like below:

[2021/12/01 01:53:50] [ warn] [engine] failed to flush chunk '1-1638323628.679735747.flb', retry in 10 seconds: task_id=0, input=emitter_for_rewrite_tag.5 > output=http.2 (out_id=2) [2021/12/01 02:02:41] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096 [2021/12/01 02:02:41] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Recieved code 200 but response was invalid, x-amzn-RequestId header not found [2021/12/01 02:02:41] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:02:41] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:02:41] [ warn] [engine] failed to flush chunk '1-1638324161.187277303.flb', retry in 7 seconds: task_id=0, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/01 02:02:55] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096 [2021/12/01 02:02:55] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Recieved code 200 but response was invalid, x-amzn-RequestId header not found [2021/12/01 02:02:55] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:02:55] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:02:55] [ warn] [engine] failed to flush chunk '1-1638324175.42999390.flb', retry in 7 seconds: task_id=0, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/01 02:04:29] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096 [2021/12/01 02:04:29] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Recieved code 200 but response was invalid, x-amzn-RequestId header not found [2021/12/01 02:04:29] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:04:29] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:04:29] [ warn] [engine] failed to flush chunk '1-1638324268.678872073.flb', retry in 9 seconds: task_id=0, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/01 02:04:38] [ warn] [http_client] malformed HTTP response from logs.eu-central-1.amazonaws.com:443 on connection #34 [2021/12/01 02:04:38] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:04:38] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:04:38] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:04:38] [ warn] [engine] chunk '1-1638324268.678872073.flb' cannot be retried: task_id=0, input=tail.0 > output=cloudwatch_logs.3 [2021/12/01 02:07:49] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096 [2021/12/01 02:07:49] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Recieved code 200 but response was invalid, x-amzn-RequestId header not found [2021/12/01 02:07:49] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:07:49] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:07:49] [ warn] [engine] failed to flush chunk '1-1638324469.352128961.flb', retry in 11 seconds: task_id=0, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/01 02:08:00] [ warn] [http_client] malformed HTTP response from logs.eu-central-1.amazonaws.com:443 on connection #102 [2021/12/01 02:08:00] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:08:00] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:08:00] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:08:00] [ warn] [engine] chunk '1-1638324469.352128961.flb' cannot be retried: task_id=0, input=tail.0 > output=cloudwatch_logs.3 [2021/12/01 02:10:31] [ warn] [http_client] malformed HTTP response from logs.eu-central-1.amazonaws.com:443 on connection #102 [2021/12/01 02:10:31] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:10:31] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:10:31] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:10:31] [ warn] [engine] failed to flush chunk '1-1638324630.866492143.flb', retry in 9 seconds: task_id=0, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/01 02:11:41] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096 [2021/12/01 02:11:41] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Recieved code 200 but response was invalid, x-amzn-RequestId header not found

`
[INPUT]
Name tail
Path /var/log/containers/.log
Parser custom_cri
Tag kube.

Mem_Buf_Limit 5MB
Skip_Long_Lines On
[INPUT]
Name systemd
Tag host.*
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
Read_From_Tail On

[FILTER]
    Name record_modifier
    Match *
    Record cluster_identifier {{ .Values.cluster }}
[FILTER]
    Name kubernetes
    Match kube.*
    Merge_Log On
    Keep_Log Off
    K8S-Logging.Parser On
    K8S-Logging.Exclude On
[FILTER]
    Name                lua
    Match               kube.*
    script              kubernetes_map_keys.lua
    call                kubernetes_map_keys
[FILTER]
    Name                lua
    Match               *
    script              kubernetes_map_keys.lua
    call                add_source_type
[FILTER]
    Name                lua
    Match               *
    script              kubernetes_map_keys.lua
    call                add_aligned_sent_received
[FILTER]
    Name                rewrite_tag
    Match               kube.*
    Rule                $source_type ^(istio-envoy-proxy-access-log)$ istio-envoy false

[OUTPUT]
    Name s3
    Match kube.*
    bucket upscale-sre-fluentbit-logs-dev-use1
    region {{ .Values.region }}
    total_file_size 100M
    upload_timeout 5m
    s3_key_format /%Y/%m/%d/%H/
    s3_key_format_tag_delimiters .

[OUTPUT]
    Name                cloudwatch_logs
    Match               kube.*
    region              {{ .Values.region }}
    log_group_name      fluent-bit-cloudwatch/{{ .Values.clusterName }}
    log_stream_prefix   from-fluent-bit-
    auto_create_group   true`
@PettitWesley
Copy link
Contributor

The errors look like below:

Don't see anything below 😐

@yangyang919
Copy link
Author

Just added...hit the wrong button...

@PettitWesley
Copy link
Contributor

What version are you using? I think its something old:

cannot increase buffer: current=4096 requested=36864 max=4096

I believe this was fixed a while ago.

Also, try the auto_retry_requests option in newer versions as well: https://github.com/aws/aws-for-fluent-bit/pull/266/files

@yangyang919
Copy link
Author

What version are you using? I think its something old:

cannot increase buffer: current=4096 requested=36864 max=4096

I believe this was fixed a while ago.

Also, try the auto_retry_requests option in newer versions as well: https://github.com/aws/aws-for-fluent-bit/pull/266/files

The version we used I believe it's 1.8.6, is this version too old?

@PettitWesley
Copy link
Contributor

@yangyang919 Unfortunately there have been a lot of bug fixes throughout the 1.8.x series, so yeah, 1.8.6 is a bit old.

Please use the AWS distro, the latest version is on 1.8.9: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.21.2

@yangyang919
Copy link
Author

thanks @PettitWesley we'll upgrade the version and get back to you

@yangyang919
Copy link
Author

yangyang919 commented Dec 9, 2021

Hi @PettitWesley ,

We have upgraded fluent-bit to version 1.8.10. New issues appeared: Pod gets OOMKilled, and Back-off restarted frequently. If removing cloud-watch output part, then everything is normal.

Our config is like below, do you see any problem?

` [SERVICE]
Daemon Off
Flush 1
Log_Level warn
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Health_Check On

[INPUT]
    Name tail
    Path /var/log/containers/*.log
    Parser custom_cri
    Tag kube.*
    Mem_Buf_Limit 128MB
    Skip_Long_Lines On

[INPUT]
    Name systemd
    Tag host.*
    Systemd_Filter _SYSTEMD_UNIT=kubelet.service
    Read_From_Tail On

[FILTER]
    Name record_modifier
    Match *
    Record cluster_identifier stage-use1-saml

[FILTER]
    Name kubernetes
    Match kube.*
    Merge_Log On
    Keep_Log Off
    K8S-Logging.Parser On
    K8S-Logging.Exclude On

[FILTER]
    Name                lua
    Match               kube.*
    script              kubernetes_map_keys.lua
    call                kubernetes_map_keys

[FILTER]
    Name                lua
    Match               *
    script              kubernetes_map_keys.lua
    call                add_source_type

[FILTER]
    Name                lua
    Match               *
    script              kubernetes_map_keys.lua
    call                add_aligned_sent_received

[FILTER]
    Name                rewrite_tag
    Match               kube.*
    Rule                $source_type ^(istio-envoy-proxy-access-log)$ istio-envoy false

[OUTPUT]
    Name http
    Match kube.*
    Retry_Limit False
    Host host
    port 443
    uri /
    http_User user
    http_Passwd pwd
    format json
    tls on
    tls.debug 1

[OUTPUT]
    Name http
    Match approuter
    Retry_Limit False
    Host host
    port 443
    uri /
    http_User user
    http_Passwd pwd
    format json
    tls on
    tls.debug 1

[OUTPUT]
    Name http
    Match istio-envoy
    Retry_Limit False
    Host host
    port 443
    uri /customindex/istio-envoy
    http_User user
    http_Passwd pwd
    format json
    tls on
    tls.debug 1






[OUTPUT]
    Name                cloudwatch_logs
    Match               kube.*
    region              us-east-1
    log_group_name      fluent-bit-cloudwatch/stage-use1
    log_stream_prefix   from-fluent-bit-
    auto_create_group   true

[OUTPUT]
    Name                cloudwatch_logs
    Match               approuter
    region              us-east-1
    log_group_name      fluent-bit-cloudwatch/stage-use1
    log_stream_prefix   from-fluent-bit-
    auto_create_group   true`

@DrewZhang13
Copy link
Contributor

DrewZhang13 commented Dec 9, 2021

memory growth issue reported from another customer for aws-for fluent bit version 2.21.1 which is fluent bit 1.8.9 version: #277

@PettitWesley
Copy link
Contributor

Please also consider checking out relevant sections of our new debugging guide: https://github.com/aws/aws-for-fluent-bit/pull/266/files

@yangyang919
Copy link
Author

yangyang919 commented Dec 10, 2021

Hi @PettitWesley @DrewZhang13

Thanks for replying. Here is what we observed:

  1. As you can see in our config, the outputs are sent to two places. One is fluentD for ElasticSearch, while the other is to AWS Cloud Watch.
  2. If enable both outputs, the pods will suffer very frequent restart issue due to OOMKilled.
  3. If enable only cloud watch or fluentD, OOMKilled will become much less frequent. Still has, but very very few.

So I'm thinking maybe tuning parameters can solve this issue. Our fluent-bit pod has resources limits like this:
Limits: cpu: 200m memory: 512Mi Requests: cpu: 100m memory: 256Mi

[INPUT]
Name tail
Path /var/log/containers/.log
Parser custom_cri
Tag kube.

Mem_Buf_Limit 128MB
Skip_Long_Lines On

You met more cases like us. Do you foresee any parameters can be optimized?

Some logs before get OOMKilled:

[2021/12/10 11:15:50] [debug] [out coro] cb_destroy coro_id=352
[2021/12/10 11:15:50] [ info] [engine] chunk '1-1639134890.24990947.flb' is not retried (no retry config): task_id=7, input=tail.0 > output=http.0 (out_id=0)
[2021/12/10 11:15:50] [debug] [task] destroy task=0x7f2633a37460 (task_id=7)
[2021/12/10 11:15:50] [debug] [socket] could not validate socket status for #99 (don't worry)
[2021/12/10 11:15:52] [debug] [input:tail:tail.0] inode=1166440 events: IN_MODIFY
[2021/12/10 11:15:53] [debug] [task] created task=0x7f2633a37460 id=7 OK
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: before SSL initialization
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS write client hello
[2021/12/10 11:15:53] [debug] [tls] connection #99 WANT_READ
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS write client hello
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS read server hello
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: TLSv1.3 read encrypted extensions
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS read server certificate
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: TLSv1.3 read server certificate verify
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS read finished
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS write change cipher spec
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS write finished
[2021/12/10 11:15:53] [debug] [http_client] not using http_proxy for header
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSL negotiation finished successfully
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSL negotiation finished successfully
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS read server session ticket
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSL negotiation finished successfully
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSL negotiation finished successfully
[2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS read server session ticket
[2021/12/10 11:15:54] [debug] [input:tail:tail.0] inode=1294459 events: IN_MODIFY
[2021/12/10 11:15:54] [ info] [output:http:http.0] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200
[2021/12/10 11:15:54] [debug] [upstream] KA connection #110 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available
[2021/12/10 11:15:54] [debug] [out coro] cb_destroy coro_id=416
[2021/12/10 11:15:54] [debug] [task] destroy task=0x7f2633a3b560 (task_id=8)
[2021/12/10 11:15:54] [debug] [task] created task=0x7f2633a3b560 id=8 OK
[2021/12/10 11:15:54] [debug] [upstream] KA connection #118 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 has been assigned (recycled)
[2021/12/10 11:15:54] [debug] [http_client] not using http_proxy for header
[2021/12/10 11:15:54] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sending 1 events
[2021/12/10 11:15:54] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sending log events to log stream from-fluent-bit-kube.var.log.containers.calico-node-7mgp2_kube-system_calico-node-87a7327d7d88b919f6b8fb8b7e42f3458010b8dc26292a5725636d1a96ee15c4.log
[2021/12/10 11:15:54] [debug] [upstream] KA connection #44 to logs.us-east-1.amazonaws.com:443 has been assigned (recycled)
[2021/12/10 11:15:54] [debug] [http_client] not using http_proxy for header
[2021/12/10 11:15:54] [debug] [aws_credentials] Requesting credentials from the env provider..
[2021/12/10 11:15:54] [ warn] [http_client] malformed HTTP response from logs.us-east-1.amazonaws.com:443 on connection #44
[2021/12/10 11:15:54] [debug] [aws_client] logs.us-east-1.amazonaws.com: http_do=-1, HTTP Status: 0
[2021/12/10 11:15:54] [debug] [upstream] KA connection #44 to logs.us-east-1.amazonaws.com:443 is now available
[2021/12/10 11:15:54] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events
[2021/12/10 11:15:54] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events
[2021/12/10 11:15:54] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events
[2021/12/10 11:15:54] [debug] [out coro] cb_destroy coro_id=421
[2021/12/10 11:15:54] [debug] [retry] new retry created for task_id=8 attempts=1
[2021/12/10 11:15:54] [ warn] [engine] failed to flush chunk '1-1639134954.235814243.flb', retry in 8 seconds: task_id=8, input=tail.0 > output=cloudwatch_logs.3 (out_id=3)
[2021/12/10 11:15:54] [debug] [upstream] KA connection #44 to logs.us-east-1.amazonaws.com:443 has been disconnected by the remote service
[2021/12/10 11:15:54] [debug] [socket] could not validate socket status for #44 (don't worry)
[2021/12/10 11:15:55] [ info] [output:http:http.0] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200
[2021/12/10 11:15:55] [debug] [upstream] KA connection #104 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available
[2021/12/10 11:15:55] [debug] [out coro] cb_destroy coro_id=413
[2021/12/10 11:15:55] [debug] [task] destroy task=0x7f26299b8fa0 (task_id=1)
[2021/12/10 11:15:55] [ info] [output:http:http.0] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200
[2021/12/10 11:15:55] [debug] [upstream] KA connection #118 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available
[2021/12/10 11:15:55] [debug] [out coro] cb_destroy coro_id=420
[2021/12/10 11:15:55] [ info] [output:http:http.0] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200
[2021/12/10 11:15:55] [debug] [upstream] KA connection #120 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available
[2021/12/10 11:15:55] [debug] [out coro] cb_destroy coro_id=414
[2021/12/10 11:15:55] [debug] [task] destroy task=0x7f2633a3bf60 (task_id=4)
[2021/12/10 11:16:02] [ info] [output:http:http.0] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200
[2021/12/10 11:16:02] [debug] [upstream] KA connection #108 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available
[2021/12/10 11:16:02] [debug] [upstream] KA connection #108 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 has been disconnected by the remote service
[2021/12/10 11:16:02] [debug] [out coro] cb_destroy coro_id=390
[2021/12/10 11:16:02] [debug] [task] destroy task=0x7f2633a3b420 (task_id=3)
[2021/12/10 11:16:02] [debug] [socket] could not validate socket status for #108 (don't worry)
[2021/12/10 11:16:02] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sending 1 events
[2021/12/10 11:16:02] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sending log events to log stream from-fluent-bit-kube.var.log.containers.calico-node-7mgp2_kube-system_calico-node-87a7327d7d88b919f6b8fb8b7e42f3458010b8dc26292a5725636d1a96ee15c4.log
[2021/12/10 11:16:02] [debug] [http_client] not using http_proxy for header
[2021/12/10 11:16:02] [debug] [aws_credentials] Requesting credentials from the env provider..
[2021/12/10 11:16:02] [debug] [upstream] KA connection #108 to logs.us-east-1.amazonaws.com:443 is now available
[2021/12/10 11:16:02] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] PutLogEvents http status=200
[2021/12/10 11:16:02] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sent events to from-fluent-bit-kube.var.log.containers.calico-node-7mgp2_kube-system_calico-node-87a7327d7d88b919f6b8fb8b7e42f3458010b8dc26292a5725636d1a96ee15c4.log
[2021/12/10 11:16:02] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sent 1 events to CloudWatch
[2021/12/10 11:16:02] [debug] [out coro] cb_destroy coro_id=422
[2021/12/10 11:16:02] [debug] [task] destroy task=0x7f2633a3b560 (task_id=8)
[2021/12/10 11:16:02] [debug] [input:tail:tail.0] inode=1166440 events: IN_MODIFY
[2021/12/10 11:16:02] [debug] [input:tail:tail.0] inode=1166440 events: IN_MODIFY
[2021/12/10 11:16:03] [debug] [task] created task=0x7f2633a3b560 id=1 OK
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: before SSL initialization
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS write client hello
[2021/12/10 11:16:03] [debug] [tls] connection #102 WANT_READ
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS write client hello
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS read server hello
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: TLSv1.3 read encrypted extensions
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS read server certificate
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: TLSv1.3 read server certificate verify
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS read finished
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS write change cipher spec
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS write finished
[2021/12/10 11:16:03] [debug] [http_client] not using http_proxy for header
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSL negotiation finished successfully
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSL negotiation finished successfully
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS read server session ticket
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSL negotiation finished successfully
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSL negotiation finished successfully
[2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS read server session ticket
[2021/12/10 11:16:04] [ info] [output:http:http.2] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200
[2021/12/10 11:16:04] [debug] [upstream] KA connection #102 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available
[2021/12/10 11:16:04] [debug] [out coro] cb_destroy coro_id=59
[2021/12/10 11:16:04] [debug] [task] destroy task=0x7f2633a3b560 (task_id=1)
[2021/12/10 11:16:04] [ info] [output:http:http.0] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200
[2021/12/10 11:16:04] [debug] [upstream] KA connection #138 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available
[2021/12/10 11:16:04] [debug] [out coro] cb_destroy coro_id=396
[2021/12/10 11:16:04] [debug] [task] destroy task=0x7f26299b9040 (task_id=5)
[2021/12/10 11:16:04] [debug] [upstream] KA connection #138 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 has been disconnected by the remote service
[2021/12/10 11:16:04] [debug] [socket] could not validate socket status for #138 (don't worry)
[2021/12/10 11:16:05] [ info] [output:http:http.0] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200
[2021/12/10 11:16:05] [debug] [upstream] KA connection #105 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available
[2021/12/10 11:16:05] [debug] [out coro] cb_destroy coro_id=410
[2021/12/10 11:16:05] [debug] [task] destroy task=0x7f2633a3a020 (task_id=9)
[2021/12/10 11:16:09] [debug] [input:tail:tail.0] inode=1166155 events: IN_MODIFY
[2021/12/10 11:16:09] [debug] [task] created task=0x7f2633a3a020 id=1 OK
[2021/12/10 11:16:09] [debug] [upstream] KA connection #102 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 has been assigned (recycled)
[2021/12/10 11:16:09] [debug] [http_client] not using http_proxy for header
[2021/12/10 11:16:09] [debug] [input:tail:tail.0] inode=1170715 events: IN_MODIFY
[2021/12/10 11:16:09] [ info] [output:http:http.2] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200
[2021/12/10 11:16:09] [debug] [upstream] KA connection #102 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available
[2021/12/10 11:16:09] [debug] [out coro] cb_destroy coro_id=60
[2021/12/10 11:16:09] [debug] [task] destroy task=0x7f2633a3a020 (task_id=1)
[2021/12/10 11:16:10] [debug] [task] created task=0x7f2633a3a020 id=1 OK
[2021/12/10 11:16:10] [debug] [upstream] KA connection #110 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 has been assigned (recycled)
[2021/12/10 11:16:10] [debug] [http_client] not using http_proxy for header
[2021/12/10 11:16:10] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sending 1 events
[2021/12/10 11:16:10] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sending log events to log stream from-fluent-bit-kube.var.log.containers.oneagent-7tvjp_dynatrace_dynatrace-oneagent-10f778095fd6aa924c26a41c818872058bcf3c3c0a9832f2db0c96e73eda2bf9.log
[2021/12/10 11:16:10] [debug] [upstream] KA connection #108 to logs.us-east-1.amazonaws.com:443 has been assigned (recycled)
[2021/12/10 11:16:10] [debug] [http_client] not using http_proxy for header
[2021/12/10 11:16:10] [debug] [aws_credentials] Requesting credentials from the env provider..

@PettitWesley
Copy link
Contributor

@yangyang919

We have upgraded fluent-bit to version 1.8.10. New issues appeared: Pod gets OOMKilled, and Back-off restarted frequently.
As you can see in our config, the outputs are sent to two places. One is fluentD for ElasticSearch, while the other is to AWS Cloud Watch.

Which outputs are seeing errors? (Both? Or more one than the other?). Usually OOMKill/high memory usage is caused by errors in fluent bit which lead to retries, which mean the logs pile up in the buffer. Consider using the monitoring interface to get error counts: https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit

Also, what are the error messages?

So I'm thinking maybe tuning parameters can solve this issue. Our fluent-bit pod has resources limits like this:
Limits: cpu: 200m memory: 512Mi Requests: cpu: 100m memory: 256Mi
Mem_Buf_Limit 128MB

So the Mem_Buf_Limit is the amount of data the input can buffer in memory (note that you can also have file system buffering as an additional layer, though memory will still be used in that case and the usage can still be high, don't think of it like a filesystem instead of memory feature, think of it as filesystem in addition to memory...).

When data is sent, its copied into a new buffer for each output and then the output might create additional buffers for the http requests and for processing. Thus, it is very reasonable for the following to be true:

"actual memory usage" > "# of outputs" * "Total Mem_Buf_Limit for inputs"

Just for normal/happy cases (and if your pods are producing tons of logs), and in error cases, it could be much more...

@yangyang919
Copy link
Author

@PettitWesley Thanks for replying! Very very important message for us:

"actual memory usage" > "# of outputs" * "Total Mem_Buf_Limit for inputs"

We have 6 outputs in total. 3 for FluentD, while 3 for Cloudwatch. Considering Mem_Buf_Limit is 128M, so the actual for FluentD is already at least 3*128MB. Coz we found out FluentD has many retries logs, while cloudwatch are quite stable.

I'll lower the Mem_Buf_Limit buffer to verify this

@PettitWesley
Copy link
Contributor

@yangyang919 Remember that lowering the mem_buf_limit could lead to log loss under high throughput and you should check the fluent bit logs for paused/overlimit warnings.

@yangyang919
Copy link
Author

@PettitWesley yes, thanks for reminding. I also enabled file system storage as well. But sure, to really solve this, we need to make sure fluentD side digests our logs, otherwise no matter how large buffer we set, it will eventually get full..
Another question is is there a way to monitor live buffer consumption in Fluent-bit?

@PettitWesley
Copy link
Contributor

PettitWesley commented Dec 13, 2021

@yangyang919 See this: https://docs.fluentbit.io/manual/administration/monitoring#metrics-examples

I think there are disk storage metrics too but only if you use the config option noted here: https://docs.fluentbit.io/manual/administration/monitoring#rest-api-interface

@yangyang919
Copy link
Author

yangyang919 commented Dec 13, 2021

thanks @PettitWesley , I can see that the buffer gets consumed all for tail plugin. A further question: we have a bunch of Filters, is there a way to check memory consumption by those plugins as well?
Coz I already set Mem_Buf_Limit to 20MB, 6 outputs, then 120MB maximum. The requested memory is 320MB now. But in 4 hours period, we still see 5 Pods out of 30 get restarted once due to OOM (which is much much better already). I am suspecting it's because some other plugins eat the memory...

{"storage_layer":{"chunks":{"total_chunks":10,"mem_chunks":10,"fs_chunks":0,"fs_chunks_up":0,"fs_chunks_down":0}},"input_chunks":{"tail.0":{"status":{"overlimit":true,"mem_size":"19.1M","mem_limit":"19.1M"},"chunks":{"total":10,"up":10,"down":0,"busy":9,"busy_size":"17.6M"}},"systemd.1":{"status":{"overlimit":false,"mem_size":"0b","mem_limit":"4.8M"},"chunks":{"total":0,"up":0,"down":0,"busy":0,"busy_size":"0b"}},"storage_backlog.2":{"status":{"overlimit":false,"mem_size":"0b","mem_limit":"0b"},"chunks":{"total":0,"up":0,"down":0,"busy":0,"busy_size":"0b"}},"emitter_for_rewrite_tag.5":{"status":{"overlimit":false,"mem_size":"0b","mem_limit":"9.5M"},"chunks":{"total":0,"up":0,"down":0,"busy":0,"busy_size":"0b"}}}}

@PettitWesley
Copy link
Contributor

@yangyang919 Unfortunately there's currently no way to see plugin specific memory usage..

Also, remember the formula I gave used > not =, each plugin will have all sorts of temporary processing buffers, so its possible for usage to be higher.

@yangyang919
Copy link
Author

thanks @PettitWesley , today I found a very strange behavior. One Pod gets restarted 3 times, but if checking its memory consumption, it's definitely below the requested resources (request 400M, limit 800M). Do you have any thoughts?
Screen Shot 2021-12-14 at 1 50 12 PM

@PettitWesley
Copy link
Contributor

@yangyang919 Do the logs contain any evidence that it crashed? Can you get the stopped reason from k8s?

@yangyang919
Copy link
Author

@PettitWesley The reason is OOMKilled...

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Tue, 14 Dec 2021 14:23:08 +0800
  Finished:     Tue, 14 Dec 2021 14:39:00 +0800

The logs before it gets restarted seems ok:

[2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY
[2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY
[2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY
[2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY
[2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY
[2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY
[2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY
[2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY
[2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY
[2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY
[2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY
[2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY
[2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY
[2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY
[2021/12/14 06:38:32] [ warn] [input] tail.0 paused (mem buf overlimit)
[2021/12/14 06:38:32] [debug] [task] created task=0x7fa2f4e3aa20 id=1 OK
[2021/12/14 06:38:32] [debug] [task] created task=0x7fa2f4e3b060 id=2 OK
[2021/12/14 06:38:32] [debug] [task] created task=0x7fa2f4e3b1a0 id=3 OK
[2021/12/14 06:38:32] [debug] [task] created task=0x7fa2f4e3b2e0 id=4 OK
[2021/12/14 06:38:32] [debug] [task] created task=0x7fa2f4e3b4c0 id=5 OK
[2021/12/14 06:38:32] [debug] [task] created task=0x7fa2f4e3b600 id=6 OK
[2021/12/14 06:38:32] [debug] [upstream] KA connection #93 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 has been assigned (recycled)
[2021/12/14 06:38:32] [debug] [http_client] not using http_proxy for header
[2021/12/14 06:38:32] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] Sending 3 events
[2021/12/14 06:38:32] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] Sending log events to log stream from-fluent-bit-kube.var.log.containers.shipping-service-67cc57c794-vgmnd_upscale_istio-proxy-ea0865f2af6aaacf16da4f2e79ce8b3a040293d0c57f734c52a1b2ac28e97ab9.log
[2021/12/14 06:38:32] [debug] [upstream] KA connection #36 to logs.us-east-1.amazonaws.com:443 has been assigned (recycled)
[2021/12/14 06:38:32] [debug] [http_client] not using http_proxy for header
[2021/12/14 06:38:32] [debug] [aws_credentials] Requesting credentials from the env provider..

@PettitWesley
Copy link
Contributor

@yangyang919 If K8s says it was OOMKilled I would assume that isn't a lie and is true... so somehow the memory usage spiked I guess.. the graph does show that right? It just isn't as high as you expect.

But I found this article: https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-part-3-container-resource-metrics-361c5ee46e66

And it says:

You might think that memory utilization is easily tracked with container_memory_usage_bytes, however, this metric also includes cached (think filesystem cache) items that can be evicted under memory pressure. The better metric is container_memory_working_set_bytes as this is what the OOM killer is watching for.

@yangyang919
Copy link
Author

@PettitWesley really weird...Below pod has restarted once in last 60 minutes due to OOMKilled. But using container_memory_working_set_bytes still looks so normal...

State:          Running
  Started:      Tue, 14 Dec 2021 15:02:15 +0800
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Tue, 14 Dec 2021 14:22:05 +0800
  Finished:     Tue, 14 Dec 2021 15:02:13 +0800
Ready:          True

Screen Shot 2021-12-14 at 3 23 35 PM

@PettitWesley
Copy link
Contributor

@yangyang919

[2021/12/14 06:38:32] [ warn] [input] tail.0 paused (mem buf overlimit)

You do have mem buf overlimit warning. Also, I wonder if the memory spikes very suddenly in your case, and the graph never catches up to the real memory usage. Each crash has a sudden increase in memory right before it. This is my guess.

As to what is causing that... I am not sure... no retries or other errors?

@yangyang919
Copy link
Author

yangyang919 commented Dec 15, 2021

@PettitWesley
Yes, retries are set to unlimited. From metrics API, we can see there are quite some retries => means the output destination is kind of stuck, the requests are not so fluent. This is something I'm trying to check with FluentD side, why they keep rejects our requests.

But still I'm guessing somewhere in our Fluent-bit there is memory leak, not sure which Plugin is leaking. The memory grows so fast, basically vertical ascending and then get OOMKilled. If I only enabled outputs to FluentD or outputs to cloudwatch, everything is fine. If enabled both, then OOMKilled happened. I attached the config below, this is right now the most stable version. In this version there are still restarts due to OOMKilled, but not no frequent. Some points worth to mention:

  1. if set Flush from 5 to 10, other config remains unchanged => OOMKilled becomes very frequent, like 3 times in 10 minutes.
  2. Inputs have memory buffer limit, storage layer has memory limit. I tried to disable Filter plugin one by one, but still no strong proof that they have memory leak.

service: |
[SERVICE]
Daemon Off
Flush 5
Log_Level {{ .Values.logLevel }}
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port {{ .Values.service.port }}
Health_Check On
storage.path /var/log/fluentbit
storage.sync normal
storage.checksum Off
storage.backlog.mem_limit 10M
storage.metrics On
inputs: |
[INPUT]
Name tail
Path /var/log/containers/.log
Parser custom_cri
Tag kube.

Mem_Buf_Limit 10MB
Skip_Long_Lines On
[INPUT]
Name systemd
Tag host.*
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
Read_From_Tail On
Mem_Buf_Limit 5MB
filters: |
[FILTER]
Name record_modifier
Match *
Record cluster_identifier {{ .Values.cluster }}
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
Buffer_Size 0
[FILTER]
Name lua
Match kube.*
script kubernetes_map_keys.lua
call kubernetes_map_keys
[FILTER]
Name lua
Match *
script kubernetes_map_keys.lua
call add_source_type
[FILTER]
Name lua
Match *
script kubernetes_map_keys.lua
call add_aligned_sent_received
[FILTER]
Name rewrite_tag
Match kube.*
Rule $source_type ^(istio-envoy-proxy-access-log)$ istio-envoy false
outputs: |
[OUTPUT]
Name http
Match kube.*
Retry_Limit False
Host path:kv-v2/#Host
port 443
uri /
http_User path:kv-v2/#http_User
http_Passwd path:kv-v2/#http_Passwd
format json
tls on
tls.debug 1

[OUTPUT]
    Name http
    Match istio-envoy
    Retry_Limit False
    Host <path:kv-v2/data/fluent-bit/{{ .Values.cluster }}#Host>
    port 443
    uri /customindex/istio-envoy
    http_User <path:kv-v2/#http_User>
    http_Passwd <path:kv-v2/#http_Passwd>
    format json
    tls on
    tls.debug 1

[OUTPUT]
    Name                cloudwatch_logs
    Match               kube.*
    region              {{ .Values.region }}
    log_group_name      fluent-bit-cloudwatch/{{ .Values.clusterName }}
    log_stream_prefix   from-fluent-bit-
    auto_create_group   true

[OUTPUT]
    Name                cloudwatch_logs
    Match               istio-envoy
    region              {{ .Values.region }}
    log_group_name      fluent-bit-cloudwatch/{{ .Values.clusterName }}
    log_stream_prefix   from-fluent-bit-
    auto_create_group   true

customParsers: |
[PARSER]
Name docker_no_time
Format json
Time_Keep Off
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
[PARSER]
Name custom_cri
Format regex
Regex ^(?[^ ]+) (?stdout|stderr) (?[^ ]) (?.)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z

luaScripts:
kubernetes_map_keys.lua: |
function kubernetes_map_keys(tag, timestamp, record)
if record.kubernetes == nil then
return 0
end
map_keys(record.kubernetes.annotations)
map_keys(record.kubernetes.labels)
return 1, timestamp, record
end
function map_keys(table)
if table == nil then
return
end
local new_table = {}
local changed_keys = {}
for key, val in pairs(table) do
local mapped_key = string.gsub(key, "[%/%.]", "_")
if mapped_key ~= key then
new_table[mapped_key] = val
changed_keys[key] = true
end
end
for key in pairs(changed_keys) do
table[key] = nil
end
for key, val in pairs(new_table) do
table[key] = val
end
end
function add_source_type(tag, timestamp, record)
if record.kubernetes ~= nil then
local namespace = record.kubernetes.namespace_name
if record.kubernetes.container_name == "istio-proxy" and
record.protocol ~= nil then
record.source_type = "istio-envoy-proxy-access-log"
elseif string.find(namespace, "compass%-system") or
string.find(namespace, "kube%-node%-lease") or
string.find(namespace, "kube%-public") or
string.find(namespace, "kube%-system") or
string.find(namespace, "istio%-system") then
record.source_type = "kube-system"
else
record.source_type = "application"
end
end
return 1, timestamp, record
end
function add_aligned_sent_received(tag, timestamp, record)
if record.source_type ~= "istio-envoy-proxy-access-log" then
return 0
end
if record.upstream_cluster:find("outbound|", 1) == 1 then
record.bytes_sent_aligned = record.bytes_received
record.bytes_received_aligned = record.bytes_sent
else
record.bytes_sent_aligned = record.bytes_sent
record.bytes_received_aligned = record.bytes_received
end
return 1, timestamp, record
end

@PettitWesley
Copy link
Contributor

@yangyang919 Sorry I'm not really sure what's causing this for you.

For another issue, I recently did some testing/benchmarking and memory leak checking, and I found that our latest stable version is very stable in memory usage and does not have any leaks: https://github.com/aws/aws-for-fluent-bit/blob/mainline/AWS_FOR_FLUENT_BIT_STABLE_VERSION

Please try that version if you haven't already.

@vkadi
Copy link

vkadi commented Apr 12, 2023

@PettitWesley I do confirm that the stable version 2.28.5 is much better with the cloudwatch_logs plugin vs latest "2.31.x"
We are also seeing the issues with the latest version and the recommendation here says switch to the older cloudwatch plugin (#340) but that is low performant.
@yangyang919 did you found any solution to stablize on these errors/restarts ?

@PettitWesley
Copy link
Contributor

@vkadi please check this issue: #542

@vkadi
Copy link

vkadi commented Apr 12, 2023

@PettitWesley
Copy link
Contributor

@vkadi the recommendations and fluent bit config options can be used in EKS as well. Fluent Bit config language is the same no matter where you deploy it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants