Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak in v1.8.7 debug #4211

Closed
lmuhlha opened this issue Oct 21, 2021 · 25 comments
Closed

Potential memory leak in v1.8.7 debug #4211

lmuhlha opened this issue Oct 21, 2021 · 25 comments
Labels

Comments

@lmuhlha
Copy link

lmuhlha commented Oct 21, 2021

Bug Report

Describe the bug
fluent/fluent-bit:1.8.7-debug@sha256:024748e4aa934d5b53a713341608b7ba801d41a170f9870fdf67f4032a20146f

To Reproduce

  • Rubular link if applicable:
  • Example log message if applicable:
stream of "OOMKilling" warnings
  • Steps to reproduce the problem:
    Deploy fluent/fluent-bit:1.8.7-debug@sha256:024748e4aa934d5b53a713341608b7ba801d41a170f9870fdf67f4032a20146f and wait 10-15 mins. Container will OOM.

Expected behavior
Deploying fluent/fluent-bit:1.8.7-debug@sha256:024748e4aa934d5b53a713341608b7ba801d41a170f9870fdf67f4032a20146f with a specified amount of memory will work and not constantly increase / OOM.

Screenshots

image

Your Environment

  • Version used: fluent/fluent-bit:1.8.7-debug@sha256:024748e4aa934d5b53a713341608b7ba801d41a170f9870fdf67f4032a20146f
  • Configuration:
  fluent-bit.conf: |-
    [SERVICE]
        Flush         5
        Grace         120
        Log_Level     debug
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_PORT     3020

    @INCLUDE containers.input.conf
    @INCLUDE system.input.conf
    @INCLUDE filter.conf
    @INCLUDE output.conf

  containers.input.conf: |-
    [INPUT]
        Name             tail
        Alias            k8s_container
        Tag              k8s_container.<namespace_name>.<pod_name>.<container_name>
        Tag_Regex        (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
        Path             /var/log/containers/*.log
        DB               /var/run/google-fluentbit/pos-files/flb_kube.db
        Buffer_Max_Size  1MB
        Mem_Buf_Limit    50MB
        Skip_Long_Lines  On
        Refresh_Interval 5
        Read_from_Head   True

  system.input.conf: |-
    # Example:
    # Dec 21 23:17:22 gke-foo-1-1-4b5cbd14-node-4eoj startupscript: Finished running startup script /var/run/google.startup.script
    [INPUT]
        Name   tail
        Alias  syslog
        Parser syslog
        Path   /var/log/startupscript.log
        DB     /var/log/startupscript.db
        Alias  startupscript
        Tag    startupscript

    [INPUT]
        Name    tail
        Alias   docker
        Path    /var/log/docker.log
        Tag     docker
        Parser  docker
        Mem_Buf_Limit    1MB
        Skip_Long_Lines  On
        Refresh_Interval 1

    [INPUT]
        Name  tail
        Alias etcd
        Path  /var/log/etcd.log
        Tag   etcd
        Mem_Buf_Limit    1MB
        Skip_Long_Lines  On
        Refresh_Interval 1

    [INPUT]
        Name             tail
        Alias            kubelet
        Path             /var/log/kubelet.log
        Tag              kubelet
        Multiline        off
        Parser_Firstline firstline
        Parser_1         format1
        Mem_Buf_Limit    1MB
        Skip_Long_Lines  On
        Refresh_Interval 1

    # Example:
    # I1118 21:26:53.975789       6 proxier.go:1096] Port "nodePort for kube-system/default-http-backend:http" (:31429/tcp) was open before and is still needed
    [INPUT]
        Name            tail
        Alias           kube-proxy
        Tag             kube-proxy
        Path            /var/log/kube-proxy.log
        DB              /var/log/kube-proxy.db
        Buffer_Max_Size 1MB
        Mem_Buf_Limit   1MB
        Refresh_Interval 1
        Parser          glog

    [INPUT]
        Name             tail
        Alias            kube-apiserver
        Path             /var/log/kube-apiserver.log
        Tag              kube-apiserver
        Multiline        off
        Parser_Firstline firstline
        Parser_1         format1
        Mem_Buf_Limit    1MB
        Skip_Long_Lines  On
        Refresh_Interval 1

    [INPUT]
        Name             tail
        Alias            kube-controller-manager
        Path             /var/log/kube-controller-manager.log
        Tag              kube-controller-manager
        Multiline        off
        Parser_Firstline firstline
        Parser_1         format1
        Mem_Buf_Limit    1MB
        Skip_Long_Lines  On
        Refresh_Interval 1

    [INPUT]
        Name             tail
        Alias            kube-scheduler
        Path             /var/log/kube-scheduler.log
        Tag              kube-scheduler
        Multiline        off
        Parser_Firstline firstline
        Parser_1         format1
        Mem_Buf_Limit    1MB
        Skip_Long_Lines  On
        Refresh_Interval 1

    [INPUT]
        Name             tail
        Alias            rescheduler
        Path             /var/log/rescheduler.log
        Tag              rescheduler
        Multiline        off
        Parser_Firstline firstline
        Parser_1         format1
        Mem_Buf_Limit    1MB
        Skip_Long_Lines  On
        Refresh_Interval 1

    [INPUT]
        Name             tail
        Alias            glbc
        Path             /var/log/glbc.log
        Tag              glbc
        Multiline        off
        Parser_Firstline firstline
        Parser_1         format1
        Mem_Buf_Limit    1MB
        Skip_Long_Lines  On
        Refresh_Interval 1

    [INPUT]
        Name             tail
        Alias            cluster-autoscaler
        Path             /var/log/cluster-autoscaler.log
        Tag              cluster-autoscaler
        Multiline        off
        Parser_Firstline firstline
        Parser_1         format1
        Mem_Buf_Limit    1MB
        Skip_Long_Lines  On
        Refresh_Interval 1

    # Logs from systemd-journal for interesting services.
    [INPUT]
        Name           systemd
        Alias          sysd-docker
        Tag            docker
        Systemd_Filter _SYSTEMD_UNIT=docker.service
        Path           /var/log/journal
        DB             /var/log/gcp-journald-docker.db
        Read_from_head  true
        Buffer_Max_Size 1MB
        Mem_Buf_Limit   1MB
        Refresh_Interval 1

    [INPUT]
        Name           systemd
        Alias          sysd-container-runtime
        Tag            container-runtime
        Systemd_Filter _SYSTEMD_UNIT=containerd.service
        Path           /var/log/journal
        DB             /var/log/gcp-journald-container-runtime.db
        Read_from_head true
        Buffer_Max_Size 1MB
        Mem_Buf_Limit   1MB
        Refresh_Interval 1

    [INPUT]
        Name            systemd
        Alias           sysd-kubelet
        Tag             kubelet
        Systemd_Filter  _SYSTEMD_UNIT=kubelet.service
        Path            /var/log/journal
        DB              /var/log/gcp-journald-kubelet.db
        Read_from_head  true
        Buffer_Max_Size 1MB
        Mem_Buf_Limit   1MB
        Refresh_Interval 1

    [INPUT]
        Name           systemd
        Alias          sysd-node-problem-detector
        Tag            node-problem-detector
        Systemd_Filter _SYSTEMD_UNIT=node-problem-detector.service
        Path           /var/log/journal
        DB             /var/log/gcp-journald-node-problem-detector.db
        Read_from_head  true
        Buffer_Max_Size 1MB
        Mem_Buf_Limit   1MB
        Refresh_Interval 1

  filter.conf: |-

    [FILTER]
        Name         parser
        Match        k8s_container.*
        Key_Name     log
        Reserve_Data True
        Parser       docker
        Parser       containerd

    [FILTER]
        Name        modify
        Match       *
        Hard_rename log message

    [FILTER]
        Name         parser
        Match        k8s_container.*
        Key_Name     message
        Reserve_Data True
        Parser       glog
        Parser       json

    # level is a common synonym for severity,
    # the default field name in libraries such as GoLang's zap.
    # populate severity with level, if severity does not exist.
    [FILTER]
        Name        modify
        Match       k8s_container.*
        Copy        level severity

  output.conf: |-

    # handle namespaces in droplist first
    {% for namespace in log_droplist %}
    [OUTPUT]
        Name  null
        Alias null-{{namespace}}
        Match k8s_container.{{namespace}}.*
    {% endfor %}

    # Single output for all logs, project log routing handled by sinks in host project
    [OUTPUT]
        Name                       http
        Alias                      http-export-all
        Match                      *
        Host                       127.0.0.1
        Port                       3021
        URI                        /logs
        header_tag                 FLUENT-TAG
        Format                     msgpack
        Retry_Limit                2

  parsers.conf: |-
    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z

    [PARSER]
        Name        containerd
        Format      regex
        Regex       ^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z

    [PARSER]
        Name        json
        Format      json

    [PARSER]
        Name        glog
        Format      regex
        Regex       ^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source_file>[^ \]]+)\:(?<source_line>\d+)\]\s(?<message>.*)$
        Time_Key    time
        Time_Format %m%d %H:%M:%S.%L

    [PARSER]
        Name        syslog
        Format      regex
        Regex       ^\<(?<pri>[0-9]+)\>(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
        Time_Key    time
        Time_Format %b %d %H:%M:%S

    [PARSER]
        Name firstline
        Format regex
        Regex  /^\w\d{4}/
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes
  • Server type and version:
  • Operating System and version: "Debian GNU/Linux 10 (buster)"
  • Filters and plugins: See config above

Additional context

@ggallagher0
Copy link

ggallagher0 commented Oct 21, 2021

We experienced the same issue when upgrading from 1.5.2 to 1.8.8. One pod would consistently use up to 3GB of memory and then crash. Upping 'Flush' to 8 in the service config helped, but pods are still using 3x more memory than they did in 1.5.2.
``[SERVICE]
Flush 8
Log_Level info
Daemon Off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /tmp
storage.sync normal
storage.backlog.mem_limit 100M
storage.metrics on

[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/xxx.log
Parser docker
DB /tmp/flb_kube.xxx.db
Mem_Buf_Limit 500MB
Skip_Long_Lines On
Refresh_Interval 10
storage.type filesystem

[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/kube-system.log
Parser docker
DB /tmp/flb_kube.kube-system.db
Mem_Buf_Limit 500MB
Skip_Long_Lines On
Refresh_Interval 10
storage.type filesystem

[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/cloudability.log
Parser docker
DB /tmp/flb_kube.cloudability.db
Mem_Buf_Limit 500MB
Skip_Long_Lines On
Refresh_Interval 10
storage.type filesystem

[INPUT]
Name systemd
Tag nodes
DB /tmp/flb_systemd.db
Mem_Buf_Limit 500MB
Strip_Underscores On
Skip_Long_Lines On
Refresh_Interval 10
storage.type filesystem
Systemd_Filter _SYSTEMD_UNIT=kubelet.service

[INPUT]
Name tail
Tag k8s-audit
Path /opt/rke/var/log/kube-audit/k8s-audit-log.json
Parser k8s-audit
DB /tmp/flb_k8s_audit.db
Mem_Buf_Limit 500MB
Skip_Long_Lines On
Refresh_Interval 10
Rotate_Wait 10
storage.type filesystem

[FILTER]
Name kubernetes
Match kube.*
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude On

[OUTPUT]
Name forward
Match *
Host fluentd-forward.xxx.svc.cluster.local.
Port 24224
Retry_Limit 5
`

@NeckBeardPrince
Copy link

Same here on 1.8.8

@NeckBeardPrince
Copy link

Any update on this? It's happening in 1.8.8 non-debug as well.

@NeckBeardPrince
Copy link

@lmuhlha Have you found a workaround for this?

@edsiper
Copy link
Member

edsiper commented Oct 27, 2021

if you have 2.6G of data up in memory and then you aim to convert it to JSON you will exceed 3GB for sure, your mem_buf_limits are too high

@ggallagher0
Copy link

@edsiper our mem_buf_limits are 500mb and the OP's are 1mb. If this was just a configuration thing, it would be happening in both versions. When we rolled back to 1.5.2, memory use dropped right back to about 4mb per pod vs the 20mb-3gb that the 1.8.8 version pods used. In 1.8.8, one pod out of three would consistently run up to 3gb within hours while the others would slowly rise up and hang around at 20mb.

@edsiper
Copy link
Member

edsiper commented Oct 27, 2021

@ggallagher0 can you try reproducing the problem by disabling systemd input ? can you help to isolate the plugin triggering the problem

@NeckBeardPrince
Copy link

@ggallagher0 can you try reproducing the problem by disabling systemd input ? can you help to isolate the plugin triggering the problem

I have this same issue and I only use the tail input.

[FILTER]
    Name              aws
    Match             *
    imds_version      v1
    az                true
    ec2_instance_id   true
    ec2_instance_type true
    private_ip        true
    ami_id            true
    account_id        true
    hostname          true
    vpc_id            true
[FILTER]
    Name                kubernetes
    Match               ingress-nginx.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Kube_Tag_Prefix     ingress-nginx.
    Use_Kubelet         true
    Buffer_Size         0
    Merge_Log           On
    Keep_Log            False
[SERVICE]
    Flush             5
    Grace             120
    Log_Level         error
    Daemon            off
    Parsers_File      parsers.conf
    HTTP_Server       On
    HTTP_Listen       0.0.0.0
    HTTP_Port         2020
    storage.metrics   On
    storage.path      /var/log/flb-storage/

@INCLUDE input-kubernetes.conf
@INCLUDE filter-kubernetes.conf
@INCLUDE filter-aws.conf
@INCLUDE output-elasticsearch.conf
@INCLUDE output-s3.conf
[INPUT]
    Name              tail
    Alias             ingress_nginx_appdat-system
    Tag               ingress_<namespace_name>_<pod_name>_<container_name>
    Tag_Regex         (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
    Path              /var/log/containers/ingress-nginx-controller*.log
    Parser            docker
    DB                /var/log/flb_ingress.db
    storage.type      filesystem
    Docker_Mode       On
    Skip_Long_Lines   On
    Refresh_Interval  5
    Buffer_Max_Size   1MB
    Mem_Buf_Limt      5MB
[OUTPUT]
    Name                      es
    Match                     *
    Host                      ${ELASTICSEARCH_HOST}
    Port                      ${ELASTICSEARCH_PORT}
    AWS_Auth                  ${ELASTICSEARCH_AWS_AUTH}
    AWS_Region                ${ELASTICSEARCH_AWS_REGION}
    TLS                       On
    Generate_ID               On
    Logstash_Prefix           access-logs
    Logstash_Format           On
    Replace_Dots              On
    Buffer_Size               False
    Retry_Limit               False
    storage.total_limit_size  2048M
[OUTPUT]
    Name                          s3
    Match                         *
    bucket                        ${S3_BUCKET_NAME}
    region                        ${S3_BUCKET_REGION}
    store_dir                     /var/log/flb-storage
    s3_key_format                 ${S3_BUCKET_KEY_FORMAT}
    s3_key_format_tag_delimiters  .-
    upload_timeout                5m
    Retry_Limit                   False
    storage.total_limit_size      2048M
[PARSER]
    Name        docker
    Format      json
    Time_Key    time
    Time_Format %Y-%m-%dT%H:%M:%S.%L
    Time_Keep   On

@NeckBeardPrince
Copy link

Any update?

@gabegorelick
Copy link

#4192 may be a related issue.

@NeckBeardPrince
Copy link

Same issue with 1.8.9 non-debug.

@leonardo-albertovich
Copy link
Collaborator

I wonder which case is the easiest one to reproduce locally, @lmuhlha s seems to be good output wise because it's using the http plugin but it's a bit convoluted configuration wise, @ggallagher0 s is good because it uses simpler inputs and the output plugin is forwarder which means it can be locally set without requiring any api keys.

Have you tried removing those outputs and adding a simple tcp endpooint to see if the leak is still there @NeckBeardPrince?

I'm trying to come up with some ideas on what these cases have in common and what simplifications could be made to prove these ideas, the one thing 2 out of 3 have in common is the Kubernetes filter plugin and all of them use parsers.

@lmuhlha
Copy link
Author

lmuhlha commented Nov 18, 2021

Just an update from my end, I've been trying to get the k8s filter to work with my set up but on 1.5.7 I can't seem to get it to connect properly: [ warn] [filter:kubernetes:kubernetes.0] could not get meta for POD ...
If I add the configs to the filter to use Kubelet, FluentBit crashes because I assume the Kubelet features weren't supported in that version yet.
If I use gcr.io/gke-on-prem-release/fluent-bit:v1.8.3-gke.3 (provided by Google from other disucssions), I am able to use Kubelet and connect properly, but we start to see several pods OOMing again.

Re: "I wonder which case is the easiest one to reproduce locally, @lmuhlha s seems to be good output wise because it's using the http plugin but it's a bit convoluted configuration wise,"
I can try to deploy a simplified config if that helps debug the issue.

@lmuhlha
Copy link
Author

lmuhlha commented Nov 18, 2021

So I just tried this again with a simplified config and decreased the Mem_Buf_Limit and am still seeing the OOM on some pods.
FluentBit version: gcr.io/gke-on-prem-release/fluent-bit:v1.8.3-gke.3
w/ Google's Exporter: gke.gcr.io/fluent-bit-gke-exporter:v0.16.2-gke.0
Config:

 fluent-bit.conf: |-
    [SERVICE]
        Flush         5
        Grace         120
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_PORT     3020
    @INCLUDE containers.input.conf
    @INCLUDE filter.conf
    @INCLUDE output.conf
  containers.input.conf: |-
    [INPUT]
        Name             tail
        Alias            k8s_container
        Tag              k8s_container.<namespace_name>.<pod_name>.<container_name>
        Tag_Regex        (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
        Path             /var/log/containers/*.log
        Parser           docker
        DB               /var/run/google-fluentbit/pos-files/flb_kube.db
        Buffer_Max_Size  1MB
        Mem_Buf_Limit    1MB
        Skip_Long_Lines  On
        Refresh_Interval 5
  filter.conf: |-
    [FILTER]
        Name                kubernetes
        Match               k8s_container.<namespace_name>.<pod_name>.<container_name>
        Kube_URL            https://kubernetes.default.svc.cluster.local:443
        Merge_Log           On
        Buffer_Size         0
        Use_Kubelet         true
        Kubelet_Port        10250
  output.conf: |-
    # Single output for all logs, project log routing handled by sinks in host project
    [OUTPUT]
        Name                       http
        Alias                      http-export-all
        Match                      *
        Host                       127.0.0.1
        Port                       3021
        URI                        /logs
        header_tag                 FLUENT-TAG
        Format                     msgpack
        Retry_Limit                2
  parsers.conf: |-
    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep    On

Pod 1:
Screen Shot 2021-11-18 at 6 12 29 PM
Pod 2:
Screen Shot 2021-11-18 at 6 14 14 PM

@lmuhlha
Copy link
Author

lmuhlha commented Nov 23, 2021

Happening on 1.7.9 as well
Screen Shot 2021-11-22 at 7 49 06 PM

@NeckBeardPrince
Copy link

Happening on 1.7.9 as well Screen Shot 2021-11-22 at 7 49 06 PM

Do you mean 1.8.9? After going back to 1.7.9 I'm no longer having the issue. But 1.8.9 is also having the same problem.

@lmuhlha
Copy link
Author

lmuhlha commented Nov 23, 2021

Nope, I actually have the issue in 1.7.9 as well. So far anything I try above 1.5.7 does it, will continue trying things out.

@lmuhlha
Copy link
Author

lmuhlha commented Nov 23, 2021

Just tried 1.7.7 with no issue.

@triThirty
Copy link

Same issue when I use 1.8.10. As you see the graph I posted, what makes me confused is container_memory_working_set_bytes{endpoint="https-metrics", id="/kubepods/pod2cfb2523-0d79-43f5-a2a0-db07e0029bdd", instance="10.34.7.89:10250", job="kubelet", metrics_path="/metrics/cadvisor", namespace="logging", node="ip-10-34-7-89.ec2.internal", pod="fluent-bit-wr5wt", service="kube-prometheus-operator-k-kubelet"} the metrics has a peak.
Screen Shot 2021-12-17 at 2 50 03 PM

@namevic
Copy link

namevic commented Feb 24, 2022

After updating to 1.8.12 I don't see the memory leak.

@KrishnaKant1509
Copy link

Maybe these two are the same issue: #5147

@KrishnaKant1509
Copy link

After updating to 1.8.12 I don't see the memory leak.

even with 1.8.12, I am facing some problem when turning K8S-Logging.Exclude On in kubernetes filter plugin. It remains constant when I turn this option Off.

@danielserrao
Copy link

Same issue here. Tested with versions 1.8.11 and 1.8.12 and with K8S-Logging.Exclude Off, but the memory always keeps leaking.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Jun 21, 2022
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants