Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentbit S3 output causes prometheus metrics endpoint to be briefly unavailable. #254

Closed
bharatnc opened this issue Oct 6, 2021 · 16 comments

Comments

@bharatnc
Copy link

bharatnc commented Oct 6, 2021

Describe the question/issue

Prometheus metrics endpoint /api/v1/metrics/prometheus is not available immediately with S3 Output and takes a long time for the interface to be available.

Configuration

Note: For credentials, AWS_ACCESS_KEY_ID & AWS_SECRET_ACCESS_KEY variables are used. These are exported and present in systemd environment when running fluentbit.

[SERVICE]
    flush        10
    daemon       Off
    log_level    info
    parsers_file /etc/flb/parsers.conf
    plugins_file /etc/flb/plugins.conf
    http_server  On
    http_listen  0.0.0.0
    http_port    3281
    storage.metrics On
    storage.path /fluentbit-buffer
    storage.sync normal
    storage.checksum off
    storage.backlog.mem_limit  8M
    storage.max_chunks_up  128

[INPUT]
    Name  tail
    Alias  test_tail
    Path  /var/log/test.log
    Read_from_Head  Off
    Path_Key  filename
    Tag  syslog
    Key  event
    exit_on_eof  false
    Rotate_Wait  5
    Refresh_Interval  60
    Skip_Long_Lines  Off
    DB.sync  normal
    DB.locking  false
    Buffer_Chunk_Size  32k
    Buffer_Max_Size  8M
    Multiline  Off
    Multiline_Flush  4
    Parser_Firstline  8192
    Docker_Mode  Off
    Docker_Mode_Flush  4

[FILTER]
    Name record_modifier
    Match *
    Record hostname ${HOSTNAME}

[OUTPUT]
    Name  s3
    Match  syslog
    endpoint  https://storage.googleapis.com
    bucket. fluentbit-test
    use_put_object true
    content_type  application/gzip
    compression gzip
    store_dir  /fluentbit/s3
    upload_timeout 1m
    region  us-west2
    total_file_size  1M
    s3_key_format  /99com25-test-k8s-s3/$UUID.gz
    s3_key_format_tag_delimiters .-

Fluent Bit Log Output

NA

Fluent Bit Version Info

  • Version used: 1.8.3 on bare metal using systemd unit (Debian stretch).

Steps to reproduce issue

curl the metrics endpoint:

curl  http://localhost:3281/api/v1/metrics/prometheus
curl: (7) Failed to connect to localhost port 3281: Connection refused

Expected behavior

I should be able to curl the endpoint without seeing connection refused error instantaneously. Curl starts working only after a long time. This varies b/w my tests from attempt to attempt - generally 5-10 minutes and also happens to coincide with the first successful upload after buffering 1M of data (according to the settings I use above).

Related Issues

fluent/fluent-bit#4165

@bharatnc
Copy link
Author

bharatnc commented Oct 7, 2021

@PettitWesley, going to post updates to this issue:

021-10-07T17:09:12.019  [2021/10/07 17:09:12] [error] [output:s3:s3.0] PutObject request failed
2021-10-07T17:22:21.592  [2021/10/07 17:22:21] [ info] [output:s3:s3.0] Successfully uploaded object /bar/M9BDOssx.gz
2021-10-07T17:28:21.867  [2021/10/07 17:28:21] [ info] [output:s3:s3.0] Successfully uploaded object /foo/8SAtFTI8.gz
2021-10-07T17:28:22.001  #4  0x55ba609dd888      in  cb_upstream_conn_ka_dropped() at src/flb_upstream.c:659
2021-10-07T17:28:22.001  #2  0x55ba609dcfd8      in  prepare_destroy_conn() at src/flb_upstream.c:390
2021-10-07T17:28:22.001  #3  0x55ba609dd03a      in  prepare_destroy_conn_safe() at src/flb_upstream.c:412
2021-10-07T17:28:21.873  [2021/10/07 17:28:21] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096
2021-10-07T17:28:21.873  [2021/10/07 17:28:21] [ info] [output:s3:s3.0] Successfully uploaded object /audit/awGjKksC.gz
2021-10-07T17:28:22.001  #1  0x55ba609dc4e5      in  mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:93
2021-10-07T17:28:22.001  #5  0x55ba609d6e68      in  flb_engine_start() at src/flb_engine.c:688
2021-10-07T17:28:22.001  #0  0x55ba609dc4ae      in  __mk_list_del() at lib/monkey/include/monkey/mk_core/mk_list.h:87
2021-10-07T17:28:22.001  #6  0x55ba609bbdc0      in  flb_lib_worker() at src/flb_lib.c:628
2021-10-07T17:28:21.873  [2021/10/07 17:28:21] [engine] caught signal (SIGSEGV)
2021-10-07T17:28:22.001  #9  0xffffffffffffffff  in  ???() at ???:0
2021-10-07T17:28:22.001  #7  0x7f606bf4c4a3      in  ???() at ???:0
2021-10-07T17:28:22.001  #8  0x7f606a9bdd0e      in  ???() at ???:0
2021-10-07T17:28:22.192  [2021/10/07 17:28:22] [ info] [engine] started (pid=42095)
2021-10-07T17:28:22.192  [2021/10/07 17:28:22] [ info] [storage] version=1.1.1, initializing...
2021-10-07T17:28:22.192  [2021/10/07 17:28:22] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
2021-10-07T17:28:22.193  [2021/10/07 17:28:22] [ info] [input:storage_backlog:storage_backlog.4] register tail.0/12829-1633626921.399727758.flb
2021-10-07T17:28:22.194  [2021/10/07 17:28:22] [ info] [output:s3:s3.0] Using upload size 1000000 bytes
2021-10-07T17:28:22.192  [2021/10/07 17:28:22] [ info] [cmetrics] version=0.1.6
2021-10-07T17:34:23.180  [2021/10/07 17:34:23] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096
2021-10-07T17:34:23.180  [2021/10/07 17:34:23] [ info] [output:s3:s3.0] Successfully uploaded object /foo/bVygdS6q.gz
2021-10-07T17:35:42.735  [2021/10/07 17:35:42] [engine] caught signal (SIGTERM)
2021-10-07T17:35:42.735     └─ down       : 3
2021-10-07T17:35:42.735  [2021/10/07 17:35:42] [ info] [output:s3:s3.0] Sending all locally buffered data to S3
  • Also another observation the metrics interface randomly becomes unavailable in between and metrics get reset and start again while fluentbit is still running. Leads to really inconsistent metrics scrape by prometheus and subsequent metrics.

When this happens I did: (and saw nothing running against that port)

sudo  lsof -i :3281

@hossain-rayhan
Copy link
Contributor

Hi @bharatnc, does this happen only with S3 output plugin? Did you test with other output plugin like stdout?

Also, did you check the normal metrics endpoint instead of prometheus endpoint? Do you see any difference?

@matthewfala
Copy link
Contributor

Hi @bharatnc, I cannot replicate your issue...

I start fluent bit with the following config:

[SERVICE]
     flush        10
     daemon       Off
     Grace 30
     Log_Level debug
     parsers_file /etc/flb/parsers.conf
     plugins_file /etc/flb/plugins.conf
     http_server  On
     http_listen  0.0.0.0
     http_port    3281
     storage.metrics On
     storage.sync normal
     storage.checksum off
     storage.backlog.mem_limit  8M
     storage.max_chunks_up  128

[INPUT]
     Name        forward
     Listen      0.0.0.0
     Port        24224

[OUTPUT]
     Name s3 
     Match *
     bucket fluent-bit-bucket-3
     total_file_size 1M
     upload_timeout 5m
     use_put_object On
     s3_key_format /$TAG[2]/$TAG[0]/%Y-%m-%d/%H-%M-%S-$UUID.gz
     region us-west-2
     compression         gzip

Then I hit the prometheus endpoint immediately via curl, as you did:

$ curl http://localhost:3281/api/v1/metrics/prometheus
# HELP fluentbit_input_bytes_total Number of input bytes.
# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="forward.0"} 0 1635447975795
# HELP fluentbit_input_records_total Number of input records.
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="forward.0"} 0 1635447975795
# HELP fluentbit_output_dropped_records_total Number of dropped records.
# TYPE fluentbit_output_dropped_records_total counter
fluentbit_output_dropped_records_total{name="s3.0"} 0 1635447975795
# HELP fluentbit_output_errors_total Number of output errors.
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="s3.0"} 0 1635447975795
# HELP fluentbit_output_proc_bytes_total Number of processed output bytes.
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="s3.0"} 0 1635447975795
# HELP fluentbit_output_proc_records_total Number of processed output records.
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="s3.0"} 0 1635447975795
# HELP fluentbit_output_retried_records_total Number of retried records.
# TYPE fluentbit_output_retried_records_total counter
fluentbit_output_retried_records_total{name="s3.0"} 0 1635447975795
# HELP fluentbit_output_retries_failed_total Number of abandoned batches because the maximum number of re-tries was reached.
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="s3.0"} 0 1635447975795
# HELP fluentbit_output_retries_total Number of output retries.
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="s3.0"} 0 1635447975795
# HELP fluentbit_uptime Number of seconds that Fluent Bit has been running.
# TYPE fluentbit_uptime counter
fluentbit_uptime 8
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1635447967
# HELP fluentbit_build_info Build version information.
# TYPE fluentbit_build_info gauge
fluentbit_build_info{version="1.9.0",edition="Community"} 1

Unfortunately I am not seeing the same problem, which is the 5 minute delay.
This leads me to believe that your issue could potentially be one of the following:

  1. Something else is running on port 3281. Could you try run the lsof command when curl fails (i think you only ran lsof previously when you were facing the second problem right?)? Could you try a different port, like maybe 3000 or 8080.
  2. The old fluent bit process (maybe a daemon) was shutting down gracefully, which might take up to 5 minutes. That way the old process was clogging the port until the old process fully shut down. you can run kill -9 $(pgrep fluent-bit) to kill old fluent bit processes. (this is the most likely problem, as I have actually ran into it with the http input plugin before. when i debug fluent bit, my startup code has the kill command in it to clear out old daemons that are still shutting down)
  3. It's something to do with the s3 endpoint configuration. I didn't test the endpoint configuration since I'm using aws' s3. (Unlikely the problem)

@matthewfala
Copy link
Contributor

@bharatnc,
Could you let me know if running kill -9 $(pgrep fluent-bit) before spinning up a new fluent-bit process resolves your problem [2]?

@fvasco
Copy link

fvasco commented Nov 8, 2021

Hi @matthewfala,
we suspect to hit the same issue.
In our case, fluent-bit is a process in a container, so the port should be always available.

spec:
  template:
    spec:
      containers:
        image: public.ecr.aws/aws-observability/aws-for-fluent-bit:2.11.0
        livenessProbe:
          exec:
            command:
            - sh
            - -c
            - find /var/log/flb-storage -cmin -30 | head -n 1 | grep . && curl -s
              http://localhost:2020/api/v1/metrics | grep '"errors":0'
          failureThreshold: 2
          initialDelaySeconds: 120
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 5
        name: fluent-bit
        ports:
        - containerPort: 2020
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - sh
            - -c
            - find /var/log/flb-storage -cmin -30 | head -n 1 | grep . && curl -s
              http://localhost:2020/api/v1/metrics | grep '"errors":0'
          failureThreshold: 1
          initialDelaySeconds: 30
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: 100m
            memory: 80Mi
          requests:
            memory: 80Mi
        securityContext:
          capabilities:
            add:
            - SYSLOG
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /etc/machine-id
          name: machine-id
          readOnly: true
        - mountPath: /var/log
          name: varlog
        - mountPath: /var/lib/docker/containers
          name: varlibdockercontainers
          readOnly: true
        - mountPath: /fluent-bit/etc/
          name: fluent-bit-config
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/role: node
      priorityClassName: production
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: fluent-bit
      serviceAccountName: fluent-bit
      terminationGracePeriodSeconds: 30
      tolerations:
      - operator: Exists
      volumes:
      - hostPath:
          path: /etc/machine-id
          type: ""
        name: machine-id
      - hostPath:
          path: /var/log
          type: ""
        name: varlog
      - hostPath:
          path: /var/lib/docker/containers
          type: ""
        name: varlibdockercontainers
      - configMap:
          defaultMode: 420
          name: fluent-bit-config
        name: fluent-bit-config

The last working version is 2.11.0, any newer version fails to expose Prometheus metrics (so live check fails).

Our S3 configuration is simpler

    [OUTPUT]
        Name            s3
        Match           *
        region          us-east-1
        bucket          ${S3_BUCKET}
        s3_key_format   /fluent-bit-logs/%Y/%m/%d/%H/%M/%S/$TAG
        total_file_size 4M
        upload_timeout  1m
        compression     gzip
        use_put_object  true

Any suggestion?

@fvasco
Copy link

fvasco commented Dec 3, 2021

Version 2.21.2 still hangs at startup.

I executed the command: kill -9 $(pgrep fluent-bit)

The full log is:

Fluent Bit v1.8.9
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/12/03 13:09:20] [ info] [engine] started (pid=1)
[2021/12/03 13:09:20] [ info] [storage] version=1.1.5, initializing...
[2021/12/03 13:09:20] [ info] [storage] root path '/var/log/flb-storage/'
[2021/12/03 13:09:20] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/12/03 13:09:20] [ info] [storage] backlog input plugin: storage_backlog.4
[2021/12/03 13:09:20] [ info] [cmetrics] version=0.2.2
[2021/12/03 13:09:20] [ info] [input:systemd:systemd] seek_cursor=s=f9ed1dac9afc49069ac4fad3a1c066da;i=2be... OK
[2021/12/03 13:09:20] [ info] [input:storage_backlog:storage_backlog.4] queue memory limit: 95.4M
[2021/12/03 13:09:20] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2021/12/03 13:09:20] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2021/12/03 13:09:20] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2021/12/03 13:09:20] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2021/12/03 13:09:20] [ info] [fstore] created root path /tmp/fluent-bit/s3/xxx
[2021/12/03 13:09:20] [ info] [output:s3:s3.0] Using upload size 4000000 bytes
[2021/12/03 13:09:20] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/12/03 13:09:20] [ info] [sp] stream processor started
[2021/12/03 13:09:20] [error] [storage backlog] could not retrieve chunk tag
[2021/12/03 13:09:20] [error] [storage backlog] error distributing chunk references
[2021/12/03 13:09:20] [error] [engine] could not segregate backlog chunks

@matthewfala
Copy link
Contributor

Thank you, @fvasco, for your details on the problem. I'll try to reproduce this again with your config next week.

@matthewfala matthewfala self-assigned this Dec 3, 2021
@fvasco
Copy link

fvasco commented Dec 9, 2021

@matthewfala I share with you my configuration, I hope this help you to reproduce my issue.

@fvasco
Copy link

fvasco commented Dec 19, 2021

@matthewfala did you detect the memory leak?
Do you need more info?

@matthewfala
Copy link
Contributor

Still haven't been able to reproduce yet. But the configuration you provided is helpful @fvasco. I'm working on some other fluent bit issues as well, but this is still something I'm trying to get done.

Also, what memory leak are you talking about? I thought that the issue is just that Fluent Bit prometheus endpoint hangs on startup?

@fvasco
Copy link

fvasco commented Dec 21, 2021

what memory leak are you talking about?

My bad.
I am using a fluent-bit version with a memory leak.
The leak has been fixed in newer version but fluent-bit crashes at startup.
Thank you for your support, I will stay tuned.

@matthewfala
Copy link
Contributor

Hmm, I still am unable to reproduce on Kubernetes...
It may be because I am using a simplified config to see if S3 output plugin is affecting the prometheus endpoint on startup. So far, seems that it isn't

I use 3 kubernetes resource config files which model your config but simplified down:

  1. Fluent Bit config
apiVersion: v1
data:
  fluent-bit.conf: |
    [SERVICE]
        Grace 30
        Log_Level debug
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_PORT     2020
    [INPUT]
        Name        forward
        Listen      0.0.0.0
        Port        24224
    [OUTPUT]
        Name s3 
        Match *
        bucket fluent-bit-bzs3
        total_file_size 1M
        upload_timeout 5m
        use_put_object On
        s3_key_format /$TAG[2]/$TAG[0]/%Y-%m-%d/%H-%M-%S-$UUID.gz
        auto_retry_requests true
    #     s3_key_format_tag_delimiters .-
        region us-west-2  
kind: ConfigMap
metadata:
  name: fluent-bit-config
  1. Test App + FireLens
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pro-fbd
  labels:
    app: prometheus-fb
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-fb
  template:
    metadata:
      labels:
        app: prometheus-fb
    spec:
      containers:
      - name: fluent-bit
        image: public.ecr.aws/aws-observability/aws-for-fluent-bit:stable
        ports:
        - containerPort: 2020
          protocol: TCP
        resources:
          limits:
            cpu: 100m
            memory: 80Mi
          requests:
            memory: 80Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /fluent-bit/etc/
          name: fluent-bit-config
        envFrom:
        - secretRef:
            name: aws-credentials
      - name: firelens-datajet
        image: public.ecr.aws/fala-fluentbit/firelens-datajet:0.1.1.r2-synchronizer
        env:
        - name: CLIENT_REQUEST_PORT
          value: "3210"
      volumes:
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
  1. AWS Secrets
apiVersion: v1
kind: Secret
metadata:
  name: aws-credentials
data:
  AWS_SECRET_ACCESS_KEY: <b64 secret>
  AWS_ACCESS_KEY_ID: <b64  key>

Upon startup, which is retried via: kubectl scale deployment/pro-fbd --replicas=0 kubectl scale deployment/pro-fbd --replicas=1, I run this command to see if the prometheus endpoint is available:

kubectl get pods
kubectl exec -ti <my pod> --container=fluent-bit -- curl -i http://localhost:2020/api/v1/metrics/prometheus

The exec command results in the following:

HTTP/1.1 200 OK
Server: Monkey/1.7.0
Date: Wed, 22 Dec 2021 23:49:16 GMT
Transfer-Encoding: chunked
Content-Type: text/plain; version=0.0.4

# HELP fluentbit_input_bytes_total Number of input bytes.
# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="forward.0"} 0 1640216957278
# HELP fluentbit_input_records_total Number of input records.
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="forward.0"} 0 1640216957278
# HELP fluentbit_output_dropped_records_total Number of dropped records.
# TYPE fluentbit_output_dropped_records_total counter
fluentbit_output_dropped_records_total{name="s3.0"} 0 1640216957278
# HELP fluentbit_output_errors_total Number of output errors.
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="s3.0"} 0 1640216957278
# HELP fluentbit_output_proc_bytes_total Number of processed output bytes.
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="s3.0"} 0 1640216957278
# HELP fluentbit_output_proc_records_total Number of processed output records.
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="s3.0"} 0 1640216957278
# HELP fluentbit_output_retried_records_total Number of retried records.
# TYPE fluentbit_output_retried_records_total counter
fluentbit_output_retried_records_total{name="s3.0"} 0 1640216957278
# HELP fluentbit_output_retries_failed_total Number of abandoned batches because the maximum number of re-tries was reached.
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="s3.0"} 0 1640216957278
# HELP fluentbit_output_retries_total Number of output retries.
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="s3.0"} 0 1640216957278
# HELP fluentbit_uptime Number of seconds that Fluent Bit has been running.
# TYPE fluentbit_uptime counter
fluentbit_uptime 40
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1640216917
# HELP fluentbit_build_info Build version information.
# TYPE fluentbit_build_info gauge
fluentbit_build_info{version="1.8.9",edition="Community"} 1

The S3 endpoint is tested via the test app, and logs are successfully sent to S3. The Prometheus endpoint is available on startup.

Would it be possible for you to try run the kubectl exec -ti <my pod> --container=fluent-bit -- curl -i http://localhost:2020/api/v1/metrics/prometheus command instead of exposing the endpoint via a service/tunnel and poking the endpoint on the host, or using it as a readiness probe? That way we can see if Fluent Bit is the problem or some networking between K8 and host.

@fvasco
Copy link

fvasco commented Dec 23, 2021

Hi @matthewfala,
I tested your hint with Fluent Bit 2.21.5.

The pod log is:

Fluent Bit v1.8.11
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/12/23 07:42:40] [ info] [engine] started (pid=1)
[2021/12/23 07:42:40] [ info] [storage] version=1.1.5, initializing...
[2021/12/23 07:42:40] [ info] [storage] root path '/var/log/flb-storage/'
[2021/12/23 07:42:40] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/12/23 07:42:40] [ info] [storage] backlog input plugin: storage_backlog.4
[2021/12/23 07:42:40] [ info] [cmetrics] version=0.2.2
[2021/12/23 07:42:40] [ info] [input:systemd:systemd] seek_cursor=s=32d7791647e1412480517c88b5344722;i=8a4... OK
[2021/12/23 07:42:40] [ info] [input:storage_backlog:storage_backlog.4] queue memory limit: 95.4M
[2021/12/23 07:42:40] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
[2021/12/23 07:42:40] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2021/12/23 07:42:40] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
[2021/12/23 07:42:40] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2021/12/23 07:42:40] [ info] [fstore] created root path /tmp/fluent-bit/s3/xxx
[2021/12/23 07:42:40] [ info] [output:s3:s3.0] Using upload size 4000000 bytes
[2021/12/23 07:42:41] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2021/12/23 07:42:41] [ info] [sp] stream processor started
[2021/12/23 07:42:41] [ info] [input:tail:containers] inotify_fs_add(): inode=512884 watch_fd=1 name=/var/log/containers/kube-proxy-ip-172-31-68-199.ec2.internal_kube-system_kube-proxy-90f51daafff7d57a760a0608f95d9b744ed3391f6a350cedf3ef0cc1360d224d.log
[2021/12/23 07:42:41] [ info] [input:tail:dmesg] inotify_fs_add(): inode=65233 watch_fd=1 name=/var/log/dmesg

It hangs here.

The exec output is:

$ kubectl exec -n logging -ti fluent-bit-kjphc --container=fluent-bit -- curl -i http://localhost:2020/api/v1/metrics/prometheus
HTTP/1.1 404 Not Found
Server: Monkey/1.7.0
Date: Thu, 23 Dec 2021 07:44:31 GMT
Transfer-Encoding: chunked

I tested it multiple times, I always received 404.
We don't use variables for AWS credentials.

@bharatnc
Copy link
Author

@matthewfala and @hossain-rayhan

Please pardon the late reply.

Hi @bharatnc, does this happen only with S3 output plugin? Did you test with other output plugin like stdout?

I actually tried to use S3 plugin with output to GCS! fluent/fluent-bit#4165 (comment)

I observed that this issue only happens with this setup. It doesn't happen with other output plugins and they all work fine.

I also tried with AWS S3 also. This seems to work fine. So I suspect that there is some incompatibility there (though it sends logs, it makes the metrics interface unavailable.

Could you let me know if running kill -9 $(pgrep fluent-bit) before spinning up a new fluent-bit process resolves your problem [2]?

It did not make a difference when I was trying to use the S3 plugin to send logs to GCS (as it was supposed to be S3 compatible but looks like it's not). Like mentioned above, it all worked fine with the other plugins.

The issue I observed happened only when I was using this setup. I'm not sure how many users of Fluentbit tried this way of using S3 output plugin to send to GCS ;)

@matthewfala
Copy link
Contributor

@bharatnc, I see. Thank you for your response and letting me know. Glad you are able to have this plugin work with AWS S3. I wasn't aware that GCS has an S3 AWS compatibility option.

@fvasco
I'm not sure I'll be able to help. My attempts to reproduce the issue thus far haven't resulted in anything. I think my test shows that the S3 plugin isn't interfering with the Prometheus endpoint? Are you able to try to isolate the issue to S3? Like what would happen if you deleted the S3 output plugin and replaced it with stdout?

Also, what happens if you delete your other filters and plugins one by one, will the error go away?
I don't think the use of ENV vars for credentials would cause this problem.

@matthewfala matthewfala removed their assignment Jan 26, 2022
@matthewfala
Copy link
Contributor

Closing due to no response. Please reopen if you have any followup questions or concerns. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants