-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluentbit S3 output causes prometheus metrics endpoint to be briefly unavailable. #254
Comments
@PettitWesley, going to post updates to this issue:
When this happens I did: (and saw nothing running against that port)
|
Hi @bharatnc, does this happen only with S3 output plugin? Did you test with other output plugin like Also, did you check the normal metrics endpoint instead of prometheus endpoint? Do you see any difference? |
Hi @bharatnc, I cannot replicate your issue... I start fluent bit with the following config:
Then I hit the prometheus endpoint immediately via curl, as you did:
Unfortunately I am not seeing the same problem, which is the 5 minute delay.
|
@bharatnc, |
Hi @matthewfala, spec:
template:
spec:
containers:
image: public.ecr.aws/aws-observability/aws-for-fluent-bit:2.11.0
livenessProbe:
exec:
command:
- sh
- -c
- find /var/log/flb-storage -cmin -30 | head -n 1 | grep . && curl -s
http://localhost:2020/api/v1/metrics | grep '"errors":0'
failureThreshold: 2
initialDelaySeconds: 120
periodSeconds: 60
successThreshold: 1
timeoutSeconds: 5
name: fluent-bit
ports:
- containerPort: 2020
protocol: TCP
readinessProbe:
exec:
command:
- sh
- -c
- find /var/log/flb-storage -cmin -30 | head -n 1 | grep . && curl -s
http://localhost:2020/api/v1/metrics | grep '"errors":0'
failureThreshold: 1
initialDelaySeconds: 30
periodSeconds: 60
successThreshold: 1
timeoutSeconds: 5
resources:
limits:
cpu: 100m
memory: 80Mi
requests:
memory: 80Mi
securityContext:
capabilities:
add:
- SYSLOG
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /etc/machine-id
name: machine-id
readOnly: true
- mountPath: /var/log
name: varlog
- mountPath: /var/lib/docker/containers
name: varlibdockercontainers
readOnly: true
- mountPath: /fluent-bit/etc/
name: fluent-bit-config
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/role: node
priorityClassName: production
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: fluent-bit
serviceAccountName: fluent-bit
terminationGracePeriodSeconds: 30
tolerations:
- operator: Exists
volumes:
- hostPath:
path: /etc/machine-id
type: ""
name: machine-id
- hostPath:
path: /var/log
type: ""
name: varlog
- hostPath:
path: /var/lib/docker/containers
type: ""
name: varlibdockercontainers
- configMap:
defaultMode: 420
name: fluent-bit-config
name: fluent-bit-config The last working version is 2.11.0, any newer version fails to expose Prometheus metrics (so live check fails). Our S3 configuration is simpler
Any suggestion? |
Version 2.21.2 still hangs at startup. I executed the command: The full log is:
|
Thank you, @fvasco, for your details on the problem. I'll try to reproduce this again with your config next week. |
@matthewfala I share with you my configuration, I hope this help you to reproduce my issue. |
@matthewfala did you detect the memory leak? |
Still haven't been able to reproduce yet. But the configuration you provided is helpful @fvasco. I'm working on some other fluent bit issues as well, but this is still something I'm trying to get done. Also, what memory leak are you talking about? I thought that the issue is just that Fluent Bit prometheus endpoint hangs on startup? |
My bad. |
Hmm, I still am unable to reproduce on Kubernetes... I use 3 kubernetes resource config files which model your config but simplified down:
Upon startup, which is retried via:
The exec command results in the following:
The S3 endpoint is tested via the test app, and logs are successfully sent to S3. The Prometheus endpoint is available on startup. Would it be possible for you to try run the |
Hi @matthewfala, The pod log is:
It hangs here. The
I tested it multiple times, I always received 404. |
@matthewfala and @hossain-rayhan Please pardon the late reply.
I actually tried to use S3 plugin with output to GCS! fluent/fluent-bit#4165 (comment) I observed that this issue only happens with this setup. It doesn't happen with other output plugins and they all work fine. I also tried with AWS S3 also. This seems to work fine. So I suspect that there is some incompatibility there (though it sends logs, it makes the metrics interface unavailable.
It did not make a difference when I was trying to use the S3 plugin to send logs to GCS (as it was supposed to be S3 compatible but looks like it's not). Like mentioned above, it all worked fine with the other plugins. The issue I observed happened only when I was using this setup. I'm not sure how many users of Fluentbit tried this way of using S3 output plugin to send to GCS ;) |
@bharatnc, I see. Thank you for your response and letting me know. Glad you are able to have this plugin work with AWS S3. I wasn't aware that GCS has an S3 AWS compatibility option. @fvasco Also, what happens if you delete your other filters and plugins one by one, will the error go away? |
Closing due to no response. Please reopen if you have any followup questions or concerns. Thank you. |
Describe the question/issue
Prometheus metrics endpoint /api/v1/metrics/prometheus is not available immediately with S3 Output and takes a long time for the interface to be available.
Configuration
Note: For credentials,
AWS_ACCESS_KEY_ID
&AWS_SECRET_ACCESS_KEY
variables are used. These are exported and present in systemd environment when running fluentbit.Fluent Bit Log Output
NA
Fluent Bit Version Info
Steps to reproduce issue
curl the metrics endpoint:
Expected behavior
I should be able to curl the endpoint without seeing connection refused error instantaneously. Curl starts working only after a long time. This varies b/w my tests from attempt to attempt - generally 5-10 minutes and also happens to coincide with the first successful upload after buffering 1M of data (according to the settings I use above).
Related Issues
fluent/fluent-bit#4165
The text was updated successfully, but these errors were encountered: