-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluent-bit stuck not doing any work #1958
Comments
Please try the test image:
|
@edsiper we can't deploy the test image to our prod infra right away. Do you expect that the hang is related to the other crashes? |
I don't see a crash here, just a normal trace. I will release v1.3.8 today, so you can give it a try. Between 1.3.2 and 1.3.7 there are many fixes, the upgrade between the stable series is always encouraged. you can take a look here: |
Yes the problem is fluentbit is stuck at the same trace, making no progress despite there being logs waiting on the system. Thanks, we will need to upgrade. |
Fluent Bit v1.3.8 has been officially released, please upgrade and send us some feedback: https://fluentbit.io/announcements/v1.3.8/ Docker image: |
Hi @edsiper we're still seeing fluentbit hanging with 1.3.8. What can we do to triage it better? Seems like something that might be specific to the Stackdriver output plugin? The stack trace we're seemingly stuck at:
|
|
I attached gdb to the running process. The storage backlog is 62MB on one of the nodes I'm looking at. I will change the storage settings. Will report back. |
One thing to try to discard slow TLS handshakes, in the output plugin section add the new keepalive feature:
|
If I set
|
@dgrala would it be possible I can get a copy of your storage backlog? (through a private mechanism of course) |
@edsiper sorry we won't be able to share logs from our production cluster :/ . We could perhaps set up a test cluster, with some toy applications running but that will take some time to set up. |
can you identify the corrupted chunk file ? (it's corrupted anyways) |
If I turn |
if checksum is Now keeping |
If I start a new We will have to apply the config change cluster-wide and wait. Let me try that. |
Hi @edsiper we've changed our settings to
What would you suggest we can do? |
hi @dgrala I am just troubleshooting an issue similar to this but with a high scale of backlog files. I am troubleshooting and I will post my findings shortly. |
@dgrala on #1975 I am waiting for final confirmation before to do a new release with further performance improvement. Now I am not 100% confident this is the exact same issue you are mentioned but could be related. Do you have specific steps I can follow to reproduce the problem in an isolated way ? |
@edsiper thanks for the update - it seems like a different issue. For us, many pods just stop making progress and are blocked for good. Might be related to storage, but we don't see much data in storage. We have to restart 30%+ of the fleet every couple hours. Don't have an isolated repro at this moment, as we're only running this setup in our production cluster. |
@dgrala if you look at the metrics exposed by the HTTP end-point, do you see a lot of
|
@edsiper no, not too many retries:
The metrics don't change if I run the query a few minutes later, on "stuck" pods. I actually do see a little network traffic from td-agent-bit, so the app doesn't seem deadlocked: |
@dgrala would be possible to arrange a zoom session so we can troubleshoot ? |
Yes @edsiper emailed you the zoom meeting inv, but looks like you couldn't make it earlier today? What day/time works for you? |
Confirmed solved as of fluentbit 1.4.6 |
Bug Report
td-agent-bit seems stuck at the same stack trace, not doing any work. New instances of the process work fine on the same node.
To Reproduce
Seems stuck at this place:
Unclear. It hags on our prod infra intermittently. We use
fluentbit.output.proc.records.total.counter
andfluentbit.input.records.total.counter
to monitor, but a hung process doesn't produce input nor output metrics.Note that I don't see
ourservice
under/var/log/containers/ourservice--1-0-1071*
, but it might've been there before.Expected behavior
Application crashes so we can restart.
Your Environment
Environment name and version (e.g. Kubernetes? What version?):
k8s
Server type and version:
Operating System and version:
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
Filters and plugins:
Additional context
The text was updated successfully, but these errors were encountered: