-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler corruption on high number of retries #1956
Comments
Please provide a full copy of your configuration, so I can try to reproduce the problem. |
I am using ansible to deploy, that's why you will see some loop for defining an input for each namespace defined.
I noticed that the cpu on the node that crashed fired up to the limit established in kubernetes of 800 milicores while the other 3 pods were around 20mc. I checked the fluentd from openshift on the same node and was only taking 60mc. The difference is that my fluentbit uses the tail input and I see that fluentd on openshift uses the fluent-plugin-systemd. It's more efficient to get the logs directly from the journal system instead of the filesystem? I cannot see how to get the pod's output in any unit in systemctl. |
what about if you try Fluent Bit systemd input plugin ? |
I can't find where the logs from my containers are in openshift within systemd. I thought perhaps the error was because the /var/lib was not properly mounted but the crash has happened again
Now is 11:27 AM I also see:
It looks like a performance issue, I cannot send as fast as I read |
things are getting worse. root@vepboanvllbo002:/var/lib/fluentbit # du -hs * and I am also seeing errors like:
I set the image to 1.3.7-debug and put the loglevel at trace and crashes: [2020/02/19 10:14:43] [debug] [storage] [cio stream] new stream registered: tail.1 [engine] caught signal (SIGTERM) |
there are two errors:
|
The last week to had to open an issue with RedHat because the fluentd queues in openshift were stopped by this error: "invalid byte sequence in UTF-8" not sending logs to elasticsearch at all. We have some app which causes this error and We have to add the ENABLE_UTF8_FILTER in fluentd: https://bugzilla.redhat.com/show_bug.cgi?id=1562004 Maybe this be affecting fluent-bit too ? |
Fluent Bit v1.3.8 has been officially released, please upgrade and send us some feedback: https://fluentbit.io/announcements/v1.3.8/ Docker image: |
I was on 1.3.8-next 1 in one of the pods yesterday.
and finally a
before upgrading to 1.3.8 I checked that the upstreams were available so it looks like it was not trying to reconnect. After upgrading to 1.3.8 it started to send logs and after a while stopped again, some error writing content body and finally I stopped it when I got
After a restart it looks like it's checking something (does not crash as before) but after 30 minutes the pod is still not ready and those are the last lines I see:
The next line was written after 32 minutes:
and now is sending some logs.. what has been doing during this 32 minutes? |
32 minutes is a lot of time, but I am seeing you are facing network issues. While writing data the socket gets disconnected from the remote end-point. Would you please try the new
note I've removed your Upstream entry. |
I've enabled the keepalive and restarted everything. What I see is the more files you have on disk the more delays the init of the containers. Note this
a delay of 30 secs after reading the storage and beginning processing logs:
and the first flush 1 second after
Now check this other pod:
22 minutes of delay between reading the storage and process the first log
but when begins to process add more files so more waiting and any flush going on , because of ?
Now is stucked here I have got the config of the td-agent listening on the other side:
|
what's the average memory usage of Fluent Bit Pod ? |
|
No crashes once network issues are fixed but still getting pauses of 1 hour and high cpu usage |
I had a Zoom meeting with Raul, this is no longer an issue. Fixed after upgrade to v1.3.8. |
originally reported on #1950 by @rmacian
The text was updated successfully, but these errors were encountered: