-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluent-bit sidecar is killed because of networking error #66
Comments
In case it helps, we have gotten the same error in our non live environment where we have
|
Hi @angulito , thanks for your detailed report. To me, it seems like an issue with datadog. However, can you share the fluent-bit config file you are using? |
Also, fluent-bit container is no more essential 996. So, if you do not set it as essential, your task should keep running even if the fluent-bit is killed. |
Hey @hossain-rayhan, thanks for your help! Yep, marking the task as non-essential is a good point but it's a pain in the neck to lose logs in our production environment because of this issue. This is our fluent-bit config (automatically generated by awsfirelens)
And this is the external file included in the config:
I would really appreciate any help! 🙇 |
@hossain-rayhan I also want to provide you a little more of info here:
I’m happy to help you with whatever you need to figure out where this issue comes from! thanks! |
Hi @angulito , thanks for your config and logs. What I found, EXIT code 139 occurs because of segmentation fault which is mostly for environment setups. 570, 65. I will comeback after further investigation. Looping in @PettitWesley if he can provide more insights. However, were you able to reproduce it? |
@angulito Check this, I think its the same issue: fluent/fluent-bit#2387 DataDog gave me a test account for this integration, I am trying to reproduce it. |
Hey @hossain-rayhan thanks for the update. We are using ECS AMI image (see in the PR description docker and ecs agent version) so it's a pity because we cannot upgrade to docker v19 because of another issue with fluent-bit (if fluent bit sidecar dies, the other take containers hangs forever unresponsive and cannot be killed) fluent/fluent-logger-golang#82. We were not able to reproduce the issue, it just happens "randomly", sometimes we see the network issue and container does not die and other it dies. Thank you!! |
Thank you @PettitWesley! Yep it seems to be very similar as exit code is the same! Will look at that issue, but don't hesitate to ask me if you need further info! Thanks! |
I set up 2 FireLens tasks with datadog outputs last night to try to repro this. One is just the latest AWS for Fluent bit image, one is a custom image built with the valgrind tool, which should help diagnose the segfault if it can be reproduced. |
@PettitWesley I have some alerts to be aware when the issue happens in my system. I'll let you know when it happens as it might be related to datadog network issues and it might happen at the same time in your service and mines |
Hey @PettitWesley - the issue happened multiple times for my services during the weekend, here you have the list of dates when any of my containers restarted, in case it happened the same for your service anytime - hours are specified in UTC
as you can see, around ~10 services crashed every day from 60+ services in total |
@angulito My datadog firelens tasks still haven't crashed a single time, days after I launched them. So I have not yet been able to reproduce this. Next I will try turning off network, and logging at a high rate, and see if any of those trigger this. |
thank you @PettitWesley! also if you want, you can share with me how you have set up the valgrind tool (I didn't use it in the past) and if it's easy for me I can try to include it in my system to help you with the research |
@angulito Here it is:
If you can catch the crash with this, Valgrind will tell you where in the code it originated from. |
We are also struggling to reproduce on Datadog's end, still to no avail. We have tried to simulate connetion failures and were able to force TCP resets by peer:
But these were not following by segmentation faults or inapprioriate ioctl calls (though these may just be a red herring, these are not uncommon on TTY operations but should not crash the process). It seems to me like this might be triggered not just by networking events/failures, but possibly also a race condition, in which case increasing the log rate to repro might help. If we do this, we must take into account the side-effects of valgrind (as amazing a tool as it is): because valgrind is essentially a VM that recompiles the source binary, and has an overhead that slow down runtime execution, it often makes it harder to trigger race conditions. So we might have to keep that in mind. It may be easier to enable core dumps if we can't get good results with valgrind. I think it's safe to assume we're all trying to reproduce on aws-for-fluent-bit: v2.6.1 (thus fluent-bit v1.5.2). Also, I have come across this very similar case with the elasticsearch fluent bit plugin fluent/fluent-bit#2416, so perhaps the issue is on the fluent bit core and not the plugins themselves. |
@truthbk Curious... how do you simulate connection failures and force TCP resets? |
I haven't tried to replicate this or looked into it much, but this issue caught my eye as appearing to sound very vaguely similar to what is being reported here: fluent/fluent-bit#2497 |
Hey @PettitWesley and @truthbk, I finally was able to reproduce the issue using valgrind. Here you have the fluent-bit logs Hope it helps. Let me know if I can help with anything else! |
@angulito @PettitWesley the valgrind output suggests that indeed we're looking at the same issue that we're trying to solve here: fluent/fluent-bit#2507. That PR is still trying to get at the root cause; we're obviously trying to free memory with a bad pointer, and it definitely seems to be a product of destroying that connection twice and how we have lingering events in the event queue that should be purged. It seems like the discussion is definitely going places, thank you @PettitWesley 🙇 Let me know if I can help in any way. |
We don't have a solution yet, but the suggestion in the PR is to try turning off keepalive to prevent the bad code path from being executed:
The behavior of different servers/endpoints must somehow influence this... we have 2 reports for datadog users; I don't have any reports from users of AWS destinations. |
@PettitWesley @truthbk thank you guys! 🙇 |
@angulito Its simple; the net.keepalive option is just like any other option For example, your logConfiguration could be:
|
@PettitWesley thank you for looking into this and providing a workaround! Unfortunately it doesn't seem to work in our case. :( I updated a service with 8 tasks yesterday, and 2 of them already have a dead fluentbit this morning, with the usual:
Each task has 3 containers (+ fluentbit), and all of them have the |
A fix has been submitted and will be released soon: fluent/fluent-bit#2531 @florent-tails interesting... that makes me worry there are two issues at play here. Is there anything unique about your networking setup? |
Hi guys! I experience the same issue. I'm using ECS Fargate, aws fluent-bit sidecar is configured with default config to send logs to Fluentd server on EC2 instance. From time to time fluent-bit sidecar fails with 139 status code and error log:
I often see log message like this but they are not always fatal:
SIGSEGV error appears not only with datadog output. |
@florent-tails Your "Inappropriate ioctl for device" is possibly a different error- can you open a separate issue for it on upstream fluent/fluent-bit? |
@krabradosty Have you tried disabling net.keepalive as a temporary work around? (Scroll up a few comments). If that fixes it- there will be an upstream release soon to fix the net.keepalive issue. |
Fluent Bit v1.5.5 has been released with a fix for the keepalive connection issue: https://fluentbit.io/announcements/v1.5.5/ Please upgrade to this version and let us know if your problems persist. |
@PettitWesley I am getting a lot of these error messages with v1.5.5:
Once I downgrade back to v1.5.4, I don't see these "Resource temporarily unavailable" errors but I still see "Inappropriate ioctl for device". While on v1.5.5, I turned on debug mode and got some related logs:
Looking at the GCP Cloud Log API stats, I am only sending 20 logs per second. |
@stevenarvar Can you open an upstream fluent/fluent-bit issue with that info? This must be something in the core. Also- these issues are sporadic right? |
The "Resource temporarily unavailable" error is pretty consistent in my case. So, it is not sporadic. |
Hey @PettitWesley, unfortunately, the keepalive false didn't fix the issue for us. This is the datadog config we were using:
and the log error was the same we were getting before setting to false the keepalive config I saw there is a new fluent-bit version that contains the fix. I will try it out once we have it released in our aws-for-fluent-bit image so I will let you know if the issue is mitigated! thanks! |
fluent/fluent-bit 1.5.6 has been released with some more fixes: https://fluentbit.io/announcements/v1.5.6/ We have been a bit busy lately... an AWS for Fluent Bit release should come sometime next week. |
@PettitWesley we've built a version of aws-for-fluent-bit internally with fluentbit 1.5.6, and (fingers crossed! 🤞 ) it seems to be working pretty well! Deployed two days ago, and no crashes yet over 8 replicas. We're going to keep an eye on it, and I'll report back on Monday |
@florent-tails We also released |
Is anyone still experiencing issues or can we close this? |
@PettitWesley I have been running 2.7.0 for a bit and have not seen the networking issue. For what its worth, I have seen introduction of intermittent OOM issues. I had moved to 2.6.1 to solve the fluent-bit memory leak issue with the DataDog plugin. That solved the memory leak, but introduced this issue. Now 2.7.0 has solved this issue, but I seem to be having memory issues again. |
@PettitWesley our issue has been resolved with aws for fluent bit 2.7.0 version and we are good now! We are not experiencing any sigterm in the last week and we don't see other issues. Thank you so much for working hard on it!! |
AWS for Fluent Bit 2.7.0 contains fluent bit 1.5.6. I just tested that version (with datadog output) with Valgrind; didn't see any memory leaks. What is your memory limit and your rough peak log throughput? You might be interested in this: aws/containers-roadmap#964 |
@PettitWesley This has been happening across many of our applications with various memory limits and throughputs, but here is one example on the lower throughput end: Task CPU: 512 App container reserved memory: 512 (no hard limit) Approximate log throughput per task: 3k/min I am sending to two destinantions: New Relic + DataDog Firelens usually is running at ~60-80MB consistently, then will intermittently jump up into 300-500MB memory usage and crash the task, with no apparent surge in traffic or log throughput. I have just learned that the firelens sidecar can now be non-essential. I'll definitely look into that as a mitigation for these failures, but I'd like to understand the seemingly random large spikes in memory usage. |
@PettitWesley from my side, the described problem was solved and it is not happening anymore using 2.7.0, so feel free to close this issue. Thanks a lot! |
Hey aws team, I want to flag you an issue we are currently having in our production system. We are using the following software versions:
Our tasks are running on EC2 mode. I didn't check the behavior with fargate.
Fluent-bit sidecar container was killed with
exit code 139
, and as it is an essential container, our task suddenly stopped.Fluent-bit logs during the crash
Docker daemon logs during the crash
Fluent bit Metrics
Memory and CPU for fluent-bit logging sidecar is stable over the time
And there are not detected errors in our prometheus metrics for fluent bit container around 09:33 (CEST time, 07:33 UTC) but it might be because it crashed and metrics were not scrapped. Anyways, I'm sharing the screenshot to show that there are a bunch of errors with datadog in the last 3 hours (queries interval is 10 minutes)
I'm not sure if it's related to #63 - error messages are different but the behavior seems to be similar
The text was updated successfully, but these errors were encountered: