-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent errors when using output plugin cloudwatch_logs #274
Comments
Don't see anything below 😐 |
Just added...hit the wrong button... |
What version are you using? I think its something old:
I believe this was fixed a while ago. Also, try the |
The version we used I believe it's 1.8.6, is this version too old? |
@yangyang919 Unfortunately there have been a lot of bug fixes throughout the 1.8.x series, so yeah, 1.8.6 is a bit old. Please use the AWS distro, the latest version is on 1.8.9: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.21.2 |
thanks @PettitWesley we'll upgrade the version and get back to you |
Hi @PettitWesley , We have upgraded fluent-bit to version 1.8.10. New issues appeared: Pod gets OOMKilled, and Back-off restarted frequently. If removing cloud-watch output part, then everything is normal. Our config is like below, do you see any problem? ` [SERVICE]
|
memory growth issue reported from another customer for aws-for fluent bit version 2.21.1 which is fluent bit 1.8.9 version: #277 |
Please also consider checking out relevant sections of our new debugging guide: https://github.com/aws/aws-for-fluent-bit/pull/266/files |
Thanks for replying. Here is what we observed:
So I'm thinking maybe tuning parameters can solve this issue. Our fluent-bit pod has resources limits like this: [INPUT] You met more cases like us. Do you foresee any parameters can be optimized? Some logs before get OOMKilled: [2021/12/10 11:15:50] [debug] [out coro] cb_destroy coro_id=352 |
Which outputs are seeing errors? (Both? Or more one than the other?). Usually OOMKill/high memory usage is caused by errors in fluent bit which lead to retries, which mean the logs pile up in the buffer. Consider using the monitoring interface to get error counts: https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit Also, what are the error messages?
So the When data is sent, its copied into a new buffer for each output and then the output might create additional buffers for the http requests and for processing. Thus, it is very reasonable for the following to be true:
Just for normal/happy cases (and if your pods are producing tons of logs), and in error cases, it could be much more... |
@PettitWesley Thanks for replying! Very very important message for us:
We have 6 outputs in total. 3 for FluentD, while 3 for Cloudwatch. Considering Mem_Buf_Limit is 128M, so the actual for FluentD is already at least 3*128MB. Coz we found out FluentD has many retries logs, while cloudwatch are quite stable. I'll lower the Mem_Buf_Limit buffer to verify this |
@yangyang919 Remember that lowering the mem_buf_limit could lead to log loss under high throughput and you should check the fluent bit logs for paused/overlimit warnings. |
@PettitWesley yes, thanks for reminding. I also enabled file system storage as well. But sure, to really solve this, we need to make sure fluentD side digests our logs, otherwise no matter how large buffer we set, it will eventually get full.. |
@yangyang919 See this: https://docs.fluentbit.io/manual/administration/monitoring#metrics-examples I think there are disk storage metrics too but only if you use the config option noted here: https://docs.fluentbit.io/manual/administration/monitoring#rest-api-interface |
thanks @PettitWesley , I can see that the buffer gets consumed all for tail plugin. A further question: we have a bunch of Filters, is there a way to check memory consumption by those plugins as well? {"storage_layer":{"chunks":{"total_chunks":10,"mem_chunks":10,"fs_chunks":0,"fs_chunks_up":0,"fs_chunks_down":0}},"input_chunks":{"tail.0":{"status":{"overlimit":true,"mem_size":"19.1M","mem_limit":"19.1M"},"chunks":{"total":10,"up":10,"down":0,"busy":9,"busy_size":"17.6M"}},"systemd.1":{"status":{"overlimit":false,"mem_size":"0b","mem_limit":"4.8M"},"chunks":{"total":0,"up":0,"down":0,"busy":0,"busy_size":"0b"}},"storage_backlog.2":{"status":{"overlimit":false,"mem_size":"0b","mem_limit":"0b"},"chunks":{"total":0,"up":0,"down":0,"busy":0,"busy_size":"0b"}},"emitter_for_rewrite_tag.5":{"status":{"overlimit":false,"mem_size":"0b","mem_limit":"9.5M"},"chunks":{"total":0,"up":0,"down":0,"busy":0,"busy_size":"0b"}}}} |
@yangyang919 Unfortunately there's currently no way to see plugin specific memory usage.. Also, remember the formula I gave used |
thanks @PettitWesley , today I found a very strange behavior. One Pod gets restarted 3 times, but if checking its memory consumption, it's definitely below the requested resources (request 400M, limit 800M). Do you have any thoughts? |
@yangyang919 Do the logs contain any evidence that it crashed? Can you get the stopped reason from k8s? |
@PettitWesley The reason is OOMKilled...
The logs before it gets restarted seems ok: [2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY |
@yangyang919 If K8s says it was OOMKilled I would assume that isn't a lie and is true... so somehow the memory usage spiked I guess.. the graph does show that right? It just isn't as high as you expect. But I found this article: https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-part-3-container-resource-metrics-361c5ee46e66 And it says:
|
@PettitWesley really weird...Below pod has restarted once in last 60 minutes due to OOMKilled. But using container_memory_working_set_bytes still looks so normal...
|
You do have mem buf overlimit warning. Also, I wonder if the memory spikes very suddenly in your case, and the graph never catches up to the real memory usage. Each crash has a sudden increase in memory right before it. This is my guess. As to what is causing that... I am not sure... no retries or other errors? |
@PettitWesley But still I'm guessing somewhere in our Fluent-bit there is memory leak, not sure which Plugin is leaking. The memory grows so fast, basically vertical ascending and then get OOMKilled. If I only enabled outputs to FluentD or outputs to cloudwatch, everything is fine. If enabled both, then OOMKilled happened. I attached the config below, this is right now the most stable version. In this version there are still restarts due to OOMKilled, but not no frequent. Some points worth to mention:
service: |
customParsers: | luaScripts: |
@yangyang919 Sorry I'm not really sure what's causing this for you. For another issue, I recently did some testing/benchmarking and memory leak checking, and I found that our latest stable version is very stable in memory usage and does not have any leaks: https://github.com/aws/aws-for-fluent-bit/blob/mainline/AWS_FOR_FLUENT_BIT_STABLE_VERSION Please try that version if you haven't already. |
@PettitWesley I do confirm that the stable version 2.28.5 is much better with the cloudwatch_logs plugin vs latest "2.31.x" |
@PettitWesley does https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/oomkill-prevention applicable for EKS as well ? |
@vkadi the recommendations and fluent bit config options can be used in EKS as well. Fluent Bit config language is the same no matter where you deploy it. |
Describe the question/issue
I know this is not a new issue. But I'm wondering whether the root cause is identified? We are using cloudwatch_logs plugin to send logs from our Pods to AWS cloud watch. The errors look like below:
[2021/12/01 01:53:50] [ warn] [engine] failed to flush chunk '1-1638323628.679735747.flb', retry in 10 seconds: task_id=0, input=emitter_for_rewrite_tag.5 > output=http.2 (out_id=2) [2021/12/01 02:02:41] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096 [2021/12/01 02:02:41] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Recieved code 200 but response was invalid, x-amzn-RequestId header not found [2021/12/01 02:02:41] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:02:41] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:02:41] [ warn] [engine] failed to flush chunk '1-1638324161.187277303.flb', retry in 7 seconds: task_id=0, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/01 02:02:55] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096 [2021/12/01 02:02:55] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Recieved code 200 but response was invalid, x-amzn-RequestId header not found [2021/12/01 02:02:55] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:02:55] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:02:55] [ warn] [engine] failed to flush chunk '1-1638324175.42999390.flb', retry in 7 seconds: task_id=0, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/01 02:04:29] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096 [2021/12/01 02:04:29] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Recieved code 200 but response was invalid, x-amzn-RequestId header not found [2021/12/01 02:04:29] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:04:29] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:04:29] [ warn] [engine] failed to flush chunk '1-1638324268.678872073.flb', retry in 9 seconds: task_id=0, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/01 02:04:38] [ warn] [http_client] malformed HTTP response from logs.eu-central-1.amazonaws.com:443 on connection #34 [2021/12/01 02:04:38] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:04:38] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:04:38] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:04:38] [ warn] [engine] chunk '1-1638324268.678872073.flb' cannot be retried: task_id=0, input=tail.0 > output=cloudwatch_logs.3 [2021/12/01 02:07:49] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096 [2021/12/01 02:07:49] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Recieved code 200 but response was invalid, x-amzn-RequestId header not found [2021/12/01 02:07:49] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:07:49] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:07:49] [ warn] [engine] failed to flush chunk '1-1638324469.352128961.flb', retry in 11 seconds: task_id=0, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/01 02:08:00] [ warn] [http_client] malformed HTTP response from logs.eu-central-1.amazonaws.com:443 on connection #102 [2021/12/01 02:08:00] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:08:00] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:08:00] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:08:00] [ warn] [engine] chunk '1-1638324469.352128961.flb' cannot be retried: task_id=0, input=tail.0 > output=cloudwatch_logs.3 [2021/12/01 02:10:31] [ warn] [http_client] malformed HTTP response from logs.eu-central-1.amazonaws.com:443 on connection #102 [2021/12/01 02:10:31] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:10:31] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:10:31] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:10:31] [ warn] [engine] failed to flush chunk '1-1638324630.866492143.flb', retry in 9 seconds: task_id=0, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/01 02:11:41] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096 [2021/12/01 02:11:41] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Recieved code 200 but response was invalid, x-amzn-RequestId header not found
`
[INPUT]
Name tail
Path /var/log/containers/.log
Parser custom_cri
Tag kube.
Mem_Buf_Limit 5MB
Skip_Long_Lines On
[INPUT]
Name systemd
Tag host.*
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
Read_From_Tail On
The text was updated successfully, but these errors were encountered: