Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluent-bit sidecar is killed because of networking error #66

Closed
angulito opened this issue Aug 17, 2020 · 44 comments
Closed

Fluent-bit sidecar is killed because of networking error #66

angulito opened this issue Aug 17, 2020 · 44 comments

Comments

@angulito
Copy link

Hey aws team, I want to flag you an issue we are currently having in our production system. We are using the following software versions:

  • aws-for-fluent-bit: 2.6.1
  • docker: 18.09.9-ce
  • ecs agent: 1.36.2

Our tasks are running on EC2 mode. I didn't check the behavior with fargate.

Fluent-bit sidecar container was killed with exit code 139, and as it is an essential container, our task suddenly stopped.

Fluent-bit logs during the crash

[2020/08/17 07:32:29] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
[engine] caught signal (SIGSEGV)
[2020/08/17 07:33:00] [error] [tls] SSL error: NET - Connection was reset by peer
[2020/08/17 07:33:00] [error] [src/flb_http_client.c:1077 errno=25] Inappropriate ioctl for device
[2020/08/17 07:33:00] [error] [output:datadog:datadog.1] could not flush records to http-intake.logs.datadoghq.com:443 (http_do=-1)

Docker daemon logs during the crash

Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.678373199Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.910700627Z" level=error msg="Failed to log msg \"\" for logger fluentd: write unix @->/var/run/fluent.sock: write: broken pipe"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.926290270Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.931976771Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.932007976Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.932909970Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.933183296Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.933869906Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.979820912Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.979910901Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.980036496Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.983592287Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:00 <ip> dockerd[3364]: time="2020-08-17T07:33:00.992278230Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.112475840Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.112527938Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.115452827Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.132519235Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.132559388Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.136487318Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.140677989Z" level=error msg="Failed to log msg \"\" for logger fluentd: fluent#send: can't send logs, client is reconnecting"
Aug 17 07:33:01 <ip> dockerd[3364]: time="2020-08-17T07:33:01.295414547Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

Fluent bit Metrics

Memory and CPU for fluent-bit logging sidecar is stable over the time

image

And there are not detected errors in our prometheus metrics for fluent bit container around 09:33 (CEST time, 07:33 UTC) but it might be because it crashed and metrics were not scrapped. Anyways, I'm sharing the screenshot to show that there are a bunch of errors with datadog in the last 3 hours (queries interval is 10 minutes)
image

I'm not sure if it's related to #63 - error messages are different but the behavior seems to be similar

@angulito
Copy link
Author

In case it helps, we have gotten the same error in our non live environment where we have debug level enabled for fluent bit and these are the fluent bit debug logs for this issue:

[2020/08/17 10:05:30] [debug] [task] created task=0x7f805a245780 id=0 OK
[2020/08/17 10:05:30] [debug] [upstream] KA connection #35 to http-intake.logs.datadoghq.com:443 has been assigned (recycled)
[2020/08/17 10:05:30] [ info] [output:datadog:datadog.2] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
[2020/08/17 10:05:30] [debug] [upstream] KA connection #35 to http-intake.logs.datadoghq.com:443 is now available
{"status":"ok","batchID":"<id>"}
[2020/08/17 10:05:30] [debug] [task] destroy task=0x7f805a245780 (task_id=0)
[2020/08/17 10:05:31] [debug] [task] created task=0x7f805a245780 id=0 OK
[2020/08/17 10:05:31] [debug] [upstream] KA connection #35 to http-intake.logs.datadoghq.com:443 has been assigned (recycled)
[2020/08/17 10:05:31] [ info] [output:datadog:datadog.2] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
[2020/08/17 10:05:31] [debug] [upstream] KA connection #35 to http-intake.logs.datadoghq.com:443 is now available
{"status":"ok","batchID":"<id>"}
[2020/08/17 10:05:31] [debug] [task] destroy task=0x7f805a245780 (task_id=0)
[2020/08/17 10:06:01] [debug] [upstream] drop keepalive connection to http-intake.logs.datadoghq.com:443 (keepalive idle timeout)
[2020/08/17 10:06:01] [debug] [task] created task=0x7f805a245500 id=0 OK
[2020/08/17 10:06:01] [debug] [upstream] KA connection #35 to http-intake.logs.datadoghq.com:443 has been assigned (recycled)
[2020/08/17 10:06:01] [error] [tls] SSL error: NET - Connection was reset by peer
[2020/08/17 10:06:01] [error] [src/flb_http_client.c:1077 errno=25] Inappropriate ioctl for device
[2020/08/17 10:06:01] [error] [output:datadog:datadog.2] could not flush records to http-intake.logs.datadoghq.com:443 (http_do=-1)
[2020/08/17 10:06:01] [debug] [upstream] KA connection #-1 to http-intake.logs.datadoghq.com:443 has been disconnected by the remote service
[engine] caught signal (SIGSEGV)

@hossain-rayhan
Copy link
Contributor

Hi @angulito , thanks for your detailed report. To me, it seems like an issue with datadog. However, can you share the fluent-bit config file you are using?

@hossain-rayhan
Copy link
Contributor

Also, fluent-bit container is no more essential 996. So, if you do not set it as essential, your task should keep running even if the fluent-bit is killed.

@angulito
Copy link
Author

Hey @hossain-rayhan, thanks for your help! Yep, marking the task as non-essential is a good point but it's a pain in the neck to lose logs in our production environment because of this issue.

This is our fluent-bit config (automatically generated by awsfirelens)

## fluent-bit.config

[INPUT]
    Name forward
    unix_path /var/run/fluent.sock

[INPUT]
    Name forward
    Listen 0.0.0.0
    Port 24224

[INPUT]
    Name tcp
    Tag firelens-healthcheck
    Listen 127.0.0.1
    Port 8877

[FILTER]
    Name record_modifier
    Match *
    Record ec2_instance_id i-123456789
    Record ecs_cluster my-cluster
    Record ecs_task_arn arn:aws:ecs:<aws-region>:<id>:task/<task-id>
    Record ecs_task_definition my-task:1

@INCLUDE /fluent-bit/etc/external.conf

[OUTPUT]
    Name null
    Match firelens-healthcheck

[OUTPUT]
    Name datadog
    Match my-task-firelens*
    Host http-intake.logs.datadoghq.com
    TLS true
    apikey ${apikey_0}
    dd_message_key log
    dd_service my-task
    dd_source go
    dd_tags env:prod
    provider ecs

And this is the external file included in the config:

## /fluent-bit/etc/external.conf

[SERVICE]
    # Flush
    # =====
    # Set an interval of seconds before to flush records to a destination
    Flush 1

    # Grace
    # =====
    # Set the grace period in seconds
    Grace 30

    # HTTP Server
    # Enable/Disable the built-in HTTP Server for metrics
    # ===========
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_PORT    2020

I would really appreciate any help! 🙇

@angulito
Copy link
Author

@hossain-rayhan I also want to provide you a little more of info here:

  • container exit status is 139
  • SIGSEGV is not received always - I see we have some error logs and after that, retried has been scheduled and succeed
    2020-08-18T21:48:39.617+02:00
[2020/08/18 19:48:39] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
2020-08-18T21:49:09.492+02:00
[2020/08/18 19:49:09] [error] [tls] SSL error: NET - Connection was reset by peer
2020-08-18T21:49:09.492+02:00
[2020/08/18 19:49:09] [error] [src/flb_http_client.c:1077 errno=25] Inappropriate ioctl for device
2020-08-18T21:49:09.492+02:00
[2020/08/18 19:49:09] [error] [output:datadog:datadog.1] could not flush records to http-intake.logs.datadoghq.com:443 (http_do=-1)
2020-08-18T21:49:09.492+02:00
[2020/08/18 19:49:09] [ warn] [engine] failed to flush chunk '1-1597780149.339654606.flb', retry in 10 seconds: task_id=0, input=forward.0 > output=datadog.1
2020-08-18T21:49:11.787+02:00
[2020/08/18 19:49:11] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
2020-08-18T21:49:12.703+02:00
[2020/08/18 19:49:12] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
2020-08-18T21:49:13.628+02:00
[2020/08/18 19:49:13] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
2020-08-18T21:49:16.587+02:00
[2020/08/18 19:49:16] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
2020-08-18T21:49:18.627+02:00
[2020/08/18 19:49:18] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
2020-08-18T21:49:19.606+02:00
[2020/08/18 19:49:19] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
2020-08-18T21:49:19.606+02:00
[2020/08/18 19:49:19] [ info] [engine] flush chunk '1-1597780149.339654606.flb' succeeded at retry 1: task_id=1, input=forward.0 > output=datadog.1
2020-08-18T21:49:55.801+02:00
[2020/08/18 19:49:55] [ info] [output:datadog:datadog.1] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}

I’m happy to help you with whatever you need to figure out where this issue comes from! thanks!

@hossain-rayhan
Copy link
Contributor

hossain-rayhan commented Aug 20, 2020

Hi @angulito , thanks for your config and logs. What I found, EXIT code 139 occurs because of segmentation fault which is mostly for environment setups. 570, 65.

I will comeback after further investigation. Looping in @PettitWesley if he can provide more insights.

However, were you able to reproduce it?

@PettitWesley
Copy link
Contributor

@angulito Check this, I think its the same issue: fluent/fluent-bit#2387

DataDog gave me a test account for this integration, I am trying to reproduce it.

@angulito
Copy link
Author

Hey @hossain-rayhan thanks for the update. We are using ECS AMI image (see in the PR description docker and ecs agent version) so it's a pity because we cannot upgrade to docker v19 because of another issue with fluent-bit (if fluent bit sidecar dies, the other take containers hangs forever unresponsive and cannot be killed) fluent/fluent-logger-golang#82.

We were not able to reproduce the issue, it just happens "randomly", sometimes we see the network issue and container does not die and other it dies.

Thank you!!

@angulito
Copy link
Author

Thank you @PettitWesley! Yep it seems to be very similar as exit code is the same! Will look at that issue, but don't hesitate to ask me if you need further info! Thanks!

@PettitWesley
Copy link
Contributor

I set up 2 FireLens tasks with datadog outputs last night to try to repro this. One is just the latest AWS for Fluent bit image, one is a custom image built with the valgrind tool, which should help diagnose the segfault if it can be reproduced.

@angulito
Copy link
Author

@PettitWesley I have some alerts to be aware when the issue happens in my system. I'll let you know when it happens as it might be related to datadog network issues and it might happen at the same time in your service and mines
Thank you!

@angulito
Copy link
Author

Hey @PettitWesley - the issue happened multiple times for my services during the weekend, here you have the list of dates when any of my containers restarted, in case it happened the same for your service anytime - hours are specified in UTC

2020/08/22 3:13 AM
2020/08/22 9:23 AM
2020/08/22 10:50 AM
2020/08/22 11:13 AM
2020/08/22 12:09 PM
2020/08/22 2:04 PM
2020/08/22 3:51 PM
2020/08/22 4:10 PM
2020/08/22 5:20 PM
2020/08/22 10:34 PM
2020/08/22 11:39 PM
2020/08/23 1:45 AM
2020/08/23 2:09 AM
2020/08/23 3:56 AM
2020/08/23 5:41 AM
2020/08/23 11:23 AM
2020/08/23 3:42 PM
2020/08/23 5:59 PM
2020/08/23 6:45 PM
2020/08/23 7:03 PM
2020/08/23 10:08 PM
2020/08/23 10:51 PM
2020/08/24 4:08 AM
2020/08/24 6:37 AM

as you can see, around ~10 services crashed every day from 60+ services in total

@PettitWesley
Copy link
Contributor

@angulito My datadog firelens tasks still haven't crashed a single time, days after I launched them. So I have not yet been able to reproduce this.

Next I will try turning off network, and logging at a high rate, and see if any of those trigger this.

@angulito
Copy link
Author

thank you @PettitWesley! also if you want, you can share with me how you have set up the valgrind tool (I didn't use it in the past) and if it's easy for me I can try to include it in my system to help you with the research

@PettitWesley
Copy link
Contributor

@angulito Here it is:

FROM amazon/aws-for-fluent-bit:latest
RUN yum -y install  valgrind
ENTRYPOINT valgrind /fluent-bit/bin/fluent-bit  -c /fluent-bit/etc/fluent-bit.conf

If you can catch the crash with this, Valgrind will tell you where in the code it originated from.

@truthbk
Copy link

truthbk commented Aug 26, 2020

We are also struggling to reproduce on Datadog's end, still to no avail.

We have tried to simulate connetion failures and were able to force TCP resets by peer:

flb_io_tls.c:356 NET - Connection was reset by peer

But these were not following by segmentation faults or inapprioriate ioctl calls (though these may just be a red herring, these are not uncommon on TTY operations but should not crash the process).

It seems to me like this might be triggered not just by networking events/failures, but possibly also a race condition, in which case increasing the log rate to repro might help. If we do this, we must take into account the side-effects of valgrind (as amazing a tool as it is): because valgrind is essentially a VM that recompiles the source binary, and has an overhead that slow down runtime execution, it often makes it harder to trigger race conditions. So we might have to keep that in mind. It may be easier to enable core dumps if we can't get good results with valgrind.

I think it's safe to assume we're all trying to reproduce on aws-for-fluent-bit: v2.6.1 (thus fluent-bit v1.5.2).

Also, I have come across this very similar case with the elasticsearch fluent bit plugin fluent/fluent-bit#2416, so perhaps the issue is on the fluent bit core and not the plugins themselves.

@PettitWesley
Copy link
Contributor

@truthbk Curious... how do you simulate connection failures and force TCP resets?

@PettitWesley
Copy link
Contributor

I haven't tried to replicate this or looked into it much, but this issue caught my eye as appearing to sound very vaguely similar to what is being reported here: fluent/fluent-bit#2497

@angulito
Copy link
Author

Hey @PettitWesley and @truthbk, I finally was able to reproduce the issue using valgrind. Here you have the fluent-bit logs
fluent-bit-restart-logs.txt

Hope it helps. Let me know if I can help with anything else!

@truthbk
Copy link

truthbk commented Aug 31, 2020

@angulito @PettitWesley the valgrind output suggests that indeed we're looking at the same issue that we're trying to solve here: fluent/fluent-bit#2507. That PR is still trying to get at the root cause; we're obviously trying to free memory with a bad pointer, and it definitely seems to be a product of destroying that connection twice and how we have lingering events in the event queue that should be purged. It seems like the discussion is definitely going places, thank you @PettitWesley 🙇

Let me know if I can help in any way.

@PettitWesley
Copy link
Contributor

We don't have a solution yet, but the suggestion in the PR is to try turning off keepalive to prevent the bad code path from being executed:

[OUTPUT]
    name            something
    match            *
    net.keepalive false

The behavior of different servers/endpoints must somehow influence this... we have 2 reports for datadog users; I don't have any reports from users of AWS destinations.

@angulito
Copy link
Author

angulito commented Sep 2, 2020

@PettitWesley @truthbk thank you guys! 🙇
@PettitWesley I would like to ask you if there is any way to set up net.keepalive false via awsfirelens code automatically (without needing to manage the entire config by ourselves)

@PettitWesley
Copy link
Contributor

I would like to ask you if there is any way to set up net.keepalive false via awsfirelens code automatically (without needing to manage the entire config by ourselves)

@angulito Its simple; the net.keepalive option is just like any other option

For example, your logConfiguration could be:

			 "logConfiguration": {
				 "logDriver":"awsfirelens",
				 "options": {
					"Name": "datadog",
					"apiKey": "<key>",
					"provider": "ecs",
					"dd_service": "my-app-service",
					"TLS": "On",
					"dd_source": "my-app",
                                        "net.keepalive": "false"
				}
			}

@florent-tails
Copy link

florent-tails commented Sep 3, 2020

@PettitWesley thank you for looking into this and providing a workaround! Unfortunately it doesn't seem to work in our case. :(

I updated a service with 8 tasks yesterday, and 2 of them already have a dead fluentbit this morning, with the usual:

[2020/09/02 21:01:20] [ info] [output:datadog:datadog.3] https://http-intake.logs.datadoghq.com, port=443, HTTP status=200 payload={}
[2020/09/02 21:01:21] [error] [tls] SSL error: NET - Connection was reset by peer
[2020/09/02 21:01:21] [error] [src/flb_http_client.c:1077 errno=25] Inappropriate ioctl for device
[2020/09/02 21:01:21] [error] [output:datadog:datadog.1] could not flush records to http-intake.logs.datadoghq.com:443 (http_do=-1)
[engine] caught signal (SIGSEGV)

Each task has 3 containers (+ fluentbit), and all of them have the "net.keepalive": "false" in logconfiguration -> options.

@PettitWesley
Copy link
Contributor

A fix has been submitted and will be released soon: fluent/fluent-bit#2531

@florent-tails interesting... that makes me worry there are two issues at play here. Is there anything unique about your networking setup?

@krabradosty
Copy link

krabradosty commented Sep 7, 2020

Hi guys!

I experience the same issue. I'm using ECS Fargate, aws fluent-bit sidecar is configured with default config to send logs to Fluentd server on EC2 instance. From time to time fluent-bit sidecar fails with 139 status code and error log:

[2020/08/24 16:25:30] [debug] [upstream] drop keepalive connection to elk.ecov1stage.internal.opt:24224 (keepalive idle timeout)
[2020/08/24 16:25:30] [debug] [task] created task=0x7f9257880280 id=0 OK
[2020/08/24 16:25:30] [debug] [output:forward:forward.1] request 1050 bytes to flush
[2020/08/24 16:25:30] [debug] [upstream] KA connection #24 to elk.ecov1stage.internal.opt:24224 has been assigned (recycled)
[2020/08/24 16:25:30] [error] [output:forward:forward.1] could not write forward header
[2020/08/24 16:25:30] [debug] [upstream] KA connection #-1 to elk.ecov1stage.internal.opt:24224 has been disconnected by the remote service
[engine] caught signal (SIGSEGV)

I often see log message like this but they are not always fatal:

[2020/08/24 16:24:00] [error] [output:forward:forward.1] could not write forward header

SIGSEGV error appears not only with datadog output.
Hope this information will help you.

@PettitWesley
Copy link
Contributor

@florent-tails Your "Inappropriate ioctl for device" is possibly a different error- can you open a separate issue for it on upstream fluent/fluent-bit?

@PettitWesley
Copy link
Contributor

@krabradosty Have you tried disabling net.keepalive as a temporary work around? (Scroll up a few comments). If that fixes it- there will be an upstream release soon to fix the net.keepalive issue.

@PettitWesley
Copy link
Contributor

Fluent Bit v1.5.5 has been released with a fix for the keepalive connection issue: https://fluentbit.io/announcements/v1.5.5/

Please upgrade to this version and let us know if your problems persist.

@stevenarvar
Copy link

stevenarvar commented Sep 8, 2020

@PettitWesley I am getting a lot of these error messages with v1.5.5:

[2020/09/08 22:46:59] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:46:59] [error] [src/flb_upstream.c:356 errno=25] Inappropriate ioctl for device
[2020/09/08 22:47:03] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:47:04] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:47:18] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:47:19] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:47:33] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:47:34] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:47:48] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:47:49] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:03] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:04] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:18] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:19] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:33] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:34] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:48] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:48:49] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:49:03] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:49:04] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:49:18] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable
[2020/09/08 22:49:19] [error] [src/flb_upstream.c:356 errno=11] Resource temporarily unavailable

Once I downgrade back to v1.5.4, I don't see these "Resource temporarily unavailable" errors but I still see "Inappropriate ioctl for device".

While on v1.5.5, I turned on debug mode and got some related logs:

[2020/09/08 22:45:15] [debug] [input:tail:myfile.tail] inode=25194448 events: IN_MODIFY
[2020/09/08 22:45:15] [debug] [input:tail:qa02_uswest2.kubernetes.tail] inode=25194448 events: IN_MODIFY
[2020/09/08 22:45:15] [debug] [task] created task=0x7f6c93ab2540 id=0 without routes, dropping.
[2020/09/08 22:45:15] [debug] [task] destroy task=0x7f6c93ab2540 (task_id=0)
[2020/09/08 22:45:15] [debug] [task] created task=0x7f6c93ab2540 id=0 without routes, dropping.
[2020/09/08 22:45:15] [debug] [task] destroy task=0x7f6c93ab2540 (task_id=0)
[2020/09/08 22:45:15] [debug] [task] created task=0x7f6c93ab2540 id=0 OK
[2020/09/08 22:45:15] [error] [src/flb_upstream.c:356 errno=25] Inappropriate ioctl for device
[2020/09/08 22:45:15] [debug] [upstream] KA connection #23 is in a failed state to: logging.googleapis.com:443, cleaning up
[2020/09/08 22:45:16] [debug] [output:stackdriver:stackdriver.0] HTTP Status=200
[2020/09/08 22:45:16] [debug] [upstream] KA connection #23 to logging.googleapis.com:443 is now available

Looking at the GCP Cloud Log API stats, I am only sending 20 logs per second.

@PettitWesley
Copy link
Contributor

@stevenarvar Can you open an upstream fluent/fluent-bit issue with that info? This must be something in the core.

Also- these issues are sporadic right?

@stevenarvar
Copy link

The "Resource temporarily unavailable" error is pretty consistent in my case. So, it is not sporadic.

@stevenarvar
Copy link

@angulito
Copy link
Author

angulito commented Sep 9, 2020

Hey @PettitWesley, unfortunately, the keepalive false didn't fix the issue for us. This is the datadog config we were using:

[OUTPUT]
    Name datadog
    Match our-service*
    Host http-intake.logs.datadoghq.com
    TLS true
    apikey ${apikey_0}
    dd_message_key log
    dd_service our-service
    dd_source our-source
    dd_tags env:prod
    net.keepalive false
    provider ecs

and the log error was the same we were getting before setting to false the keepalive config
image

I saw there is a new fluent-bit version that contains the fix. I will try it out once we have it released in our aws-for-fluent-bit image so I will let you know if the issue is mitigated! thanks!

@PettitWesley
Copy link
Contributor

fluent/fluent-bit 1.5.6 has been released with some more fixes: https://fluentbit.io/announcements/v1.5.6/

We have been a bit busy lately... an AWS for Fluent Bit release should come sometime next week.

@florent-tails
Copy link

@PettitWesley we've built a version of aws-for-fluent-bit internally with fluentbit 1.5.6, and (fingers crossed! 🤞 ) it seems to be working pretty well! Deployed two days ago, and no crashes yet over 8 replicas. We're going to keep an eye on it, and I'll report back on Monday

@hossain-rayhan
Copy link
Contributor

@florent-tails We also released aws-for-fluent-bit 2.7.0 (which includes fluent-bit 1.5.6) yesterday. If you want to play with it and report us back that would be great.

@PettitWesley
Copy link
Contributor

Is anyone still experiencing issues or can we close this?

@blytheaw
Copy link

@PettitWesley I have been running 2.7.0 for a bit and have not seen the networking issue. For what its worth, I have seen introduction of intermittent OOM issues. I had moved to 2.6.1 to solve the fluent-bit memory leak issue with the DataDog plugin. That solved the memory leak, but introduced this issue. Now 2.7.0 has solved this issue, but I seem to be having memory issues again.

@angulito
Copy link
Author

@PettitWesley our issue has been resolved with aws for fluent bit 2.7.0 version and we are good now! We are not experiencing any sigterm in the last week and we don't see other issues. Thank you so much for working hard on it!!

@PettitWesley
Copy link
Contributor

@blytheaw

Now 2.7.0 has solved this issue, but I seem to be having memory issues again.

AWS for Fluent Bit 2.7.0 contains fluent bit 1.5.6. I just tested that version (with datadog output) with Valgrind; didn't see any memory leaks.

What is your memory limit and your rough peak log throughput?

You might be interested in this: aws/containers-roadmap#964

@blytheaw
Copy link

blytheaw commented Oct 9, 2020

@PettitWesley This has been happening across many of our applications with various memory limits and throughputs, but here is one example on the lower throughput end:

Task CPU: 512
Task Memory: 1024

App container reserved memory: 512 (no hard limit)
Firelens reserved memory: 128 (no hard limit)

Approximate log throughput per task: 3k/min

I am sending to two destinantions: New Relic + DataDog
I am using amazon/aws-for-fluent-bit:2.7.0

Firelens usually is running at ~60-80MB consistently, then will intermittently jump up into 300-500MB memory usage and crash the task, with no apparent surge in traffic or log throughput.

I have just learned that the firelens sidecar can now be non-essential. I'll definitely look into that as a mitigation for these failures, but I'd like to understand the seemingly random large spikes in memory usage.

@angulito
Copy link
Author

angulito commented Jan 12, 2021

@PettitWesley from my side, the described problem was solved and it is not happening anymore using 2.7.0, so feel free to close this issue. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants