Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Losing logs with S3 plugin #495

Open
fhitchen opened this issue Dec 13, 2022 · 11 comments
Open

Losing logs with S3 plugin #495

fhitchen opened this issue Dec 13, 2022 · 11 comments

Comments

@fhitchen
Copy link

Describe the question/issue

Have had team members complaining that log file entries are missing when they go to search for them in Kibana. The fluent-bit logs do show failures and retries happening quite frequently, but there appears to be one particular log sequence that never seems to get retried. On our 64 node EKS cluster I can see the error Could not send chunk with tag happening around 1.600 times a day in OpenSearch.

I have gone through all the steps in the debugging issues guide and nothing has helped so far, so I am hoping for more guidance here.

Configuration

[INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    Exclude_Path      /var/log/containers/bad*.log
    multiline.parser  docker, cri
    DB                /var/log/flb_kube.db
    Mem_Buf_Limit     50MB
    Skip_Long_Lines   On
    Refresh_Interval  10
    storage.type      filesystem

[OUTPUT]
    Name                         s3
    Match                        kube.*
    bucket                       my-bucket
    region                       us-west-2
    total_file_size              5M
    s3_key_format                /%Y/%m/%d/%H/$UUID.log
    s3_key_format_tag_delimiters .-
    upload_timeout               5m
    store_dir                    /tmp/fluent-bit/s3
    workers                      1

Fluent Bit Log Output

Have tried to turn on debug logging, but it was so verbose that I could not see the error happening.

This is the error sequence that does not appear to be subject to any kind of retry.

[2022/12/13 15:58:59] [error] [http_client] broken connection to s3.us-west-2.amazonaws.com:443 ?
[2022/12/13 15:58:59] [error] [http_client] broken connection to s3.us-west-2.amazonaws.com:443 ?
[2022/12/13 15:58:59] [error] [output:s3:s3.1] PutObject request failed
[2022/12/13 15:58:59] [error] [output:s3:s3.1] Could not send chunk with tag kube.var.log.containers.x.log
[2022/12/13 15:58:59] [error] [http_client] broken connection to s3.us-west-2.amazonaws.com:443 ?
[2022/12/13 15:58:59] [error] [http_client] broken connection to s3.us-west-2.amazonaws.com:443 ?
[2022/12/13 15:58:59] [error] [output:s3:s3.1] PutObject request failed
[2022/12/13 15:58:59] [error] [output:s3:s3.1] Could not send chunk with tag kube.var.log.containers.y.log
[2022/12/13 15:58:59] [error] [tls] error: error:00000006:lib(0):func(0):EVP lib
[2022/12/13 15:58:59] [error] [src/flb_http_client.c:1189 errno=25] Inappropriate ioctl for device

Fluent Bit Version Info

AWS for Fluent Bit Container Image Version 2.28.4
Fluent Bit v1.9.9

Cluster Details

  • No service mesh.
  • network restricted VPC
  • Not sure if throttling from the destination part of the problem. Don't know how to see throttling on S3.
  • EKS
  • Daemon deployment for Fluent Bit

Application Details

We are shipping logs from around 1,800 pods. The log activity varies dramatically from pod to pod. Each day we are creating around 60 million documents in OpenSearch and the size of the indexes is around 50Gb.

Input and output rates are in the attached Grafana screenshots.

image

image

Steps to reproduce issue

From the fluent-bit logs I see that we successfully upload around 250,000 objects to S3 in a 24 hour period, so the failure rate is quite low, but still it is getting noticed.

Added the worker setting but it made no difference. Have also tried to increase the upload timeout from 3 to 5 minutes and that also did not make any difference.

There does not seem to be any correlation between the failure rate and the input bytes rate. The failure rate remains more or less constant across the day, between 60 and 100 failures and hour.

Related Issues

@PettitWesley
Copy link
Contributor

Apologies for this issue. Currently, S3 output has a hard-coded 5 retries for chunks, but it doesn't output any messages to tell you that it is retrying. In addition, the timestamp in the S3 key of the file in S3 is not actually the timestamp of any log record in the file. It's just the time when S3 created the buffer file.

How do you search for the logs? Is kibana search based on the timestamp in the log or the timestamp of the S3 file? If its the S3 file, then you unfortunately need to increase your range of search in time.

The team and especially I apologize for these bugs. We have been working on fixing them this quarter and are hoping to get a release out very soon with the fixes:

One immediate mitigation step you can take is to enable our auto_retry_requests option which will issue an immediate retry for any network errors.

Additionally, we could build you a custom image with our fixes, would you be interested in that?

@PettitWesley
Copy link
Contributor

@Claych FYI

@Claych
Copy link
Contributor

Claych commented Dec 14, 2022

These two PRs will fix the bugs with s3 timestamp and can let you set the number with retry_limit in configuration file. The doc will be updated synchronously when these two PRs release, and you can see the details here for s3 later.

@fhitchen
Copy link
Author

Thank you very much for the reply.

We are indexing the log file entries not on the S3 timestamp but on the timestamps either from the application timestamp or from the container run time logger. So timestamp is not a problem. We know logs are missing because our searches using an application timestamp sometimes come up empty, but we can find the entries when we search the container logs.

Just a bit confused by the auto_retry_requests option as the doc says it is true by default. Is this not the case?

And sure, I would be interested in a custom image, I have several non-production clusters that also have the error where I could try it out. Thanks again.

@PettitWesley
Copy link
Contributor

Sorry @fhitchen for this issue you are experiencing.

Just a bit confused by the auto_retry_requests option as the doc says it is true by default. Is this not the case?

My bad, yea you do not need to make any config update to enable it.

We know logs are missing because our searches using an application timestamp sometimes come up empty, but we can find the entries when we search the container logs.

Can you elaborate on what this means? Do you mean that when you check the actual log files on disk or use kubectl logs that you can find the missing logs?

Also, we have some guidance on investigating log loss here: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#log-loss

One thing that'd be super useful is if you can provide us with details on any correlations you see in the lost logs, as the above guide notes. For example, do they call come in a series, as if one chunk/buffer file was lost, or are they very mixed over time.

@fhitchen
Copy link
Author

Hi @PettitWesley,

My bad this time. What I meant about knowing we had missing log entries was nothing to do with timestamp. All our log entries have a traceId in them. We could not find the traceid in OpenSearch but we could when we did a kubectl logs on the POD where we knew the issue probably was.

I have found a pattern in the distribution of the Could not send chunk with tag error messages. over a 24 hour period. It just appears to be related to the volume of logging on the particular worker node.

    Occurrences   Worker node  
    112             "pod_name" : "fluentbit-6ghkf"
    109             "pod_name" : "fluentbit-tq6cn"
    105             "pod_name" : "fluentbit-xdzwl"
    105             "pod_name" : "fluentbit-58hsd"
    101             "pod_name" : "fluentbit-cwrch"
    100             "pod_name" : "fluentbit-srx5l"
     98             "pod_name" : "fluentbit-gb57h"
     89             "pod_name" : "fluentbit-9xcmz"
     86             "pod_name" : "fluentbit-6cq4z"
     85             "pod_name" : "fluentbit-2pg97"
     79             "pod_name" : "fluentbit-x6s5s"
     75             "pod_name" : "fluentbit-vpdfs"
     70             "pod_name" : "fluentbit-5jxt2"
     60             "pod_name" : "fluentbit-m7sgh"
     50             "pod_name" : "fluentbit-dvstl"
     50             "pod_name" : "fluentbit-7vhxg"
     47             "pod_name" : "fluentbit-cvh5q"
     41             "pod_name" : "fluentbit-8wchc"
     40             "pod_name" : "fluentbit-49pz2"
     38             "pod_name" : "fluentbit-tf4jr"
     35             "pod_name" : "fluentbit-d9k8s"
     31             "pod_name" : "fluentbit-nwqq6"
     28             "pod_name" : "fluentbit-wmfbw"
     28             "pod_name" : "fluentbit-2zdsw"
     26             "pod_name" : "fluentbit-lwfjb"
     22             "pod_name" : "fluentbit-c5cmd"
     21             "pod_name" : "fluentbit-srzwp"
     21             "pod_name" : "fluentbit-mkr9j"
     20             "pod_name" : "fluentbit-jrmhd"
     19             "pod_name" : "fluentbit-wpjfq"
     18             "pod_name" : "fluentbit-2p7d2"
     16             "pod_name" : "fluentbit-znnkw"
     16             "pod_name" : "fluentbit-b4b2m"
     15             "pod_name" : "fluentbit-cxtgk"
     13             "pod_name" : "fluentbit-7fk4n"
     11             "pod_name" : "fluentbit-hwqsb"
      8             "pod_name" : "fluentbit-zgnvh"
      8             "pod_name" : "fluentbit-m8nzp"
      6             "pod_name" : "fluentbit-2gh6m"
      5             "pod_name" : "fluentbit-lbnxb"
      3             "pod_name" : "fluentbit-xzt5t"
      3             "pod_name" : "fluentbit-tbwnt"
      2             "pod_name" : "fluentbit-xh8gr"
      2             "pod_name" : "fluentbit-hwb8v"
      2             "pod_name" : "fluentbit-djsn4"
      2             "pod_name" : "fluentbit-45hc9"
      1             "pod_name" : "fluentbit-v7sdh"
      1             "pod_name" : "fluentbit-tr8l9"
      1             "pod_name" : "fluentbit-qkz4k"
      1             "pod_name" : "fluentbit-p5kfm"
      1             "pod_name" : "fluentbit-k59cl"

If I look at the input bytes/second on the nodes with the highest number of occurrences, it is much busier that the nodes with the lowest.

Highest occurences

image

Lowest occurences

image

Here is the actual distribution by time of the error message from the worst fluent-bit POD.

[2022/12/15 12:05:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 13:14:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 13:21:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 13:31:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 13:38:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 13:44:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 14:10:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 15:30:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 15:30:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 15:57:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 16:00:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 16:29:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 16:53:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 16:56:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 17:22:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 17:36:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 18:20:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 18:29:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 18:42:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 18:51:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 19:01:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 19:05:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 19:15:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 19:25:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 19:35:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 19:47:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 20:03:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 20:39:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 21:00:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 21:14:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 21:18:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 21:40:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 21:48:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 21:50:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 22:04:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 22:14:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 22:25:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 22:29:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 22:43:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 22:53:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 22:59:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 23:04:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 23:08:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 23:19:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 23:27:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 23:46:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 23:50:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/15 23:53:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 00:14:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 00:19:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 00:43:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 00:44:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 00:45:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 01:04:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 01:10:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 01:31:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 02:04:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 02:25:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 02:48:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 02:55:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 02:56:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 03:07:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 03:13:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 03:13:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 03:18:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 03:25:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 03:27:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 03:44:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 03:44:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 03:56:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 04:31:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 04:47:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 04:54:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 05:11:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 05:24:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 05:54:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 06:32:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 06:44:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 07:04:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 07:05:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 07:29:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 07:33:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 08:04:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 08:12:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 08:23:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 08:38:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 08:40:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 08:52:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 09:04:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 09:09:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 09:23:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 09:59:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 10:11:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 10:15:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 10:24:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 10:26:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 10:33:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 10:40:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 10:41:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 10:45:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 10:47:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 10:51:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 11:19:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 11:20:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 11:35:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 11:54:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 12:19:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 12:44:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 13:29:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 13:41:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 13:47:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 14:19:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 14:21:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 14:54:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 14:56:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 15:11:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 15:13:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 15:25:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 15:27:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 15:33:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 15:35:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 15:54:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 15:55:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 16:02:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 16:04:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 16:05:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 16:09:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 16:19:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 16:37:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 16:44:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 17:17:29] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 17:36:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 17:48:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 17:54:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 18:00:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 18:31:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 19:09:59] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 19:21:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 19:21:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 19:30:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 19:36:39] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 19:50:49] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 20:03:19] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 20:14:09] [error] [output:s3:s3.1] Could not send chunk with tag
[2022/12/16 20:20:49] [error] [output:s3:s3.1] Could not send chunk with tag

From what I can see, it looks as though it happen with no noticeable pattern, just 3-8 times an hour or so. One strange thing, why does the error timestamp second always end in 9? I checked another fluent-bit POD, the error timestamp was always 5. Must be some internal flush interval for logs?

@PettitWesley
Copy link
Contributor

@fhitchen

One strange thing, why does the error timestamp second always end in 9? I checked another fluent-bit POD, the error timestamp was always 5. Must be some internal flush interval for logs?

So S3 plugin is unfortunately very complicated. Much more so than other plugins. In most outputs, the Fluent Bit engine sends the output chunks of records via its "flush callback function" which will then send the records before that callback function returns.

S3 is more complicated because it has to buffer files locally so that it can get large files in S3. A Fluent Bit chunk is targeted to be around 2 megabytes, which is too small for most use cases for S3. So when the S3 "flush callback function" gets a chunk of logs, it most of the time just buffers it to the file system to create a large file.

S3 also has a "timer callback function" which runs at a regular interval and checks if the pending buffer files are ready to send or not. The regular time interval at which you see these errors possibly indicates the failures are all happening in the "timer callback function".

Fluent Bit is a concurrent log routing tool: https://github.com/fluent/fluent-bit/blob/master/DEVELOPER_GUIDE.md#concurrency

Recently, @matthewfala pointed out to me that S3 has a potential concurrency bug that will affect the "timer callback function" execution. I am working on fixing this.

The pending fix is built into these images:

144718711470.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:2.29.0-init-s3-sync-io

144718711470.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:2.29.0-s3-sync-io

One is for init, one is just normal image without init. I think currently you are not using our init tag, so use the 2nd one. https://github.com/aws/aws-for-fluent-bit#using-the-init-tag

Code Source:

@fhitchen There is also one more S3 issue I need to mention. Apologies for how many S3 issues there are right now. Currently, the retry and error metrics for S3 are broken. I have opened this issue for the design of the fix: fluent/fluent-bit#6141

@fhitchen
Copy link
Author

Hi @PettitWesley

I switched the image to the 144718711470.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:2.29.0-s3-sync-io and immediately noticed the number of errors increased.

image

So instead of going back to the 2.23.4 version I went to the 2.29.0 image to see if there was a difference and forgot I had made the change. When I looked 24 hours later I could see with this fluent-bit version the number of errors had gone from about 180 a day to 3,500. Going back to 2.23.4, I will let you know how it goes.

@PettitWesley
Copy link
Contributor

@fhitchen Can you confirm you mean 2.23.4? That version is quite old: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.23.4

Its on 1.8.15 Fluent Bit, and was out in april. Many bugs have been fixed since then.

@fhitchen
Copy link
Author

I can confirm that I switched back to 2.23.4. I am already seeing a big drop in error count, down from about 50 per 30 minutes to 8 in the same time period. The original issue I reported was from our production cluster which is on 2.28.4. I will make them the same.

@srikanth-burra
Copy link

@PettitWesley what is the expected behaviour of the plugin if the hard retry count exhausts. I am seeing the following log : Chunk file failed to send 5 times, will not retry does this mean data loss ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants