out_s3: use retry_limit in fluent-bit to replace MAX_UPLOAD_ERROR … #6475

Claych · 2022-11-28T23:16:51Z

Signed-off-by: Clay Cheng [email protected]

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

Example configuration file for the change
[OUTPUT]
Name s3
Match *
bucket clay-bucket-5-s3-test
region us-east-1
total_file_size 60M
auto_retry_requests true
use_put_object off
upload_chunk_size 5M
Debug log output from testing the change

Attached Valgrind output that shows no leaks or memory corruption was found
Test result when connected with s3:

Test results when disconnected from s3:

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

Attached local packaging test output showing all targets (including any new ones) build.

Documentation

Documentation required for this feature

Backporting

Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

PettitWesley · 2022-12-01T21:20:36Z

plugins/out_s3/s3.c

+                        "failed to flush chunk tag=%s, create_time=%s"
+                        "(out_id=%d)",
+                        tag, create_time_str, ctx->ins->id);


So I know that in the original design, I said we want to as much as possible match the format of the normal retry messages... and we didn't want to add the "retry in X seconds" since we don't have a way of calculating the retry time... but note that I still had:

[ warn] [engine] failed to flush chunk tag=xxxxx, create_time=2022-08-18T21:34:42+0000, retry issued: input=forward.1 > output=s3.0 (out_id=0)

I think the retry_issued is important to let the user know clearly that we will retry

plugins/out_s3/s3.c

matthewfala · 2022-12-01T21:24:22Z

plugins/out_s3/s3.c

                return -1;
            }

            /* data was sent successfully- delete the local buffer */
-            s3_store_file_delete(ctx, chunk);
+                s3_store_file_delete(ctx, chunk);


remove stray tab

matthewfala · 2022-12-01T21:25:23Z

plugins/out_s3/s3.c

+            if (chunk->failures > ctx->ins->retry_limit){
+                less_than_limit = FLB_FALSE;
+            }
+            s3_retry_warn(ctx, tag, chunk->input_name, create_time, less_than_limit);
+            if (less_than_limit == FLB_FALSE) {
+                s3_store_file_unlock(chunk);
+                return FLB_RETRY;
+            }
+            else {
+                s3_store_file_delete(ctx, chunk);
+                return FLB_ERROR;
+            }
        }


Are you sure this logic is correct?

So if chunk failures is greater than retry limit:

You retry

If failures is less than retry limit

You delete the file.

Oh good catch... I didn't realize this logic is inverted... @Claych please fix

PettitWesley · 2022-12-01T22:44:46Z

plugins/out_s3/s3.c

+        if (less_than_limit == FLB_TRUE) {
+            flb_plg_warn(ctx->ins,
+                        "failed to flush chunk tag=%s, create_time=%s, "
+                        "retry issues: (out_id=%d)",


retry issued

PettitWesley · 2022-12-01T22:58:50Z

plugins/out_s3/s3.c

+        if (tmp_upload->upload_errors > ctx->ins->retry_limit) {
            tmp_upload->upload_state = MULTIPART_UPLOAD_STATE_COMPLETE_IN_PROGRESS;
-            flb_plg_error(ctx->ins, "Upload for %s has reached max upload errors",
-                          tmp_upload->s3_key);
+            s3_retry_warn(ctx, tmp_upload->tag, tmp_upload->input_name,
+                          tmp_upload->init_time, FLB_FALSE);


I am wondering if we actually need a message here anymore... and if we should actually be deleting the upload here...

PettitWesley · 2022-12-02T01:14:36Z

@Claych Here are my thoughts on each type of failure and how we should handle them.

Chunk failures => what we have in this PR is good and is good to go.
Multipart upload complete_errors => Currently we only check this in the cb_s3_upload function, which I think is fine. I think we should use the user configured retry_limit here, but the message should not be the new s3_retry_warn. This is because this is a different type of failure, and the user should be clearly told this by different messages. Currently we have "Upload for %s has reached max completion errors, plugin will give up", which I think is good. I think we could improve it by making it clear this is a "Multipart Upload" and also giving the S3 API name, so I would change to "Multipart Upload for %s has reached max s3:CompleteMultipartUpload errors, plugin will give up". I also noticed that after this message though we have a potential memory leak since we remove the upload from the list but do not call multipart_upload_destroy. Please fix that bug/thanks :)
For multipart upload upload_errors currently I see that in the get_upload function if the upload has reached the retry_limit, we do not try to upload more data to it, and mark it for completion instead. I think this works... basically we already have data uploaded for this file but if for some reason we can't upload anymore, to make sure the already uploaded data is not lost, we just try to complete it. So let's keep that, and let's make the message clear to the user of what we are doing. So let's not use the s3_retry_warn here either, let's give a message that is explicit like "Multipart Upload for %s has reached max s3:UploadPart errors, plugin will try to complete this upload to prevent loss of already uploaded data". And then there is a special case that we need to handle here. If you check the struct multipart_upload it has a parts_uploaded field- we only want to mark it for completion if it has at least one part uploaded. Otherwise, the upload has no pending data, and we can just remove it from the list and free it without notifying the user IMO. The code will automatically create a new upload and try again. And finally, related to this, please see the call to create_multipart_upload in the code, if this fails we need to increment the multipart upload upload_errors integer.
Upload queue retry_counter- so since the upload queue just tracks a buffer chunk file, and it just uses the same upload_data function to upload data... I think we do not need a message here at all. The user will already see s3_retry_warn messages from the change you made to upload_data, so if we have a message here for this case, it just duplicates messages. So I think you can just remove the upload queue entry and free it on failure here, and do not need to give the user any message at all. Basically, the struct upload_queue is just a wrapper around a struct s3_file chunk, so it really doesn't actually need to track its own failures. So you could actually simplify and improve the code by removing the struct upload_queue retry_counter and changing call code that uses it to instead use the s3_file failures integer instead. I think this change can be optional.

… update s3 warn output messages with function s3_retry_warn() Signed-off-by: Clay Cheng <[email protected]>

Claych requested a review from PettitWesley as a code owner November 28, 2022 23:16

github-actions bot added the docs-required label Nov 28, 2022

Claych changed the title ~~out_s3: use retry_limit in fluent-bit to replace MAXMAX_UPLOAD_ERROR …~~ out_s3: use retry_limit in fluent-bit to replace MAX_UPLOAD_ERROR … Nov 28, 2022

PettitWesley previously approved these changes Nov 28, 2022

View reviewed changes

Claych temporarily deployed to pr November 28, 2022 23:24 Inactive

Claych temporarily deployed to pr November 28, 2022 23:25 Inactive

Claych dismissed PettitWesley’s stale review via 59a69a9 November 28, 2022 23:33

Claych force-pushed the clay-retry-limit-1.9 branch 2 times, most recently from 59a69a9 to e7898af Compare November 28, 2022 23:54

PettitWesley previously approved these changes Nov 29, 2022

View reviewed changes

Claych temporarily deployed to pr November 29, 2022 00:02 Inactive

Claych temporarily deployed to pr November 29, 2022 00:16 Inactive

Claych dismissed PettitWesley’s stale review via 97ed2e9 December 1, 2022 19:35

Claych force-pushed the clay-retry-limit-1.9 branch from e7898af to 97ed2e9 Compare December 1, 2022 19:35

Claych requested review from edsiper, leonardo-albertovich, fujimotos and koleini as code owners December 1, 2022 19:35

Claych force-pushed the clay-retry-limit-1.9 branch 2 times, most recently from ce0f890 to 6c38848 Compare December 1, 2022 19:43

PettitWesley reviewed Dec 1, 2022

View reviewed changes

plugins/out_s3/s3.c Show resolved Hide resolved

matthewfala suggested changes Dec 1, 2022

View reviewed changes

Claych force-pushed the clay-retry-limit-1.9 branch from 6c38848 to 2e03190 Compare December 1, 2022 21:39

PettitWesley reviewed Dec 1, 2022

View reviewed changes

Claych force-pushed the clay-retry-limit-1.9 branch from 2e03190 to 9f27bd3 Compare December 1, 2022 22:46

PettitWesley reviewed Dec 1, 2022

View reviewed changes

Claych temporarily deployed to pr December 2, 2022 01:15 Inactive

Claych temporarily deployed to pr January 25, 2023 02:00 — with GitHub Actions Inactive

Claych temporarily deployed to pr January 25, 2023 02:18 — with GitHub Actions Inactive

matthewfala mentioned this pull request Feb 3, 2023

bump 2.31.1 aws/aws-for-fluent-bit#535

Merged

Claych dismissed PettitWesley’s stale review via 76ae393 February 3, 2023 20:03

Claych force-pushed the clay-retry-limit-1.9 branch from 96587ee to 76ae393 Compare February 3, 2023 20:03

Claych temporarily deployed to pr February 3, 2023 20:04 — with GitHub Actions Inactive

Claych temporarily deployed to pr February 3, 2023 20:23 — with GitHub Actions Inactive

Claych force-pushed the clay-retry-limit-1.9 branch from 76ae393 to dce85f4 Compare February 20, 2023 23:03

Claych had a problem deploying to pr February 20, 2023 23:13 — with GitHub Actions Failure

Claych had a problem deploying to pr February 20, 2023 23:14 — with GitHub Actions Failure

Claych force-pushed the clay-retry-limit-1.9 branch from 967cbcc to 7c0c025 Compare February 20, 2023 23:14

Claych had a problem deploying to pr February 20, 2023 23:19 — with GitHub Actions Failure

Claych force-pushed the clay-retry-limit-1.9 branch from cbc95c5 to 7c0c025 Compare February 20, 2023 23:21

Claych had a problem deploying to pr February 20, 2023 23:44 — with GitHub Actions Failure

Claych had a problem deploying to pr February 20, 2023 23:45 — with GitHub Actions Failure

out_s3: use retry_limit in fluent-bit to replace MAX_UPLOAD_ERROR and…

0758dee

… update s3 warn output messages with function s3_retry_warn() Signed-off-by: Clay Cheng <[email protected]>

Claych force-pushed the clay-retry-limit-1.9 branch from 9d46507 to 0758dee Compare February 20, 2023 23:47

Claych had a problem deploying to pr February 20, 2023 23:47 — with GitHub Actions Failure

Claych had a problem deploying to pr February 20, 2023 23:49 — with GitHub Actions Failure

edsiper merged commit 1b93b1a into fluent:1.9 Feb 21, 2023

leonardo-albertovich mentioned this pull request Feb 28, 2023

out_s3: added missing structure instance in s3_retry_warn calls #6923

Merged

PettitWesley mentioned this pull request Jul 5, 2023

INFO: AWS Distro differences for S3 aws/aws-for-fluent-bit#702

Open

JSchy65 mentioned this pull request Mar 18, 2024

V2.2.2 / AWS S3 Output Plugin / retry_limit not taken into account #8595

Closed

lheer mentioned this pull request Jun 20, 2024

out_s3: replace MAX_UPLOAD_ERRORS constant with retry_limit #8985

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out_s3: use retry_limit in fluent-bit to replace MAX_UPLOAD_ERROR … #6475

out_s3: use retry_limit in fluent-bit to replace MAX_UPLOAD_ERROR … #6475

Claych commented Nov 28, 2022

PettitWesley Dec 1, 2022

matthewfala Dec 1, 2022

matthewfala Dec 1, 2022

PettitWesley Dec 1, 2022

PettitWesley Dec 1, 2022

PettitWesley Dec 1, 2022

PettitWesley commented Dec 2, 2022

out_s3: use retry_limit in fluent-bit to replace MAX_UPLOAD_ERROR … #6475

out_s3: use retry_limit in fluent-bit to replace MAX_UPLOAD_ERROR … #6475

Conversation

Claych commented Nov 28, 2022

PettitWesley Dec 1, 2022

Choose a reason for hiding this comment

matthewfala Dec 1, 2022

Choose a reason for hiding this comment

matthewfala Dec 1, 2022

Choose a reason for hiding this comment

PettitWesley Dec 1, 2022

Choose a reason for hiding this comment

PettitWesley Dec 1, 2022

Choose a reason for hiding this comment

PettitWesley Dec 1, 2022

Choose a reason for hiding this comment

PettitWesley commented Dec 2, 2022