Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_s3: use retry_limit in fluent-bit to replace MAX_UPLOAD_ERROR … #6475

Merged
merged 1 commit into from
Feb 21, 2023

Conversation

Claych
Copy link
Contributor

@Claych Claych commented Nov 28, 2022

Signed-off-by: Clay Cheng [email protected]


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
    [OUTPUT]
    Name s3
    Match *
    bucket clay-bucket-5-s3-test
    region us-east-1
    total_file_size 60M
    auto_retry_requests true
    use_put_object off
    upload_chunk_size 5M
  • Debug log output from testing the change

Screen Shot 2022-10-13 at 1 48 19 PM

  • Attached Valgrind output that shows no leaks or memory corruption was found
    Test result when connected with s3:
    image

Test results when disconnected from s3:
image

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@Claych Claych requested a review from PettitWesley as a code owner November 28, 2022 23:16
@Claych Claych changed the title out_s3: use retry_limit in fluent-bit to replace MAXMAX_UPLOAD_ERROR … out_s3: use retry_limit in fluent-bit to replace MAX_UPLOAD_ERROR … Nov 28, 2022
PettitWesley
PettitWesley previously approved these changes Nov 28, 2022
@Claych Claych temporarily deployed to pr November 28, 2022 23:24 Inactive
@Claych Claych temporarily deployed to pr November 28, 2022 23:25 Inactive
@Claych Claych force-pushed the clay-retry-limit-1.9 branch 2 times, most recently from 59a69a9 to e7898af Compare November 28, 2022 23:54
PettitWesley
PettitWesley previously approved these changes Nov 29, 2022
@Claych Claych temporarily deployed to pr November 29, 2022 00:02 Inactive
@Claych Claych temporarily deployed to pr November 29, 2022 00:02 Inactive
@Claych Claych temporarily deployed to pr November 29, 2022 00:16 Inactive
@Claych Claych force-pushed the clay-retry-limit-1.9 branch from e7898af to 97ed2e9 Compare December 1, 2022 19:35
@Claych Claych force-pushed the clay-retry-limit-1.9 branch 2 times, most recently from ce0f890 to 6c38848 Compare December 1, 2022 19:43
Comment on lines 134 to 137
"failed to flush chunk tag=%s, create_time=%s"
"(out_id=%d)",
tag, create_time_str, ctx->ins->id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I know that in the original design, I said we want to as much as possible match the format of the normal retry messages... and we didn't want to add the "retry in X seconds" since we don't have a way of calculating the retry time... but note that I still had:

[ warn] [engine] failed to flush chunk tag=xxxxx, create_time=2022-08-18T21:34:42+0000, retry issued: input=forward.1 > output=s3.0 (out_id=0)

I think the retry_issued is important to let the user know clearly that we will retry

return -1;
}

/* data was sent successfully- delete the local buffer */
s3_store_file_delete(ctx, chunk);
s3_store_file_delete(ctx, chunk);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove stray tab

Comment on lines 1112 to 1130
if (chunk->failures > ctx->ins->retry_limit){
less_than_limit = FLB_FALSE;
}
s3_retry_warn(ctx, tag, chunk->input_name, create_time, less_than_limit);
if (less_than_limit == FLB_FALSE) {
s3_store_file_unlock(chunk);
return FLB_RETRY;
}
else {
s3_store_file_delete(ctx, chunk);
return FLB_ERROR;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this logic is correct?

So if chunk failures is greater than retry limit:

  • You retry

If failures is less than retry limit

  • You delete the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch... I didn't realize this logic is inverted... @Claych please fix

@Claych Claych force-pushed the clay-retry-limit-1.9 branch from 6c38848 to 2e03190 Compare December 1, 2022 21:39
if (less_than_limit == FLB_TRUE) {
flb_plg_warn(ctx->ins,
"failed to flush chunk tag=%s, create_time=%s, "
"retry issues: (out_id=%d)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retry issued

@Claych Claych force-pushed the clay-retry-limit-1.9 branch from 2e03190 to 9f27bd3 Compare December 1, 2022 22:46
Comment on lines 1569 to 1572
if (tmp_upload->upload_errors > ctx->ins->retry_limit) {
tmp_upload->upload_state = MULTIPART_UPLOAD_STATE_COMPLETE_IN_PROGRESS;
flb_plg_error(ctx->ins, "Upload for %s has reached max upload errors",
tmp_upload->s3_key);
s3_retry_warn(ctx, tmp_upload->tag, tmp_upload->input_name,
tmp_upload->init_time, FLB_FALSE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we actually need a message here anymore... and if we should actually be deleting the upload here...

@PettitWesley
Copy link
Contributor

@Claych Here are my thoughts on each type of failure and how we should handle them.

  1. Chunk failures => what we have in this PR is good and is good to go.
  2. Multipart upload complete_errors => Currently we only check this in the cb_s3_upload function, which I think is fine. I think we should use the user configured retry_limit here, but the message should not be the new s3_retry_warn. This is because this is a different type of failure, and the user should be clearly told this by different messages. Currently we have "Upload for %s has reached max completion errors, plugin will give up", which I think is good. I think we could improve it by making it clear this is a "Multipart Upload" and also giving the S3 API name, so I would change to "Multipart Upload for %s has reached max s3:CompleteMultipartUpload errors, plugin will give up". I also noticed that after this message though we have a potential memory leak since we remove the upload from the list but do not call multipart_upload_destroy. Please fix that bug/thanks :)
  3. For multipart upload upload_errors currently I see that in the get_upload function if the upload has reached the retry_limit, we do not try to upload more data to it, and mark it for completion instead. I think this works... basically we already have data uploaded for this file but if for some reason we can't upload anymore, to make sure the already uploaded data is not lost, we just try to complete it. So let's keep that, and let's make the message clear to the user of what we are doing. So let's not use the s3_retry_warn here either, let's give a message that is explicit like "Multipart Upload for %s has reached max s3:UploadPart errors, plugin will try to complete this upload to prevent loss of already uploaded data". And then there is a special case that we need to handle here. If you check the struct multipart_upload it has a parts_uploaded field- we only want to mark it for completion if it has at least one part uploaded. Otherwise, the upload has no pending data, and we can just remove it from the list and free it without notifying the user IMO. The code will automatically create a new upload and try again. And finally, related to this, please see the call to create_multipart_upload in the code, if this fails we need to increment the multipart upload upload_errors integer.
  4. Upload queue retry_counter- so since the upload queue just tracks a buffer chunk file, and it just uses the same upload_data function to upload data... I think we do not need a message here at all. The user will already see s3_retry_warn messages from the change you made to upload_data, so if we have a message here for this case, it just duplicates messages. So I think you can just remove the upload queue entry and free it on failure here, and do not need to give the user any message at all. Basically, the struct upload_queue is just a wrapper around a struct s3_file chunk, so it really doesn't actually need to track its own failures. So you could actually simplify and improve the code by removing the struct upload_queue retry_counter and changing call code that uses it to instead use the s3_file failures integer instead. I think this change can be optional.

@Claych Claych temporarily deployed to pr December 2, 2022 01:15 Inactive
@Claych Claych temporarily deployed to pr January 25, 2023 02:00 — with GitHub Actions Inactive
@Claych Claych temporarily deployed to pr January 25, 2023 02:18 — with GitHub Actions Inactive
@Claych Claych force-pushed the clay-retry-limit-1.9 branch from 96587ee to 76ae393 Compare February 3, 2023 20:03
@Claych Claych temporarily deployed to pr February 3, 2023 20:04 — with GitHub Actions Inactive
@Claych Claych temporarily deployed to pr February 3, 2023 20:04 — with GitHub Actions Inactive
@Claych Claych temporarily deployed to pr February 3, 2023 20:23 — with GitHub Actions Inactive
@Claych Claych force-pushed the clay-retry-limit-1.9 branch from 76ae393 to dce85f4 Compare February 20, 2023 23:03
@Claych Claych force-pushed the clay-retry-limit-1.9 branch from 967cbcc to 7c0c025 Compare February 20, 2023 23:14
@Claych Claych force-pushed the clay-retry-limit-1.9 branch from cbc95c5 to 7c0c025 Compare February 20, 2023 23:21
… update s3 warn output messages with function s3_retry_warn()

Signed-off-by: Clay Cheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants