Add asynchronous ACK handling to S3 and SQS inputs #40699

faec · 2024-09-05T22:07:40Z

Modify SQS ingestion to listen for ACKs asynchronously so that input workers can keep reading new objects after a previous one has been published, instead of blocking on full upstream ingestion. This addresses the bottleneck where ingesting many small objects is slow as each one waits for a full ingestion round trip.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

This can best be tested by ingesting data from a live S3 or SQS queue. The scenario that most highlights the changed performance is ingesting many small individual objects.

Related issues

Closes aws-s3 input workers shouldn't wait for objects to be fully ingested before starting the next object #39414

jlind23 · 2024-09-10T05:47:34Z

@faec is this ready to be reviewed by O11y team?

faec · 2024-09-24T13:57:23Z

@faec is this ready to be reviewed by O11y team?

I think we've talked directly since this ping, but for visibility: I don't consider this ready for review until I can try it on a live SQS queue, which has been deferred for a few reasons (previously SDH rotation and illness, currently the OTel remote-offsite). I expect that running on a live queue will immediately fail in some obvious fixable ways and I want to get those out of the way before proper review.

You also asked me to integrate #39709 (which was never merged to main) with this PR before finalizing, which hasn't been started and will probably take a day or two beyond the basic smoke check.

mergify · 2024-09-24T13:58:08Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b awss3-ack-handling upstream/awss3-ack-handling
git merge upstream/main
git push upstream awss3-ack-handling

mergify · 2024-09-24T13:58:09Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

pierrehilbert · 2024-09-24T14:00:17Z

Let's separate the work here, I added issues in your next sprint to take care of #39718

jlind23 · 2024-09-24T14:05:06Z

Thanks @faec for the reply which close some of the knowledge gaps on my end.
I let you focus on the OTel related workshop and will come back in a few days.
The target remains unchanged and we should try to make all of those changes land before 8.16 feature freeze.

…-ack-handling

cmacknz · 2024-10-15T19:56:31Z

CHANGELOG.next.asciidoc

@@ -46,6 +46,7 @@ https://github.com/elastic/beats/compare/v8.8.1\...main[Check the HEAD diff]
 - Added `container.image.name` to `journald` Filebeat input's Docker-specific translated fields. {pull}40450[40450]
 - Change log.file.path field in awscloudwatch input to nested object. {pull}41099[41099]
 - Remove deprecated awscloudwatch field from Filebeat. {pull}41089[41089]
+- `max_number_of_messages` config for S3 input's SQS mode is now ignored. Instead use `number_of_workers` to scale ingestion rate in both S3 and SQS modes. {pull}40699[40699]


What impact will this have for users that were relying on max_number_of_messages to try to tune the input?

Is it just going to perform better with no user intervention? Could there be unintended side effects, like agents running out of memory because they can now fill up their queues?

Under what circumstances would number_of_workers be something a user has to adjust?

What impact will this have for users that were relying on max_number_of_messages to try to tune the input?

For users that set the field to something unreasonably high to improve throughput (which was previously the only way to work around this bottleneck), they will instead get good performance by default with no added cost. The only scenario that should be "bad" in this change is if the input objects are very small (the worst-case bottleneck) and they intentionally tuned max_number_of_messages low so that ingestion would be extremely slow. In that case, ingestion speed will significantly increase (which costs a fair amount more network bandwidth, and marginally more CPU). [Though "I used this special config value to intentionally throttle my ingestion to one event per second" is very much in the "holding the escape key down to warm up the room" category imo -- total ingestion costs should not go up, it should just empty the initial queue faster.]

agents running out of memory because they can now fill up their queues?

It isn't impossible for this to happen -- if users configured a queue larger than they actually used, and were relying on the fact that this input never has more than 5-10 active events, then the new version will use more memory. However, even a full queue is usually less than 10% of Filebeat's memory footprint, so this wouldn't be a significant factor unless they had explicitly configured a very large queue that they never use.

Under what circumstances would number_of_workers be something a user has to adjust?

If they have a cap on network resources, or to a lesser extent CPU ("number of workers" is also "number of potentially parallel AWS API calls/downloads", and also number of downloaded json blobs parsed in parallel). Or in the other direction, I suppose, if they have a lot of cores and bandwidth and really want the SQS queue to move fast.

Thanks, let's add a little bit more context around this.

I think this changelog is a bit too modest, it should specifically mention the performance improvement and that tuning max_number_of_messages should not be necessary anymore.

You should also mention the small object use case and the potential to increase memory usage.

The changelogs here are usually terse but that isn't actually a requirement. You could add an entire sub-section if you wanted to.

Looks great, thanks!

x-pack/filebeat/input/awss3/acks.go

Modify SQS ingestion to listen for ACKs asynchronously so that input workers can keep reading new objects after a previous one has been published, instead of blocking on full upstream ingestion. This addresses the bottleneck where ingesting many small objects is slow as each one waits for a full ingestion round trip. With a default configuration, SQS queues with many small objects are now ingested up to 60x faster. (cherry picked from commit d2867fd) # Conflicts: # go.sum # x-pack/filebeat/input/awss3/input_benchmark_test.go # x-pack/filebeat/input/awss3/s3_objects.go # x-pack/filebeat/input/awss3/sqs_s3_event_test.go

Modify SQS ingestion to listen for ACKs asynchronously so that input workers can keep reading new objects after a previous one has been published, instead of blocking on full upstream ingestion. This addresses the bottleneck where ingesting many small objects is slow as each one waits for a full ingestion round trip. With a default configuration, SQS queues with many small objects are now ingested up to 60x faster. (cherry picked from commit d2867fd) # Conflicts: # x-pack/filebeat/input/awss3/input_benchmark_test.go # x-pack/filebeat/input/awss3/sqs_s3_event_test.go

…puts (#41249) * Add asynchronous ACK handling to S3 and SQS inputs (#40699) Modify SQS ingestion to listen for ACKs asynchronously so that input workers can keep reading new objects after a previous one has been published, instead of blocking on full upstream ingestion. This addresses the bottleneck where ingesting many small objects is slow as each one waits for a full ingestion round trip. With a default configuration, SQS queues with many small objects are now ingested up to 60x faster. (cherry picked from commit d2867fd) # Conflicts: # x-pack/filebeat/input/awss3/input_benchmark_test.go # x-pack/filebeat/input/awss3/sqs_s3_event_test.go * fix broken merge --------- Co-authored-by: Fae Charlton <[email protected]>

Modify SQS ingestion to listen for ACKs asynchronously so that input workers can keep reading new objects after a previous one has been published, instead of blocking on full upstream ingestion. This addresses the bottleneck where ingesting many small objects is slow as each one waits for a full ingestion round trip. With a default configuration, SQS queues with many small objects are now ingested up to 60x faster.

faec added 18 commits May 21, 2024 16:06

working on S3 ack tracking

e9ccebe

handle S3 acks asynchronously

b17f7b5

fix loop variable scope

d023027

Merge branch 'main' of github.com:elastic/beats into awss3-ack-handling

aedb5a2

remove leftover print

ed4953a

Merge branch 'main' of github.com:elastic/beats into awss3-ack-handling

12a39b4

cleanup

7e70ea6

Merge branch 'main' of github.com:elastic/beats into awss3-ack-handling

ff4f57c

restore struct field

2d18b28

Merge branch 'main' of github.com:elastic/beats into awss3-ack-handling

6bc5e42

updates in progress

b4cc30e

leftover file

d4ecc1f

use a common ack helper in s3 and sqs

64a96d4

in progress

7a20250

Merge branch 'main' of github.com:elastic/beats into awss3-ack-handling

977e4c5

test updates

5282671

fix/update error SQS error handling logic

1548ac2

close ack handler in s3 worker

2358241

faec added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team backport-8.15 Automated backport to the 8.15 branch with mergify labels Sep 5, 2024

faec self-assigned this Sep 5, 2024

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Sep 5, 2024

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Sep 24, 2024

faec added 4 commits October 14, 2024 19:02

make check again

83ba548

update changelog

577759c

Merge branch 'awss3-ack-handling' of github.com:faec/beats into awss3…

7ba1374

…-ack-handling

Merge branch 'main' of github.com:elastic/beats into awss3-ack-handling

cf90022

faec requested a review from a team as a code owner October 14, 2024 23:08

faec added 2 commits October 14, 2024 19:29

make update

a743c08

linter / CI fixes

48025b5

cmacknz reviewed Oct 15, 2024

View reviewed changes

x-pack/filebeat/input/awss3/acks.go Outdated Show resolved Hide resolved

faec added 2 commits October 15, 2024 16:39

Fix benchmark test ack counting

10ecfd3

edit changelog

7a38f55

cmacknz approved these changes Oct 15, 2024

View reviewed changes

faec added 2 commits October 15, 2024 19:00

more mock test fixes

84cb422

fix more mock details

b088ca7

faec enabled auto-merge (squash) October 15, 2024 23:44

fix race condition in mock sequencing

bbbd1f9

faec merged commit d2867fd into elastic:main Oct 16, 2024
141 of 143 checks passed

This was referenced Oct 16, 2024

[8.15](backport #40699) Add asynchronous ACK handling to S3 and SQS inputs #41248

Closed

[8.x](backport #40699) Add asynchronous ACK handling to S3 and SQS inputs #41249

Merged

faec deleted the awss3-ack-handling branch October 16, 2024 13:45

faec mentioned this pull request Oct 16, 2024

Meta: Improve performance and reliability of awss3 and awscloudwatch inputs #38956

Open

This was referenced Oct 21, 2024

[Cleanup] Delete unused/buggy EventACKTracker helper #41357

Merged

aws-s3 input can delete SQS entries before they're fully acknowledged #38961

Closed

mergify bot mentioned this pull request Oct 22, 2024

[8.x](backport #41357) [Cleanup] Delete unused/buggy EventACKTracker helper #41369

Merged

6 tasks

kaiyan-sheng mentioned this pull request Nov 26, 2024

[Filebeat][input/aws-s3]: backup_to_bucket_arn and delete_after_backup do not work #41784

Open

strawgate mentioned this pull request Dec 13, 2024

Remove max_number_of_messages for SQS+S3-based inputs elastic/integrations#12101

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add asynchronous ACK handling to S3 and SQS inputs #40699

Add asynchronous ACK handling to S3 and SQS inputs #40699

faec commented Sep 5, 2024 •

edited

Loading

jlind23 commented Sep 10, 2024

faec commented Sep 24, 2024

mergify bot commented Sep 24, 2024

mergify bot commented Sep 24, 2024

pierrehilbert commented Sep 24, 2024

jlind23 commented Sep 24, 2024

cmacknz Oct 15, 2024

faec Oct 15, 2024

cmacknz Oct 15, 2024

faec Oct 15, 2024

cmacknz Oct 15, 2024

Add asynchronous ACK handling to S3 and SQS inputs #40699

Add asynchronous ACK handling to S3 and SQS inputs #40699

Conversation

faec commented Sep 5, 2024 • edited Loading

Checklist

How to test this PR locally

Related issues

jlind23 commented Sep 10, 2024

faec commented Sep 24, 2024

mergify bot commented Sep 24, 2024

mergify bot commented Sep 24, 2024

pierrehilbert commented Sep 24, 2024

jlind23 commented Sep 24, 2024

cmacknz Oct 15, 2024

Choose a reason for hiding this comment

faec Oct 15, 2024

Choose a reason for hiding this comment

cmacknz Oct 15, 2024

Choose a reason for hiding this comment

faec Oct 15, 2024

Choose a reason for hiding this comment

cmacknz Oct 15, 2024

Choose a reason for hiding this comment

faec commented Sep 5, 2024 •

edited

Loading