-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add asynchronous ACK handling to S3 and SQS inputs #40699
Conversation
@faec is this ready to be reviewed by O11y team? |
I think we've talked directly since this ping, but for visibility: I don't consider this ready for review until I can try it on a live SQS queue, which has been deferred for a few reasons (previously SDH rotation and illness, currently the OTel remote-offsite). I expect that running on a live queue will immediately fail in some obvious fixable ways and I want to get those out of the way before proper review. You also asked me to integrate #39709 (which was never merged to main) with this PR before finalizing, which hasn't been started and will probably take a day or two beyond the basic smoke check. |
This pull request is now in conflicts. Could you fix it? 🙏
|
|
Let's separate the work here, I added issues in your next sprint to take care of #39718 |
Thanks @faec for the reply which close some of the knowledge gaps on my end. |
CHANGELOG.next.asciidoc
Outdated
@@ -46,6 +46,7 @@ https://github.com/elastic/beats/compare/v8.8.1\...main[Check the HEAD diff] | |||
- Added `container.image.name` to `journald` Filebeat input's Docker-specific translated fields. {pull}40450[40450] | |||
- Change log.file.path field in awscloudwatch input to nested object. {pull}41099[41099] | |||
- Remove deprecated awscloudwatch field from Filebeat. {pull}41089[41089] | |||
- `max_number_of_messages` config for S3 input's SQS mode is now ignored. Instead use `number_of_workers` to scale ingestion rate in both S3 and SQS modes. {pull}40699[40699] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What impact will this have for users that were relying on max_number_of_messages
to try to tune the input?
Is it just going to perform better with no user intervention? Could there be unintended side effects, like agents running out of memory because they can now fill up their queues?
Under what circumstances would number_of_workers
be something a user has to adjust?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What impact will this have for users that were relying on max_number_of_messages to try to tune the input?
For users that set the field to something unreasonably high to improve throughput (which was previously the only way to work around this bottleneck), they will instead get good performance by default with no added cost. The only scenario that should be "bad" in this change is if the input objects are very small (the worst-case bottleneck) and they intentionally tuned max_number_of_messages low so that ingestion would be extremely slow. In that case, ingestion speed will significantly increase (which costs a fair amount more network bandwidth, and marginally more CPU). [Though "I used this special config value to intentionally throttle my ingestion to one event per second" is very much in the "holding the escape key down to warm up the room" category imo -- total ingestion costs should not go up, it should just empty the initial queue faster.]
agents running out of memory because they can now fill up their queues?
It isn't impossible for this to happen -- if users configured a queue larger than they actually used, and were relying on the fact that this input never has more than 5-10 active events, then the new version will use more memory. However, even a full queue is usually less than 10% of Filebeat's memory footprint, so this wouldn't be a significant factor unless they had explicitly configured a very large queue that they never use.
Under what circumstances would number_of_workers be something a user has to adjust?
If they have a cap on network resources, or to a lesser extent CPU ("number of workers" is also "number of potentially parallel AWS API calls/downloads", and also number of downloaded json blobs parsed in parallel). Or in the other direction, I suppose, if they have a lot of cores and bandwidth and really want the SQS queue to move fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, let's add a little bit more context around this.
I think this changelog is a bit too modest, it should specifically mention the performance improvement and that tuning max_number_of_messages should not be necessary anymore.
You should also mention the small object use case and the potential to increase memory usage.
The changelogs here are usually terse but that isn't actually a requirement. You could add an entire sub-section if you wanted to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
Modify SQS ingestion to listen for ACKs asynchronously so that input workers can keep reading new objects after a previous one has been published, instead of blocking on full upstream ingestion. This addresses the bottleneck where ingesting many small objects is slow as each one waits for a full ingestion round trip. With a default configuration, SQS queues with many small objects are now ingested up to 60x faster. (cherry picked from commit d2867fd) # Conflicts: # go.sum # x-pack/filebeat/input/awss3/input_benchmark_test.go # x-pack/filebeat/input/awss3/s3_objects.go # x-pack/filebeat/input/awss3/sqs_s3_event_test.go
Modify SQS ingestion to listen for ACKs asynchronously so that input workers can keep reading new objects after a previous one has been published, instead of blocking on full upstream ingestion. This addresses the bottleneck where ingesting many small objects is slow as each one waits for a full ingestion round trip. With a default configuration, SQS queues with many small objects are now ingested up to 60x faster. (cherry picked from commit d2867fd) # Conflicts: # x-pack/filebeat/input/awss3/input_benchmark_test.go # x-pack/filebeat/input/awss3/sqs_s3_event_test.go
…puts (#41249) * Add asynchronous ACK handling to S3 and SQS inputs (#40699) Modify SQS ingestion to listen for ACKs asynchronously so that input workers can keep reading new objects after a previous one has been published, instead of blocking on full upstream ingestion. This addresses the bottleneck where ingesting many small objects is slow as each one waits for a full ingestion round trip. With a default configuration, SQS queues with many small objects are now ingested up to 60x faster. (cherry picked from commit d2867fd) # Conflicts: # x-pack/filebeat/input/awss3/input_benchmark_test.go # x-pack/filebeat/input/awss3/sqs_s3_event_test.go * fix broken merge --------- Co-authored-by: Fae Charlton <[email protected]>
Modify SQS ingestion to listen for ACKs asynchronously so that input workers can keep reading new objects after a previous one has been published, instead of blocking on full upstream ingestion. This addresses the bottleneck where ingesting many small objects is slow as each one waits for a full ingestion round trip. With a default configuration, SQS queues with many small objects are now ingested up to 60x faster.
Modify SQS ingestion to listen for ACKs asynchronously so that input workers can keep reading new objects after a previous one has been published, instead of blocking on full upstream ingestion. This addresses the bottleneck where ingesting many small objects is slow as each one waits for a full ingestion round trip.
Checklist
I have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
This can best be tested by ingesting data from a live S3 or SQS queue. The scenario that most highlights the changed performance is ingesting many small individual objects.
Related issues
aws-s3
input workers shouldn't wait for objects to be fully ingested before starting the next object #39414