-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amazon SQS input stalls on new queue flush timeout defaults #37754
Comments
Looks like @andrewkroh did the original implementation here in #27199, he might be best to comment on if this is something we can improve in the implementation. CC @lucabelluccini. Also CC @elastic/obs-cloud-monitoring since it doesn't look like the team label did anything. I think we'll want to document this recommendation in:
One complication for getting this to be consistently used with just documentation is that the awss3 input is an implementation detail of several integrations, so users might not be aware that they are affected by this until they observe the performance regression. |
The
If both Additionally with the concepts being separated it would be an opportunity to concurrently process multiple S3 objects that can be contained in a single SQS notification. Today, the S3 objects contained in one SQS message are processed serially by one goroutine. |
Is there a higher value of |
there's a very old PR that does something similar: #33659 It's not exactly the same, as far as I understand what you are proposing, @andrewkroh The changes in the PR creates While I guess you propose to have Or is it something even different? In the PR |
Thanks @cmacknz for the notification ❤️
I think just warning users about this recommendation is tricky as the chances this is going to be missed are high. This is one example of what can happen with Elastic Agent:
As a consequence, if the user has AWS SQS input in any of the integrations deployed, they get performance regressions detailed here. |
For an additional piece of information here: It was not possible to customize the queue settings including timeout via fleet output settings before 8.12. |
Known issue PRs for agent and beats: |
It sounds like we'll need to coordinate between teams to get this implemented to resolve the core performance issue with the SQS input. @jlind23 can we figure out how to divy this up to get this fix together? |
removing the agent label. hoping it gets routed properly this time. |
@aspacca 's PR above may be worth another look. Based on a novice read seems to suggest the same fix as what @andrewkroh has mentioned. Wouldn't a configurable (i don't know what that PR was closed) |
@strawgate Nima escalated this to the o11y team that owns the SQS input, there is an ongoing mailthread to get this sorted out. @bturquet you probably want to track this issue on your end. |
@bturquet Shall we add this to any of your board for priotisation purpose? |
@jlind23 we are tracking our progress in a separate issue for S3 input: We are still in performance tests stage (playing with different mix of parameters and versions). We plan to have conclusions before the end of the week and will then decide if we need to make some change in the input logic. cc @aspacca for coordination and communication |
I'm closing this one then in favour of yours. |
Short version if you're here because your SQS ingestion slowed down after installing 8.12: if your configuration uses a performance preset, switch it to
preset: latency
. If you use no preset or a custom preset, then setqueue.mem.flush.timeout: 1
.Long version:
In 8.12 the default memory queue flush interval was raised from 1 second to 10 seconds. In many configurations this improves performance because it allows the output to batch more events per round trip, which improves efficiency. However, the SQS input has an extra bottleneck that interacts badly with the new value.
The SQS input is configured with a number of input workers, by default 5. Each worker reads one message from the SQS queue, fetches and publishes the events it references, waits for those events to be acknowledged upstream, and then deletes the original message. The worker will not proceed to handling the next message until the previous one is fully acknowledged.
Now suppose we are using default settings, and each SQS message corresponds to 200 events. 5 workers will read 5 SQS messages and publish 1000 events. However, this is less than the queue's
flush.min_events
value of 1600, so the queue will continue waiting for a full 10 seconds before making those events available to the output. Once it does, the output will need to fully ingest and acknowledge those events before the input workers resume. So no matter how fast the reading and ingestion is, the pipeline will be capped at 5 SQS messages every 10 seconds.The pipeline expects backpressure to come from the outputs as their throughput is saturated, to propagate from there to the queue, and then to block the input's
Publish
calls once the queue becomes full. However, in many scenarios the current SQS input will never use more than a tiny fraction of the queue, and will be entirely dependent on the queue's flush interval to make progress.One important question is whether the current approach is de facto imposed by Amazon APIs, rate limits, or similar. If that's the case then we'll need to look for other workarounds based on those constraints. However, the ideal solution would be for the SQS input to decouple message cleanup from the worker lifecycle, by saving acknowledgment metadata and moving on to the next SQS message before the previous one has been fully acknowledged. This would let the input take full advantage of the configured queue to accumulate data and improve ingestion performance. It would also improve performance beyond the existing baseline in some scenarios (even before 8.12, an SQS queue with small payloads could never be processed faster than 5 messages per second, no matter how fast the actual ingestion was).
The text was updated successfully, but these errors were encountered: