Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amazon SQS input stalls on new queue flush timeout defaults #37754

Closed
faec opened this issue Jan 25, 2024 · 14 comments
Closed

Amazon SQS input stalls on new queue flush timeout defaults #37754

faec opened this issue Jan 25, 2024 · 14 comments
Labels
bug Team:Cloud-Monitoring Label for the Cloud Monitoring team

Comments

@faec
Copy link
Contributor

faec commented Jan 25, 2024

Short version if you're here because your SQS ingestion slowed down after installing 8.12: if your configuration uses a performance preset, switch it to preset: latency. If you use no preset or a custom preset, then set queue.mem.flush.timeout: 1.

Long version:

In 8.12 the default memory queue flush interval was raised from 1 second to 10 seconds. In many configurations this improves performance because it allows the output to batch more events per round trip, which improves efficiency. However, the SQS input has an extra bottleneck that interacts badly with the new value.

The SQS input is configured with a number of input workers, by default 5. Each worker reads one message from the SQS queue, fetches and publishes the events it references, waits for those events to be acknowledged upstream, and then deletes the original message. The worker will not proceed to handling the next message until the previous one is fully acknowledged.

Now suppose we are using default settings, and each SQS message corresponds to 200 events. 5 workers will read 5 SQS messages and publish 1000 events. However, this is less than the queue's flush.min_events value of 1600, so the queue will continue waiting for a full 10 seconds before making those events available to the output. Once it does, the output will need to fully ingest and acknowledge those events before the input workers resume. So no matter how fast the reading and ingestion is, the pipeline will be capped at 5 SQS messages every 10 seconds.

The pipeline expects backpressure to come from the outputs as their throughput is saturated, to propagate from there to the queue, and then to block the input's Publish calls once the queue becomes full. However, in many scenarios the current SQS input will never use more than a tiny fraction of the queue, and will be entirely dependent on the queue's flush interval to make progress.

One important question is whether the current approach is de facto imposed by Amazon APIs, rate limits, or similar. If that's the case then we'll need to look for other workarounds based on those constraints. However, the ideal solution would be for the SQS input to decouple message cleanup from the worker lifecycle, by saving acknowledgment metadata and moving on to the next SQS message before the previous one has been fully acknowledged. This would let the input take full advantage of the configured queue to accumulate data and improve ingestion performance. It would also improve performance beyond the existing baseline in some scenarios (even before 8.12, an SQS queue with small payloads could never be processed faster than 5 messages per second, no matter how fast the actual ingestion was).

@faec faec added bug Team:Elastic-Agent Label for the Agent team Team:Cloud-Monitoring Label for the Cloud Monitoring team labels Jan 25, 2024
@cmacknz
Copy link
Member

cmacknz commented Jan 25, 2024

Looks like @andrewkroh did the original implementation here in #27199, he might be best to comment on if this is something we can improve in the implementation.

CC @lucabelluccini. Also CC @elastic/obs-cloud-monitoring since it doesn't look like the team label did anything.

I think we'll want to document this recommendation in:

  1. The 8.12 release notes for beats and agent.
  2. The preset documentation for beats and agent
  3. The support knowledgebase.

One complication for getting this to be consistently used with just documentation is that the awss3 input is an implementation detail of several integrations, so users might not be aware that they are affected by this until they observe the performance regression.

@andrewkroh
Copy link
Member

The max_number_of_messages configuration option controls the number of SQS messages that can be in flight (received from a queue by a consumer, but not yet deleted from the queue) at any time for the input. Each queue has an in flight quota associated to it. Ideally you would keep the number of inputs * max_number_of_messages below the quota, but in practice this isn't a problem because the quota is high AND ReceiveMessage will silently stop handing out more SQS message until you fall back below the quoto.

max_number_of_messages also implicitly controls the number of goroutines that are used to process messages. I think this is where there is some flexibility to decouple the control of max inflight SQS messages vs the max goroutines. In fact there is a number_of_workers setting with an aligned definition, but it's only used in S3 listing mode.

If both max_number_of_messages and number_of_workers were available for use with SQS mode then you could set max_number_of_messages to like 100 while keeping number_of_workers at a more conservative 5 to account for large internal queues with long flush intervals.

Additionally with the concepts being separated it would be an opportunity to concurrently process multiple S3 objects that can be contained in a single SQS notification. Today, the S3 objects contained in one SQS message are processed serially by one goroutine.

@strawgate
Copy link
Contributor

Is there a higher value of max_number_of_messages that we would feel comfortable defaulting to to restore at least some performance?

@aspacca
Copy link

aspacca commented Jan 26, 2024

If both max_number_of_messages and number_of_workers were available for use with SQS mode then you could set max_number_of_messages to like 100 while keeping number_of_workers at a more conservative 5 to account for large internal queues with long flush intervals.

there's a very old PR that does something similar: #33659

It's not exactly the same, as far as I understand what you are proposing, @andrewkroh

The changes in the PR creates number_of_workers goroutines, each consuming max_number_of_messages SQS message.

While I guess you propose to have number_of_workers goroutines, each consuming max_number_of_messages / number_of_workers, is it correct?

Or is it something even different?

In the PR number_of_sqs_consumers is introduced, instead of using number_of_workers. but that's just a minor detail we can get rid of.

@lucabelluccini
Copy link
Contributor

Thanks @cmacknz for the notification ❤️
I think the actions you propose are great (I can cover the knowledge article part).
This problem can affect:

  • Beat users (who make use of AWS input explicitly)
  • Beat modules users (who make use of AWS input almost implicitly)
  • EA Integration users (who make use of AWS SQS input almost implicitly)

I think just warning users about this recommendation is tricky as the chances this is going to be missed are high.

This is one example of what can happen with Elastic Agent:

As a consequence, if the user has AWS SQS input in any of the integrations deployed, they get performance regressions detailed here.

@strawgate
Copy link
Contributor

The personalized settings didn't include the queue.mem.flush.timeout.

For an additional piece of information here: It was not possible to customize the queue settings including timeout via fleet output settings before 8.12.

@cmacknz
Copy link
Member

cmacknz commented Jan 26, 2024

@strawgate
Copy link
Contributor

strawgate commented Jan 31, 2024

max_number_of_messages also implicitly controls the number of goroutines that are used to process messages. I think this is where there is some flexibility to decouple the control of max inflight SQS messages vs the max goroutines. In fact there is a number_of_workers setting with an aligned definition, but it's only used in S3 listing mode.

If both max_number_of_messages and number_of_workers were available for use with SQS mode then you could set max_number_of_messages to like 100 while keeping number_of_workers at a more conservative 5 to account for large internal queues with long flush intervals.

It sounds like we'll need to coordinate between teams to get this implemented to resolve the core performance issue with the SQS input. @jlind23 can we figure out how to divy this up to get this fix together?

@nimarezainia nimarezainia removed the Team:Elastic-Agent Label for the Agent team label Feb 1, 2024
@nimarezainia
Copy link
Contributor

removing the agent label. hoping it gets routed properly this time.

@nimarezainia
Copy link
Contributor

@aspacca 's PR above may be worth another look. Based on a novice read seems to suggest the same fix as what @andrewkroh has mentioned. Wouldn't a configurable max_number_of_messages and number_of_workers be a fix for this?

(i don't know what that PR was closed)

@jlind23
Copy link
Collaborator

jlind23 commented Feb 2, 2024

It sounds like we'll need to coordinate between teams to get this implemented to resolve the core performance issue with the SQS input. @jlind23 can we figure out how to divy this up to get this fix together?

@strawgate Nima escalated this to the o11y team that owns the SQS input, there is an ongoing mailthread to get this sorted out.

@bturquet you probably want to track this issue on your end.

@jlind23
Copy link
Collaborator

jlind23 commented Feb 12, 2024

@bturquet Shall we add this to any of your board for priotisation purpose?

@bturquet
Copy link

@jlind23 we are tracking our progress in a separate issue for S3 input:

We are still in performance tests stage (playing with different mix of parameters and versions). We plan to have conclusions before the end of the week and will then decide if we need to make some change in the input logic.

cc @aspacca for coordination and communication

@jlind23
Copy link
Collaborator

jlind23 commented Feb 13, 2024

I'm closing this one then in favour of yours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Cloud-Monitoring Label for the Cloud Monitoring team
Projects
None yet
Development

No branches or pull requests

9 participants