Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow throttling SQS #53

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AlexITC
Copy link

@AlexITC AlexITC commented Aug 22, 2022

Adds new config options to throttle SQS.

Adds new config options to throttle SQS.
@istreeter
Copy link
Contributor

Hi @AlexITC sorry I've been slow to get around to looking at this PR.

Please can you tell me more about why this change is needed? What behaviour were you finding without this change? For example, was it a problem with the app running out of memory? Or exceeding kinesis rate limits?

Are you using it as part of a Snowplow pipeline setup, or is it for some other purpose of copying from sqs to kinesis?

@AlexITC
Copy link
Author

AlexITC commented Sep 22, 2022

We have a project that copies data from SQS to Kinesis, still, from time to time, we are experiencing that messages are consumed faster than what Kinesis can process, leaving us with many errors.

Then, we have introduced a throttling strategy to not pull more data from SQS than what our Kinesis instance can actually handle, until now, it has worked reasonably for us.

So, we came to contribute the minor patch just in case others find it useful.

@AlexITC
Copy link
Author

AlexITC commented Oct 31, 2022

@istreeter any luck checking this?

@istreeter
Copy link
Contributor

Hi @AlexITC sorry I've been very slow to respond to this. I will explain why I am hesitating about this change...

Akka-streams apps typically work with the concept of sinks applying backpressure towards the source, not the other way around. In sqs2kinesis, the "sink" is writing to kinesis. Sqs2kinesis works on the principle that it should write to kinesis as quickly as it is allowed to. When it is throttled by kinesis... then that's OK! It retries sending to kinesis until it is successful. Then it fetches more from records from Sqs.

This design of the app seems to fit the akka-streams philosophy, that slow sinks set the overall rate, and quick sources adapt to that rate. By throttling the source, it seems we are flipping that paradigm the wrong way around.

When you say...

we are experiencing that messages are consumed faster than what Kinesis can process, leaving us with many errors.

Are you just referring to messages in the logs saying that it got throttled? Or is there some other error that I'm not aware of?

If you just mean the messages in the logs, then I wouldn't really call them errors. I admit that we are currently logging them at ERROR logging level, and that is probably a mistake. Maybe we just need to re-think how we log this transient problem?

Sorry again to be slow with this PR. If there is a good reason why the fundamental design of the app is problematic then I'll try to help you out. I just want to make sure we put in the right fix in the right place.

@AlexITC
Copy link
Author

AlexITC commented Dec 2, 2022

Thanks for taking the time to review and explain.

I should mention that I was hired for a few hours to check a problem faced by a customer, hence, I don't have as much context as I'd hope.

Akka-streams apps typically work with the concept of sinks applying backpressure towards the source, not the other way around. In sqs2kinesis, the "sink" is writing to kinesis. Sqs2kinesis works on the principle that it should write to kinesis as quickly as it is allowed to. When it is throttled by kinesis... then that's OK! It retries sending to kinesis until it is successful. Then it fetches more from records from Sqs.

Thanks for explaining this, I'm aware of this back-pressure mechanism, what my customer shared is that their Kinesis instance ended up crashing when processing a high amount of records (I don't know what's the number) which caused considerable delays before recovering.

This design of the app seems to fit the akka-streams philosophy, that slow sinks set the overall rate, and quick sources adapt to that rate. By throttling the source, it seems we are flipping that paradigm the wrong way around.

Fair point, still, this has been the simplest solution for my customer who is currently used my fork.

Are you just referring to messages in the logs saying that it got throttled? Or is there some other error that I'm not aware of?

My customer shared to me that their Kinesis instance was constantly crashing until the throttling strategy was introduced. As far as I saw, failed messages ended up being retried.

Sorry again to be slow with this PR. If there is a good reason why the fundamental design of the app is problematic then I'll try to help you out. I just want to make sure we put in the right fix in the right place.

I have raised this to my customer who is interested in getting rid of my fork, given the lack of details, do you have any better suggestion?

@istreeter
Copy link
Contributor

Hi @AlexITC thank you that is very helpful extra context!

their Kinesis instance ended up crashing when processing a high amount of records

I find this very interesting!! If we're talking about the original Kinesis™ service managed by AWS... then in my experience it does not crash like this. Because it's a highly resilient cloud service.

I wonder, is the customer using a "Kinesis-compatible" non-AWS instance? Something like Localstack? If I am guessing right, then that's an interesting angle I had never thought about for sqs2kinesis. But I can see it would be nice to support these less resilient tools.

To avoid crashing the Kinesis sink, it might be possible to use some of the existing settings for backoff and retry. However, our backoff mechanism assumes that the Kinesis instance issues WriteThroughputExceeded messages. If we're talking about a non-AWS version of Kinesis, then I wonder if it has this support for WriteThroughputExceeded.

By the way, I am leaning towards merging this for you. I never wanted to be a blocker for you or for your customer, I just wanted to understand the change.

@AlexITC
Copy link
Author

AlexITC commented Dec 9, 2022

@istreeter apologize for the delay, I can't share much context about this just yet, what I got to understand so far is that the problem fixed by this throttling strategy occurs rarely, requiring some specific settings/environment (I still don't know those details).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants