-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runaway stream when max.poll.interval.ms
is exceeded
#1262
Comments
Hi @myazinn . Thanks for the extensive report! Very useful. This is complicated stuff, I have to think about it more before I can give a good answer. For now: because zio-kafka doesn't run the streams (the user does) zio-kafka does not have full control over them. In particular we cannot interrupt the stream at any moment, only when it needs to fetch more records. In the example this happens every 5 records. This explains why offset 0 is committed even though processing took 10 seconds. Unfortunately, the commit does not complete, otherwise the retry would skip offset 0. I have to think a bit more on why the first consumer does not completely close and even tries to complete the commit (and eventually fails). |
As a quick fix: it is perfectly fine to raise |
Thanks @erikvanoosten! Yeah, that's what we've decided to do as well (increasing |
Here's a self-contained test that reproduces the issue
It models a scenario when event handling takes more time than
max.poll.interval.ms
value (which happens for us from time to time :( ).The expected behaviour seems to be that once the poll interval is exceeded, stream is interrupted along with all child fibers. So each time we "retry" Kafka stream, it should behave like it starts from scratch and nothing happened.
But that's not what we see here. Once
max.poll.interval.ms
is exceeded, zio-kafka "forgets" the stream and re-subscribes to Kafka. But the "forgotten" stream is not actually dead and messes with RunLoop state. Here's what you'll get if you run the testNote that
CommitTimeout
exception. The fail itself could be ok, but the exact exception is extremely misleading.And there's more. In a more "real-world" scenario it could lead to stream hanging indefinitely. Here's how (leaving only the important part)
Here we have two topics, one of which is perfectly fine and the second one would fail on first iteration. Eventually that "broken" iteration will fail both streams, and test will never complete. Here's what you'll get when you run the test
And that's what've actually encountered :(
It seems that it is caused by a workaround for this issue, but I'm not sure.
Let me know if you need anything else from my side
The text was updated successfully, but these errors were encountered: