Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On kafka consumer rebalance, Vector consumer stops consuming. #22006

Open
ADustyOldMuffin opened this issue Dec 10, 2024 · 2 comments
Open

On kafka consumer rebalance, Vector consumer stops consuming. #22006

ADustyOldMuffin opened this issue Dec 10, 2024 · 2 comments
Labels
type: bug A code related bug.

Comments

@ADustyOldMuffin
Copy link

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We have observed that when a connectivity problem causes consumer groups to change Vector will randomly stop consuming from the associated topic and consumer lag will build up in Kafka.

rdkafka has various issues introduced in version 0.34.0 -> 0.36.0

Vector specifically uses the StreamConsumer to consume from Kafka via the source component which is indicated as having an issue.

Inside of these issues the report that v0.35.0 should be okay, but from this PR fede1024/rust-rdkafka#666 which fixes the issue it would still be present v0.35.0 for the StreamConsumer as seen here.

Configuration

No response

Version

0.40.0

Debug Output

No response

Example Data

No response

Additional Context

We'd like to propose to upgrade rdkafka to version 0.37.0 to fix the issue as we've identified that the fix for StreamConsumer is in this release.

Until then, we have to monitor kafka and restart Vector when this happens. It also of note that due to #21134 we can't just monitor Vector as the metrics that indicate something is wrong are currently incorrect.

References

No response

@ADustyOldMuffin ADustyOldMuffin added the type: bug A code related bug. label Dec 10, 2024
@ADustyOldMuffin
Copy link
Author

You can see the PR chore bumping it here #21929 just to document that this will resolve a fairly large issue.

@sam6258
Copy link
Contributor

sam6258 commented Dec 12, 2024

So I think #21134 is actually caused by the StreamConsumer race. I took a look at where the metrics are in Vector and its just a callback into the rust rdkafka library. It would make sense if the consumer thread goes idle for a particular partition that it doesn't update its lag metric.

@ADustyOldMuffin ADustyOldMuffin changed the title rdkafka v0.35.0 has a StreamConsumer race condition that causes a deadlock on consumer group changes On kafka consumer rebalance, Vector consumer stops consuming. Dec 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

2 participants