-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cooperative-sticky performReassignments
can get stuck in an infinite loop
#4783
Comments
Ok I can reproduce it. As long as I have same consumer group different consumers subscribed to different topics and try to upscale, the assignor gets stuck in this loop. When I make the topics use different CGs, it no longer hangs. I am not sure at the moment whether this is a multi-process issue or an issue that only affects a single process (some state shared?, name collision?) setup with multiple librdkafka consumers subscribed to multiple topics within the same consumer group. Will continue the investigation. |
Here are some extensive debug logs. The moment I see the 100% CPU spike I kill -9 the whole process. I cannot however pinpoint (yet) where it starts in the logs exactly. What is also interesting is that it is responding (I can poll data) but it hangs when I attempt to close it. |
Kafka logs from the moment of the hanging assignment:
what is crazy, is that when I use a random client id that is time based, it does not hang. So I suspect it is somehow related to metadata requests. I can not reproduce it within Karafka always. |
What is SUPER interesting to me, is the fact that this does not happen if I set
but when I use a constant |
One more update: I wanted to check if this isn't a Kafka issue with how it caches (incorrectly) some of the metadata responses but with RedPanda librdkafka presents the same behaviour and the same mitigation. |
Thanks a lot @mensfeld ! Could reproduce it and found the cause. It's because of using the same variable librdkafka/src/rdkafka_sticky_assignor.c Line 821 in 6eaf89f
|
@emasab you are welcome. Can you explain to me why was it mitigated by client id randomization? |
Not in all cases, I could reproduce it in Python even with |
@emasab do you have an ETA for this maybe? Just a PR would help me because I could temporarily cherry-pick this and release as a special release for ppl affected on my side. |
this seems to work
|
Description
I do not know why, but the cooperative-sticky assignment can get stuck in an infinite loop causing 100% CPU usage and hanging.
How to reproduce
At the moment I am able to reproduce it fully in a stable manner.
I subscribe few consumers from the same consumer group to two topics. Each consumer is subscribed to one out of two topics. I start with 1 consumer instance per topic and then every 5 seconds I add one more consumer instance. Everything is
fine until I start adding consumers subscribed to the second topic.
Reproduction:
when
client.id
is not set (or set to the same value) it will cause the described behaviour. If I randomize it it will not.Topics configuration:
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
2.4.0
and2.5.0
confluentinc/cp-kafka:7.6.1
partition.assignment.strategy: cooperative.sticky, client.id: same_for_all_instances
Linux shinra 5.15.0-113-generic #123-Ubuntu SMP Mon Jun 10 08:16:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
debug=..
as necessary) from librdkafka (will be provided, trying to repro with logs)Full librdkafka config
Broker logs
Since I'm running multi-threaded setup with few consumer instances in one ruby process to simulate it and with randomness of them joining and rejoning there are some logs but they do not differ from when the instance is not stuck
Additional info
watch ps -T -p pid
that shows extreme usage of CPU orrdk:main
thread:gdb of this thread:
it looks like this code never exits:
librdkafka/src/rdkafka_sticky_assignor.c
Line 899 in 6eaf89f
I suspect (though I did not confirm this yet) that it may be caused by a consumer instance leaving during the sticky rebalance or something similar causing this to run forever.
I tried to stop with GDB several times and it's always in the same place (when tracking the backtrace)
frame details:
locals for this frame
Forcing a rebalance by keeping GBD beyond max.poll.interval.ms does not trigger an exit from this loop. It keeps running forever.
assignor logs from the moment it happens:
More detailed logs (assignor,generic,broker,cgrp,interceptor):
Once in a while when I try to shutdown such hanging consumer I get:
The text was updated successfully, but these errors were encountered: