-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster entering a recovery loop may caused by Kafka sink QueueFull (Local: Queue full) #16640
Comments
@tabVersion Any recommendations / thoughts on this? |
Seeing the log here
It indicates the batching is too small and it triggers a small failover to ingest the data within the same epoch over and over again. |
@tabVersion Thanks for the suggestions, but I have a concern about how to set those properties well, because in practice, the upstream thruput can change frequently, it may work pretty well when the upstream thruput is 1k/s, but enters a recovery loop when the upstream thruput gets high(100k/s) or Kafka is experiencing high load. And when we encounter this error, the only thing we can do is drop this sink to prevent cluster from continuing to crash. It seems that in risingwave/src/connector/src/sink/kafka.rs Lines 511 to 522 in 91b7ee2
In this case, I think we might await all inflight deliveries being drained or wait a sufficient time before doing a retry, otherwise the producer queue may keep reaching full. |
It seems this issue is not urgent, because I can't reproduce it.😅 |
remove from milestone, keep open for tracking |
It seems that this error (Queue Full) can happen when the network bandwidth reaches its limit, after we increased the bandwidth, the error disappeared. |
close as false alarm |
Describe the bug
Description
The cluster entering a recovery loop when creating kafka sink from a mv (about 4 million records); when the problematic sink was dropped, the system went back to normal.
20 minutes later, the sink had been successfully recreated.
Other observed phenomena
Kafka should work fine, because within the recovery period, the topic had 140 million records written into it, but the upstream mv only had 4 million records.
Not sure increasing
properties.retry.max
can solve this issue.Error message/log
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
v1.8.2
Additional context
No response
The text was updated successfully, but these errors were encountered: