-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: Rangefeeds appear to be stuck #86818
Comments
cc @cockroachdb/cdc |
cc @cockroachdb/replication |
We're hopeful this may have been addressed by #106053, since we've seen this on multi-store nodes. However, we'll keep this issue open until we're confident it has been resolved. We've since added a stuck rangefeed watcher, which will restart rangefeeds that don't emit events, masking the problem. The way to confirm this would be to inspect the telemetry for stuck rangefeed restarts ( @miretskiy Is the above telemetry sufficient here, or do we need metrics as well? |
We have |
Given the amount of changes in this area recently, I'm closing this until we have further reports of stuck rangefeeds. |
This has been observed in large scale deployments;
See https://github.com/cockroachlabs/support/issues/1729
The cause of changefeed/rangefeed stuckness is not understood. However, what should
never happen is the following stacks:
Dist sender should never be blocked for 107 minutes in (gRPC) Recv since each range should be producing either the events, or
range checkpoints (every
kv.closed_timestamp.side_transport_interval
interval).This is a repeat of an issue we have observed about a year ago for the same customer.
It seems that this happens when there is significant activity happening, with range getting split/moved (possibly to different
nodes and/or stores). There seems to be some sort of a race where rangefeed is not disconnected; it kind of remains in the zombie state where there are matching go routines on the server side, but nothing is being emitted -- thus causing stuckness.
We should add defense in depth mechanism to dist sender (being worked on)
And we should also figure out what is going on.
Jira issue: CRDB-18946
The text was updated successfully, but these errors were encountered: