-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvcoord: Add a rangefeed retry reason for stream termination #101330
Comments
cc @cockroachdb/replication |
@miretskiy Any clues how I can easily reproduce the scenario? |
So I tried it with different arrangements where leaseholder is serving rangefeed or the follower and the behaviour is consistent, you only get nil if processor is stopped because all range feeds were closed. But in that case it is the client who requests termination by doing |
@miretskiy is it specific to mux rangefeed maybe? So when we start to wind down node we cancel rangefeeds and that's why we get nil errors when underlying rangefeeds stop before they are stopped by stopper. Let me try to verify that. |
Yes, it is specific to mux rangefeed. The regular rangefeed RPC stream returns nil error when it is closed - which indicates |
Yeah, looks like the errors library correctly encodes/decodes |
I still think I don't completely understand the use case here.
it uses gateway node that is always up during the test.
I'm forcing muxrangefeed to be always on. I also have logging where we check for null and i could never get nil error there by just running code in gist. If we see it happening, maybe there's a race of some sort where we try to stop range feed, but mux is not yet aware and tries to trigger reconnect? |
I guess the main question is, if rangefeed was closed by client, why does it need to be restarted? |
I think the idea was to plug a hole that could potentially cause leaks or hangs. It's possible for a rangefeed to complete successfully without an error, and if it does, this must be signalled to the mux client such that it can restart the rangefeed. Otherwise, we won't be receiving further closed timestamp updates from the range, and the entire changefeed stalls. You may well be right that this won't happen in practice, because we only successfully close a rangefeed when the client shuts it down, but are you 100% sure that this is completely impossible and will never ever happen? I'm not, so it seems prudent to plug this hole and explicitly signal the client when it happens, so that we can eliminate it as a possible cause of problems. If the client isn't there anymore then it doesn't hurt anyway. |
So we want to switch that to socket like semantics where if someone closes feed, listener will get EOF/custom error code instead of nil. |
Well, currently the mux listener/client doesn't get anything at all. The nil is simply dropped and never propagated to the client. The client thinks the rangefeed is still running.
There will still be a different: the non-mux rangefeed will get en
Only send the new error code when |
mux rangefeed node server side code currently sends a
kvpb.NewRangeFeedRetryError(kvpb.RangeFeedRetryError_REASON_REPLICA_REMOVED)
error when logical rangefeed completes with a nil error.
Nil error could be returned when processor is being unloaded (e.g. during node shutdown).
The error is semantically equivalent to sending io.EOF to the client; but we want to make sure
that io.EOF can be correctly encoded across RPC boundaries;
We should add an explicit rangefeed retry reason.
#100649
Jira issue: CRDB-26904
The text was updated successfully, but these errors were encountered: