kvcoord: Add a rangefeed retry reason for stream termination #101330

miretskiy · 2023-04-12T15:37:49Z

mux rangefeed node server side code currently sends a kvpb.NewRangeFeedRetryError(kvpb.RangeFeedRetryError_REASON_REPLICA_REMOVED)
error when logical rangefeed completes with a nil error.
Nil error could be returned when processor is being unloaded (e.g. during node shutdown).
The error is semantically equivalent to sending io.EOF to the client; but we want to make sure
that io.EOF can be correctly encoded across RPC boundaries;

We should add an explicit rangefeed retry reason.

#100649

Jira issue: CRDB-26904

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-04-12T15:37:53Z

cc @cockroachdb/replication

aliher1911 · 2023-04-28T17:31:13Z

@miretskiy
I'm trying to see this exact behaviour in the test and when node goes away processor sends NodeUnavailableError because it stops in quiesce channel from stopper. The only time it sends nil error is when it is stopped by Stop() but I can only get that when I stop range feed itself.

Any clues how I can easily reproduce the scenario?

aliher1911 · 2023-05-03T18:59:54Z

So I tried it with different arrangements where leaseholder is serving rangefeed or the follower and the behaviour is consistent, you only get nil if processor is stopped because all range feeds were closed. But in that case it is the client who requests termination by doing rangefeed.RangeFeed.Close() so nil error is justified.

aliher1911 · 2023-05-09T14:13:43Z

@miretskiy is it specific to mux rangefeed maybe? So when we start to wind down node we cancel rangefeeds and that's why we get nil errors when underlying rangefeeds stop before they are stopped by stopper. Let me try to verify that.

miretskiy · 2023-05-09T14:22:27Z

Yes, it is specific to mux rangefeed. The regular rangefeed RPC stream returns nil error when it is closed - which indicates
normal termination reason. The library would then treat this as a transient error, and would simply restart that rangefeed.
In case of mux rangefeed, I needed a way to communicate that 1 range, part of larger mux rangefeed, terminated, and needs to be restarted. Eric had questions on whether returning a direct io.EOF in that case would work correctly (i.e. is this error correctly encoded on the wire, etc). I think it is; but it felt safer to just put in an explicit error.

erikgrinaker · 2023-05-09T14:28:41Z

Yeah, looks like the errors library correctly encodes/decodes io.EOF across the network. I'm not sure exactly how it does this (maybe by reflection or comparing error strings, would be worth finding out), but I know this isn't true for non-stdlib errors like gRPC errors or our own error types.

https://github.com/cockroachdb/errors/blob/509daddce11c4769d4d6eae34d7ec6afef8d81d5/markers/markers_test.go#L104-L114

aliher1911 · 2023-05-09T17:04:50Z

I still think I don't completely understand the use case here.
I'm running this test:
https://gist.github.com/aliher1911/341c3dfa93c9f6486db6a4778dc6d32a
It tries to create rangefeed and

relocate replica off node
stop node that is serving feed
Close() rangefeed

it uses gateway node that is always up during the test.
I added logging to processor where we wind it down to see which exceptions are going back to the client and what I observe is:

replica relocated:   PROCESSOR for r64 STOPPED: retry rangefeed (REASON_REPLICA_REMOVED)
stop processor node: PROCESSOR for r64 STOPPED (QUIESCE): node unavailable; try another peer
rangefeed Close():   PROCESSOR for r64 STOPPED: <nil>

I'm forcing muxrangefeed to be always on.

I also have logging where we check for null and i could never get nil error there by just running code in gist. If we see it happening, maybe there's a race of some sort where we try to stop range feed, but mux is not yet aware and tries to trigger reconnect?
Or we are trying to blindly replace nil error with EOF or new reason value to make it more transparent and avoid dealing with nil?

aliher1911 · 2023-05-09T19:24:04Z

I guess the main question is, if rangefeed was closed by client, why does it need to be restarted?

erikgrinaker · 2023-05-09T20:17:49Z

I think the idea was to plug a hole that could potentially cause leaks or hangs. It's possible for a rangefeed to complete successfully without an error, and if it does, this must be signalled to the mux client such that it can restart the rangefeed. Otherwise, we won't be receiving further closed timestamp updates from the range, and the entire changefeed stalls.

You may well be right that this won't happen in practice, because we only successfully close a rangefeed when the client shuts it down, but are you 100% sure that this is completely impossible and will never ever happen? I'm not, so it seems prudent to plug this hole and explicitly signal the client when it happens, so that we can eliminate it as a possible cause of problems. If the client isn't there anymore then it doesn't hurt anyway.

aliher1911 · 2023-05-10T09:36:36Z

So we want to switch that to socket like semantics where if someone closes feed, listener will get EOF/custom error code instead of nil.
I think we should stick to new custom error code here as handling would be uniform rather than checking type of error in one case and error code in the other. Need to think how should it will behave in mixed version cluster though, but it is likely fine as we already substitute null for RangeFeedRetryError_REASON_REPLICA_REMOVED which is 0 so any new code will fall back to it as well and retry will happen.
That case should be trivial. And since we don't know if it exists in the wild, I'm going to limit tests to a unit test since I'm not able to trigger such behaviour so far.

erikgrinaker · 2023-05-10T09:41:32Z

listener will get EOF/custom error code instead of nil

Well, currently the mux listener/client doesn't get anything at all. The nil is simply dropped and never propagated to the client. The client thinks the rangefeed is still running.

I think we should stick to new custom error code here as handling would be uniform rather than checking type of error in one case and error code in the other.

There will still be a different: the non-mux rangefeed will get en io.EOF when the client's gRPC stream is closed. The mux rangefeed will get the new error code when one of the component streams are closed, and io.EOF when the entire mux gRPC stream is closed. I think that's fine (the io.EOF and error code cases are different and need different handling).

Need to think how should it will behave in mixed version cluster though.

Only send the new error code when V23_2 is active, otherwise send the current error code (REASON_REPLICA_REMOVED).

miretskiy added C-enhancement A-kv-replication A-kv T-kv T-kv-replication labels Apr 12, 2023

exalate-issue-sync bot assigned aliher1911 Apr 12, 2023

exalate-issue-sync bot removed the T-kv-replication label Apr 12, 2023

aliher1911 mentioned this issue May 10, 2023

rangefeed: when processor is stopped send errors to client #103049

Merged

craig bot closed this as completed in 3ff847a May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvcoord: Add a rangefeed retry reason for stream termination #101330

kvcoord: Add a rangefeed retry reason for stream termination #101330

miretskiy commented Apr 12, 2023 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Apr 12, 2023

aliher1911 commented Apr 28, 2023

aliher1911 commented May 3, 2023

aliher1911 commented May 9, 2023

miretskiy commented May 9, 2023 •

edited

Loading

erikgrinaker commented May 9, 2023

aliher1911 commented May 9, 2023 •

edited

Loading

aliher1911 commented May 9, 2023

erikgrinaker commented May 9, 2023 •

edited

Loading

aliher1911 commented May 10, 2023 •

edited

Loading

erikgrinaker commented May 10, 2023 •

edited

Loading

kvcoord: Add a rangefeed retry reason for stream termination #101330

kvcoord: Add a rangefeed retry reason for stream termination #101330

Comments

miretskiy commented Apr 12, 2023 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Apr 12, 2023

aliher1911 commented Apr 28, 2023

aliher1911 commented May 3, 2023

aliher1911 commented May 9, 2023

miretskiy commented May 9, 2023 • edited Loading

erikgrinaker commented May 9, 2023

aliher1911 commented May 9, 2023 • edited Loading

aliher1911 commented May 9, 2023

erikgrinaker commented May 9, 2023 • edited Loading

aliher1911 commented May 10, 2023 • edited Loading

erikgrinaker commented May 10, 2023 • edited Loading

miretskiy commented Apr 12, 2023 •

edited by cockroach-jira-scripts

Loading

miretskiy commented May 9, 2023 •

edited

Loading

aliher1911 commented May 9, 2023 •

edited

Loading

erikgrinaker commented May 9, 2023 •

edited

Loading

aliher1911 commented May 10, 2023 •

edited

Loading

erikgrinaker commented May 10, 2023 •

edited

Loading