Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: Add a rangefeed retry reason for stream termination #101330

Closed
miretskiy opened this issue Apr 12, 2023 · 11 comments · Fixed by #103049
Closed

kvcoord: Add a rangefeed retry reason for stream termination #101330

miretskiy opened this issue Apr 12, 2023 · 11 comments · Fixed by #103049
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@miretskiy
Copy link
Contributor

miretskiy commented Apr 12, 2023

mux rangefeed node server side code currently sends a kvpb.NewRangeFeedRetryError(kvpb.RangeFeedRetryError_REASON_REPLICA_REMOVED)
error when logical rangefeed completes with a nil error.
Nil error could be returned when processor is being unloaded (e.g. during node shutdown).
The error is semantically equivalent to sending io.EOF to the client; but we want to make sure
that io.EOF can be correctly encoded across RPC boundaries;

We should add an explicit rangefeed retry reason.

#100649

Jira issue: CRDB-26904

@miretskiy miretskiy added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. A-kv Anything in KV that doesn't belong in a more specific category. T-kv KV Team T-kv-replication labels Apr 12, 2023
@blathers-crl
Copy link

blathers-crl bot commented Apr 12, 2023

cc @cockroachdb/replication

@aliher1911
Copy link
Contributor

@miretskiy
I'm trying to see this exact behaviour in the test and when node goes away processor sends NodeUnavailableError because it stops in quiesce channel from stopper. The only time it sends nil error is when it is stopped by Stop() but I can only get that when I stop range feed itself.

Any clues how I can easily reproduce the scenario?

@aliher1911
Copy link
Contributor

So I tried it with different arrangements where leaseholder is serving rangefeed or the follower and the behaviour is consistent, you only get nil if processor is stopped because all range feeds were closed. But in that case it is the client who requests termination by doing rangefeed.RangeFeed.Close() so nil error is justified.

@aliher1911
Copy link
Contributor

@miretskiy is it specific to mux rangefeed maybe? So when we start to wind down node we cancel rangefeeds and that's why we get nil errors when underlying rangefeeds stop before they are stopped by stopper. Let me try to verify that.

@miretskiy
Copy link
Contributor Author

miretskiy commented May 9, 2023

Yes, it is specific to mux rangefeed. The regular rangefeed RPC stream returns nil error when it is closed - which indicates
normal termination reason. The library would then treat this as a transient error, and would simply restart that rangefeed.
In case of mux rangefeed, I needed a way to communicate that 1 range, part of larger mux rangefeed, terminated, and needs to be restarted. Eric had questions on whether returning a direct io.EOF in that case would work correctly (i.e. is this error correctly encoded on the wire, etc). I think it is; but it felt safer to just put in an explicit error.

@erikgrinaker
Copy link
Contributor

Yeah, looks like the errors library correctly encodes/decodes io.EOF across the network. I'm not sure exactly how it does this (maybe by reflection or comparing error strings, would be worth finding out), but I know this isn't true for non-stdlib errors like gRPC errors or our own error types.

https://github.com/cockroachdb/errors/blob/509daddce11c4769d4d6eae34d7ec6afef8d81d5/markers/markers_test.go#L104-L114

@aliher1911
Copy link
Contributor

aliher1911 commented May 9, 2023

I still think I don't completely understand the use case here.
I'm running this test:
https://gist.github.com/aliher1911/341c3dfa93c9f6486db6a4778dc6d32a
It tries to create rangefeed and

  • relocate replica off node
  • stop node that is serving feed
  • Close() rangefeed

it uses gateway node that is always up during the test.
I added logging to processor where we wind it down to see which exceptions are going back to the client and what I observe is:

replica relocated:   PROCESSOR for r64 STOPPED: retry rangefeed (REASON_REPLICA_REMOVED)
stop processor node: PROCESSOR for r64 STOPPED (QUIESCE): node unavailable; try another peer
rangefeed Close():   PROCESSOR for r64 STOPPED: <nil>

I'm forcing muxrangefeed to be always on.

I also have logging where we check for null and i could never get nil error there by just running code in gist. If we see it happening, maybe there's a race of some sort where we try to stop range feed, but mux is not yet aware and tries to trigger reconnect?
Or we are trying to blindly replace nil error with EOF or new reason value to make it more transparent and avoid dealing with nil?

@aliher1911
Copy link
Contributor

I guess the main question is, if rangefeed was closed by client, why does it need to be restarted?

@erikgrinaker
Copy link
Contributor

erikgrinaker commented May 9, 2023

I think the idea was to plug a hole that could potentially cause leaks or hangs. It's possible for a rangefeed to complete successfully without an error, and if it does, this must be signalled to the mux client such that it can restart the rangefeed. Otherwise, we won't be receiving further closed timestamp updates from the range, and the entire changefeed stalls.

You may well be right that this won't happen in practice, because we only successfully close a rangefeed when the client shuts it down, but are you 100% sure that this is completely impossible and will never ever happen? I'm not, so it seems prudent to plug this hole and explicitly signal the client when it happens, so that we can eliminate it as a possible cause of problems. If the client isn't there anymore then it doesn't hurt anyway.

@aliher1911
Copy link
Contributor

aliher1911 commented May 10, 2023

So we want to switch that to socket like semantics where if someone closes feed, listener will get EOF/custom error code instead of nil.
I think we should stick to new custom error code here as handling would be uniform rather than checking type of error in one case and error code in the other. Need to think how should it will behave in mixed version cluster though, but it is likely fine as we already substitute null for RangeFeedRetryError_REASON_REPLICA_REMOVED which is 0 so any new code will fall back to it as well and retry will happen.
That case should be trivial. And since we don't know if it exists in the wild, I'm going to limit tests to a unit test since I'm not able to trigger such behaviour so far.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented May 10, 2023

listener will get EOF/custom error code instead of nil

Well, currently the mux listener/client doesn't get anything at all. The nil is simply dropped and never propagated to the client. The client thinks the rangefeed is still running.

I think we should stick to new custom error code here as handling would be uniform rather than checking type of error in one case and error code in the other.

There will still be a different: the non-mux rangefeed will get en io.EOF when the client's gRPC stream is closed. The mux rangefeed will get the new error code when one of the component streams are closed, and io.EOF when the entire mux gRPC stream is closed. I think that's fine (the io.EOF and error code cases are different and need different handling).

Need to think how should it will behave in mixed version cluster though.

Only send the new error code when V23_2 is active, otherwise send the current error code (REASON_REPLICA_REMOVED).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants