-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
all: fix various badness when resuscitating a node #8163
all: fix various badness when resuscitating a node #8163
Conversation
I don't understand what the first commit is doing. In the third commit: s/following/fouling/ Reviewed 1 of 1 files at r1, 2 of 2 files at r2, 3 of 3 files at r3. cmd/zerosum/cluster.go, line 182 [r3] (raw file):
this looks unrelated, and i can't tell why it's being changed from the commit message kv/dist_sender.go, line 205 [r2] (raw file):
extract a local context? also, this needs a test. ah, i see you filed an issue. Comments from Reviewable |
6f69bd7
to
1aa6ad6
Compare
The first commit is fixing a silliness Spencer introduced in Review status: 2 of 6 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending. cmd/zerosum/cluster.go, line 182 [r3] (raw file):
|
Right, but why does that matter? Reviewed 4 of 4 files at r4, 3 of 3 files at r5. Comments from Reviewable |
It doesn't matter for correctness, just something I noticed. Why should we retrieve the connection when the active queue already has the connection? When the connection closes the associated streams will be closed closing the queue and the next time Review status: all files reviewed at latest revision, all discussions resolved, some commit checks pending. Comments from Reviewable |
Ah, I see. Could you expand the comment? Review status: all files reviewed at latest revision, all discussions resolved, some commit checks pending. Comments from Reviewable |
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks pending. Comments from Reviewable |
Previously we were getting the cached grpc connection on every call to RaftSender.SendAsync even though it was only needed when a queue was being created. This was unnecessary waste. In the majority of cases it was retrieving the cached grpc connection and never bothering to use it. Note that when the grpc connection is closed (e.g. because the remote terminates) the associated streams will be closed which will in turn cause the queues to be removed.
We want heartbeat RPCs to be fail-fast so that we get notified of transport failures and close the connection. Failure to do this left the gRPC client connections permanently open and trying to reconnect to down nodes, fouling up the circuit breaker expectations in Raft transport. Changed client.NewSender to create a separate gRPC connection which will not be heartbeat by the rpc.Context. We don't want the heartbeat service and we don't want these connections closed. Fixes cockroachdb#8130
1aa6ad6
to
a79b1b9
Compare
Expanded the first commit message. I also reverted the usage of fail-fast by Review status: 1 of 5 files reviewed at latest revision, all discussions resolved, some commit checks pending. Comments from Reviewable |
Reviewed 1 of 1 files at r1, 5 of 5 files at r6, 2 of 2 files at r7, 2 of 2 files at r8. Comments from Reviewable |
I briefly stress tested this to make sure it isn't obviously flaky, and also verified that it fails without cockroachdb#8163. Fixes cockroachdb#8164.
I briefly stress tested this to make sure it isn't obviously flaky, and also verified that it fails without cockroachdb#8163. Fixes cockroachdb#8164.
Prior to this PR,
zerosum
would experience a significant hiccup whenrestarting a node:
And now:
This change is