Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
kv: connect transport lazily
A user observed that after decommissioning a node, connections would
still be attempted to the node, seemingly forever. This was easily
reproduced (gRPC warnings are logged when our onlyOnceDialer returns
grpcutil.ErrCannotReuseClientConn
).The root cause is that when the node disappears, many affected range
descriptors are not evicted by
DistSender
: when the decommissionednode was not the lease holder, there will be no reason to evict.
Similarly, if it was, we'll opportunistically send an RPC to the other
possibly stale, but often not) replicas which frequently discovers the
new lease holder.
This doesn't seem like a problem; the replica is carried through and
usually placed last (behind the lease holder and any other healthy
replica) - the only issue was that the transport layer connects to all
of the replicas eagerly, which provokes the spew of log messages.
This commit changes the initialization process so that we don't actually
GRPCDial replicas until they are needed. The previous call site to
GRPCDial and the new one are very close to each other, so nothing is
lost. In the common case, the rpc context already has a connection open
and no actual work is done anyway.
I diagnosed this using an in-progress decommissioning roachtest (and
printf debugging); I hope to be able to prevent regression of this bug
in that test when it's done.
Touches #21882.
Release note (bug fix): Avoid connection attempts to former members of
the cluster and the associated spurious log messages.