Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
A user observed that after decommissioning a node, connections would still be attempted to the node, seemingly forever. This was easily reproduced (gRPC warnings are logged when our onlyOnceDialer returns `grpcutil.ErrCannotReuseClientConn`). The root cause is that when the node disappears, many affected range descriptors are not evicted by `DistSender`: when the decommissioned node was not the lease holder, there will be no reason to evict. Similarly, if it was, we'll opportunistically send an RPC to the other possibly stale, but often not) replicas which frequently discovers the new lease holder. This doesn't seem like a problem; the replica is carried through and usually placed last (behind the lease holder and any other healthy replica) - the only issue was that the transport layer connects to all of the replicas eagerly, which provokes the spew of log messages. This commit changes the initialization process so that we don't actually GRPCDial replicas until they are needed. The previous call site to GRPCDial and the new one are very close to each other, so nothing is lost. In the common case, the rpc context already has a connection open and no actual work is done anyway. I diagnosed this using an in-progress decommissioning roachtest (and printf debugging); I hope to be able to prevent regression of this bug in that test when it's done. Touches #21882. Release note (bug fix): Avoid connection attempts to former members of the cluster and the associated spurious log messages.
- Loading branch information