Skip to content

Commit

Permalink
kv: connect transport lazily
Browse files Browse the repository at this point in the history
A user observed that after decommissioning a node, connections would
still be attempted to the node, seemingly forever. This was easily
reproduced (gRPC warnings are logged when our onlyOnceDialer returns
`grpcutil.ErrCannotReuseClientConn`).

The root cause is that when the node disappears, many affected range
descriptors are not evicted by `DistSender`: when the decommissioned
node was not the lease holder, there will be no reason to evict.
Similarly, if it was, we'll opportunistically send an RPC to the other
possibly stale, but often not) replicas which frequently discovers the
new lease holder.

This doesn't seem like a problem; the replica is carried through and
usually placed last (behind the lease holder and any other healthy
replica) - the only issue was that the transport layer connects to all
of the replicas eagerly, which provokes the spew of log messages.

This commit changes the initialization process so that we don't actually
GRPCDial replicas until they are needed. The previous call site to
GRPCDial and the new one are very close to each other, so nothing is
lost. In the common case, the rpc context already has a connection open
and no actual work is done anyway.

I diagnosed this using an in-progress decommissioning roachtest (and
printf debugging); I hope to be able to prevent regression of this bug
in that test when it's done.

Touches cockroachdb#21882.

Release note (bug fix): Avoid connection attempts to former members of
the cluster and the associated spurious log messages.
  • Loading branch information
tbg committed Mar 8, 2018
1 parent d7f2636 commit d548f4a
Showing 1 changed file with 3 additions and 5 deletions.
8 changes: 3 additions & 5 deletions pkg/kv/transport.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,6 @@ type SendOptions struct {

type batchClient struct {
remoteAddr string
conn *rpc.Connection
args roachpb.BatchRequest
healthy bool
pending bool
Expand Down Expand Up @@ -118,15 +117,14 @@ func grpcTransportFactoryImpl(
) (Transport, error) {
clients := make([]batchClient, 0, len(replicas))
for _, replica := range replicas {
conn := rpcContext.GRPCDial(replica.NodeDesc.Address.String())
argsCopy := args
argsCopy.Replica = replica.ReplicaDescriptor
remoteAddr := replica.NodeDesc.Address.String()
healthy := rpcContext.ConnHealth(remoteAddr) == nil
clients = append(clients, batchClient{
remoteAddr: remoteAddr,
conn: conn,
args: argsCopy,
healthy: rpcContext.ConnHealth(remoteAddr) == nil,
healthy: healthy,
})
}

Expand Down Expand Up @@ -244,7 +242,7 @@ func (gt *grpcTransport) send(
}

log.VEventf(ctx, 2, "sending request to %s", client.remoteAddr)
conn, err := client.conn.Connect(ctx)
conn, err := gt.rpcContext.GRPCDial(client.remoteAddr).Connect(ctx)
if err != nil {
return nil, err
}
Expand Down

0 comments on commit d548f4a

Please sign in to comment.