kv: connect transport lazily #23521

tbg · 2018-03-07T07:14:45Z

kv: connect transport lazily

A user observed that after decommissioning a node, connections would
still be attempted to the node, seemingly forever. This was easily
reproduced (gRPC warnings are logged when our onlyOnceDialer returns
grpcutil.ErrCannotReuseClientConn).

The root cause is that when the node disappears, many affected range
descriptors are not evicted by DistSender: when the decommissioned
node was not the lease holder, there will be no reason to evict.
Similarly, if it was, we'll opportunistically send an RPC to the other
possibly stale, but often not) replicas which frequently discovers the
new lease holder.

This doesn't seem like a problem; the replica is carried through and
usually placed last (behind the lease holder and any other healthy
replica) - the only issue was that the transport layer connects to all
of the replicas eagerly, which provokes the spew of log messages.

This commit changes the initialization process so that we don't actually
GRPCDial replicas until they are needed. The previous call site to
GRPCDial and the new one are very close to each other, so nothing is
lost. In the common case, the rpc context already has a connection open
and no actual work is done anyway.

I diagnosed this using an in-progress decommissioning roachtest (and
printf debugging); I hope to be able to prevent regression of this bug
in that test when it's done.

Touches #21882.

Release note (bug fix): Avoid connection attempts to former members of
the cluster and the associated spurious log messages.

cockroach-teamcity · 2018-03-07T07:14:52Z

This change is

bdarnell · 2018-03-07T17:16:10Z

I agree that we should probably make it lazy in all cases at this layer. It might make sense to optionally open connections to all healthy nodes eagerly (an analogue to #23510), but for that it would be better to make it truly eager (at startup) instead of here where it is shortly before the point of use.

Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful.

Comments from Reviewable

tbg · 2018-03-08T19:07:51Z

TFTR! Done, was a trivial change. The proactive connecting stuff could make sense to reduce tail latencies during rolling upgrades, but I don't think it's worth doing just yet.

Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, some commit checks pending.

Comments from Reviewable

A user observed that after decommissioning a node, connections would still be attempted to the node, seemingly forever. This was easily reproduced (gRPC warnings are logged when our onlyOnceDialer returns `grpcutil.ErrCannotReuseClientConn`). The root cause is that when the node disappears, many affected range descriptors are not evicted by `DistSender`: when the decommissioned node was not the lease holder, there will be no reason to evict. Similarly, if it was, we'll opportunistically send an RPC to the other possibly stale, but often not) replicas which frequently discovers the new lease holder. This doesn't seem like a problem; the replica is carried through and usually placed last (behind the lease holder and any other healthy replica) - the only issue was that the transport layer connects to all of the replicas eagerly, which provokes the spew of log messages. This commit changes the initialization process so that we don't actually GRPCDial replicas until they are needed. The previous call site to GRPCDial and the new one are very close to each other, so nothing is lost. In the common case, the rpc context already has a connection open and no actual work is done anyway. I diagnosed this using an in-progress decommissioning roachtest (and printf debugging); I hope to be able to prevent regression of this bug in that test when it's done. Touches cockroachdb#21882. Release note (bug fix): Avoid connection attempts to former members of the cluster and the associated spurious log messages.

tbg requested a review from a team March 7, 2018 07:14

tbg requested a review from bdarnell March 7, 2018 07:16

tbg mentioned this pull request Mar 8, 2018

storage: implement follower reads using replica closed timestamps #21056

Closed

tbg force-pushed the fix/connections-decommissioned branch from a7b84c3 to 8c0f6e2 Compare March 8, 2018 19:05

tbg force-pushed the fix/connections-decommissioned branch from 8c0f6e2 to d548f4a Compare March 8, 2018 19:08

tbg changed the title ~~kv: connect lazily to non-live nodes~~ kv: connect transport lazily Mar 8, 2018

tbg merged commit 2115e6d into cockroachdb:master Mar 8, 2018

tbg deleted the fix/connections-decommissioned branch March 8, 2018 20:06

tbg mentioned this pull request Mar 8, 2018

teamcity: failed tests on master: test/TestDatabaseAccessors #23608

Closed

tbg restored the fix/connections-decommissioned branch April 16, 2018 15:24

tbg deleted the fix/connections-decommissioned branch May 8, 2018 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: connect transport lazily #23521

kv: connect transport lazily #23521

tbg commented Mar 7, 2018 •

edited

Loading

cockroach-teamcity commented Mar 7, 2018

bdarnell commented Mar 7, 2018

tbg commented Mar 8, 2018

kv: connect transport lazily #23521

kv: connect transport lazily #23521

Conversation

tbg commented Mar 7, 2018 • edited Loading

cockroach-teamcity commented Mar 7, 2018

bdarnell commented Mar 7, 2018

tbg commented Mar 8, 2018

tbg commented Mar 7, 2018 •

edited

Loading