Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: connect transport lazily #23521

Merged
merged 1 commit into from
Mar 8, 2018

Conversation

tbg
Copy link
Member

@tbg tbg commented Mar 7, 2018

kv: connect transport lazily

A user observed that after decommissioning a node, connections would
still be attempted to the node, seemingly forever. This was easily
reproduced (gRPC warnings are logged when our onlyOnceDialer returns
grpcutil.ErrCannotReuseClientConn).

The root cause is that when the node disappears, many affected range
descriptors are not evicted by DistSender: when the decommissioned
node was not the lease holder, there will be no reason to evict.
Similarly, if it was, we'll opportunistically send an RPC to the other
possibly stale, but often not) replicas which frequently discovers the
new lease holder.

This doesn't seem like a problem; the replica is carried through and
usually placed last (behind the lease holder and any other healthy
replica) - the only issue was that the transport layer connects to all
of the replicas eagerly, which provokes the spew of log messages.

This commit changes the initialization process so that we don't actually
GRPCDial replicas until they are needed. The previous call site to
GRPCDial and the new one are very close to each other, so nothing is
lost. In the common case, the rpc context already has a connection open
and no actual work is done anyway.

I diagnosed this using an in-progress decommissioning roachtest (and
printf debugging); I hope to be able to prevent regression of this bug
in that test when it's done.

Touches #21882.

Release note (bug fix): Avoid connection attempts to former members of
the cluster and the associated spurious log messages.

@tbg tbg requested a review from a team March 7, 2018 07:14
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from bdarnell March 7, 2018 07:16
@bdarnell
Copy link
Contributor

bdarnell commented Mar 7, 2018

:lgtm:

I agree that we should probably make it lazy in all cases at this layer. It might make sense to optionally open connections to all healthy nodes eagerly (an analogue to #23510), but for that it would be better to make it truly eager (at startup) instead of here where it is shortly before the point of use.


Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Mar 8, 2018

TFTR! Done, was a trivial change. The proactive connecting stuff could make sense to reduce tail latencies during rolling upgrades, but I don't think it's worth doing just yet.


Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, some commit checks pending.


Comments from Reviewable

A user observed that after decommissioning a node, connections would
still be attempted to the node, seemingly forever. This was easily
reproduced (gRPC warnings are logged when our onlyOnceDialer returns
`grpcutil.ErrCannotReuseClientConn`).

The root cause is that when the node disappears, many affected range
descriptors are not evicted by `DistSender`: when the decommissioned
node was not the lease holder, there will be no reason to evict.
Similarly, if it was, we'll opportunistically send an RPC to the other
possibly stale, but often not) replicas which frequently discovers the
new lease holder.

This doesn't seem like a problem; the replica is carried through and
usually placed last (behind the lease holder and any other healthy
replica) - the only issue was that the transport layer connects to all
of the replicas eagerly, which provokes the spew of log messages.

This commit changes the initialization process so that we don't actually
GRPCDial replicas until they are needed. The previous call site to
GRPCDial and the new one are very close to each other, so nothing is
lost. In the common case, the rpc context already has a connection open
and no actual work is done anyway.

I diagnosed this using an in-progress decommissioning roachtest (and
printf debugging); I hope to be able to prevent regression of this bug
in that test when it's done.

Touches cockroachdb#21882.

Release note (bug fix): Avoid connection attempts to former members of
the cluster and the associated spurious log messages.
@tbg tbg force-pushed the fix/connections-decommissioned branch from 8c0f6e2 to d548f4a Compare March 8, 2018 19:08
@tbg tbg changed the title kv: connect lazily to non-live nodes kv: connect transport lazily Mar 8, 2018
@tbg tbg merged commit 2115e6d into cockroachdb:master Mar 8, 2018
@tbg tbg deleted the fix/connections-decommissioned branch March 8, 2018 20:06
@tbg tbg restored the fix/connections-decommissioned branch April 16, 2018 15:24
@tbg tbg deleted the fix/connections-decommissioned branch May 8, 2018 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants