rpc: Perform initial-heartbeat validation on GRPC reconnections (take 2) #22658

bdarnell · 2018-02-13T19:27:49Z

The only difference from the previous version of this change is that
the new error has been moved from pkg/rpc to pkg/util/grpcutil,
and is included in IsClosedConnection. This suppresses the log message
that was causing intermittent failures in test_missing_log_output.tcl

Fixes #22536

This reverts commit 9b20d8a.

The only difference from the previous version of this change is that the new error has been moved from pkg/rpc to pkg/util/grpcutil, and is included in IsClosedConnection. This suppresses the log message that was causing intermittent failures in test_missing_log_output.tcl Fixes cockroachdb#22536 This reverts commit 9b20d8a.

cockroach-teamcity · 2018-02-13T19:27:55Z

This change is

benesch · 2018-02-13T21:19:45Z

LGTM, but should we also put this back the way it was?

cockroach/pkg/cli/interactive_tests/common.tcl

Lines 106 to 109 in 5e18f7a

    
           ## Re-enable this once issue #22536 is fixed. 
        
           # system "for i in `seq 1 5`; do kill -CONT `cat server_pid` 2>/dev/null || exit 0; echo still waiting; sleep 1; done; echo 'server still running?'; exit 0" 
        
           # Until #22536 is fixed use a violent kill. 
        
           system "for i in `seq 1 5`; do kill -CONT `cat server_pid` 2>/dev/null || exit 0; echo still waiting; sleep 1; done; echo 'server still running?'; kill -9 `cat server_pid`; exit 0"

It was commented out in a61e303.

Release note: None

bdarnell · 2018-02-13T23:11:12Z

Added another commit to fix that.

Local connections can fail if the machine's network configuration changes (for example, if you turn wifi off on a laptop). We disabled heartbeating of local connections in cockroachdb#16526 to work around a UI issue (in which the UI would fail to recover after a local connection failed). This relied on GRPC's internal reconnections, which were disabled in cockroachdb#22658. This commit re-enables local heartbeats so that our reconnection logic works as usual (In the meantime, the UI has gotten better about error recovery, so cockroachdb#16526 is no longer necessary). Fixes cockroachdb#22951 Release note (bug fix): The admin UI no longer hangs after a node's network configuration has changed.

WIP because needs a test. ---- The recent PR cockroachdb#22658 introduced a regression in `(*rpcContext).ConnHealth` which caused DistSQL to continue planning on unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the leaseholder cache happened to not be updated by other non-DistSQL requests. Instead, consult node liveness and avoid planning on dead nodes. This reduces the problem to a <10s window. The defunct `ConnHealth` mechanism still protects against planning in some of cases (supposedly due to a once-per-second reconnection policy) and is retained for that reason, with issue cockroachdb#23829 filed to decide its future. NB: I'm not putting a release note since this was introduced after 1.1. We released it in a beta, though, so it may be worth calling out there. Touches cockroachdb#23601. (Not fixing it because this issue should only close when there's a roachtest). Release note (bug fix): NB: this fixes a regression introduced in 2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against unavailable nodes.

The recent PR cockroachdb#22658 introduced a regression in `(*rpcContext).ConnHealth` which caused DistSQL to continue planning on unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the leaseholder cache happened to not be updated by other non-DistSQL requests. Instead, consult node liveness and avoid planning on dead nodes. This reduces the problem to a <10s window. The defunct `ConnHealth` mechanism still protects against planning in some of cases (supposedly due to a once-per-second reconnection policy) and is retained for that reason, with issue cockroachdb#23829 filed to decide its future. NB: I'm not putting a release note since this was introduced after 1.1. We released it in a beta, though, so it may be worth calling out there. Touches cockroachdb#23601. (Not fixing it because this issue should only close when there's a roachtest). Release note (bug fix): NB: this fixes a regression introduced in 2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against unavailable nodes.

23916: backport-2.0: distsql: consult liveness during physical planning r=tschottdorf a=tschottdorf Backport 1/1 commits from #23834. /cc @cockroachdb/release --- The recent PR #22658 introduced a regression in `(*rpcContext).ConnHealth` which caused DistSQL to continue planning on unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the leaseholder cache happened to not be updated by other non-DistSQL requests. Instead, consult node liveness and avoid planning on dead nodes. This reduces the problem to a <10s window. The defunct `ConnHealth` mechanism still protects against planning in some of cases (supposedly due to a once-per-second reconnection policy) and is retained for that reason, with issue #23829 filed to decide its future. NB: I'm not putting a release note since this was introduced after 1.1. We released it in a beta, though, so it may be worth calling out there. Touches #23601. (Not fixing it because this issue should only close when there's a roachtest). Release note (bug fix): NB: this fixes a regression introduced in 2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against unavailable nodes.

Acceptance tests are (infrequently) failing with this message, when it looks like this test has historically swallowed RPC errors here. I'm not sure that's completely correct, but this change should restore the test to its previous level of reliability. Updates cockroachdb#24138 Release note: None

When cockroachdb#22658 changed ConnHealth to be pessimistic instead of optimistic, it meant that distsql could theoretically get stuck in a state without connections to the necessary nodes (distsql would never initiate connections on its own; it only attempts to use connections for which ConnHealth returns true so it was effectively relying on raft/kv to initiate these connections). Now ConnHealth will attempt to start a connection process if none is in flight to ensure that we will eventually discover the health of any address we are concerned about. Updated the ConnHealth docstring to reflect this change and the change to pessimistic behavior. Fixes cockroachdb#23829 Release note: None

24804: opt: Enhance optsteps command to support exploration rules r=andy-kimball a=andy-kimball This PR has 2 commits, for easier review: 1. Add AppliedRule event to memo: adds a new callback that the optimizer fires once it's done applying a rule; opsteps uses this to determine which part of the tree is affected by a rule 2. Enhance optsteps command to support exploration rules 24835: testutils: Update RPC error matcher for #22658 r=bdarnell a=bdarnell Acceptance tests are (infrequently) failing with this message, when it looks like this test has historically swallowed RPC errors here. I'm not sure that's completely correct, but this change should restore the test to its previous level of reliability. Release note: None Co-authored-by: Andrew Kimball <[email protected]> Co-authored-by: Ben Darnell <[email protected]>

When cockroachdb#22658 changed ConnHealth to be pessimistic instead of optimistic, it meant that distsql could theoretically get stuck in a state without connections to the necessary nodes (distsql would never initiate connections on its own; it only attempts to use connections for which ConnHealth returns true so it was effectively relying on raft/kv to initiate these connections). Now ConnHealth will attempt to start a connection process if none is in flight to ensure that we will eventually discover the health of any address we are concerned about. Updated the ConnHealth docstring to reflect this change and the change to pessimistic behavior. Fixes cockroachdb#23829 Release note: None

24845: rpc: ConnHealth now kicks off a connection attempt r=tschottdorf a=bdarnell When #22658 changed ConnHealth to be pessimistic instead of optimistic, it meant that distsql could theoretically get stuck in a state without connections to the necessary nodes (distsql would never initiate connections on its own; it only attempts to use connections for which ConnHealth returns true so it was effectively relying on raft/kv to initiate these connections). Now ConnHealth will attempt to start a connection process if none is in flight to ensure that we will eventually discover the health of any address we are concerned about. Updated the ConnHealth docstring to reflect this change and the change to pessimistic behavior. Fixes #23829 Release note: None 24852: ui, sql: fix display of "database dropped" event. r=couchand a=vilterp The DROP_DATABASE event in `system.eventlog` was changed from having the key `DroppedTables` to having the key `DroppedTablesAndViews`. The corresponding UI code which displays the dropped tables was not changed, creating bug #18523. This commit fixes #18523 by renaming the key to `DroppedSchemaObjects`, both in the event and in the UI code. The term "Schema Objects" accounts for the existence of sequences in addition to views and tables. Q: is it worth adding a migration to change old events from `DroppedTablesAndViews` to `DroppedSchemaObjects`? I'm leaning toward no, since (a) it doesn't look like we added a migration when we renamed `DroppedTables` to `DroppedTablesAndViews`, and (b) without the migration, you'll still see the "drop database" event in the UI, just not the table names (but they'll be in `system.eventlog` under the old key). Release note (admin ui change): display the names of dropped schema objects on `DROP DATABASE ... CASCADE` Co-authored-by: Ben Darnell <[email protected]> Co-authored-by: Pete Vilter <[email protected]>

bdarnell requested a review from a team February 13, 2018 19:27

cli/interactive_tests: Revert test change from cockroachdb#22568

8e0a505

Release note: None

bdarnell requested a review from a team as a code owner February 13, 2018 23:10

benesch approved these changes Feb 13, 2018

View reviewed changes

bdarnell merged commit 0c433fe into cockroachdb:master Feb 13, 2018

bdarnell deleted the grpc-2 branch February 13, 2018 23:37

solongordon mentioned this pull request Feb 22, 2018

core: endless "cannot reuse client connection" warnings in log #22951

Closed

bdarnell mentioned this pull request Mar 1, 2018

rpc: Heartbeat local connections too #23298

Merged

bdarnell mentioned this pull request Mar 3, 2018

cherrypick-2.0: rpc: Heartbeat local connections too #23348

Merged

solongordon mentioned this pull request Mar 8, 2018

sql: distsql plans against unavailable node #23601

Closed

tbg mentioned this pull request Mar 14, 2018

rpc: evaluate what to do with ConnHealth #23829

Closed

tbg mentioned this pull request Mar 14, 2018

distsql: consult liveness during physical planning #23834

Merged

tbg mentioned this pull request Mar 15, 2018

backport-2.0: distsql: consult liveness during physical planning #23916

Merged

bdarnell mentioned this pull request Apr 16, 2018

acceptance: Unexpected RPC errors at SQL level #24138

Closed

bdarnell mentioned this pull request Apr 16, 2018

rpc: ConnHealth now kicks off a connection attempt #24845

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: Perform initial-heartbeat validation on GRPC reconnections (take 2) #22658

rpc: Perform initial-heartbeat validation on GRPC reconnections (take 2) #22658

bdarnell commented Feb 13, 2018

cockroach-teamcity commented Feb 13, 2018

benesch commented Feb 13, 2018

bdarnell commented Feb 13, 2018

rpc: Perform initial-heartbeat validation on GRPC reconnections (take 2) #22658

rpc: Perform initial-heartbeat validation on GRPC reconnections (take 2) #22658

Conversation

bdarnell commented Feb 13, 2018

cockroach-teamcity commented Feb 13, 2018

benesch commented Feb 13, 2018

bdarnell commented Feb 13, 2018