Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: Perform initial-heartbeat validation on GRPC reconnections (take 2) #22658

Merged
merged 2 commits into from
Feb 13, 2018

Conversation

bdarnell
Copy link
Contributor

The only difference from the previous version of this change is that
the new error has been moved from pkg/rpc to pkg/util/grpcutil,
and is included in IsClosedConnection. This suppresses the log message
that was causing intermittent failures in test_missing_log_output.tcl

Fixes #22536

This reverts commit 9b20d8a.

The only difference from the previous version of this change is that
the new error has been moved from pkg/rpc to pkg/util/grpcutil,
and is included in IsClosedConnection. This suppresses the log message
that was causing intermittent failures in test_missing_log_output.tcl

Fixes cockroachdb#22536

This reverts commit 9b20d8a.
@bdarnell bdarnell requested a review from a team February 13, 2018 19:27
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@benesch
Copy link
Contributor

benesch commented Feb 13, 2018

LGTM, but should we also put this back the way it was?

## Re-enable this once issue #22536 is fixed.
# system "for i in `seq 1 5`; do kill -CONT `cat server_pid` 2>/dev/null || exit 0; echo still waiting; sleep 1; done; echo 'server still running?'; exit 0"
# Until #22536 is fixed use a violent kill.
system "for i in `seq 1 5`; do kill -CONT `cat server_pid` 2>/dev/null || exit 0; echo still waiting; sleep 1; done; echo 'server still running?'; kill -9 `cat server_pid`; exit 0"

It was commented out in a61e303.

@bdarnell bdarnell requested a review from a team as a code owner February 13, 2018 23:10
@bdarnell
Copy link
Contributor Author

Added another commit to fix that.

@bdarnell bdarnell merged commit 0c433fe into cockroachdb:master Feb 13, 2018
@bdarnell bdarnell deleted the grpc-2 branch February 13, 2018 23:37
bdarnell added a commit to bdarnell/cockroach that referenced this pull request Mar 1, 2018
Local connections can fail if the machine's network configuration
changes (for example, if you turn wifi off on a laptop). We disabled
heartbeating of local connections in cockroachdb#16526 to work around a UI issue
(in which the UI would fail to recover after a local connection
failed). This relied on GRPC's internal reconnections, which were
disabled in cockroachdb#22658. This commit re-enables local heartbeats so that
our reconnection logic works as usual (In the meantime, the UI has
gotten better about error recovery, so cockroachdb#16526 is no longer necessary).

Fixes cockroachdb#22951

Release note (bug fix): The admin UI no longer hangs after a node's
network configuration has changed.
bdarnell added a commit to bdarnell/cockroach that referenced this pull request Mar 5, 2018
Local connections can fail if the machine's network configuration
changes (for example, if you turn wifi off on a laptop). We disabled
heartbeating of local connections in cockroachdb#16526 to work around a UI issue
(in which the UI would fail to recover after a local connection
failed). This relied on GRPC's internal reconnections, which were
disabled in cockroachdb#22658. This commit re-enables local heartbeats so that
our reconnection logic works as usual (In the meantime, the UI has
gotten better about error recovery, so cockroachdb#16526 is no longer necessary).

Fixes cockroachdb#22951

Release note (bug fix): The admin UI no longer hangs after a node's
network configuration has changed.
tbg added a commit to tbg/cockroach that referenced this pull request Mar 14, 2018
WIP because needs a test.

----

The recent PR cockroachdb#22658 introduced a regression in
`(*rpcContext).ConnHealth` which caused DistSQL to continue planning on
unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the
leaseholder cache happened to not be updated by other non-DistSQL
requests.

Instead, consult node liveness and avoid planning on dead nodes. This
reduces the problem to a <10s window. The defunct `ConnHealth` mechanism
still protects against planning in some of cases (supposedly due to a
once-per-second reconnection policy) and is retained for that reason,
with issue cockroachdb#23829 filed to decide its future.

NB: I'm not putting a release note since this was introduced after 1.1.
We released it in a beta, though, so it may be worth calling out there.

Touches cockroachdb#23601. (Not fixing it because this issue should only close
when there's a roachtest).

Release note (bug fix): NB: this fixes a regression introduced in
2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against
unavailable nodes.
tbg added a commit to tbg/cockroach that referenced this pull request Mar 15, 2018
The recent PR cockroachdb#22658 introduced a regression in
`(*rpcContext).ConnHealth` which caused DistSQL to continue planning on
unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the
leaseholder cache happened to not be updated by other non-DistSQL
requests.

Instead, consult node liveness and avoid planning on dead nodes. This
reduces the problem to a <10s window. The defunct `ConnHealth` mechanism
still protects against planning in some of cases (supposedly due to a
once-per-second reconnection policy) and is retained for that reason,
with issue cockroachdb#23829 filed to decide its future.

NB: I'm not putting a release note since this was introduced after 1.1.
We released it in a beta, though, so it may be worth calling out there.

Touches cockroachdb#23601. (Not fixing it because this issue should only close
when there's a roachtest).

Release note (bug fix): NB: this fixes a regression introduced in
2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against
unavailable nodes.
tbg added a commit to tbg/cockroach that referenced this pull request Mar 15, 2018
The recent PR cockroachdb#22658 introduced a regression in
`(*rpcContext).ConnHealth` which caused DistSQL to continue planning on
unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the
leaseholder cache happened to not be updated by other non-DistSQL
requests.

Instead, consult node liveness and avoid planning on dead nodes. This
reduces the problem to a <10s window. The defunct `ConnHealth` mechanism
still protects against planning in some of cases (supposedly due to a
once-per-second reconnection policy) and is retained for that reason,
with issue cockroachdb#23829 filed to decide its future.

NB: I'm not putting a release note since this was introduced after 1.1.
We released it in a beta, though, so it may be worth calling out there.

Touches cockroachdb#23601. (Not fixing it because this issue should only close
when there's a roachtest).

Release note (bug fix): NB: this fixes a regression introduced in
2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against
unavailable nodes.
craig bot added a commit that referenced this pull request Apr 6, 2018
23916: backport-2.0: distsql: consult liveness during physical planning r=tschottdorf a=tschottdorf

Backport 1/1 commits from #23834.

/cc @cockroachdb/release

---

The recent PR #22658 introduced a regression in
`(*rpcContext).ConnHealth` which caused DistSQL to continue planning on
unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the
leaseholder cache happened to not be updated by other non-DistSQL
requests.

Instead, consult node liveness and avoid planning on dead nodes. This
reduces the problem to a <10s window. The defunct `ConnHealth` mechanism
still protects against planning in some of cases (supposedly due to a
once-per-second reconnection policy) and is retained for that reason,
with issue #23829 filed to decide its future.

NB: I'm not putting a release note since this was introduced after 1.1.
We released it in a beta, though, so it may be worth calling out there.

Touches #23601. (Not fixing it because this issue should only close
when there's a roachtest).

Release note (bug fix): NB: this fixes a regression introduced in
2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against
unavailable nodes.
bdarnell added a commit to bdarnell/cockroach that referenced this pull request Apr 16, 2018
Acceptance tests are (infrequently) failing with this message, when it
looks like this test has historically swallowed RPC errors here. I'm
not sure that's completely correct, but this change should restore the
test to its previous level of reliability.

Updates cockroachdb#24138

Release note: None
bdarnell added a commit to bdarnell/cockroach that referenced this pull request Apr 16, 2018
When cockroachdb#22658 changed ConnHealth to be pessimistic instead of
optimistic, it meant that distsql could theoretically get stuck in a
state without connections to the necessary nodes (distsql would never
initiate connections on its own; it only attempts to use connections
for which ConnHealth returns true so it was effectively relying on
raft/kv to initiate these connections).

Now ConnHealth will attempt to start a connection process if none is
in flight to ensure that we will eventually discover the health of any
address we are concerned about.

Updated the ConnHealth docstring to reflect this change and the change
to pessimistic behavior.

Fixes cockroachdb#23829

Release note: None
craig bot pushed a commit that referenced this pull request Apr 17, 2018
24804: opt: Enhance optsteps command to support exploration rules r=andy-kimball a=andy-kimball

This PR has 2 commits, for easier review:

1. Add AppliedRule event to memo: adds a new callback that the optimizer fires once it's done applying a rule; opsteps uses this to determine which part of the tree is affected by a rule

2. Enhance optsteps command to support exploration rules

24835: testutils: Update RPC error matcher for #22658 r=bdarnell a=bdarnell

Acceptance tests are (infrequently) failing with this message, when it
looks like this test has historically swallowed RPC errors here. I'm
not sure that's completely correct, but this change should restore the
test to its previous level of reliability.



Release note: None

Co-authored-by: Andrew Kimball <[email protected]>
Co-authored-by: Ben Darnell <[email protected]>
bdarnell added a commit to bdarnell/cockroach that referenced this pull request Apr 17, 2018
When cockroachdb#22658 changed ConnHealth to be pessimistic instead of
optimistic, it meant that distsql could theoretically get stuck in a
state without connections to the necessary nodes (distsql would never
initiate connections on its own; it only attempts to use connections
for which ConnHealth returns true so it was effectively relying on
raft/kv to initiate these connections).

Now ConnHealth will attempt to start a connection process if none is
in flight to ensure that we will eventually discover the health of any
address we are concerned about.

Updated the ConnHealth docstring to reflect this change and the change
to pessimistic behavior.

Fixes cockroachdb#23829

Release note: None
craig bot pushed a commit that referenced this pull request Apr 17, 2018
24845: rpc: ConnHealth now kicks off a connection attempt r=tschottdorf a=bdarnell

When #22658 changed ConnHealth to be pessimistic instead of
optimistic, it meant that distsql could theoretically get stuck in a
state without connections to the necessary nodes (distsql would never
initiate connections on its own; it only attempts to use connections
for which ConnHealth returns true so it was effectively relying on
raft/kv to initiate these connections).

Now ConnHealth will attempt to start a connection process if none is
in flight to ensure that we will eventually discover the health of any
address we are concerned about.

Updated the ConnHealth docstring to reflect this change and the change
to pessimistic behavior.

Fixes #23829

Release note: None

24852: ui, sql: fix display of "database dropped" event. r=couchand a=vilterp

The DROP_DATABASE event in `system.eventlog`  was changed from having the key `DroppedTables` to having the key `DroppedTablesAndViews`. The corresponding UI code which displays the dropped tables was not changed, creating bug #18523.

This commit fixes #18523 by renaming the key to `DroppedSchemaObjects`, both in the event and in the UI code. The term "Schema Objects" accounts for the existence of sequences in addition to views and tables.

Q: is it worth adding a migration to change old events from `DroppedTablesAndViews` to `DroppedSchemaObjects`? I'm leaning toward no, since (a) it doesn't look like we added a migration when we renamed `DroppedTables` to `DroppedTablesAndViews`, and (b) without the migration, you'll still see the "drop database" event in the UI, just not the table names (but they'll be in `system.eventlog` under the old key).

Release note (admin ui change): display the names of dropped schema objects on `DROP DATABASE ... CASCADE`

Co-authored-by: Ben Darnell <[email protected]>
Co-authored-by: Pete Vilter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants