backport-2.0: distsql: consult liveness during physical planning #23916

tbg · 2018-03-15T18:28:23Z

Backport 1/1 commits from #23834.

/cc @cockroachdb/release

The recent PR #22658 introduced a regression in
(*rpcContext).ConnHealth which caused DistSQL to continue planning on
unavailable nodes for about an hour (ttlNodeDescriptorGossip) if the
leaseholder cache happened to not be updated by other non-DistSQL
requests.

Instead, consult node liveness and avoid planning on dead nodes. This
reduces the problem to a <10s window. The defunct ConnHealth mechanism
still protects against planning in some of cases (supposedly due to a
once-per-second reconnection policy) and is retained for that reason,
with issue #23829 filed to decide its future.

NB: I'm not putting a release note since this was introduced after 1.1.
We released it in a beta, though, so it may be worth calling out there.

Touches #23601. (Not fixing it because this issue should only close
when there's a roachtest).

Release note (bug fix): NB: this fixes a regression introduced in
2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against
unavailable nodes.

The recent PR cockroachdb#22658 introduced a regression in `(*rpcContext).ConnHealth` which caused DistSQL to continue planning on unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the leaseholder cache happened to not be updated by other non-DistSQL requests. Instead, consult node liveness and avoid planning on dead nodes. This reduces the problem to a <10s window. The defunct `ConnHealth` mechanism still protects against planning in some of cases (supposedly due to a once-per-second reconnection policy) and is retained for that reason, with issue cockroachdb#23829 filed to decide its future. NB: I'm not putting a release note since this was introduced after 1.1. We released it in a beta, though, so it may be worth calling out there. Touches cockroachdb#23601. (Not fixing it because this issue should only close when there's a roachtest). Release note (bug fix): NB: this fixes a regression introduced in 2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against unavailable nodes.

cockroach-teamcity · 2018-03-15T18:28:53Z

This change is

tbg · 2018-04-05T23:48:59Z

Hmm, I missed that this didn't get reviewed and/or merged. Should definitely get this into 2.0.1. Ping @bdarnell

tbg · 2018-04-05T23:51:00Z

bors try

andreimatei · 2018-04-05T23:54:31Z

Review status: 0 of 4 files reviewed at latest revision, all discussions resolved, all commit checks successful.

Comments from Reviewable

craig · 2018-04-06T00:18:35Z

try

Build succeeded

GitHub CI (Cockroach)

tbg · 2018-04-06T00:24:22Z

Thanks @andreimatei!

bors r+

23916: backport-2.0: distsql: consult liveness during physical planning r=tschottdorf a=tschottdorf Backport 1/1 commits from #23834. /cc @cockroachdb/release --- The recent PR #22658 introduced a regression in `(*rpcContext).ConnHealth` which caused DistSQL to continue planning on unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the leaseholder cache happened to not be updated by other non-DistSQL requests. Instead, consult node liveness and avoid planning on dead nodes. This reduces the problem to a <10s window. The defunct `ConnHealth` mechanism still protects against planning in some of cases (supposedly due to a once-per-second reconnection policy) and is retained for that reason, with issue #23829 filed to decide its future. NB: I'm not putting a release note since this was introduced after 1.1. We released it in a beta, though, so it may be worth calling out there. Touches #23601. (Not fixing it because this issue should only close when there's a roachtest). Release note (bug fix): NB: this fixes a regression introduced in 2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against unavailable nodes.

craig · 2018-04-06T00:50:36Z

Build succeeded

GitHub CI (Cockroach)

tbg requested review from a team March 15, 2018 18:28

tbg requested a review from bdarnell April 5, 2018 23:50

craig bot added a commit that referenced this pull request Apr 5, 2018

Try #23916:

839b9b5

craig bot merged commit 93a7164 into cockroachdb:release-2.0 Apr 6, 2018

tbg deleted the backport2.0-23834 branch May 8, 2018 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backport-2.0: distsql: consult liveness during physical planning #23916

backport-2.0: distsql: consult liveness during physical planning #23916

tbg commented Mar 15, 2018

cockroach-teamcity commented Mar 15, 2018

tbg commented Apr 5, 2018

tbg commented Apr 5, 2018

andreimatei commented Apr 5, 2018

craig bot commented Apr 6, 2018

tbg commented Apr 6, 2018

craig bot commented Apr 6, 2018

backport-2.0: distsql: consult liveness during physical planning #23916

backport-2.0: distsql: consult liveness during physical planning #23916

Conversation

tbg commented Mar 15, 2018

cockroach-teamcity commented Mar 15, 2018

tbg commented Apr 5, 2018

tbg commented Apr 5, 2018

andreimatei commented Apr 5, 2018

craig bot commented Apr 6, 2018

try

Build succeeded

tbg commented Apr 6, 2018

craig bot commented Apr 6, 2018

Build succeeded