distsql: consult liveness during physical planning #23834

tbg · 2018-03-14T15:02:13Z

The recent PR #22658 introduced a regression in
(*rpcContext).ConnHealth which caused DistSQL to continue planning on
unavailable nodes for about an hour (ttlNodeDescriptorGossip) if the
leaseholder cache happened to not be updated by other non-DistSQL
requests.

Instead, consult node liveness and avoid planning on dead nodes. This
reduces the problem to a <10s window. The defunct ConnHealth mechanism
still protects against planning in some of cases (supposedly due to a
once-per-second reconnection policy) and is retained for that reason,
with issue #23829 filed to decide its future.

NB: I'm not putting a release note since this was introduced after 1.1.
We released it in a beta, though, so it may be worth calling out there.

Touches #23601. (Not fixing it because this issue should only close
when there's a roachtest).

Release note (bug fix): NB: this fixes a regression introduced in
2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against
unavailable nodes.

cockroach-teamcity · 2018-03-14T15:02:22Z

This change is

tbg · 2018-03-14T15:03:46Z

@solongordon with this change I'm unable to repro any planning errors (sleeping 10s after the kill to make liveness time out). Could you give it a spin too to make sure I'm not messing this up (again)?

solongordon · 2018-03-14T16:48:47Z

Yup, I'm also unable to repro the error with this patch (except within 10s of killing the node).

andreimatei · 2018-03-15T14:33:10Z

Thanks Tobias. You also did DistSender improvements elsewhere, right?

Review status: 0 of 3 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

pkg/sql/distsql_physical_planner.go, line 530 at r1 (raw file):

	//
	// TODO(tschottdorf): it's not clear that this adds anything to the liveness
	// check below. The node descriptor TTL is an hour at the time of writing.

nit nit: i prefer "as of 03/2018"

pkg/sql/distsql_physical_planner.go, line 561 at r1 (raw file):

	}

	// NB: not all tests populate a NodeLiveness. Everything using the

I'm not a fan of this comment. Better say on the field definition "can be nil for tests", and remove the panic from the ctor too.

Comments from Reviewable

tbg · 2018-03-15T16:02:56Z

PTAL, added rudimentary unit testing.

Review status: 0 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.

pkg/sql/distsql_physical_planner.go, line 530 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit nit: i prefer "as of 03/2018"

Done.

pkg/sql/distsql_physical_planner.go, line 561 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm not a fan of this comment. Better say on the field definition "can be nil for tests", and remove the panic from the ctor too.

No, that increases the risk that production code passes a nil node liveness as well.

Comments from Reviewable

The recent PR cockroachdb#22658 introduced a regression in `(*rpcContext).ConnHealth` which caused DistSQL to continue planning on unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the leaseholder cache happened to not be updated by other non-DistSQL requests. Instead, consult node liveness and avoid planning on dead nodes. This reduces the problem to a <10s window. The defunct `ConnHealth` mechanism still protects against planning in some of cases (supposedly due to a once-per-second reconnection policy) and is retained for that reason, with issue cockroachdb#23829 filed to decide its future. NB: I'm not putting a release note since this was introduced after 1.1. We released it in a beta, though, so it may be worth calling out there. Touches cockroachdb#23601. (Not fixing it because this issue should only close when there's a roachtest). Release note (bug fix): NB: this fixes a regression introduced in 2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against unavailable nodes.

andreimatei · 2018-03-15T18:12:16Z

but why no roachtest? Still working on it?

Review status: 0 of 4 files reviewed at latest revision, all discussions resolved, all commit checks successful.

Comments from Reviewable

tbg · 2018-03-15T18:26:59Z

No, haven't even started working on it. But it should be straightforward given the shell snippet that we have.

tbg · 2018-03-15T18:27:10Z

TFTR!

tbg · 2018-03-15T18:38:43Z

PS DistSender PR is here: #23885

23916: backport-2.0: distsql: consult liveness during physical planning r=tschottdorf a=tschottdorf Backport 1/1 commits from #23834. /cc @cockroachdb/release --- The recent PR #22658 introduced a regression in `(*rpcContext).ConnHealth` which caused DistSQL to continue planning on unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the leaseholder cache happened to not be updated by other non-DistSQL requests. Instead, consult node liveness and avoid planning on dead nodes. This reduces the problem to a <10s window. The defunct `ConnHealth` mechanism still protects against planning in some of cases (supposedly due to a once-per-second reconnection policy) and is retained for that reason, with issue #23829 filed to decide its future. NB: I'm not putting a release note since this was introduced after 1.1. We released it in a beta, though, so it may be worth calling out there. Touches #23601. (Not fixing it because this issue should only close when there's a roachtest). Release note (bug fix): NB: this fixes a regression introduced in 2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against unavailable nodes.

tbg requested review from a team March 14, 2018 15:02

tbg requested review from andreimatei and solongordon March 14, 2018 15:03

tbg force-pushed the distsql/plan-dead-nodes branch from 5e8a89a to 7129623 Compare March 14, 2018 15:25

tbg force-pushed the distsql/plan-dead-nodes branch from 7129623 to 0c75e1e Compare March 15, 2018 16:00

tbg force-pushed the distsql/plan-dead-nodes branch from 0c75e1e to 52681e4 Compare March 15, 2018 16:08

tbg changed the title ~~[wip]: distsql: consult liveness during physical planning~~ distsql: consult liveness during physical planning Mar 15, 2018

tbg merged commit 0c03d37 into cockroachdb:master Mar 15, 2018

tbg deleted the distsql/plan-dead-nodes branch March 15, 2018 18:27

tbg mentioned this pull request Mar 15, 2018

backport-2.0: distsql: consult liveness during physical planning #23916

Merged

tbg restored the distsql/plan-dead-nodes branch April 16, 2018 15:23

tbg deleted the distsql/plan-dead-nodes branch May 8, 2018 15:06

jordanlewis mentioned this pull request Jul 19, 2018

sql: distsql plans against unavailable node #23601

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsql: consult liveness during physical planning #23834

distsql: consult liveness during physical planning #23834

tbg commented Mar 14, 2018 •

edited

Loading

cockroach-teamcity commented Mar 14, 2018

tbg commented Mar 14, 2018

solongordon commented Mar 14, 2018

andreimatei commented Mar 15, 2018

tbg commented Mar 15, 2018

andreimatei commented Mar 15, 2018

tbg commented Mar 15, 2018

tbg commented Mar 15, 2018

tbg commented Mar 15, 2018

distsql: consult liveness during physical planning #23834

distsql: consult liveness during physical planning #23834

Conversation

tbg commented Mar 14, 2018 • edited Loading

cockroach-teamcity commented Mar 14, 2018

tbg commented Mar 14, 2018

solongordon commented Mar 14, 2018

andreimatei commented Mar 15, 2018

tbg commented Mar 15, 2018

andreimatei commented Mar 15, 2018

tbg commented Mar 15, 2018

tbg commented Mar 15, 2018

tbg commented Mar 15, 2018

tbg commented Mar 14, 2018 •

edited

Loading