Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsql: consult liveness during physical planning #23834

Merged
merged 1 commit into from
Mar 15, 2018

Conversation

tbg
Copy link
Member

@tbg tbg commented Mar 14, 2018

The recent PR #22658 introduced a regression in
(*rpcContext).ConnHealth which caused DistSQL to continue planning on
unavailable nodes for about an hour (ttlNodeDescriptorGossip) if the
leaseholder cache happened to not be updated by other non-DistSQL
requests.

Instead, consult node liveness and avoid planning on dead nodes. This
reduces the problem to a <10s window. The defunct ConnHealth mechanism
still protects against planning in some of cases (supposedly due to a
once-per-second reconnection policy) and is retained for that reason,
with issue #23829 filed to decide its future.

NB: I'm not putting a release note since this was introduced after 1.1.
We released it in a beta, though, so it may be worth calling out there.

Touches #23601. (Not fixing it because this issue should only close
when there's a roachtest).

Release note (bug fix): NB: this fixes a regression introduced in
2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against
unavailable nodes.

@tbg tbg requested review from a team March 14, 2018 15:02
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested review from andreimatei and solongordon March 14, 2018 15:03
@tbg
Copy link
Member Author

tbg commented Mar 14, 2018

@solongordon with this change I'm unable to repro any planning errors (sleeping 10s after the kill to make liveness time out). Could you give it a spin too to make sure I'm not messing this up (again)?

@tbg tbg force-pushed the distsql/plan-dead-nodes branch from 5e8a89a to 7129623 Compare March 14, 2018 15:25
@solongordon
Copy link
Contributor

Yup, I'm also unable to repro the error with this patch (except within 10s of killing the node).

@andreimatei
Copy link
Contributor

:lgtm:

Thanks Tobias. You also did DistSender improvements elsewhere, right?


Review status: 0 of 3 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/sql/distsql_physical_planner.go, line 530 at r1 (raw file):

	//
	// TODO(tschottdorf): it's not clear that this adds anything to the liveness
	// check below. The node descriptor TTL is an hour at the time of writing.

nit nit: i prefer "as of 03/2018"


pkg/sql/distsql_physical_planner.go, line 561 at r1 (raw file):

	}

	// NB: not all tests populate a NodeLiveness. Everything using the

I'm not a fan of this comment. Better say on the field definition "can be nil for tests", and remove the panic from the ctor too.


Comments from Reviewable

@tbg tbg force-pushed the distsql/plan-dead-nodes branch from 7129623 to 0c75e1e Compare March 15, 2018 16:00
@tbg
Copy link
Member Author

tbg commented Mar 15, 2018

PTAL, added rudimentary unit testing.


Review status: 0 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


pkg/sql/distsql_physical_planner.go, line 530 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit nit: i prefer "as of 03/2018"

Done.


pkg/sql/distsql_physical_planner.go, line 561 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm not a fan of this comment. Better say on the field definition "can be nil for tests", and remove the panic from the ctor too.

No, that increases the risk that production code passes a nil node liveness as well.


Comments from Reviewable

The recent PR cockroachdb#22658 introduced a regression in
`(*rpcContext).ConnHealth` which caused DistSQL to continue planning on
unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the
leaseholder cache happened to not be updated by other non-DistSQL
requests.

Instead, consult node liveness and avoid planning on dead nodes. This
reduces the problem to a <10s window. The defunct `ConnHealth` mechanism
still protects against planning in some of cases (supposedly due to a
once-per-second reconnection policy) and is retained for that reason,
with issue cockroachdb#23829 filed to decide its future.

NB: I'm not putting a release note since this was introduced after 1.1.
We released it in a beta, though, so it may be worth calling out there.

Touches cockroachdb#23601. (Not fixing it because this issue should only close
when there's a roachtest).

Release note (bug fix): NB: this fixes a regression introduced in
2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against
unavailable nodes.
@tbg tbg force-pushed the distsql/plan-dead-nodes branch from 0c75e1e to 52681e4 Compare March 15, 2018 16:08
@tbg tbg changed the title [wip]: distsql: consult liveness during physical planning distsql: consult liveness during physical planning Mar 15, 2018
@andreimatei
Copy link
Contributor

:lgtm: but why no roachtest? Still working on it?


Review status: 0 of 4 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Mar 15, 2018

No, haven't even started working on it. But it should be straightforward given the shell snippet that we have.

@tbg tbg merged commit 0c03d37 into cockroachdb:master Mar 15, 2018
@tbg
Copy link
Member Author

tbg commented Mar 15, 2018

TFTR!

@tbg tbg deleted the distsql/plan-dead-nodes branch March 15, 2018 18:27
@tbg
Copy link
Member Author

tbg commented Mar 15, 2018

PS DistSender PR is here: #23885

craig bot added a commit that referenced this pull request Apr 6, 2018
23916: backport-2.0: distsql: consult liveness during physical planning r=tschottdorf a=tschottdorf

Backport 1/1 commits from #23834.

/cc @cockroachdb/release

---

The recent PR #22658 introduced a regression in
`(*rpcContext).ConnHealth` which caused DistSQL to continue planning on
unavailable nodes for about an hour (`ttlNodeDescriptorGossip`) if the
leaseholder cache happened to not be updated by other non-DistSQL
requests.

Instead, consult node liveness and avoid planning on dead nodes. This
reduces the problem to a <10s window. The defunct `ConnHealth` mechanism
still protects against planning in some of cases (supposedly due to a
once-per-second reconnection policy) and is retained for that reason,
with issue #23829 filed to decide its future.

NB: I'm not putting a release note since this was introduced after 1.1.
We released it in a beta, though, so it may be worth calling out there.

Touches #23601. (Not fixing it because this issue should only close
when there's a roachtest).

Release note (bug fix): NB: this fixes a regression introduced in
2.0-beta, and not present in 1.1: Avoid planning DistSQL errors against
unavailable nodes.
@tbg tbg restored the distsql/plan-dead-nodes branch April 16, 2018 15:23
@tbg tbg deleted the distsql/plan-dead-nodes branch May 8, 2018 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants