Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlbase: avoid a race in sqlbase.CancelChecker #35548

Merged
merged 1 commit into from
Mar 8, 2019

Conversation

knz
Copy link
Contributor

@knz knz commented Mar 8, 2019

Fixes #35539.

Prior to the distsql-ification of planNode execution, all the
execution steps in a local query were interleaved (using
coroutine-style concurrency), so that the calls to
(*CancelChecker).Check() were all sequential.

With distsql now it's possible for different planNode Next() or
startExec() methods to be running in concurrent goroutines on
multiple cores, i.e. truly parallel.

This in turn requires atomic access to the counter in the cancel
checker. Without atomic access, there is a race condition and the
possibility that the cancel checker does not work well (some
increments can be performed two times, which could cause the condition
of the check to occasionally fail).

Found with SQLSmith.

Release note: None

@knz knz requested review from maddyblue and asubiotto March 8, 2019 16:48
@knz knz requested review from a team March 8, 2019 16:48
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great find. I wonder why callsSinceLastCheck is an int32, looks like we might as well make it a uint32 but not a big deal, I don't think we care about overflow in this case anyway (could maybe use a smaller uint).

Prior to the distsql-ification of planNode execution, all the
execution steps in a local query were interleaved (using
coroutine-style concurrency), so that the calls to
`(*CancelChecker).Check()` were all sequential.

With distsql now it's possible for different planNode `Next()` or
`startExec()` methods to be running in concurrent goroutines on
multiple cores, i.e. truly parallel.

This in turn requires atomic access to the counter in the cancel
checker. Without atomic access, there is a race condition and the
possibility that the cancel checker does not work well (some
increments can be performed two times, which could cause the condition
of the check to occasionally fail).

Found with SQLSmith.

Release note: None
@knz knz force-pushed the 20190308-atomic-ancel branch from fbc9b43 to 1ca3629 Compare March 8, 2019 17:20
@knz
Copy link
Contributor Author

knz commented Mar 8, 2019

I wonder why callsSinceLastCheck is an int32, looks like we might as well make it a uint32

Good idea. Done.

@knz
Copy link
Contributor Author

knz commented Mar 8, 2019

TFYR!

bors r+

craig bot pushed a commit that referenced this pull request Mar 8, 2019
35521: sql: reduce non-determinism in the show_trace logic tests r=knz a=knz

Fixes #33720.
First commit from #35519.

The test is really about asserting the KV operations sent on behalf of
SQL statements. The distsender traffic is really irrelevant. Since
that's where most of the test flakes / non-determinism is coming from,
simply remove it.

Release note: None

35548: sqlbase: avoid a race in sqlbase.CancelChecker r=knz a=knz

Fixes  #35539.

Prior to the distsql-ification of planNode execution, all the
execution steps in a local query were interleaved (using
coroutine-style concurrency), so that the calls to
`(*CancelChecker).Check()` were all sequential.

With distsql now it's possible for different planNode `Next()` or
`startExec()` methods to be running in concurrent goroutines on
multiple cores, i.e. truly parallel.

This in turn requires atomic access to the counter in the cancel
checker. Without atomic access, there is a race condition and the
possibility that the cancel checker does not work well (some
increments can be performed two times, which could cause the condition
of the check to occasionally fail).

Found with SQLSmith.

Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
@craig
Copy link
Contributor

craig bot commented Mar 8, 2019

Build succeeded

@craig craig bot merged commit 1ca3629 into cockroachdb:master Mar 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants