Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: concurrent ADD REGION, DROP REGION, (CREATE|ALTER) REGIONAL BY ROW combinations do not mix #63462

Closed
otan opened this issue Apr 12, 2021 · 3 comments · Fixed by #64117
Closed
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker

Comments

@otan
Copy link
Contributor

otan commented Apr 12, 2021

Assume that a DROP REGION job fails. If we do not include regions that are being dropped in the SynthesizeRegionConfig during this state, a number of issues arise on REGIONAL BY ROW tables in their given callsites, such as:

  • New REGIONAL BY ROW tables created will be missing the dropped region partition.
  • alter to/from REGIONAL BY ROW will be missing partitions
  • an active CREATE INDEX job on RBR tables may miss the dropped partition
  • an active ALTER PRIMARY KEY job on may miss the dropped partition
  • an active CREATE UNIQUE CONSTRAINT job will miss the dropped partition
  • Partitioning validation is broken during a concurrent ADD/DROP REGION, as per discovery in sql: add validation for partitions matching REGIONAL BY ROW #63459. This is because an add will repartition every table, but will discard the dropped region.

I've tried taking the tack of including DROPPING regions for these operations here in the latter two commits, but unfortunately we hit the error that the enum is not PUBLIC on the repartition case.

We should probably block any REGIONAL BY ROW creating / altering behaviour if there is an ADD/DROP REGION underway (i.e. creating an index, unique constraint, create table ... regional by row, some already handled by #63368) and vice versa - this is the quickest and easiest way to unblock the release. Furthermore, it is probably worth prohibiting concurrent ADD/DROP regions as well - again to unblock the release.

@otan otan added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 12, 2021
@otan otan changed the title sql: region configs should include regions being DROPPED by default sql: concurrent ADD/DROP REGIONS and REGIONAL BY ROW tables do not mix Apr 12, 2021
@otan otan changed the title sql: concurrent ADD/DROP REGIONS and REGIONAL BY ROW tables do not mix sql: concurrent ADD REGION, DROP REGION, (CREATE|ALTER) REGIONAL BY ROW combinations do not mix Apr 12, 2021
@otan
Copy link
Contributor Author

otan commented Apr 12, 2021

cc @arulajmani happy to do this the CREATE INDEX, etc. stuff if you prefer, i'd ask you to do the ADD/DROP concurrency bit because i prefer you to rewrite that test, hehe

@arulajmani
Copy link
Collaborator

I think all but the last concern can be accounted for by re-partitioning all Regional By Row Tables even if the Type Schema Change job is rolled back. Concretely, consider the scenario:

  • Op 1: Region A is being dropped, but will fail validation in the type schema changer, therefore requiring a rollback.
  • Op 2: A RBR table t is being created.

Let's assume these operations are run concurrently. If Op2 is serialized before Op1, then Region A should still be PUBLIC. As such, we should be fine.
The problem, as you pointed out, arises if Op1 is serialized before Op2 and the type schema change job runs and rolls back after Op2. In this case, table t was created without a partition for Region A as it was moved to READ ONLY state by Op1. If validation fails and the operation is rolled back, Region A will be promoted back to PUBLIC. The right thing here would be to re-synthesize a region config here and repartition all RBR tables from this new set of regions. This would mean that t would get a partition for Region A.

You can extend this CREATE RBR scenario to all the other ones enumerated in the issue -- the fundamental problem is the same.

I'm still munching on the last point that pertains to validation.

@arulajmani
Copy link
Collaborator

Partitioning validation -- I think we're being too strict here, trying to map out all possible scenarios when enum values are transitioning (READ_ONLY with either DIRECTION_ADD or DIRECTION_REMOVE. I've got a long form explanation on the draft PR #63459 (review), but I think the right answer here is to relax this validation a bit.

arulajmani added a commit to arulajmani/cockroach that referenced this issue Apr 13, 2021
When altering to a regional by row table we only add partitions for
PUBLIC enum values (regions). The type schema change job that
negotiates promotion/demotion is responsible for re-partitioning
regional by row tables. Previously, this re-partitioning only happened
if the type schema change job succeeded.

This behavior missed an edge case, which looks like:
- Client 1 drops region A from the database. The user txn commits.
- Client 2 alters table t to a RBR.
- The drop region fails, and therefore rolls back.
- table t is now missing a partition for region A.

This patch fixes this problem by trigerring a re-partition even when
the type schema change job is rolling back.

Informs cockroachdb#63462

Release note (bug fix): Previously if a user altered a table to
REGIONAL BY ROW when a region was being dropped, and the drop failed
and had to be rolled back, it could have resulted in the Regional By
Row table missing a partition for this region. This is now fixed and
region drop failure rollbacks are sane.
arulajmani added a commit to arulajmani/cockroach that referenced this issue Apr 13, 2021
When altering to a regional by row table we only add partitions for
PUBLIC enum values (regions). The type schema change job that
negotiates promotion/demotion is responsible for re-partitioning
regional by row tables. Previously, this re-partitioning only happened
if the type schema change job succeeded.

This behavior missed an edge case, which looks like:
- Client 1 drops region A from the database. The user txn commits.
- Client 2 alters table t to a RBR.
- The drop region fails, and therefore rolls back.
- table t is now missing a partition for region A.

This patch fixes this problem by trigerring a re-partition even when
the type schema change job is rolling back.

Informs cockroachdb#63462

Release note (bug fix): Previously if a user altered a table to
REGIONAL BY ROW when a region was being dropped, and the drop failed
and had to be rolled back, it could have resulted in the Regional By
Row table missing a partition for this region. This is now fixed and
region drop failure rollbacks are sane.
craig bot pushed a commit that referenced this issue Apr 13, 2021
62683: optgen: add multi-variable bind expression r=mgartner a=mgartner

This commit adds a new optgen expression syntax for binding multiple
variables to the return values of a custom function. See the updated
documentation in this commit for more details.

Release note: None


63502: sql: remove transitioning regions from zone config validation r=arulajmani,ajstorm a=otan

See individual commits for details.

Refs: #63462 

63544: sqlsmith: fix tests under bazel r=RaduBerinde a=otan

It was missing the GEOS data dependency.

Release note: None

63566: Added COPY FROM diagram to docgen r=ericharmeling a=ericharmeling

Release justification: non-production code changes

Release note: None

Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: Eric Harmeling <[email protected]>
arulajmani added a commit to arulajmani/cockroach that referenced this issue Apr 15, 2021
When altering to a regional by row table we only add partitions for
PUBLIC enum values (regions). The type schema change job that
negotiates promotion/demotion is responsible for re-partitioning
regional by row tables. Previously, this re-partitioning only happened
if the type schema change job succeeded.

This behavior missed an edge case, which looks like:
- Client 1 drops region A from the database. The user txn commits.
- Client 2 alters table t to a RBR.
- The drop region fails, and therefore rolls back.
- table t is now missing a partition for region A.

This patch fixes this problem by trigerring a re-partition even when
the type schema change job is rolling back.

Informs cockroachdb#63462

Release note (bug fix): Previously if a user altered a table to
REGIONAL BY ROW when a region was being dropped, and the drop failed
and had to be rolled back, it could have resulted in the Regional By
Row table missing a partition for this region. This is now fixed and
region drop failure rollbacks are sane.
arulajmani added a commit to arulajmani/cockroach that referenced this issue Apr 15, 2021
When altering to a regional by row table we only add partitions for
PUBLIC enum values (regions). The type schema change job that
negotiates promotion/demotion is responsible for re-partitioning
regional by row tables. Previously, this re-partitioning only happened
if the type schema change job succeeded.

This behavior missed an edge case, which looks like:
- Client 1 drops region A from the database. The user txn commits.
- Client 2 alters table t to a RBR.
- The drop region fails, and therefore rolls back.
- table t is now missing a partition for region A.

This patch fixes this problem by trigerring a re-partition even when
the type schema change job is rolling back.

Informs cockroachdb#63462

Release note (bug fix): Previously if a user altered a table to
REGIONAL BY ROW when a region was being dropped, and the drop failed
and had to be rolled back, it could have resulted in the Regional By
Row table missing a partition for this region. This is now fixed and
region drop failure rollbacks are sane.
arulajmani added a commit to arulajmani/cockroach that referenced this issue Apr 16, 2021
When altering to a regional by row table we only add partitions for
PUBLIC enum values (regions). The type schema change job that
negotiates promotion/demotion is responsible for re-partitioning
regional by row tables. Previously, this re-partitioning only happened
if the type schema change job succeeded.

This behavior missed an edge case, which looks like:
- Client 1 drops region A from the database. The user txn commits.
- Client 2 alters table t to a RBR.
- The drop region fails, and therefore rolls back.
- table t is now missing a partition for region A.

This patch fixes this problem by trigerring a re-partition even when
the type schema change job is rolling back.

Informs cockroachdb#63462

Release note (bug fix): Previously if a user altered a table to
REGIONAL BY ROW when a region was being dropped, and the drop failed
and had to be rolled back, it could have resulted in the Regional By
Row table missing a partition for this region. This is now fixed and
region drop failure rollbacks are sane.
arulajmani added a commit to arulajmani/cockroach that referenced this issue Apr 16, 2021
Informs cockroachdb#63462

This patch adds TestRegionAddDropEnclosingRegionalByRowOps, which
contains subtests with the following sketch:

- Client 1 performs an ALTER ADD / DROP REGION. Let the user txn commit.
- Block in the type schema changer.
- Client 2 performs an operation on a REGIONAL BY ROW table, such as creating
 an index, unique constraint, altering the primary key etc.
- Test both a rollback in the type schema changer and a successful execution.
- Ensure the partitions on the REGIONAL BY ROW table are sane.

Release note: None
arulajmani added a commit to arulajmani/cockroach that referenced this issue Apr 19, 2021
Informs cockroachdb#63462

This patch adds TestRegionAddDropEnclosingRegionalByRowOps, which
contains subtests with the following sketch:

- Client 1 performs an ALTER ADD / DROP REGION. Let the user txn commit.
- Block in the type schema changer.
- Client 2 performs an operation on a REGIONAL BY ROW table, such as creating
 an index, unique constraint, altering the primary key etc.
- Test both a rollback in the type schema changer and a successful execution.
- Ensure the partitions on the REGIONAL BY ROW table are sane.

Release note: None
arulajmani added a commit to arulajmani/cockroach that referenced this issue Apr 19, 2021
When altering to a regional by row table we only add partitions for
PUBLIC enum values (regions). The type schema change job that
negotiates promotion/demotion is responsible for re-partitioning
regional by row tables. Previously, this re-partitioning only happened
if the type schema change job succeeded.

This behavior missed an edge case, which looks like:
- Client 1 drops region A from the database. The user txn commits.
- Client 2 alters table t to a RBR.
- The drop region fails, and therefore rolls back.
- table t is now missing a partition for region A.

This patch fixes this problem by trigerring a re-partition even when
the type schema change job is rolling back.

Informs cockroachdb#63462

Release note (bug fix): Previously if a user altered a table to
REGIONAL BY ROW when a region was being dropped, and the drop failed
and had to be rolled back, it could have resulted in the Regional By
Row table missing a partition for this region. This is now fixed and
region drop failure rollbacks are sane.
craig bot pushed a commit that referenced this issue Apr 19, 2021
63804: sql: add more testing for concurrent region/rbr interactions r=otan a=arulajmani

Informs #63462

This patch adds TestRegionAddDropEnclosingRegionalByRowOps, which
contains subtests with the following sketch:

- Client 1 performs an ALTER ADD / DROP REGION. Let the user txn commit.
- Block in the type schema changer.
- Client 2 performs an operation on a REGIONAL BY ROW table, such as creating
 an index, unique constraint, altering the primary key etc.
- Test both a rollback in the type schema changer and a successful execution.
- Ensure the partitions on the REGIONAL BY ROW table are sane.

Release note: None

63843: jobs: skip flakey test that should be removed r=ajwerner a=ajwerner

This tests seems to flake but exercises old code. It is scheduled for
removal in #61417 which didn't make the 21.1 release.

Touches #63842.

Release note: None

Co-authored-by: arulajmani <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
arulajmani added a commit to arulajmani/cockroach that referenced this issue Apr 19, 2021
Informs cockroachdb#63462

This patch adds TestRegionAddDropEnclosingRegionalByRowOps, which
contains subtests with the following sketch:

- Client 1 performs an ALTER ADD / DROP REGION. Let the user txn commit.
- Block in the type schema changer.
- Client 2 performs an operation on a REGIONAL BY ROW table, such as creating
 an index, unique constraint, altering the primary key etc.
- Test both a rollback in the type schema changer and a successful execution.
- Ensure the partitions on the REGIONAL BY ROW table are sane.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants