Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: TRUNCATE and ongoing index builds are broken #62842

Closed
ajwerner opened this issue Mar 31, 2021 · 5 comments · Fixed by #62944
Closed

sql: TRUNCATE and ongoing index builds are broken #62842

ajwerner opened this issue Mar 31, 2021 · 5 comments · Fixed by #62944
Assignees
Labels
A-schema-changes A-schema-descriptors Relating to SQL table/db descriptor handling. branch-release-20.2 Used to mark GA and release blockers, technical advisories, and bugs for 20.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-technical-advisory Caused a technical advisory S-1 High impact: many users impacted, serious risk of high unavailability or data loss

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Mar 31, 2021

Describe the problem

In 20.2 we changed the implementation of TRUNCATE to create all new indexes instead of creating a new table (#52235). This was wonderful in that it made table IDs stable. Unfortunately, it appears that that PR had some severe logic bugs in the face of concurrent index builds.

The logic has two bugs around

// Resolve all outstanding mutations. Make all new schema elements
// public because the table is empty and doesn't need to be backfilled.
for _, m := range tableDesc.Mutations {
if err := tableDesc.MakeMutationComplete(m); err != nil {
return err
}
}

The first is that it should make the in-progress indexes public prior to calculating the re-mapping. Secondly, it needs to deal with the ongoing jobs.

To Reproduce

Create an index backfill which takes a long time and then truncate the table. The index will become public but this index will contain data from prior to the truncate. The job will also remain in the queue which will prevent subsequent schema changes.

Expected behavior

The index is truncated and made public.

Additional context

This affects all current 20.2 versions and 21.1.

@ajwerner ajwerner added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-schema-changes A-schema-descriptors Relating to SQL table/db descriptor handling. labels Mar 31, 2021
@ajwerner ajwerner added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Mar 31, 2021
@ajwerner
Copy link
Contributor Author

@fqazi any chance I can enlist you on this? I think the fix is small and the test should be straightforward. This is higher priority than anything else you're currently working on and I believe you have sufficient context.

@fqazi
Copy link
Collaborator

fqazi commented Mar 31, 2021

Sure I'll take it.

@ajwerner
Copy link
Contributor Author

ajwerner commented Apr 1, 2021

I took a stab at this #62944.

@ajwerner
Copy link
Contributor Author

ajwerner commented Apr 1, 2021

Alright, after some thinking about this implementation, there's still a lot of scary edge cases. For now the thinking is that we're going to disallow this hazardous behavior. It's modestly unfortunate as there are some cases which would have allowed you to quickly finish long-running schema changes that take a long time to cancel but getting this right in all of the cases seems too hard. Instead we'll allowlist the cases we can convince ourselves are safe and disallow the rest.

ajwerner added a commit to ajwerner/cockroach that referenced this issue Apr 2, 2021
…ma changes

Prior to this change, the truncate implementation since 20.2 was totally
broken. The code creates new indexes for all existing indexes. It also
makes all mutations public. The problem with this is that it makes the
mutations public *after* deciding which indexes to clone. Because of this
it just makes currently being backfilled or validated indexes public. This
leads to inconsistency between the old secondary index and the newly
truncated indexes.

The code is also problematic in that it does not do anything about the mutation
jobs. The code assumes that these jobs will happily continue and detect that
they correspond to no mutations and stop doing things as well as clean
themselves up. Unfortunately, it seems that these changes can lead to certain
failures. For example, new index validation may now fail. Reverting that index
build will now also fail because the index is public. This will leave the
descriptor in a state where it has an orphaned mutation job. We fix this by
only permitting TRUNCATE to run concurrently with mutations we deem safe and
then to fix the behavior for those cases.

Fixes cockroachdb#62842.

Release note (bug fix): Fixed bugs where TRUNCATE concurrent with index
construction and other schema changes could result in corruption.
craig bot pushed a commit that referenced this issue Apr 5, 2021
62944: sql: fix or disallow TRUNCATE and concurrent schema changes r=jordanlewis a=ajwerner


Prior to this change, the truncate implementation since 20.2 was totally
broken. The code creates new indexes for all existing indexes. It also
makes all mutations public. The problem with this is that it makes the
mutations public *after* deciding which indexes to clone. Because of this
it just makes currently being backfilled or validated indexes public. This
leads to inconsistency between the old secondary index and the newly
truncated indexes.

The code is also problematic in that it does not do anything about the mutation
jobs. The code assumes that these jobs will happily continue and detect that
they correspond to no mutations and stop doing things as well as clean
themselves up. Unfortunately, it seems that these changes can lead to certain
failures. For example, new index validation may now fail. Reverting that index
build will now also fail because the index is public. This will leave the
descriptor in a state where it has an orphaned mutation job. We fix this by
only permitting TRUNCATE to run concurrently with mutations we deem safe and
then to fix the behavior for those cases.

Fixes #62842.

Release note (bug fix): Fixed bugs where TRUNCATE concurrent with index
construction and other schema changes could result in corruption.


63038: ccl/bulk-io: clean up the processors a bit r=yuzefovich a=yuzefovich

This commit removes some redundant implementations of `ConsumerClosed`
method that simply call `InternalClose` - the reason is that it is the
default implementation provided by the `ProcessorBase`.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@craig craig bot closed this as completed in 5f390d8 Apr 5, 2021
ajwerner added a commit to ajwerner/cockroach that referenced this issue Apr 6, 2021
…ma changes

Prior to this change, the truncate implementation since 20.2 was totally
broken. The code creates new indexes for all existing indexes. It also
makes all mutations public. The problem with this is that it makes the
mutations public *after* deciding which indexes to clone. Because of this
it just makes currently being backfilled or validated indexes public. This
leads to inconsistency between the old secondary index and the newly
truncated indexes.

The code is also problematic in that it does not do anything about the mutation
jobs. The code assumes that these jobs will happily continue and detect that
they correspond to no mutations and stop doing things as well as clean
themselves up. Unfortunately, it seems that these changes can lead to certain
failures. For example, new index validation may now fail. Reverting that index
build will now also fail because the index is public. This will leave the
descriptor in a state where it has an orphaned mutation job. We fix this by
only permitting TRUNCATE to run concurrently with mutations we deem safe and
then to fix the behavior for those cases.

Fixes cockroachdb#62842.

Release note (bug fix): Fixed bugs where TRUNCATE concurrent with index
construction and other schema changes could result in corruption.
ajwerner added a commit to ajwerner/cockroach that referenced this issue Apr 20, 2021
…ma changes

Prior to this change, the truncate implementation since 20.2 was totally
broken. The code creates new indexes for all existing indexes. It also
makes all mutations public. The problem with this is that it makes the
mutations public *after* deciding which indexes to clone. Because of this
it just makes currently being backfilled or validated indexes public. This
leads to inconsistency between the old secondary index and the newly
truncated indexes.

The code is also problematic in that it does not do anything about the mutation
jobs. The code assumes that these jobs will happily continue and detect that
they correspond to no mutations and stop doing things as well as clean
themselves up. Unfortunately, it seems that these changes can lead to certain
failures. For example, new index validation may now fail. Reverting that index
build will now also fail because the index is public. This will leave the
descriptor in a state where it has an orphaned mutation job. We fix this by
only permitting TRUNCATE to run concurrently with mutations we deem safe and
then to fix the behavior for those cases.

Fixes cockroachdb#62842.

Release note (bug fix): Fixed bugs where TRUNCATE concurrent with index
construction and other schema changes could result in corruption.
@ajwerner
Copy link
Contributor Author

You can detect this situation in a cluster by running:

  WITH descs AS (
				SELECT d
				  FROM (
						SELECT crdb_internal.pb_to_json(
								'cockroach.sql.sqlbase.Descriptor',
								descriptor,
								false
						       )->'table' AS d
						  FROM system.descriptor
				       )
				 WHERE d IS NOT NULL
             ),
       bad_descs AS (
					SELECT d, idx
					  FROM (
							SELECT d,
							       d->'primaryIndex' AS pi,
							       json_array_elements(d->'indexes') AS idx
							  FROM descs
					       )
					 WHERE (pi->>'id')::INT8 > (idx->>'id')::INT8
                 )
SELECT database_name,
       schema_name,
       name AS table_name,
       table_id,
       idx->>'name' AS index_name,
       (idx->>'id')::INT8 AS index_id
  FROM crdb_internal.tables JOIN bad_descs ON (table_id = (d->>'id')::INT8);

@rytaft rytaft added C-technical-advisory Caused a technical advisory branch-release-20.2 Used to mark GA and release blockers, technical advisories, and bugs for 20.2 labels Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-changes A-schema-descriptors Relating to SQL table/db descriptor handling. branch-release-20.2 Used to mark GA and release blockers, technical advisories, and bugs for 20.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-technical-advisory Caused a technical advisory S-1 High impact: many users impacted, serious risk of high unavailability or data loss
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants