sql: TRUNCATE and ongoing index builds are broken #62842

ajwerner · 2021-03-31T04:14:46Z

Describe the problem

In 20.2 we changed the implementation of TRUNCATE to create all new indexes instead of creating a new table (#52235). This was wonderful in that it made table IDs stable. Unfortunately, it appears that that PR had some severe logic bugs in the face of concurrent index builds.

The logic has two bugs around

cockroach/pkg/sql/truncate.go

Lines 221 to 227 in d145e9f

    
           // Resolve all outstanding mutations. Make all new schema elements 
        
           // public because the table is empty and doesn't need to be backfilled. 
        
           for _, m := range tableDesc.Mutations { 
        
           	if err := tableDesc.MakeMutationComplete(m); err != nil { 
        
           		return err 
        
           	} 
        
           }

The first is that it should make the in-progress indexes public prior to calculating the re-mapping. Secondly, it needs to deal with the ongoing jobs.

To Reproduce

Create an index backfill which takes a long time and then truncate the table. The index will become public but this index will contain data from prior to the truncate. The job will also remain in the queue which will prevent subsequent schema changes.

Expected behavior

The index is truncated and made public.

Additional context

This affects all current 20.2 versions and 21.1.

ajwerner · 2021-03-31T04:17:19Z

@fqazi any chance I can enlist you on this? I think the fix is small and the test should be straightforward. This is higher priority than anything else you're currently working on and I believe you have sufficient context.

fqazi · 2021-03-31T04:25:00Z

Sure I'll take it.

ajwerner · 2021-04-01T06:32:11Z

I took a stab at this #62944.

ajwerner · 2021-04-01T15:30:35Z

Alright, after some thinking about this implementation, there's still a lot of scary edge cases. For now the thinking is that we're going to disallow this hazardous behavior. It's modestly unfortunate as there are some cases which would have allowed you to quickly finish long-running schema changes that take a long time to cancel but getting this right in all of the cases seems too hard. Instead we'll allowlist the cases we can convince ourselves are safe and disallow the rest.

…ma changes Prior to this change, the truncate implementation since 20.2 was totally broken. The code creates new indexes for all existing indexes. It also makes all mutations public. The problem with this is that it makes the mutations public *after* deciding which indexes to clone. Because of this it just makes currently being backfilled or validated indexes public. This leads to inconsistency between the old secondary index and the newly truncated indexes. The code is also problematic in that it does not do anything about the mutation jobs. The code assumes that these jobs will happily continue and detect that they correspond to no mutations and stop doing things as well as clean themselves up. Unfortunately, it seems that these changes can lead to certain failures. For example, new index validation may now fail. Reverting that index build will now also fail because the index is public. This will leave the descriptor in a state where it has an orphaned mutation job. We fix this by only permitting TRUNCATE to run concurrently with mutations we deem safe and then to fix the behavior for those cases. Fixes cockroachdb#62842. Release note (bug fix): Fixed bugs where TRUNCATE concurrent with index construction and other schema changes could result in corruption.

62944: sql: fix or disallow TRUNCATE and concurrent schema changes r=jordanlewis a=ajwerner Prior to this change, the truncate implementation since 20.2 was totally broken. The code creates new indexes for all existing indexes. It also makes all mutations public. The problem with this is that it makes the mutations public *after* deciding which indexes to clone. Because of this it just makes currently being backfilled or validated indexes public. This leads to inconsistency between the old secondary index and the newly truncated indexes. The code is also problematic in that it does not do anything about the mutation jobs. The code assumes that these jobs will happily continue and detect that they correspond to no mutations and stop doing things as well as clean themselves up. Unfortunately, it seems that these changes can lead to certain failures. For example, new index validation may now fail. Reverting that index build will now also fail because the index is public. This will leave the descriptor in a state where it has an orphaned mutation job. We fix this by only permitting TRUNCATE to run concurrently with mutations we deem safe and then to fix the behavior for those cases. Fixes #62842. Release note (bug fix): Fixed bugs where TRUNCATE concurrent with index construction and other schema changes could result in corruption. 63038: ccl/bulk-io: clean up the processors a bit r=yuzefovich a=yuzefovich This commit removes some redundant implementations of `ConsumerClosed` method that simply call `InternalClose` - the reason is that it is the default implementation provided by the `ProcessorBase`. Release note: None Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

…ma changes Prior to this change, the truncate implementation since 20.2 was totally broken. The code creates new indexes for all existing indexes. It also makes all mutations public. The problem with this is that it makes the mutations public *after* deciding which indexes to clone. Because of this it just makes currently being backfilled or validated indexes public. This leads to inconsistency between the old secondary index and the newly truncated indexes. The code is also problematic in that it does not do anything about the mutation jobs. The code assumes that these jobs will happily continue and detect that they correspond to no mutations and stop doing things as well as clean themselves up. Unfortunately, it seems that these changes can lead to certain failures. For example, new index validation may now fail. Reverting that index build will now also fail because the index is public. This will leave the descriptor in a state where it has an orphaned mutation job. We fix this by only permitting TRUNCATE to run concurrently with mutations we deem safe and then to fix the behavior for those cases. Fixes cockroachdb#62842. Release note (bug fix): Fixed bugs where TRUNCATE concurrent with index construction and other schema changes could result in corruption.

ajwerner · 2021-04-30T16:54:03Z

You can detect this situation in a cluster by running:

  WITH descs AS (
				SELECT d
				  FROM (
						SELECT crdb_internal.pb_to_json(
								'cockroach.sql.sqlbase.Descriptor',
								descriptor,
								false
						       )->'table' AS d
						  FROM system.descriptor
				       )
				 WHERE d IS NOT NULL
             ),
       bad_descs AS (
					SELECT d, idx
					  FROM (
							SELECT d,
							       d->'primaryIndex' AS pi,
							       json_array_elements(d->'indexes') AS idx
							  FROM descs
					       )
					 WHERE (pi->>'id')::INT8 > (idx->>'id')::INT8
                 )
SELECT database_name,
       schema_name,
       name AS table_name,
       table_id,
       idx->>'name' AS index_name,
       (idx->>'id')::INT8 AS index_id
  FROM crdb_internal.tables JOIN bad_descs ON (table_id = (d->>'id')::INT8);

ajwerner added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-schema-changes A-schema-descriptors Relating to SQL table/db descriptor handling. labels Mar 31, 2021

ajwerner added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Mar 31, 2021

fqazi self-assigned this Mar 31, 2021

ajwerner self-assigned this Apr 1, 2021

ajwerner mentioned this issue Apr 1, 2021

sql: fix or disallow TRUNCATE and concurrent schema changes #62944

Merged

craig bot closed this as completed in 5f390d8 Apr 5, 2021

ajwerner mentioned this issue Apr 6, 2021

release-21.1: sql: fix or disallow TRUNCATE and concurrent schema changes #63143

Merged

ajwerner mentioned this issue Apr 6, 2021

release-20.2: sql: fix or disallow TRUNCATE and concurrent schema changes #63153

Merged

exalate-issue-sync bot unassigned ajwerner Sep 27, 2021

rytaft added C-technical-advisory Caused a technical advisory branch-release-20.2 Used to mark GA and release blockers, technical advisories, and bugs for 20.2 labels Dec 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: TRUNCATE and ongoing index builds are broken #62842

sql: TRUNCATE and ongoing index builds are broken #62842

ajwerner commented Mar 31, 2021 •

edited

Loading

ajwerner commented Mar 31, 2021

fqazi commented Mar 31, 2021

ajwerner commented Apr 1, 2021

ajwerner commented Apr 1, 2021

ajwerner commented Apr 30, 2021

sql: TRUNCATE and ongoing index builds are broken #62842

sql: TRUNCATE and ongoing index builds are broken #62842

Comments

ajwerner commented Mar 31, 2021 • edited Loading

ajwerner commented Mar 31, 2021

fqazi commented Mar 31, 2021

ajwerner commented Apr 1, 2021

ajwerner commented Apr 1, 2021

ajwerner commented Apr 30, 2021

ajwerner commented Mar 31, 2021 •

edited

Loading