Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schema: allow transaction retry in schema changer backfill #107954

Merged
merged 5 commits into from
Aug 15, 2023
Merged

schema: allow transaction retry in schema changer backfill #107954

merged 5 commits into from
Aug 15, 2023

Conversation

ecwall
Copy link
Contributor

@ecwall ecwall commented Aug 1, 2023

Fixes #106696

This change adds a test to simulate transaction retries during the schema
changer backfill.

Some retries caused ingested key collides with an existing one errors
during addSSTable because the schema change's source data changed.
For example: CREATE TABLE t AS SELECT nextval('seq'); - nextval is
non-transactional so each time it is selected, the value changes.

This change also allows the schema changer to specify that addSSTable is
allowed to shadow (replace) keys' values if they are different to handle
this case.

Release note: None

@ecwall ecwall requested a review from stevendanna August 1, 2023 15:16
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ecwall ecwall marked this pull request as ready for review August 9, 2023 19:21
@ecwall ecwall requested review from a team as code owners August 9, 2023 19:21
@ecwall ecwall requested a review from rytaft August 9, 2023 19:21
@ecwall ecwall changed the title sql_ctas_txn_retry schema: allow transaction retry in schema changer backfill Aug 9, 2023
Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r1, 13 of 13 files at r2, 8 of 8 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ecwall, @rytaft, and @stevendanna)


pkg/kv/kvclient/kvcoord/testing_knobs.go line 64 at r1 (raw file):

	// loops to randomly inject retriable errors.
	EnableRandomTransactionRetryErrors bool
}

Should this setting be a probability value instead?


pkg/kv/kvclient/kvcoord/txn_coord_sender.go line 43 at r1 (raw file):

//
// See: https://github.com/cockroachdb/cockroach/pull/73512.
var DisableCommitSanityCheck = envutil.EnvOrDefaultBool("COCKROACH_DISABLE_COMMIT_SANITY_CHECK", false)

Should this be in this commit?


pkg/sql/rowexec/bulk_row_writer.go line 146 at r3 (raw file):

			Name:          sp.tableDesc.GetName(),
			MinBufferSize: bufferSize,
			// We disallow shadowing here to ensure that we report errors when builds

Update the comment here, also I think this will prevent errors from surfacing I think in other cases?

@rytaft rytaft removed request for a team and rytaft August 10, 2023 19:36
Copy link
Contributor Author

@ecwall ecwall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi and @stevendanna)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go line 43 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Should this be in this commit?

This is a change from @stevendanna that has not been merged yet. My change extends some of the functionality.


pkg/sql/rowexec/bulk_row_writer.go line 146 at r3 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Update the comment here, also I think this will prevent errors from surfacing I think in other cases?

After talking with the DR team (https://cockroachlabs.slack.com/archives/C03JCUUSCD6/p1691669801528989), I will have to rethink this part of the PR.

Copy link
Contributor Author

@ecwall ecwall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi and @stevendanna)


pkg/kv/kvclient/kvcoord/testing_knobs.go line 64 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Should this setting be a probability value instead?

Only some of the filtering is probability based so I left it to the filter impl to do that calculation.

@ecwall ecwall requested review from dt and fqazi August 11, 2023 20:14
@ecwall ecwall added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Aug 14, 2023
@@ -133,6 +133,21 @@ func (sp *bulkRowWriter) wrapDupError(ctx context.Context, orig error) error {

func (sp *bulkRowWriter) ingestLoop(ctx context.Context, kvCh chan row.KVBatch) error {
writeTS := sp.spec.Table.CreateAsOfTime

// Delete existing span before ingestion to prevent key collisions.
if sp.spec.DeleteSpan {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this operation is distributed, don't we risk stepping on ourselves? Should this be done at a higher level?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think we should be doing this in the processor; I think we should be doing this in the coordinator before we setup the flow, since a) now every proc is deleting the same span, which will all the levels of the LSM with duplicate range keys pushing all writes into L0, and more importantly b) one proc might already be writing by the time another starts up, gets scheduled and runs its deletes, leading to data corruption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I moved this up into backfillQueryIntoTable().

stevendanna and others added 4 commits August 15, 2023 11:42
Occasionally, we see bugs caused by mismangement of state inside a
function that will be retried. Namely, the functions `(*kv.DB).Txn`
and `(*sql.InternalExecutor).Txn` take funcs that must be suitably
idempotent.

This is testing knob provides cheap way to help us hunt down such
bugs.

In the long run, it would be nice to turn this on by default, but we
should probably make sure TestServer runs reliably with this turned on
before flipping the default.

I also considered retrying _every_ transaction once rather than
relying on randomness. However, that would require adding more state
in kv.Txn, which I wasn't sure I really wanted to do for testing
purposes.

Informs #106417

Epic: none

Release note: None
…etryFilter

Adds TransactionRetryFilter which allows the retry behavior to be customized.

EnableRandomTransactionRetryErrors has been refactored to a
TransactionRetryFilter implementation that randomly retries.

Release note: None
@ecwall ecwall requested a review from fqazi August 15, 2023 15:44
Fixes #106696

This change adds a test to simulate transaction retries during the schema
changer backfill.

Some retries caused `ingested key collides with an existing one` errors
during addSSTable because the schema change's source data changed.
For example: `CREATE TABLE t AS SELECT nextval('seq');` - `nextval` is
non-transactional so each time it is selected, the value changes.

This change also allows the schema changer to specify that addSSTable is
allowed to shadow (replace) keys' values if they are different to handle
this case.

Release note: None
Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 18 of 18 files at r9, 13 of 13 files at r10, 1 of 1 files at r11, 1 of 1 files at r12, 8 of 8 files at r13, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @stevendanna)

@ecwall
Copy link
Contributor Author

ecwall commented Aug 15, 2023

bors r=fqazi

@craig
Copy link
Contributor

craig bot commented Aug 15, 2023

Build succeeded:

@craig craig bot merged commit aba50b9 into cockroachdb:master Aug 15, 2023
@blathers-crl
Copy link

blathers-crl bot commented Aug 15, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from c9ed884 to blathers/backport-release-23.1-107954: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 16, 2023
This commit includes a collection of kv-level cleanups for the
randomized retryable error injection functionality recently introduced
in cockroachdb#107954. Notably, it:
- removes the dependency from `kvcoord.TxnCoordSender` on `kv.Txn` by adjusting
  the parameter of `ClientTestingKnobs.TransactionRetryFilter`.
- unexports `kv.Txn.DebugNameLocked`
- deletes `COCKROACH_DISABLE_COMMIT_SANITY_CHECK` again, which was unused
- removes rng sources in favor of the global rand, for simplicity
- cleans up and adds some comments

Epic: None
Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 20, 2023
This commit includes a collection of kv-level cleanups for the
randomized retryable error injection functionality recently introduced
in cockroachdb#107954. Notably, it:
- removes the dependency from `kvcoord.TxnCoordSender` on `kv.Txn` by adjusting
  the parameter of `ClientTestingKnobs.TransactionRetryFilter`.
- unexports `kv.Txn.DebugNameLocked`
- deletes `COCKROACH_DISABLE_COMMIT_SANITY_CHECK` again, which was unused
- removes rng sources in favor of the global rand, for simplicity
- cleans up and adds some comments

Epic: None
Release note: None
craig bot pushed a commit that referenced this pull request Aug 20, 2023
108828: kv: clean up randomized retryable error injection r=nvanbenschoten a=nvanbenschoten

This commit includes a collection of kv-level cleanups for the randomized retryable error injection functionality recently introduced in #107954. Notably, it:
- removes the dependency from `kvcoord.TxnCoordSender` on `kv.Txn` by adjusting the parameter of `ClientTestingKnobs.TransactionRetryFilter`.
- unexports `kv.Txn.DebugNameLocked`
- deletes `COCKROACH_DISABLE_COMMIT_SANITY_CHECK` again, which was unused
- removes rng sources in favor of the global rand, for simplicity
- cleans up and adds some comments

Epic: None
Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
5 participants