Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RESTORE: Restores may split ranges between column families of same row #109483

Closed
dt opened this issue Aug 25, 2023 · 4 comments · Fixed by #109713
Closed

RESTORE: Restores may split ranges between column families of same row #109483

dt opened this issue Aug 25, 2023 · 4 comments · Fixed by #109713
Assignees
Labels
A-disaster-recovery branch-master Failures and bugs on the master branch. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-technical-advisory Caused a technical advisory release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. T-disaster-recovery

Comments

@dt
Copy link
Member

dt commented Aug 25, 2023

This is coming from https://github.com/cockroachlabs/support/issues/2549#issuecomment-1692104972

If a DB was restored, and the schema of that DB has a table with multiple column families, SQL may fail reading or return wrong results.

Restore calls EnsureSafeSplitKey() before using a key as a split point, and then passes it to AdminSplit(). If the key was a valid key then EnsureSafeSplitKey() modifies the key to not be between column families, meaning it ensures the key would be a safe split point.

If the key is not a table key then EnsureSafeSplitKey() fails, but Restore can still split on that key - it is safe to split on non-table keys. But, if the key is a table key that EnsureSafeSplitKey() cannot parse then it will fail, and Restore (wrongfully) assumes that the key is again, not a table key.

In the bug here, the key was produced by Key.Next() which adds a 0 (NULL) to the end of the key, and that key cannot be parsed properly by EnsureSafeSplitKey(). Restore still used that key as a split point even though it may be between column families.

If the schema has column families, and an unsafe split point falls between column families then, when reading the table, SQL may either fail because it cannot find the column that is in another range (a different column family), or worse, SQL may succeed but return NULL if those missing columns are nullable (wrong result).

Worth mentioning that the data is not lost, and therefore by changing the range splits the DB can be fixed.

There are a things to do here:

  • Fix: avoid sending a potentially unsafe split point to KV. If we only implement this fix, then Restore will be slower because we will not issue about half of the split point.
  • Fix: avoid using Key.Next() for restore spans (or at least avoid allowing those to be used as split point).
  • Improve testing: the reason this was not found in our nightly roachtests is that we run a KV level fingerprinting on the restored DB, but from the perspective of KV the data exists and correct. In order to avoid this issue we need to run a SQL fingerprinting in those tests.
  • Produce a script/tool/instructions on how to fix a DB that was restored and may have unsafe split points.

Jira issue: CRDB-30944

@dt dt added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. branch-master Failures and bugs on the master branch. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.1.9-rc labels Aug 25, 2023
@blathers-crl
Copy link

blathers-crl bot commented Aug 25, 2023

cc @cockroachdb/disaster-recovery

@dt dt added S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. T-disaster-recovery and removed T-disaster-recovery labels Aug 25, 2023
@blathers-crl
Copy link

blathers-crl bot commented Aug 25, 2023

cc @cockroachdb/disaster-recovery

rhu713 pushed a commit to rhu713/cockroach that referenced this issue Aug 30, 2023
Remove all calls to key.Next() from generateAndSendImportSpans so that all cover
spans have valid keys as start and end keys.

Fixes: cockroachdb#109483

Release note: None
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Aug 30, 2023
Remove all calls to key.Next() from generateAndSendImportSpans so that all cover
spans have valid keys as start and end keys.

This fixes an issue where a split can be called on an invalid key that's in the
form of `someValidKey.Next()` during restore. These invalid keys will generally
have a `NULL` at the end of the key, which will result in an error when calling
`EnsureSafeSplits` on this split key. Currently errors from `EnsureSafeSplits`
are ignored, and thus a split will always be attempted on this type of invalid
split key. This split key can land in the middle of a row with column families,
and thus result in failing SQL queries when querying the restored table.

Fixes: cockroachdb#109483

Release note: None
lidorcarmel added a commit to lidorcarmel/cockroach that referenced this issue Aug 30, 2023
Restore may use unsafe keys as split points, which may cause unsafe splits
between column families, which may cause SQL to fail when reading the row, or
worse, return wrong resutls.

This commit avoids splitting on keys that might be unsafe.

See the issue for more info.

Epic: none
Informs: cockroachdb#109483

Release note: None.
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Aug 30, 2023
Remove all calls to key.Next() from generateAndSendImportSpans so that all cover
spans have valid keys as start and end keys.

This fixes an issue where a split can be called on an invalid key that's in the
form of `someValidKey.Next()` during restore. These invalid keys will generally
have a `NULL` at the end of the key, which will result in an error when calling
`EnsureSafeSplits` on this split key. Currently errors from `EnsureSafeSplits`
are ignored, and thus a split will always be attempted on this type of invalid
split key. This split key can land in the middle of a row with column families,
and thus result in failing SQL queries when querying the restored table.

A slight modification was also made to restore span generation to only create
spans out of the start keys of the file instead of both start and end keys.
This is not expected to change much to the generated spans as most of the time
the end key of a file span is the same as the start key of the subsequent file,
but it simplifies the restore span generation logic and may result in slighly
larger spans during restore.

Fixes: cockroachdb#109483

Release note: None
craig bot pushed a commit that referenced this issue Oct 13, 2023
112321: roachtest: maybe init tpcc with column families in backup-restore tests r=adityamaru a=msbutler

This patch inits the tpcc workload with column families 50% of the time in the backup-mixed-version and backup-restore/roundtrip tests.

This patch is the first step to recreating a roachtest workload that can reproduce #109483.

Release note: None

Epic: none

Co-authored-by: Michael Butler <[email protected]>
msbutler added a commit to msbutler/cockroach that referenced this issue Oct 13, 2023
This patch metamorphically reduces the default range size for user databases in
the backup-restore/round-trip roachtest to simulate a larger cluster, in terms
of range count, at smaller data sizes. In addition, the patch metamorphically
reduces the backup.restore_span.target_size setting to increase test coverage
of generateImportSpans.

This patch should allow us to recreate a roachtest workload that can simulate
the conditions that lead to cockroachdb#109483 but at smaller data sizes.

Informs cockroachdb#109483

Release note: None
blathers-crl bot pushed a commit that referenced this issue Oct 16, 2023
This patch inits the tpcc workload with column families 50% of the time in the
backup-mixed-version and backup-restore/roundtrip tests.

This patch is the first step to recreating a roachtest workload that can
reproduce #109483.

Release note: None

Epic: none
blathers-crl bot pushed a commit that referenced this issue Oct 16, 2023
This patch inits the tpcc workload with column families 50% of the time in the
backup-mixed-version and backup-restore/roundtrip tests.

This patch is the first step to recreating a roachtest workload that can
reproduce #109483.

Release note: None

Epic: none
msbutler added a commit to msbutler/cockroach that referenced this issue Oct 17, 2023
This patch metamorphically reduces the default range size for user databases in
the backup-restore/round-trip roachtest to simulate a larger cluster, in terms
of range count, at smaller data sizes. In addition, the patch metamorphically
reduces the backup.restore_span.target_size setting to increase test coverage
of generateImportSpans.

This patch should allow us to recreate a roachtest workload that can simulate
the conditions that lead to cockroachdb#109483 but at smaller data sizes.

Informs cockroachdb#109483

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Oct 19, 2023
This patch adds a new backup-restore roachtest variant that reduces default
range size for user databases in the backup-restore/round-trip roachtest to
simulate a larger cluster, in terms of range count, at smaller data sizes. In
addition, the roachtest scales down a few cluster settings such that the ratio
of rangeSize/clusterSetting remains constant.

This patch should allow us to recreate a roachtest workload that can simulate
the conditions that lead to cockroachdb#109483 but at smaller data sizes.

Informs cockroachdb#109483

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Oct 19, 2023
This patch adds a new backup-restore roachtest variant that reduces default
range size for user databases in the backup-restore/round-trip roachtest to
simulate a larger cluster, in terms of range count, at smaller data sizes. In
addition, the roachtest scales down a few cluster settings such that the ratio
of rangeSize/clusterSetting remains constant.

This patch should allow us to recreate a roachtest workload that can simulate
the conditions that lead to cockroachdb#109483 but at smaller data sizes.

Informs cockroachdb#109483

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Oct 19, 2023
This patch adds a new backup-restore roachtest variant that reduces default
range size for user databases in the backup-restore/round-trip roachtest to
simulate a larger cluster, in terms of range count, at smaller data sizes. In
addition, the roachtest scales down a few cluster settings such that the ratio
of rangeSize/clusterSetting remains constant.

This patch should allow us to recreate a roachtest workload that can simulate
the conditions that lead to cockroachdb#109483 but at smaller data sizes.

Informs cockroachdb#109483

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Oct 24, 2023
This patch adds a new backup-restore roachtest variant that reduces default
range size for user databases in the backup-restore/round-trip roachtest to
simulate a larger cluster, in terms of range count, at smaller data sizes. In
addition, the roachtest scales down a few cluster settings such that the ratio
of rangeSize/clusterSetting remains constant.

This patch should allow us to recreate a roachtest workload that can simulate
the conditions that lead to cockroachdb#109483 but at smaller data sizes.

Informs cockroachdb#109483

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Oct 25, 2023
This patch adds a new backup-restore roachtest variant that reduces default
range size for user databases in the backup-restore/round-trip roachtest to
simulate a larger cluster, in terms of range count, at smaller data sizes. In
addition, the roachtest scales down a few cluster settings such that the ratio
of rangeSize/clusterSetting remains constant.

This patch should allow us to recreate a roachtest workload that can simulate
the conditions that lead to cockroachdb#109483 but at smaller data sizes.

Informs cockroachdb#109483

Release note: None
craig bot pushed a commit that referenced this issue Oct 26, 2023
112356: roachtest: add backup-restore/small-ranges  r=renatolabs a=msbutler

This patch adds a new backup-restore roachtest variant that reduces default
range size for user databases in the backup-restore/round-trip roachtest to
simulate a larger cluster, in terms of range count, at smaller data sizes. In
addition, the roachtest scales down a few cluster settings such that the ratio
of rangeSize/clusterSetting remains constant.

This patch should allow us to recreate a roachtest workload that can simulate
the conditions that lead to #109483 but at smaller data sizes.

Informs #109483

Release note: None

Co-authored-by: Michael Butler <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Oct 27, 2023
This patch adds a new backup-restore roachtest variant that reduces default
range size for user databases in the backup-restore/round-trip roachtest to
simulate a larger cluster, in terms of range count, at smaller data sizes. In
addition, the roachtest scales down a few cluster settings such that the ratio
of rangeSize/clusterSetting remains constant.

This patch should allow us to recreate a roachtest workload that can simulate
the conditions that lead to #109483 but at smaller data sizes.

Informs #109483

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Nov 3, 2023
This patch adds a new backup-restore roachtest variant that reduces default
range size for user databases in the backup-restore/round-trip roachtest to
simulate a larger cluster, in terms of range count, at smaller data sizes. In
addition, the roachtest scales down a few cluster settings such that the ratio
of rangeSize/clusterSetting remains constant.

This patch should allow us to recreate a roachtest workload that can simulate
the conditions that lead to cockroachdb#109483 but at smaller data sizes.

Informs cockroachdb#109483

Release note: None
@rytaft rytaft added C-technical-advisory Caused a technical advisory branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery branch-master Failures and bugs on the master branch. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-technical-advisory Caused a technical advisory release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants