backupccl: cluster backups include offline tables #88043

msbutler · 2022-09-16T13:53:48Z

In pre 22.2 releases, it was assumed that cluster backups excluded offline tables, but in fact they do include them. This misconception can lead to corrupt data in restore. Consider:

t0: begin IMPORT on foo
t1: conduct cluster backup - captures foo's pre-import state and some importing data
t2: rollback import foo via non-mvcc clear range
t3: conduct incremental backup
t4: restore foo to latest time

b/c of the non-mvcc clear range, the incremental backup is completely naive to the rollback, thus, the importing data will get restored.

For cluster backups without revision history, this bug could be fixed by simply excluding the table from the backup. For cluster backups with revision history, a more complex fix is necessary, as outlined in #88042

Jira issue: CRDB-19658

Currently RESTORE may restore invalid backup data from a backed up table that underwent an IMPORT rollback. See cockroachdb#87305 for a detailed explanation. This patch ensures that RESTORE elides older backup data that were deleted via a non-MVCC operation. Because incremental backups always reintroduce spans (i.e. backs them up from timestamp 0) that may have undergone a non-mvcc operation, restore can identify restoring spans with potentially corrupt data in the backup chain and only ingest the spans' reintroduced data to any system time, without the corrupt data. Here's the basic impliemenation in Restore: - For each table we want to restore - identify the last time, l, the table was re-introduced, using the manifests - dont restore the table using a backup if backup.EndTime < l This implementation rests on the following assumption: the input spans for each restoration flow (created in createImportingDescriptors) and the restoreSpanEntries (created by makeSimpleImportSpans) do not span across multiple tables. Given this assumption, makeSimpleImportSpans skips adding files from a backups for a given input span that was reintroduced in a subsequent backup. It's worth noting that all significant refactoring occurs on code run by the restore coordinator; therefore, no special care needs to be taken for mixed / cross version backups. In other words, if the coordinator has updated, the cluster restores properly; else, the bug will exist on the restored cluster. It's also worth noting that other forms of this bug are apparent on older cluster versions (cockroachdb#88042, cockroachdb#88043) and has not been noticed by customers; thus, there is no need to fail a mixed version restore to protect the customer from this already existing bug. Fixes cockroachdb#87305 Release justification: bug fix Release note: none

blathers-crl · 2022-09-20T15:51:59Z

Hi @msbutler, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

Currently RESTORE may restore invalid backup data from a backed up table that underwent an IMPORT rollback. See cockroachdb#87305 for a detailed explanation. This patch ensures that RESTORE elides older backup data that were deleted via a non-MVCC operation. Because incremental backups always reintroduce spans (i.e. backs them up from timestamp 0) that may have undergone a non-mvcc operation, restore can identify restoring spans with potentially corrupt data in the backup chain and only ingest the spans' reintroduced data to any system time, without the corrupt data. Here's the basic impliemenation in Restore: - For each span we want to restore - identify the last time, l, the span was introduced, using the manifests - dont restore the span using a backup if backup.EndTime < l This implementation rests on the following assumption: the input spans for each restoration flow (created in createImportingDescriptors) and the restoreSpanEntries (created by makeSimpleImportSpans) do not span across multiple tables. Given this assumption, makeSimpleImportSpans skips adding files from a backups for a given input span that was reintroduced in a subsequent backup. It's worth noting that all significant refactoring occurs on code run by the restore coordinator; therefore, no special care needs to be taken for mixed / cross version backups. In other words, if the coordinator has updated, the cluster restores properly; else, the bug will exist on the restored cluster. It's also worth noting that other forms of this bug are apparent on older cluster versions (cockroachdb#88042, cockroachdb#88043) and has not been noticed by customers; thus, there is no need to fail a mixed version restore to protect the customer from this already existing bug. Informs cockroachdb#87305 Release justification: bug fix Release note: fix for TA advisory https://cockroachlabs.atlassian.net/browse/TSE-198

Currently RESTORE may restore invalid backup data from a backed up table that underwent an IMPORT rollback. See cockroachdb#87305 for a detailed explanation. This patch ensures that RESTORE elides older backup data that were deleted via a non-MVCC operation. Because incremental backups always reintroduce spans (i.e. backs them up from timestamp 0) that may have undergone a non-mvcc operation, restore can identify restoring spans with potentially corrupt data in the backup chain and only ingest the spans' reintroduced data to any system time, without the corrupt data. Here's the basic impliemenation in Restore: - For each span we want to restore - identify the last time, l, the span was introduced, using the manifests - dont restore the span using a backup if backup.EndTime < l This implementation rests on the following assumption: the input spans for each restoration flow (created in createImportingDescriptors) and the restoreSpanEntries (created by makeSimpleImportSpans) do not span across multiple tables. Given this assumption, makeSimpleImportSpans skips adding files from a backups for a given input span that was reintroduced in a subsequent backup. It's worth noting that all significant refactoring occurs on code run by the restore coordinator; therefore, no special care needs to be taken for mixed / cross version backups. In other words, if the coordinator has updated, the cluster restores properly; else, the bug will exist on the restored cluster. It's also worth noting that other forms of this bug are apparent on older cluster versions (cockroachdb#88042, cockroachdb#88043) and has not been noticed by customers; thus, there is no need to fail a mixed version restore to protect the customer from this already existing bug. Informs cockroachdb#87305 Release justification: bug fix Release note (bug fix): fix for TA advisory https://cockroachlabs.atlassian.net/browse/TSE-198

87312: backupccl: elide spans from backups that were subsequently reintroduced r=dt,adityamaru a=msbutler Currently RESTORE may restore invalid backup data from a backed up table that underwent an IMPORT rollback. See #87305 for a detailed explanation. This patch ensures that RESTORE elides older backup data that were deleted via a non-MVCC operation. Because incremental backups always reintroduce spans (i.e. backs them up from timestamp 0) that may have undergone a non-mvcc operation, restore can identify restoring spans with potentially corrupt data in the backup chain and only ingest the spans' reintroduced data to any system time, without the corrupt data. Here's the basic impliemenation in Restore: - For each span we want to restore - identify the last time, l, the span was introduced, using the manifests - dont restore the span using a backup if backup.EndTime < l This implementation rests on the following assumption: the input spans for each restoration flow (created in createImportingDescriptors) and the restoreSpanEntries (created by makeSimpleImportSpans) do not span across multiple tables. Given this assumption, makeSimpleImportSpans skips adding files from a backups for a given input span that was reintroduced in a subsequent backup. It's worth noting that all significant refactoring occurs on code run by the restore coordinator; therefore, no special care needs to be taken for mixed / cross version backups. In other words, if the coordinator has updated, the cluster restores properly; else, the bug will exist on the restored cluster. It's also worth noting that other forms of this bug are apparent on older cluster versions (#88042, #88043) and has not been noticed by customers; thus, there is no need to fail a mixed version restore to protect the customer from this already existing bug. Informs #87305 Release justification: bug fix Release note: fix for TA advisory https://cockroachlabs.atlassian.net/browse/TSE-198 88384: server: return elapsed time for active executions r=xinhaoz a=xinhaoz Previously, we calculated the time elapsed for an active stmt or txn based on the start time returned from the server and the time the response was last received. Calculating this value on the client is not reliable and can lead to negative values when the server time is slightly ahead. This commit fixes this issue by including the time elapsed as part of the active txns and stmts response. Release note (bug fix): time elapsed for active txns and stmts is never negative. 88449: kvserver: fix flaky test for consistency checks r=erikgrinaker a=pavelkalinnikov There was a race in selecting between a canceled context.Done and 0-time timer. Fixes #88133 Release justification: flaky test fix Release note: None Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>

Currently RESTORE may restore invalid backup data from a backed up table that underwent an IMPORT rollback. See #87305 for a detailed explanation. This patch ensures that RESTORE elides older backup data that were deleted via a non-MVCC operation. Because incremental backups always reintroduce spans (i.e. backs them up from timestamp 0) that may have undergone a non-mvcc operation, restore can identify restoring spans with potentially corrupt data in the backup chain and only ingest the spans' reintroduced data to any system time, without the corrupt data. Here's the basic impliemenation in Restore: - For each span we want to restore - identify the last time, l, the span was introduced, using the manifests - dont restore the span using a backup if backup.EndTime < l This implementation rests on the following assumption: the input spans for each restoration flow (created in createImportingDescriptors) and the restoreSpanEntries (created by makeSimpleImportSpans) do not span across multiple tables. Given this assumption, makeSimpleImportSpans skips adding files from a backups for a given input span that was reintroduced in a subsequent backup. It's worth noting that all significant refactoring occurs on code run by the restore coordinator; therefore, no special care needs to be taken for mixed / cross version backups. In other words, if the coordinator has updated, the cluster restores properly; else, the bug will exist on the restored cluster. It's also worth noting that other forms of this bug are apparent on older cluster versions (#88042, #88043) and has not been noticed by customers; thus, there is no need to fail a mixed version restore to protect the customer from this already existing bug. Informs #87305 Release justification: bug fix Release note (bug fix): fix for TA advisory https://cockroachlabs.atlassian.net/browse/TSE-198

Currently RESTORE may restore invalid backup data from a backed up table that underwent an IMPORT rollback. See cockroachdb#87305 for a detailed explanation. This patch ensures that RESTORE elides older backup data that were deleted via a non-MVCC operation. Because incremental backups always reintroduce spans (i.e. backs them up from timestamp 0) that may have undergone a non-mvcc operation, restore can identify restoring spans with potentially corrupt data in the backup chain and only ingest the spans' reintroduced data to any system time, without the corrupt data. Here's the basic impliemenation in Restore: - For each span we want to restore - identify the last time, l, the span was introduced, using the manifests - dont restore the span using a backup if backup.EndTime < l This implementation rests on the following assumption: the input spans for each restoration flow (created in createImportingDescriptors) and the restoreSpanEntries (created by makeSimpleImportSpans) do not span across multiple tables. Given this assumption, makeSimpleImportSpans skips adding files from a backups for a given input span that was reintroduced in a subsequent backup. It's worth noting that all significant refactoring occurs on code run by the restore coordinator; therefore, no special care needs to be taken for mixed / cross version backups. In other words, if the coordinator has updated, the cluster restores properly; else, the bug will exist on the restored cluster. It's also worth noting that other forms of this bug are apparent on older cluster versions (cockroachdb#88042, cockroachdb#88043) and has not been noticed by customers; thus, there is no need to fail a mixed version restore to protect the customer from this already existing bug. Informs cockroachdb#87305 Release justification: bug fix Release note (bug fix): fix for TA advisory https://cockroachlabs.atlassian.net/browse/TSE-198

msbutler · 2022-10-13T20:15:03Z

closed via

22.1 PR
#88488
21.2 PR
#89019

msbutler added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery T-disaster-recovery labels Sep 16, 2022

msbutler self-assigned this Sep 16, 2022

msbutler added the GA-blocker label Sep 20, 2022

msbutler added the branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 label Sep 20, 2022

msbutler mentioned this issue Sep 21, 2022

backupccl: elide spans from backups that were subsequently reintroduced #87312

Merged

blathers-crl bot mentioned this issue Sep 22, 2022

release-22.2: backupccl: elide spans from backups that were subsequently reintroduced #88474

Merged

msbutler mentioned this issue Sep 22, 2022

release-22.1: backupccl: elide spans from backups that were subsequently reintroduced #88488

Merged

msbutler mentioned this issue Sep 29, 2022

release-21.2: release-22.1: backupccl: elide spans from backups that were subsequently reintroduced #89019

Merged

msbutler mentioned this issue Sep 30, 2022

release-22.2: release-22.1: backupccl: reintroduce previously offline tables with manifest.DescriptorChanges #89102

Merged

msbutler mentioned this issue Oct 12, 2022

backupccl: offline table data in revision history backups can leak into restored cluster #88042

Closed

msbutler closed this as completed Oct 13, 2022

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: cluster backups include offline tables #88043

backupccl: cluster backups include offline tables #88043

msbutler commented Sep 16, 2022 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Sep 20, 2022

msbutler commented Oct 13, 2022 •

edited

Loading

backupccl: cluster backups include offline tables #88043

backupccl: cluster backups include offline tables #88043

Comments

msbutler commented Sep 16, 2022 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Sep 20, 2022

msbutler commented Oct 13, 2022 • edited Loading

msbutler commented Sep 16, 2022 •

edited by cockroach-jira-scripts

Loading

msbutler commented Oct 13, 2022 •

edited

Loading