backupccl: issue protected timestamp for restore spans #91148

msbutler · 2022-11-02T18:41:29Z

Currently, we don't issue a protected timestamp over spans we're about to restore, which at the surface, does not seem like an issue, as restore doesn't overwrite existing keys. The lack of a PTS can cause the restore to fail, however, if the span we're restoring into has a very low gc ttl, like say 5 minutes. Specifically, if an AddSSTable request takes longer than the gc ttl, then its batch request timestamp may be earlier than the replica's gc threshold, causing the AddSStable to fail.

We can prevent this failure by preventing the gc threshold from advancing in replicas where the restore is occuring. To do this, a pts should be issued over the whole restore span, before any restore flow begins.

Note that the addsstable request can't merely be retried at a higher batch request TS, as all the keys in the SST have timestamps written at the batch request timestamp. Bumping the batchTS would require rewriting the SST, which seems quite expensive.

Jira issue: CRDB-21123

blathers-crl · 2022-11-02T18:41:30Z

cc @cockroachdb/disaster-recovery

Previously, we would ignore sql updates for databases that are offline. As explained in cockroachdb#91148 in some situations we require writing a protected timestamp to an offline database during a restore. This will require the spanconfig machinery to reconcile the PTS record so that the GC queue can observe and respect it. We already reconcile sql udpates for offline tables so it seems reasonable to do the same for offline databases. Fixes: cockroachdb#91149 Release note: None

91173: spanconfigsqltranslator: translate spanconfigs for offline dbs r=adityamaru a=adityamaru Previously, we would ignore sql updates for databases that are offline. As explained in #91148 in some situations we require writing a protected timestamp to an offline database during a restore. This will require the spanconfig machinery to reconcile the PTS record so that the GC queue can observe and respect it. We already reconcile sql udpates for offline tables so it seems reasonable to do the same for offline databases. Fixes: #91149 Release note: None Co-authored-by: adityamaru <[email protected]>

Previously, we would ignore sql updates for databases that are offline. As explained in #91148 in some situations we require writing a protected timestamp to an offline database during a restore. This will require the spanconfig machinery to reconcile the PTS record so that the GC queue can observe and respect it. We already reconcile sql udpates for offline tables so it seems reasonable to do the same for offline databases. Fixes: #91149 Release note: None

Fixes cockroachdb#91148 Release note: None

This allows the in memory job in the restore resumer to capture pts modifications. Informs cockroachdb#91148 Release note: None

92295: jobs: pass in memory job to pts manager api r=fqazi a=msbutler This allows the in memory job in the restore resumer to stay up to date with pts modifications. Informs #91148 Release note: None Co-authored-by: Michael Butler <[email protected]>

Fixes cockroachdb#91148 Release note: None

89721: multitenant: re-enable admission control fairness tests r=irfansharif a=cucaroach Previously these tests were disabled for being flakey. Re-enable them and increase tenant resource limits to prevent throughput collapse, not sure why this wasn't an issue originally. Also disable running the tests w/o admission control as that mode is flakey and no longer of interest. Also includes some commented out code to attempt to use prometheus graphana, I couldn't get it to work but its probably close. Fixes: #82033, #83994 Release note: None 91324: server: implement unified status server r=knz,abarganier a=dhartunian Previously, we had a separate tenant status server that implemented a subset of status server RPC handlers for SQL tenants to use. This commit modifies the existing status server to create a single implementation that can be shared between app and system tenants. There are two primary reasons why the two implementations diverged: 1. The tenant server simply does not have many capabilities that would allow it to serve certain StatusServer requests. For example: gossip. This is simple to reconcile as certain RPCs will return errors. Some additional work is necessary to ensure that tenant implementations don't panic and return proper responses. Previously, this was accomplished via the base implementation that would return "Unimplemented" errors for all handlers, now we will have to modify our handlers to know when they are run as tenants. 2. Requests that require fan-out to either nodes in a cluster or instances in a tenant, require different code to execute. This has now been moved behind an interface called `ServerIterator` that has two implementations: one for nodes, and another for tenant instances. Once we have the admin server migrated as well, the full API V2 server can be implemented on tenants and we should have simpler feature parity between the two. Contributes to: #80789 Epic: [CRDB-17356](https://cockroachlabs.atlassian.net/browse/CRDB-17356) Release note: None 91991: backupccl: issue protected timestamps during on restore spans r=adityamaru a=msbutler Fixes #91148 Release note: None 93008: rowexec: high frequency cancel checking for row exec engine r=DrewKimball,yuzefovich a=msirek Informs #92753 The row execution engine is slower than the vectorized one, and any additional slowdowns caused by contention or other factors may make the cancel checker unresponsive because each call to `Check()` could occur in 350 ms or longer intervals. This can impact SQLSmith tests which expect a 1 minute statement timeout to be honored, timing out the test with error after 5 minutes have elapsed. The solution is to increase the frequency of the cancel checker for calls to `Check()` from the row engine from once every 1024 calls to once every 128 calls. Release note: None 93063: roachtest: update version map for 22.2.0 r=ZhouXing19 a=ZhouXing19 links epic https://cockroachlabs.atlassian.net/browse/REL-184 Release note: None 93069: vendor: bump Pebble to 4a63cdb3a71e r=coolcom200 a=jbowens ``` 4a63cdb3 crossversion: gracefully handle parallel test failures 0fd6d402 docs: update virtual sstables RFC filename, status a08baf44 ci: temporarily skip the linux-race job a3c599e2 crossversion: allow run dir to not exist fb84a7b8 db: change LazyFetcher.ValueFetcher to an interface 8e5e7973 db: make EnableValueBlocks dynamically configurable 7d9a5b2e db: read path for values in value blocks 936e011b rfc: virtual sstables in the ingestion path fcf9e404 internal/rangekey: avoid Transform allocation fece1a6f db: use bytealloc.A for key buffering ec94ead4 internal/rangekey: reuse merging buffers 630e6e90 internal/rangekey: reuse defragmenting buffers 4613f12b db: reuse RangeKeyData slice 6ee5cca6 vfs: fix typo b9289d76 base: add LazyValue.TryGetShortAttribute 87eccabb metamorphic: prevent directory collisions ``` Release note: None Epic: None 93085: ptcache: use simpler TestServer for tests r=andreimatei a=andreimatei These tests were creating a 1-node TestCluster instead of a simpler TestServer for no apparent reason. This patch switches to TestServer. Besides being more straight-forward, the TestServer is better because the test is doing funky stuff, combining the server with an external Cache using the server's stopper. Before, it was using the cluster's stopper, which is different from the server's stopper, which in turn was causing problems because their tracers are different. Release note: None Epic: None Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Mark Sirek <[email protected]> Co-authored-by: Jane Xing <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Andrei Matei <[email protected]>

shermanCRL · 2022-12-14T20:48:06Z

I assume we can’t backport due to proto change but I’ll ask anyway...

msbutler · 2022-12-14T20:55:19Z

i don't think we can backport this. If a customer does encounter this issue, which we haven't ever seen in an escalation, they can always increase the gc ttl on the target or parent target they are attempting to restore as a workaround.

msbutler added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery labels Nov 2, 2022

msbutler self-assigned this Nov 2, 2022

blathers-crl bot added the A-disaster-recovery label Nov 2, 2022

This was referenced Nov 2, 2022

spanconfigccl: reconcile zone configs for offline descriptors #91149

Closed

import: issue protected timestamp on import spans #91151

Open

adityamaru mentioned this issue Nov 3, 2022

spanconfigsqltranslator: translate spanconfigs for offline dbs #91173

Merged

blathers-crl bot mentioned this issue Nov 5, 2022

release-22.2: spanconfigsqltranslator: translate spanconfigs for offline dbs #91348

Merged

msbutler mentioned this issue Nov 16, 2022

backupccl: issue protected timestamps during on restore spans #91991

Merged

msbutler added a commit to msbutler/cockroach that referenced this issue Nov 16, 2022

backupccl: issue protected timestamps during on restore spans

a57ac1b

Fixes cockroachdb#91148 Release note: None

msbutler added a commit to msbutler/cockroach that referenced this issue Nov 19, 2022

backupccl: issue protected timestamps on restore spans

945a07b

Fixes cockroachdb#91148 Release note: None

msbutler mentioned this issue Nov 21, 2022

jobs: pass in memory job to pts manager api #92295

Merged

msbutler added a commit to msbutler/cockroach that referenced this issue Nov 22, 2022

jobs: pass in memory job to pts manager api

dcddea1

This allows the in memory job in the restore resumer to capture pts modifications. Informs cockroachdb#91148 Release note: None

msbutler added a commit to msbutler/cockroach that referenced this issue Nov 23, 2022

jobs: pass in memory job to pts manager api

dd8eb95

This allows the in memory job in the restore resumer to capture pts modifications. Informs cockroachdb#91148 Release note: None

msbutler added a commit to msbutler/cockroach that referenced this issue Nov 30, 2022

backupccl: protect restore spans

eccc803

Fixes cockroachdb#91148 Release note: None

msbutler added a commit to msbutler/cockroach that referenced this issue Nov 30, 2022

backupccl: protect restore spans

236e105

Fixes cockroachdb#91148 Release note: None

msbutler added a commit to msbutler/cockroach that referenced this issue Dec 5, 2022

backupccl: protect restore spans

8ec0e60

Fixes cockroachdb#91148 Release note: None

craig bot closed this as completed in 0bfc1b0 Dec 6, 2022

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: issue protected timestamp for restore spans #91148

backupccl: issue protected timestamp for restore spans #91148

msbutler commented Nov 2, 2022 •

edited

Loading

blathers-crl bot commented Nov 2, 2022

shermanCRL commented Dec 14, 2022

msbutler commented Dec 14, 2022

backupccl: issue protected timestamp for restore spans #91148

backupccl: issue protected timestamp for restore spans #91148

Comments

msbutler commented Nov 2, 2022 • edited Loading

blathers-crl bot commented Nov 2, 2022

shermanCRL commented Dec 14, 2022

msbutler commented Dec 14, 2022

msbutler commented Nov 2, 2022 •

edited

Loading