Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: issue protected timestamp for restore spans #91148

Closed
msbutler opened this issue Nov 2, 2022 · 3 comments · Fixed by #91991
Closed

backupccl: issue protected timestamp for restore spans #91148

msbutler opened this issue Nov 2, 2022 · 3 comments · Fixed by #91991
Assignees
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery

Comments

@msbutler
Copy link
Collaborator

msbutler commented Nov 2, 2022

Currently, we don't issue a protected timestamp over spans we're about to restore, which at the surface, does not seem like an issue, as restore doesn't overwrite existing keys. The lack of a PTS can cause the restore to fail, however, if the span we're restoring into has a very low gc ttl, like say 5 minutes. Specifically, if an AddSSTable request takes longer than the gc ttl, then its batch request timestamp may be earlier than the replica's gc threshold, causing the AddSStable to fail.

We can prevent this failure by preventing the gc threshold from advancing in replicas where the restore is occuring. To do this, a pts should be issued over the whole restore span, before any restore flow begins.

Note that the addsstable request can't merely be retried at a higher batch request TS, as all the keys in the SST have timestamps written at the batch request timestamp. Bumping the batchTS would require rewriting the SST, which seems quite expensive.

Jira issue: CRDB-21123

@msbutler msbutler added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery labels Nov 2, 2022
@msbutler msbutler self-assigned this Nov 2, 2022
@blathers-crl
Copy link

blathers-crl bot commented Nov 2, 2022

cc @cockroachdb/disaster-recovery

adityamaru added a commit to adityamaru/cockroach that referenced this issue Nov 3, 2022
Previously, we would ignore sql updates for databases
that are offline. As explained in  cockroachdb#91148 in some situations
we require writing a protected timestamp to an offline database
during a restore. This will require the spanconfig machinery to
reconcile the PTS record so that the GC queue can observe
and respect it.

We already reconcile sql udpates for offline tables so it seems
reasonable to do the same for offline databases.

Fixes: cockroachdb#91149

Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Nov 4, 2022
Previously, we would ignore sql updates for databases
that are offline. As explained in  cockroachdb#91148 in some situations
we require writing a protected timestamp to an offline database
during a restore. This will require the spanconfig machinery to
reconcile the PTS record so that the GC queue can observe
and respect it.

We already reconcile sql udpates for offline tables so it seems
reasonable to do the same for offline databases.

Fixes: cockroachdb#91149

Release note: None
craig bot pushed a commit that referenced this issue Nov 5, 2022
91173: spanconfigsqltranslator: translate spanconfigs for offline dbs r=adityamaru a=adityamaru

Previously, we would ignore sql updates for databases that are offline. As explained in  #91148 in some situations we require writing a protected timestamp to an offline database during a restore. This will require the spanconfig machinery to reconcile the PTS record so that the GC queue can observe and respect it.

We already reconcile sql udpates for offline tables so it seems reasonable to do the same for offline databases.

Fixes: #91149

Release note: None

Co-authored-by: adityamaru <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Nov 5, 2022
Previously, we would ignore sql updates for databases
that are offline. As explained in  #91148 in some situations
we require writing a protected timestamp to an offline database
during a restore. This will require the spanconfig machinery to
reconcile the PTS record so that the GC queue can observe
and respect it.

We already reconcile sql udpates for offline tables so it seems
reasonable to do the same for offline databases.

Fixes: #91149

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Nov 16, 2022
msbutler added a commit to msbutler/cockroach that referenced this issue Nov 19, 2022
msbutler added a commit to msbutler/cockroach that referenced this issue Nov 22, 2022
This allows the in memory job in the restore resumer to capture
pts modifications.

Informs cockroachdb#91148

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Nov 23, 2022
This allows the in memory job in the restore resumer to capture
pts modifications.

Informs cockroachdb#91148

Release note: None
craig bot pushed a commit that referenced this issue Nov 24, 2022
92295: jobs: pass in memory job to pts manager api r=fqazi a=msbutler

This allows the in memory job in the restore resumer to stay up to date with pts modifications.

Informs #91148

Release note: None

Co-authored-by: Michael Butler <[email protected]>
msbutler added a commit to msbutler/cockroach that referenced this issue Nov 30, 2022
msbutler added a commit to msbutler/cockroach that referenced this issue Nov 30, 2022
msbutler added a commit to msbutler/cockroach that referenced this issue Dec 5, 2022
craig bot pushed a commit that referenced this issue Dec 5, 2022
89721: multitenant: re-enable admission control fairness tests r=irfansharif a=cucaroach

Previously these tests were disabled for being flakey. Re-enable them
and increase tenant resource limits to prevent throughput collapse,
not sure why this wasn't an issue originally. Also disable running the
tests w/o admission control as that mode is flakey and no longer of
interest.

Also includes some commented out code to attempt to use prometheus
graphana, I couldn't get it to work but its probably close.

Fixes: #82033, #83994

Release note: None


91324: server: implement unified status server  r=knz,abarganier a=dhartunian

Previously, we had a separate tenant status server that implemented a subset of
status server RPC handlers for SQL tenants to use. This commit modifies the
existing status server to create a single implementation that can be shared
between app and system tenants.

There are two primary reasons why the two implementations diverged:

1. The tenant server simply does not have many capabilities that would allow it
to serve certain StatusServer requests. For example: gossip. This is simple to
reconcile as certain RPCs will return errors. Some additional work is necessary
to ensure that tenant implementations don't panic and return proper responses.
Previously, this was accomplished via the base implementation that would return
"Unimplemented" errors for all handlers, now we will have to modify our
handlers to know when they are run as tenants.

2. Requests that require fan-out to either nodes in a cluster or instances in a
tenant, require different code to execute. This has now been moved behind an
interface called `ServerIterator` that has two implementations: one for nodes,
and another for tenant instances.

Once we have the admin server migrated as well, the full API V2 server can be
implemented on tenants and we should have simpler feature parity between the
two.

Contributes to: #80789
Epic: [CRDB-17356](https://cockroachlabs.atlassian.net/browse/CRDB-17356)

Release note: None

91991: backupccl: issue protected timestamps during on restore spans r=adityamaru a=msbutler

Fixes #91148

Release note: None

93008: rowexec: high frequency cancel checking for row exec engine r=DrewKimball,yuzefovich a=msirek

Informs #92753

The row execution engine is slower than the vectorized one, and any additional slowdowns caused by contention or other factors may make the cancel checker unresponsive because each call to `Check()` could occur in 350 ms or longer intervals. This can impact SQLSmith tests which expect a 1 minute statement timeout to be honored, timing out the test with error after 5 minutes have elapsed.

The solution is to increase the frequency of the cancel checker for calls to `Check()` from the row engine from once every 1024 calls to once every 128 calls.

Release note: None

93063: roachtest: update version map for 22.2.0 r=ZhouXing19 a=ZhouXing19

links epic https://cockroachlabs.atlassian.net/browse/REL-184

Release note: None

93069: vendor: bump Pebble to 4a63cdb3a71e r=coolcom200 a=jbowens

```
4a63cdb3 crossversion: gracefully handle parallel test failures
0fd6d402 docs: update virtual sstables RFC filename, status
a08baf44 ci: temporarily skip the linux-race job
a3c599e2 crossversion: allow run dir to not exist
fb84a7b8 db: change LazyFetcher.ValueFetcher to an interface
8e5e7973 db: make EnableValueBlocks dynamically configurable
7d9a5b2e db: read path for values in value blocks
936e011b rfc: virtual sstables in the ingestion path
fcf9e404 internal/rangekey: avoid Transform allocation
fece1a6f db: use bytealloc.A for key buffering
ec94ead4 internal/rangekey: reuse merging buffers
630e6e90 internal/rangekey: reuse defragmenting buffers
4613f12b db: reuse RangeKeyData slice
6ee5cca6 vfs: fix typo
b9289d76 base: add LazyValue.TryGetShortAttribute
87eccabb metamorphic: prevent directory collisions
```

Release note: None
Epic: None

93085: ptcache: use simpler TestServer for tests r=andreimatei a=andreimatei

These tests were creating a 1-node TestCluster instead of a simpler TestServer for no apparent reason. This patch switches to TestServer.

Besides being more straight-forward, the TestServer is better because the test is doing funky stuff, combining the server with an external Cache using the server's stopper. Before, it was using the cluster's stopper, which is different from the server's stopper, which in turn was causing problems because their tracers are different.

Release note: None
Epic: None

Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Mark Sirek <[email protected]>
Co-authored-by: Jane Xing <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
@craig craig bot closed this as completed in 0bfc1b0 Dec 6, 2022
@shermanCRL
Copy link
Contributor

I assume we can’t backport due to proto change but I’ll ask anyway...

@msbutler
Copy link
Collaborator Author

i don't think we can backport this. If a customer does encounter this issue, which we haven't ever seen in an escalation, they can always increase the gc ttl on the target or parent target they are attempting to restore as a workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants