-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: issue protected timestamp for restore spans #91148
Labels
A-disaster-recovery
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-disaster-recovery
Comments
msbutler
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-disaster-recovery
labels
Nov 2, 2022
cc @cockroachdb/disaster-recovery |
This was referenced Nov 2, 2022
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Nov 3, 2022
Previously, we would ignore sql updates for databases that are offline. As explained in cockroachdb#91148 in some situations we require writing a protected timestamp to an offline database during a restore. This will require the spanconfig machinery to reconcile the PTS record so that the GC queue can observe and respect it. We already reconcile sql udpates for offline tables so it seems reasonable to do the same for offline databases. Fixes: cockroachdb#91149 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Nov 4, 2022
Previously, we would ignore sql updates for databases that are offline. As explained in cockroachdb#91148 in some situations we require writing a protected timestamp to an offline database during a restore. This will require the spanconfig machinery to reconcile the PTS record so that the GC queue can observe and respect it. We already reconcile sql udpates for offline tables so it seems reasonable to do the same for offline databases. Fixes: cockroachdb#91149 Release note: None
craig bot
pushed a commit
that referenced
this issue
Nov 5, 2022
91173: spanconfigsqltranslator: translate spanconfigs for offline dbs r=adityamaru a=adityamaru Previously, we would ignore sql updates for databases that are offline. As explained in #91148 in some situations we require writing a protected timestamp to an offline database during a restore. This will require the spanconfig machinery to reconcile the PTS record so that the GC queue can observe and respect it. We already reconcile sql udpates for offline tables so it seems reasonable to do the same for offline databases. Fixes: #91149 Release note: None Co-authored-by: adityamaru <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
Nov 5, 2022
Previously, we would ignore sql updates for databases that are offline. As explained in #91148 in some situations we require writing a protected timestamp to an offline database during a restore. This will require the spanconfig machinery to reconcile the PTS record so that the GC queue can observe and respect it. We already reconcile sql udpates for offline tables so it seems reasonable to do the same for offline databases. Fixes: #91149 Release note: None
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Nov 16, 2022
Fixes cockroachdb#91148 Release note: None
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Nov 19, 2022
Fixes cockroachdb#91148 Release note: None
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Nov 22, 2022
This allows the in memory job in the restore resumer to capture pts modifications. Informs cockroachdb#91148 Release note: None
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Nov 23, 2022
This allows the in memory job in the restore resumer to capture pts modifications. Informs cockroachdb#91148 Release note: None
craig bot
pushed a commit
that referenced
this issue
Nov 24, 2022
92295: jobs: pass in memory job to pts manager api r=fqazi a=msbutler This allows the in memory job in the restore resumer to stay up to date with pts modifications. Informs #91148 Release note: None Co-authored-by: Michael Butler <[email protected]>
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Nov 30, 2022
Fixes cockroachdb#91148 Release note: None
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Nov 30, 2022
Fixes cockroachdb#91148 Release note: None
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Dec 5, 2022
Fixes cockroachdb#91148 Release note: None
craig bot
pushed a commit
that referenced
this issue
Dec 5, 2022
89721: multitenant: re-enable admission control fairness tests r=irfansharif a=cucaroach Previously these tests were disabled for being flakey. Re-enable them and increase tenant resource limits to prevent throughput collapse, not sure why this wasn't an issue originally. Also disable running the tests w/o admission control as that mode is flakey and no longer of interest. Also includes some commented out code to attempt to use prometheus graphana, I couldn't get it to work but its probably close. Fixes: #82033, #83994 Release note: None 91324: server: implement unified status server r=knz,abarganier a=dhartunian Previously, we had a separate tenant status server that implemented a subset of status server RPC handlers for SQL tenants to use. This commit modifies the existing status server to create a single implementation that can be shared between app and system tenants. There are two primary reasons why the two implementations diverged: 1. The tenant server simply does not have many capabilities that would allow it to serve certain StatusServer requests. For example: gossip. This is simple to reconcile as certain RPCs will return errors. Some additional work is necessary to ensure that tenant implementations don't panic and return proper responses. Previously, this was accomplished via the base implementation that would return "Unimplemented" errors for all handlers, now we will have to modify our handlers to know when they are run as tenants. 2. Requests that require fan-out to either nodes in a cluster or instances in a tenant, require different code to execute. This has now been moved behind an interface called `ServerIterator` that has two implementations: one for nodes, and another for tenant instances. Once we have the admin server migrated as well, the full API V2 server can be implemented on tenants and we should have simpler feature parity between the two. Contributes to: #80789 Epic: [CRDB-17356](https://cockroachlabs.atlassian.net/browse/CRDB-17356) Release note: None 91991: backupccl: issue protected timestamps during on restore spans r=adityamaru a=msbutler Fixes #91148 Release note: None 93008: rowexec: high frequency cancel checking for row exec engine r=DrewKimball,yuzefovich a=msirek Informs #92753 The row execution engine is slower than the vectorized one, and any additional slowdowns caused by contention or other factors may make the cancel checker unresponsive because each call to `Check()` could occur in 350 ms or longer intervals. This can impact SQLSmith tests which expect a 1 minute statement timeout to be honored, timing out the test with error after 5 minutes have elapsed. The solution is to increase the frequency of the cancel checker for calls to `Check()` from the row engine from once every 1024 calls to once every 128 calls. Release note: None 93063: roachtest: update version map for 22.2.0 r=ZhouXing19 a=ZhouXing19 links epic https://cockroachlabs.atlassian.net/browse/REL-184 Release note: None 93069: vendor: bump Pebble to 4a63cdb3a71e r=coolcom200 a=jbowens ``` 4a63cdb3 crossversion: gracefully handle parallel test failures 0fd6d402 docs: update virtual sstables RFC filename, status a08baf44 ci: temporarily skip the linux-race job a3c599e2 crossversion: allow run dir to not exist fb84a7b8 db: change LazyFetcher.ValueFetcher to an interface 8e5e7973 db: make EnableValueBlocks dynamically configurable 7d9a5b2e db: read path for values in value blocks 936e011b rfc: virtual sstables in the ingestion path fcf9e404 internal/rangekey: avoid Transform allocation fece1a6f db: use bytealloc.A for key buffering ec94ead4 internal/rangekey: reuse merging buffers 630e6e90 internal/rangekey: reuse defragmenting buffers 4613f12b db: reuse RangeKeyData slice 6ee5cca6 vfs: fix typo b9289d76 base: add LazyValue.TryGetShortAttribute 87eccabb metamorphic: prevent directory collisions ``` Release note: None Epic: None 93085: ptcache: use simpler TestServer for tests r=andreimatei a=andreimatei These tests were creating a 1-node TestCluster instead of a simpler TestServer for no apparent reason. This patch switches to TestServer. Besides being more straight-forward, the TestServer is better because the test is doing funky stuff, combining the server with an external Cache using the server's stopper. Before, it was using the cluster's stopper, which is different from the server's stopper, which in turn was causing problems because their tracers are different. Release note: None Epic: None Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Mark Sirek <[email protected]> Co-authored-by: Jane Xing <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Andrei Matei <[email protected]>
I assume we can’t backport due to proto change but I’ll ask anyway... |
i don't think we can backport this. If a customer does encounter this issue, which we haven't ever seen in an escalation, they can always increase the gc ttl on the target or parent target they are attempting to restore as a workaround. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-disaster-recovery
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-disaster-recovery
Currently, we don't issue a protected timestamp over spans we're about to restore, which at the surface, does not seem like an issue, as restore doesn't overwrite existing keys. The lack of a PTS can cause the restore to fail, however, if the span we're restoring into has a very low gc ttl, like say 5 minutes. Specifically, if an AddSSTable request takes longer than the gc ttl, then its batch request timestamp may be earlier than the replica's gc threshold, causing the AddSStable to fail.
We can prevent this failure by preventing the gc threshold from advancing in replicas where the restore is occuring. To do this, a pts should be issued over the whole restore span, before any restore flow begins.
Note that the addsstable request can't merely be retried at a higher batch request TS, as all the keys in the SST have timestamps written at the batch request timestamp. Bumping the batchTS would require rewriting the SST, which seems quite expensive.
Jira issue: CRDB-21123
The text was updated successfully, but these errors were encountered: