Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[persist] Restore Blob state based on the contents of Consensus #20482

Merged
merged 15 commits into from
Oct 19, 2023

Conversation

bkirwi
Copy link
Contributor

@bkirwi bkirwi commented Jul 12, 2023

Persist state is a mix of stuff in CRDB and stuff in S3. This complicates restoring from backup: if we restore CRDB to an old state, the blobs it references might be deleted.

It's possible to "undelete" deleted blobs in S3 by deleting a delete marker from an unversioned bucket. This PR implements that logic as a CLI tool, and implements a basic test that restores a CRDB state and checks that the database is in the expected state after the CLI runs.

Motivation

See #18157 for the design.

Tips for reviewer

One subtlety here is that we can't rely on the usual state iterators, since those rely on the existence of rollups, which may not exist yet. The current draft just iterates over the diffs manually.

I have not yet tested this tool against a real environment. I will probably wait for at least a first pass of code review first!

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@bkirwi bkirwi force-pushed the undelete branch 2 times, most recently from 25ee5bb to 630a94e Compare October 6, 2023 16:27
src/persist/src/location.rs Outdated Show resolved Hide resolved
@philip-stoev
Copy link
Contributor

@bkirwi as discussed in Slack, we would like to add backup/restore to various other CI workloads. Can you please add me as a reviewer to this PR once it reaches the state where this can be done? Thanks!

@bkirwi bkirwi changed the title [sketch] Restore Blob state based on the contents of Consensus [persist] Restore Blob state based on the contents of Consensus Oct 12, 2023
Copy link
Contributor Author

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philip-stoev - I think the test is now in decent shape. I've left a few comments on the test file in case they're useful as you're thinking about extracting code.

(FWIW, I'm a bit nervous about this part: a correct restore process involves some environmental assumptions (ie. no MZs running) and a specific sequence of calls unrelated tools, which seems pretty brittle. I'm not sure if there's a good way to add guardrails to this logic that would make it harder to mess that process up, or if we'll just need to be very careful as we're adding tests...)

Service(
name="persistcli",
config={
# TODO: depends!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dependencies definitely don't matter for this script, but I haven't thought about whether we'd want to make that explicit if we extract this out.

"minioadmin",
)
c.run("mc", "version", "enable", "persist/persist")
blob_uri = "s3://minioadmin:minioadmin@persist/persist?endpoint=http://minio:9000/&region=minio"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this would always be worthwhile for minio-based tests, but it does involve yet another container in the mix, which the existing minio tool goes out of its way to avoid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to do this in Platform checks and other frameworks, so most minio-based tests will start using versioning .

"materialized"
) # Very important that MZ is not running during the restore process!
c.exec(
"cockroach", "cockroach", "sql", "--insecure", "-e", "DROP DATABASE defaultdb;"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This immediately puts the database in an invalid state. It does not return to a valid state until the successful restore-blob call below. This is sort of a "critical section"; missing one of these steps, or having a materialize instance running while either of these things is happening, is likely to cause crashes or other breakage... but that breakage is expected.

@bkirwi
Copy link
Contributor Author

bkirwi commented Oct 12, 2023

Oh, and the backup/restore test seems to be "passing" but fails when uploading results... I'm not sure if that goes away once it's merged or whether we need some manual intervention. (Or if @philip-stoev & co would prefer to fold it into an existing suite instead?)

@bkirwi bkirwi marked this pull request as ready for review October 12, 2023 20:27
@bkirwi bkirwi requested review from a team as code owners October 12, 2023 20:27
Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, looks great!

ran out of time to completely finish the review, so I might trickle in a few more nits on the next pass, but I think this should be most of it

the python/mzcompose stuff all seems sane to me, but it's probably best to get another reviewer for that

src/persist-client/src/cli/admin.rs Outdated Show resolved Hide resolved
src/persist-client/src/cli/admin.rs Outdated Show resolved Hide resolved
src/persist-client/src/cli/admin.rs Outdated Show resolved Hide resolved
src/persist-client/src/cli/admin.rs Outdated Show resolved Hide resolved
src/persist-client/src/internal/metrics.rs Outdated Show resolved Hide resolved
.await?;

let temp_dir = tempfile::tempdir().map_err(Error::from)?;
blob_impl_test(move |path| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's give MemBlob the same treatment (a flag to enable tombstoneing and a test)

}
Ok(())
}

/// Attempt to restore all the blobs referenced by the current state in consensus.
/// Returns a list of blobs that were not possible to restore.
async fn restore_blob(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have some basic unit test coverage of this (the MemBlob comment I left will make this easier). kind just thinking: start up persist, write some data, delete blobs out from under it, restore them, and then read the data. I think that won't tickle any of the correctness issues that we'd expect in an online restore?

src/persist/src/mem.rs Outdated Show resolved Hide resolved
src/persist-client/src/cli/admin.rs Show resolved Hide resolved

let mut shards = consensus.list_keys();
let mut not_restored = vec![];
while let Some(shard) = shards.next().await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it crazy to run restore_blob concurrently? feels like we might want some amount of concurrency in the tool for big envs and this might be an easy place to to do it. otoh, I'm 100% okay leaving this as a TODO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not crazy but not trivial... I'll leave a TODO for now.

(If we do allow concurrency, that should probably be a CLI arg?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think what I'm worried about here is that we end up in an incident and find out that the restore tool is gonna take many hours because we don't have any sort of concurrency. If we had some data that running this on a huge env still finished in O(minutes), then I definitely don't care. Fine with making it a cli arg, but I'd lean toward making the default of that arg to be some amount of concurrency, unless we have evidence that things tend to be fast enough even without it

@philip-stoev philip-stoev requested review from philip-stoev and removed request for a team October 16, 2023 07:10
@philip-stoev
Copy link
Contributor

@bkirwi my plan on this is as follows:

  • I am going to look into the CI failure
  • I am going to add backup/restore to a few more testing frameworks. Depending on how straightfoward it is , I will push the required changes to this PR or a follow-up one.

@philip-stoev philip-stoev self-requested a review October 16, 2023 11:33
@philip-stoev
Copy link
Contributor

I am getting the following on stdout when doing a restore:

2023-10-16T11:36:52.396497Z  WARN mz_persist_client::cli::admin: unhandled mz_persist_external_blob_sizes metric type: HISTOGRAM

is that expected?

@bkirwi
Copy link
Contributor Author

bkirwi commented Oct 16, 2023

I am getting the following on stdout when doing a restore [...]

That's not too concerning -- we're just dumping out a bunch of internal metrics as a convenience here. I'll see if that warning is easy to avoid, though if not I may take it as a followup.

@shepherdlybot
Copy link

shepherdlybot bot commented Oct 16, 2023

This PR has higher risk. In addition to having a knowledgeable reviewer, it may be useful to add observability and/or a feature flag. What's This?

Risk Score Probability Buggy File Hotspots
🔴 80 / 100 60% 0

@bkirwi bkirwi force-pushed the undelete branch 2 times, most recently from e3c4767 to 2ed130d Compare October 16, 2023 17:47
@bkirwi bkirwi requested a review from danhhz October 16, 2023 18:44
@philip-stoev
Copy link
Contributor

@bkirwi I was able to get the entire stuff in Platform Checks to be backed up and then restored, but unfortunately persistcli admin --commit restore-blob seems to hang indefinitely, with all threads of the process blocked in tokio. I will continue investigating tomorrow.

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once QA is happy!

I'll see if that warning is easy to avoid, though if not I may take it as a followup.

I think just downgrade it to an info


let mut shards = consensus.list_keys();
let mut not_restored = vec![];
while let Some(shard) = shards.next().await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think what I'm worried about here is that we end up in an incident and find out that the restore tool is gonna take many hours because we don't have any sort of concurrency. If we had some data that running this on a huge env still finished in O(minutes), then I definitely don't care. Fine with making it a cli arg, but I'd lean toward making the default of that arg to be some amount of concurrency, unless we have evidence that things tend to be fast enough even without it

src/persist-client/src/cli/admin.rs Show resolved Hide resolved
src/persist-client/src/internal/restore.rs Outdated Show resolved Hide resolved
src/persist-client/tests/machine/restore_blob Show resolved Hide resolved
@philip-stoev
Copy link
Contributor

@bkirwi The platform checks test that fails can be found at:

To github.com:philip-stoev/materialize.git
 * [new branch]            undelete-platform-checks-test -> undelete-platform-checks-test

To run:

cd test/platform-checks
./mzcompose down -v ; ./mzcompose run default --scenario=BackupAndRestoreAfterManipulate 

It will then hang as follows:

$ docker compose run persistcli admin --commit restore-blob --blob-uri=s3://minioadmin:minioadmin@persist/persist?endpoint=http://minio:9000/&region=minio --consensus-uri=postgres://root@cockroach:26257?options=--search_path=consensus
2023-10-17T06:39:36.391558Z  INFO tokio_postgres::connection: NOTICE: relation "consensus" already exists, skipping    
2023-10-17T06:39:36.402296Z  INFO lazy_load_credentials: aws_credential_types::cache::lazy_caching: credentials cache miss occurred; added new AWS credentials (took 32.472µs)
....

and can not ctrl-C it or SIGTERM it, SIGKILL is required to terminate persistcli.

@bkirwi
Copy link
Contributor Author

bkirwi commented Oct 18, 2023

I spent a couple days looking to the issue Philip mentioned.

It turns out that the restore CLI is hanging during a query to CRDB. (In a fairly odd way: always the second query, without triggering connection or user timeouts, and the connection itself is obtained successfully.)

I'm at a bit of a loss as to why this is the case at the moment, but a timeout and retry seems to work around it effectively for now. I'll continue to investigate -- or write an issue, if I remain at a loss -- but for now hopefully this unblocks the PR.

@philip-stoev - The final commit is enough to get your platform check passing -- though if you don't mind, I'd rather do further CI tests as a followup PR.
@danhhz - I put the retry loop in the restore code itself, to avoid sullying the normal implementation. (Though it may eventually make sense to move it if it turns out to be a more general problem.) Let me know if you'd like to see any changes!

@danhhz
Copy link
Contributor

danhhz commented Oct 19, 2023

Weird! Agreed that it would be good to figure this out at some point, but what you have looks fine to unblock things

@bkirwi
Copy link
Contributor Author

bkirwi commented Oct 19, 2023

Thanks for the reviews!

@bkirwi bkirwi merged commit 8608b5f into MaterializeInc:main Oct 19, 2023
@philip-stoev
Copy link
Contributor

The retry loop takes 5 minutes to exit so a separate ticket has been opened https://github.com/MaterializeInc/database-issues/issues/6812

@bkirwi bkirwi mentioned this pull request Oct 20, 2023
5 tasks
@joacoc joacoc mentioned this pull request Oct 29, 2023
@joacoc joacoc mentioned this pull request Nov 10, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants