-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[persist] Restore Blob state based on the contents of Consensus #20482
Conversation
25ee5bb
to
630a94e
Compare
@bkirwi as discussed in Slack, we would like to add backup/restore to various other CI workloads. Can you please add me as a reviewer to this PR once it reaches the state where this can be done? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philip-stoev - I think the test is now in decent shape. I've left a few comments on the test file in case they're useful as you're thinking about extracting code.
(FWIW, I'm a bit nervous about this part: a correct restore process involves some environmental assumptions (ie. no MZs running) and a specific sequence of calls unrelated tools, which seems pretty brittle. I'm not sure if there's a good way to add guardrails to this logic that would make it harder to mess that process up, or if we'll just need to be very careful as we're adding tests...)
Service( | ||
name="persistcli", | ||
config={ | ||
# TODO: depends! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dependencies definitely don't matter for this script, but I haven't thought about whether we'd want to make that explicit if we extract this out.
"minioadmin", | ||
) | ||
c.run("mc", "version", "enable", "persist/persist") | ||
blob_uri = "s3://minioadmin:minioadmin@persist/persist?endpoint=http://minio:9000/®ion=minio" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this would always be worthwhile for minio-based tests, but it does involve yet another container in the mix, which the existing minio tool goes out of its way to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to do this in Platform checks and other frameworks, so most minio-based tests will start using versioning .
"materialized" | ||
) # Very important that MZ is not running during the restore process! | ||
c.exec( | ||
"cockroach", "cockroach", "sql", "--insecure", "-e", "DROP DATABASE defaultdb;" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This immediately puts the database in an invalid state. It does not return to a valid state until the successful restore-blob
call below. This is sort of a "critical section"; missing one of these steps, or having a materialize instance running while either of these things is happening, is likely to cause crashes or other breakage... but that breakage is expected.
Oh, and the backup/restore test seems to be "passing" but fails when uploading results... I'm not sure if that goes away once it's merged or whether we need some manual intervention. (Or if @philip-stoev & co would prefer to fold it into an existing suite instead?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, looks great!
ran out of time to completely finish the review, so I might trickle in a few more nits on the next pass, but I think this should be most of it
the python/mzcompose stuff all seems sane to me, but it's probably best to get another reviewer for that
.await?; | ||
|
||
let temp_dir = tempfile::tempdir().map_err(Error::from)?; | ||
blob_impl_test(move |path| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's give MemBlob the same treatment (a flag to enable tombstoneing and a test)
src/persist-client/src/cli/admin.rs
Outdated
} | ||
Ok(()) | ||
} | ||
|
||
/// Attempt to restore all the blobs referenced by the current state in consensus. | ||
/// Returns a list of blobs that were not possible to restore. | ||
async fn restore_blob( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to have some basic unit test coverage of this (the MemBlob comment I left will make this easier). kind just thinking: start up persist, write some data, delete blobs out from under it, restore them, and then read the data. I think that won't tickle any of the correctness issues that we'd expect in an online restore?
|
||
let mut shards = consensus.list_keys(); | ||
let mut not_restored = vec![]; | ||
while let Some(shard) = shards.next().await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it crazy to run restore_blob concurrently? feels like we might want some amount of concurrency in the tool for big envs and this might be an easy place to to do it. otoh, I'm 100% okay leaving this as a TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not crazy but not trivial... I'll leave a TODO for now.
(If we do allow concurrency, that should probably be a CLI arg?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think what I'm worried about here is that we end up in an incident and find out that the restore tool is gonna take many hours because we don't have any sort of concurrency. If we had some data that running this on a huge env still finished in O(minutes), then I definitely don't care. Fine with making it a cli arg, but I'd lean toward making the default of that arg to be some amount of concurrency, unless we have evidence that things tend to be fast enough even without it
@bkirwi my plan on this is as follows:
|
I am getting the following on stdout when doing a restore:
is that expected? |
That's not too concerning -- we're just dumping out a bunch of internal metrics as a convenience here. I'll see if that warning is easy to avoid, though if not I may take it as a followup. |
This PR has higher risk. In addition to having a knowledgeable reviewer, it may be useful to add observability and/or a feature flag. What's This?
|
e3c4767
to
2ed130d
Compare
@bkirwi I was able to get the entire stuff in Platform Checks to be backed up and then restored, but unfortunately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once QA is happy!
I'll see if that warning is easy to avoid, though if not I may take it as a followup.
I think just downgrade it to an info
|
||
let mut shards = consensus.list_keys(); | ||
let mut not_restored = vec![]; | ||
while let Some(shard) = shards.next().await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think what I'm worried about here is that we end up in an incident and find out that the restore tool is gonna take many hours because we don't have any sort of concurrency. If we had some data that running this on a huge env still finished in O(minutes), then I definitely don't care. Fine with making it a cli arg, but I'd lean toward making the default of that arg to be some amount of concurrency, unless we have evidence that things tend to be fast enough even without it
@bkirwi The platform checks test that fails can be found at:
To run:
It will then hang as follows:
and can not ctrl-C it or SIGTERM it, SIGKILL is required to terminate persistcli. |
Probably should not call this on the hot path, but it's useful in the same circumstances as `restore` is.
- do not use a mounted volume for Mc(), use a persistent container instead - do not make Mc() depend on Minio() as for some reason docker compose insists on restarting minio when mc is started, and minio fails to restart properly.
I spent a couple days looking to the issue Philip mentioned. It turns out that the restore CLI is hanging during a query to CRDB. (In a fairly odd way: always the second query, without triggering connection or user timeouts, and the connection itself is obtained successfully.) I'm at a bit of a loss as to why this is the case at the moment, but a timeout and retry seems to work around it effectively for now. I'll continue to investigate -- or write an issue, if I remain at a loss -- but for now hopefully this unblocks the PR. @philip-stoev - The final commit is enough to get your platform check passing -- though if you don't mind, I'd rather do further CI tests as a followup PR. |
Weird! Agreed that it would be good to figure this out at some point, but what you have looks fine to unblock things |
Thanks for the reviews! |
The retry loop takes 5 minutes to exit so a separate ticket has been opened https://github.com/MaterializeInc/database-issues/issues/6812 |
Persist state is a mix of stuff in CRDB and stuff in S3. This complicates restoring from backup: if we restore CRDB to an old state, the blobs it references might be deleted.
It's possible to "undelete" deleted blobs in S3 by deleting a delete marker from an unversioned bucket. This PR implements that logic as a CLI tool, and implements a basic test that restores a CRDB state and checks that the database is in the expected state after the CLI runs.
Motivation
See #18157 for the design.
Tips for reviewer
One subtlety here is that we can't rely on the usual state iterators, since those rely on the existence of rollups, which may not exist yet. The current draft just iterates over the diffs manually.
I have not yet tested this tool against a real environment. I will probably wait for at least a first pass of code review first!
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.