-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(barrier): support database failure isolation (part 2, local) #19579
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
b5d838a
to
9defad3
Compare
f71fccc
to
75dbaad
Compare
264210d
to
8d9bbb9
Compare
8984fee
to
29d69ee
Compare
a2d7db4
to
d28b483
Compare
60d379b
to
379521e
Compare
7baa0d1
to
07de9ce
Compare
379521e
to
a1fd984
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic LGTM. I think we need some tests to trigger per DB failure and make sure other DBs are unaffected. Simulation tests are preferred.
// TODO: may only report as database failure instead of reset the stream | ||
// when the HummockUploader support partial recovery. Currently the HummockUploader | ||
// enter `Err` state and stop working until a global recovery to clear the uploader. | ||
self.control_stream_handle.reset_stream_with_err(Status::internal(format!("failed to complete epoch: {} {} {:?} {:?}", database_id, partial_graph_id.0, barrier.epoch, err.as_report()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We skip try_find_root_actor_failure
here. Is it intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because this error is likely to be caused by hummock sync error, and is not related to actor failure.
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
After #19664. The local barrier manager part of database failure isolation.
Previously in
ManagedBarrierState
, all databases are in aRunning
status, as aDatabaseManagedBarrierState
, because in case of any failure, we will wait for a global recovery. To support database failure isolation, besides theRunning
status, we will have the following statuses.The lifecycle of a
DatabaseStatus
will beadd_partial_graph
request. The status is atRunning
.ReportDatabaseFailureResponse
, and enter theSuspended
.ResetDatabaseRequest
from the meta global barrier manager, and then enterResetting
, which spawns a task to clear all actors and clear hummock uploader state of the state tables. Note that we may enter theResetting
directly fromRunning
when the database reset is triggered by failure reported from other CNs.DatabaseStatus
removed fromManagedBarrierState
, and will be recreated by the subsequentadd_partial_graph
request from the meta global barrier manager.Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.