-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(barrier): support database failure isolation (part 1, meta) #19664
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
@@ -979,33 +982,11 @@ impl LocalBarrierManager { | |||
} | |||
|
|||
/// A [`StreamError`] with a score, used to find the root cause of actor failures. | |||
#[derive(Debug, Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to src/error/src/tonic/extra.rs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some early comments.
if self.checkpoint_control.is_failed_at_worker_err(worker_id) { | ||
let errors = self.control_stream_manager.collect_errors(worker_id, e).await; | ||
let err = merge_node_rpc_errors("get error from control stream", errors); | ||
self.report_collect_failure(&err); | ||
Self::report_collect_failure(&self.env, &err); | ||
self.failure_recovery(err).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that, the per-database recovery is only triggered by receiving a ReportDatabaseFailureResponse from CN, and when receiving an error from bidi-stream to any compute node will still trigger global recovery.
With node label introduced in the future, I think the flow on bidi-stream failure will become:
- Get the databases assigned to the failing CN
- Trigger database recovery for each of those databases
In other words, we will only rely on database recovery and deprecate the global recovery logics ultimately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Global recovery may still be triggered in cases such as manually triggering recovery, and some internal errors of global barrier manager such as failed to commit epoch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not saying that we no longer have global recovery. My point is it is better that the global recovery is done by triggering DB recovery for each existing DB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. In future PRs, the logic when handling bidi-stream failure will become something similar to:
- Get the databases assigned to the failing CN
- Trigger database recovery for each of those databases
The reason why this PR still triggers a global recovery is because the CN side per database recovery logic is not included in this PR so to avoid breaking CI, we need to keep the logic global recovery separately in this PR.
if let Some(entering_recovery) = self.checkpoint_control.on_report_failure(resp, &mut self.control_stream_manager) { | ||
let database_id = entering_recovery.database_id(); | ||
warn!(database_id = database_id.database_id, "database entering recovery"); | ||
self.context.abort_and_mark_blocked(Some(database_id), RecoveryReason::Failover(anyhow!("reset database: {}", database_id).into())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @yezizp2012 Will it be an issue if we remove the barrier_manager.check_status_running check completely? Previously the check seems to be used to avoid DDL and scale (1, 2). However, the check is not 100% accurate because it is possible that recovery is triggered immediately after the check has passed.
In this case, we still need to rely on the validation in barrier manager to make sure the corresponding command is not allowed. In other words, if we change to only rely on the barrier manager validation, will it be a problem? We may need to check whether there can be partial metadata written before barrier injection and whether these partial metadata will be cleaned up correctly on barrier injection/collection failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM.
I have two more suggestions:
- The PR description is awesome and can help people understand the state machine in the codes. How about including the description in the codes as documentation of
GlobalBarrierWorker::run_inner
(with references to other codes as well) - The state machine is a bit more complicated so we need to improve observability on that. Let's include elapsed time and an info log with necessary information (db_id, state, status action, retry, etc) before and after each state machine transition.
let temp_place_holder = DatabaseRecoveringStage::Resetting { | ||
remaining_workers: Default::default(), | ||
reset_workers: Default::default(), | ||
reset_request_id: 0, | ||
backoff_future: None, | ||
}; | ||
match replace(&mut state.stage, temp_place_holder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits: What do we need the replace
here? The whole database_status
will be reset in L400 anyway so I guess we only need to do match state.stage
here?
cc @shanicky @st1page for awareness. In this PR, global recovery is still present when the CN-meta RPC fails, which can happen on CN crashes. In the future we will trigger DB recovery instead of global recovery in such cases and global recovery will only be triggered on manual recovery or errors inside meta. See context here. By comparing the metadata recovery logic for DB recovery and global recovery, there is one missing piece in the former: Per DB |
We need to determine whether we need a database level recovery or a resource group level group. cc @shanicky |
Can you elaborate why it requires additional support on each CN to support per DB recovery/actor migration? Currently I think we make a strong assumption that there will be no edges between DB objects in different DB but I am not sure whether the same assumption holds for resource group. Also, the metadata like barrier command and catalog are bound to DB id so I think it is more natural to do per DB recovery actually. |
7baa0d1
to
07de9ce
Compare
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
In this PR, we implement the logic of database failure isolation in meta global barrier manager part.
We will add a new variant
ReportDatabaseFailureResponse
inStreamingControlStreamResponse
to report the failure of single database. After receiving such response, instead of triggering a global recovery, we will only recover the streaming job of the failed database, and other databases won't be affected by this failure. Note that, the per-database recovery is only triggered by receiving aReportDatabaseFailureResponse
from CN, and when receiving an error from bidi-stream to any compute node will still trigger global recovery. However, in this PR we only implement the logic in the meta global barrier manager, and the CN local barrier manager will not send anyReportDatabaseFailureResponse
yet. Therefore, the per-database recovery logic will not be triggered in this PR. It will be triggered only after #19579, which implements the logic in CN local barrier manager, is merged.Previously in
CheckpointControl
, we store the runtime info of each database inDatabaseCheckpointControl
. In this PR, each database will be stored asDatabaseCheckpointControlStatus
, which can be eitherRunning
orRecovering
.In
Recovery
, it has two stages,Resetting
andInitializing
. In theResetting
stage, it sendResetDatabaseRequest
to all CNs, and wait for theResetDatabaseResponse
from all CNs. After receiving theResetDatabaseResponse
from all CNs, and will enter theInitializing
stage. By now, we will have ensured that, any information of the reset database will not exist in any CN, and there won't be any inflight message of the database exist in the bidi-stream. In theInitializing
stage, we will send initial barriers to all CNs to recreate the actors of the database. After we successfully collects the initial barrier from all CNs, the database is recovered and enter the normalRunning
.We can treat each database as a state machine of 3 states:
Running
,Resetting
andInitializing
. The state transition can be triggered when receiving 3 variants of response:ReportDatabaseFailure
,BarrierComplete
,DatabaseReset
. The logic of state transition can be summarized as followed:Running
ReportDatabaseFailure
BarrierCompletingTask
to finish if there is any, mark the database as blocked in command queueResetDatabaseRequest
withreset_request_id
as 0 to all CNs, and savereset_request_id
and the set of nodes that need to collect response.Resetting
state.BarrierComplete
: update theDatabaseCheckpointControl
.DatabaseReset
: unreachableResetting
ReportDatabaseFailure
orBarrierComplete
: ignoreDatabaseReset
:reset_request_id
in the response is less than the savedreset_request_id
, ignoreInitializing
stateInitializing
BarrierComplete
:Running
ReportDatabaseFailure
reset_request_id
, and sendResetDatabaseRequest
to all CNsResetting
DatabaseReset
: unreachableNote that we use the
reset_request_id
to handle the case of staleResetDatabaseResponse
, though in current state machine, there should not be staleResetDatabaseResponse
, because inResetting
state, we either enterInitializing
when we collect theResetDatabaseResponse
from previously sent CNs, or get error and enter global recovery.Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.