Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dag] dag rebootstrap #9967

Merged
merged 6 commits into from
Sep 15, 2023
Merged

[dag] dag rebootstrap #9967

merged 6 commits into from
Sep 15, 2023

Conversation

ibalajiarun
Copy link
Contributor

@ibalajiarun ibalajiarun commented Sep 7, 2023

Description

This PR introduces the rebootstrap logic for the DAG. Essentially, when there is a need to state sync the DAG, we abort the existing handlers, return to the bootstrapper and let the bootstrapper state sync the DAG and recreate all the components and start the handlers again. This is to unify the logic such that we use the same bootstraping logic for recovery as well as state sync.

Test Plan

Existing Tests

@ibalajiarun ibalajiarun force-pushed the balaji/dag-state-sync branch from 7e41139 to 3932158 Compare September 7, 2023 21:48
@ibalajiarun ibalajiarun force-pushed the balaji/dag-state-sync branch 2 times, most recently from fab0ad8 to 6dd6796 Compare September 7, 2023 23:41
Base automatically changed from balaji/dag-state-sync to main September 8, 2023 14:45
@ibalajiarun ibalajiarun changed the base branch from main to balaji/bcast-certified-node-msg September 8, 2023 16:35
@ibalajiarun ibalajiarun force-pushed the balaji/dag-rebootstrap branch from b00b863 to 017824b Compare September 8, 2023 16:35
error!(error = ?e, "unable to sync");
}
},
_ = handler.start(&mut dag_rpc_rx) => {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works? once it starts, it'll never go out from the inner loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it works. as long as we have an .await within the loop, the future should stop polling and yield to another future in the outer select!. If we get rebootstrap notification, it will complete the future and then we will simply not poll handler again. instead, we will start a new handler.

},
Some(node) = rebootstrap_notification_rx.recv() => {
df_handle.abort();
let _ = df_handle.await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should have a guard for these services

Copy link
Contributor

@zekun000 zekun000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at it again, why we need to separate the trigger and sync manager? since we block the handler anyway, why not just check and sync directly inside the handler?

AggregateSignature::empty(),
);

let mut shutdown_rx = shutdown_rx.into_stream();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this, just &mut shutdown_rx is enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need this btw, because the oneshot::Receiver seems moved within the select statement.

@ibalajiarun
Copy link
Contributor Author

looking at it again, why we need to separate the trigger and sync manager? since we block the handler anyway, why not just check and sync directly inside the handler?

I think I need to abort the fetch service before starting an ad-hoc fetch for state sync. otherwise, there could be two fetches for same nodes happening concurrently?

Base automatically changed from balaji/bcast-certified-node-msg to main September 11, 2023 20:22
@ibalajiarun ibalajiarun force-pushed the balaji/dag-rebootstrap branch from 60eace0 to c1842f6 Compare September 11, 2023 22:31
@ibalajiarun ibalajiarun force-pushed the balaji/dag-rebootstrap branch from c1842f6 to a8e5453 Compare September 11, 2023 22:36
@ibalajiarun ibalajiarun marked this pull request as ready for review September 11, 2023 22:38
@ibalajiarun ibalajiarun requested a review from sasha8 as a code owner September 11, 2023 22:38
@ibalajiarun ibalajiarun force-pushed the balaji/dag-rebootstrap branch from a8e5453 to c921272 Compare September 11, 2023 23:50

let dag_fetcher = DagFetcher::new(self.epoch_state.clone(), self.dag_network_sender.clone(), self.time_service.clone());

if let Err(e) = sync_manager.sync_dag_to(&certified_node_msg, dag_fetcher, dag_store.clone()).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought we're creating a new dag store instead of re-using the current one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we are. I pass the existing dag store to do some assertion checks on whether to actually state sync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that sounds weird, the check should be done in the check function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i am being paranoid. i check in the check function and i assert in the sync_to function.

let (handler, fetch_service) =
self.bootstrap_components(dag_store.clone(), order_rule, state_sync_trigger);

let df_handle = tokio::spawn(fetch_service.start());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can just have a drop guard like this created and avoid the abort/await lines in both branches?

match certified_node_msg.verify(&self.epoch_state.verifier) {
Ok(_) => match self.state_sync_trigger.check(certified_node_msg).await {
ret @ (NeedsSync(_), None) => return Ok(ret.0),
(Synced, Some(certified_node_msg)) => self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the message can be carried in the StateSyncStatus::Synced to avoid the second Option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to send the Synced status from process_rpc fn as well, so I either clone the ceritifed_node_msg or use the second option.

Alternatively, I could have two enums, state sync check return one enum and I can convert it to another one for process_rpc. i thought it was too much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to define two enum, this function can just return a Result<(), CertifiedNodeMessage>?

@ibalajiarun ibalajiarun force-pushed the balaji/dag-rebootstrap branch from c921272 to 51158fc Compare September 13, 2023 19:18
match certified_node_msg.verify(&self.epoch_state.verifier) {
Ok(_) => match self.state_sync_trigger.check(certified_node_msg).await {
ret @ (NeedsSync(_), None) => return Ok(ret.0),
(Synced, Some(certified_node_msg)) => self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to define two enum, this function can just return a Result<(), CertifiedNodeMessage>?


let dag_fetcher = DagFetcher::new(self.epoch_state.clone(), self.dag_network_sender.clone(), self.time_service.clone());

if let Err(e) = sync_manager.sync_dag_to(&certified_node_msg, dag_fetcher, dag_store.clone()).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that sounds weird, the check should be done in the check function?

Copy link
Contributor

@sasha8 sasha8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we now ready to increase the dag_window.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, when we introduce the onchain config.

dag.clone(),
self.time_service.clone(),
);
let fetch_requester = Arc::new(fetch_requester);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why cannot you return arc from DagFetcherService::new?

@ibalajiarun ibalajiarun force-pushed the balaji/dag-rebootstrap branch from c41b645 to d85cfd6 Compare September 15, 2023 15:28
@ibalajiarun ibalajiarun enabled auto-merge (squash) September 15, 2023 16:00
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

✅ Forge suite compat success on aptos-node-v1.6.2 ==> 3318585eae993e20a272c2f33de24bec49f75c65

Compatibility test results for aptos-node-v1.6.2 ==> 3318585eae993e20a272c2f33de24bec49f75c65 (PR)
1. Check liveness of validators at old version: aptos-node-v1.6.2
compatibility::simple-validator-upgrade::liveness-check : committed: 4660 txn/s, latency: 6649 ms, (p50: 6900 ms, p90: 9200 ms, p99: 10000 ms), latency samples: 181760
2. Upgrading first Validator to new version: 3318585eae993e20a272c2f33de24bec49f75c65
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1786 txn/s, latency: 16391 ms, (p50: 18800 ms, p90: 22300 ms, p99: 22600 ms), latency samples: 92900
3. Upgrading rest of first batch to new version: 3318585eae993e20a272c2f33de24bec49f75c65
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1824 txn/s, latency: 15760 ms, (p50: 19300 ms, p90: 22000 ms, p99: 22600 ms), latency samples: 91220
4. upgrading second batch to new version: 3318585eae993e20a272c2f33de24bec49f75c65
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 3372 txn/s, latency: 9450 ms, (p50: 9400 ms, p90: 13900 ms, p99: 25000 ms), latency samples: 138280
5. check swarm health
Compatibility test for aptos-node-v1.6.2 ==> 3318585eae993e20a272c2f33de24bec49f75c65 passed
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 3318585eae993e20a272c2f33de24bec49f75c65

two traffics test: inner traffic : committed: 5841 txn/s, latency: 6617 ms, (p50: 6300 ms, p90: 8400 ms, p99: 14200 ms), latency samples: 2564200
two traffics test : committed: 100 txn/s, latency: 3078 ms, (p50: 3000 ms, p90: 3500 ms, p99: 7300 ms), latency samples: 1880
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.228, avg: 0.211", "QsPosToProposal: max: 0.299, avg: 0.183", "ConsensusProposalToOrdered: max: 0.687, avg: 0.630", "ConsensusOrderedToCommit: max: 0.569, avg: 0.531", "ConsensusProposalToCommit: max: 1.214, avg: 1.161"]
Max round gap was 1 [limit 4] at version 826366. Max no progress secs was 3.968346 [limit 10] at version 2618845.
Test Ok

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

✅ Forge suite framework_upgrade success on aptos-node-v1.5.1 ==> 3318585eae993e20a272c2f33de24bec49f75c65

Compatibility test results for aptos-node-v1.5.1 ==> 3318585eae993e20a272c2f33de24bec49f75c65 (PR)
Upgrade the nodes to version: 3318585eae993e20a272c2f33de24bec49f75c65
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 5128 txn/s, latency: 6332 ms, (p50: 5400 ms, p90: 9500 ms, p99: 17500 ms), latency samples: 189740
5. check swarm health
Compatibility test for aptos-node-v1.5.1 ==> 3318585eae993e20a272c2f33de24bec49f75c65 passed
Test Ok

@ibalajiarun ibalajiarun merged commit fcf58d0 into main Sep 15, 2023
70 of 73 checks passed
@ibalajiarun ibalajiarun deleted the balaji/dag-rebootstrap branch September 15, 2023 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants