Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dag] preliminary state sync implementation #9724

Merged
merged 7 commits into from
Sep 8, 2023
Merged

Conversation

ibalajiarun
Copy link
Contributor

@ibalajiarun ibalajiarun commented Aug 22, 2023

Description

This PR provides the preliminary implementation of state sync for the DAG. The integration part will be in another PR.

Test Plan

Unit test in following PR

@ibalajiarun ibalajiarun force-pushed the balaji/dag-bitmask-fix branch from 984b7a8 to 178b9f5 Compare August 24, 2023 20:43
@ibalajiarun ibalajiarun force-pushed the balaji/dag-state-sync branch from 918b442 to ac67202 Compare August 24, 2023 20:43
@ibalajiarun ibalajiarun force-pushed the balaji/dag-bitmask-fix branch from 178b9f5 to 2bcdbd4 Compare August 24, 2023 20:54
@ibalajiarun ibalajiarun force-pushed the balaji/dag-state-sync branch from ac67202 to d4409ca Compare August 24, 2023 20:54
@ibalajiarun ibalajiarun force-pushed the balaji/dag-bitmask-fix branch from 2bcdbd4 to bc4d9ee Compare August 24, 2023 21:01
Base automatically changed from balaji/dag-bitmask-fix to main August 24, 2023 21:34
@ibalajiarun ibalajiarun force-pushed the balaji/dag-state-sync branch 2 times, most recently from abf8a81 to 7071576 Compare August 30, 2023 04:39
@ibalajiarun ibalajiarun changed the base branch from main to balaji/dag-fixes August 30, 2023 05:14
@ibalajiarun ibalajiarun force-pushed the balaji/dag-state-sync branch from 7071576 to 023b879 Compare August 30, 2023 05:14
@ibalajiarun ibalajiarun marked this pull request as ready for review August 30, 2023 05:15
@ibalajiarun ibalajiarun requested a review from zekun000 as a code owner August 30, 2023 05:15
@ibalajiarun ibalajiarun changed the title [dag][wip] state sync [dag] preliminary state sync implementation Aug 30, 2023
@ibalajiarun ibalajiarun requested a review from sasha8 August 30, 2023 05:17
@@ -418,6 +419,33 @@ impl TDAGMessage for CertifiedNode {
}
}

#[derive(Serialize, Deserialize, Clone, Debug, PartialEq)]
pub struct CertifiedNodeWithLedgerInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the convention we have today is just to call it CertifiedNodeMessage


async fn send_epoch_change(&self, proof: EpochChangeProof);

async fn send_commit_proof(&self, ledger_info: LedgerInfoWithSignatures);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to couple this two with network sender, this is more like self notifier. maybe another adapter trait?

// Note: ledger info round <= highest ordered round
if dag_reader.highest_committed_round().unwrap_or_default()
< ledger_info.commit_info().round()
&& dag_reader.exists_by_round_digest(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this existence check? if the round is between commit and order, it should always exist if we don't violate safety?

let commit_li = node.ledger_info();

// TODO: there is a case where DAG fetches missing nodes in window and a crash happens and when we restart,
// we end up with a gap between the DAG and we need to be smart enough to clean up the DAG before the gap.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we pass the ledger info in constructor and do the cleanup

// Create a new DAG store and Fetch blocks
let target_round = node.round();
let start_round = commit_li.commit_info().round().saturating_sub(DAG_WINDOW);
let sync_dag_store = Arc::new(RwLock::new(Dag::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need a different constructor for this, this'll re-construct the current dag from storage

{
let mut dag_writer = sync_dag_store.write();
dag_writer.prune();
if let Some(node_status) = dag_writer.get_node_ref_mut_by_round_digest(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not enough, we need recursively marking all previous committed anchors in the window and all causal history

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed with @sasha8 on this, I think the best way is probably to read the events from block metadata that contains previous anchor information

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it is not enough to mark only the anchor, we should mark all the causal history.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we are not persisting it. Won't this be a problem in recovery? Should we persist before calling state sync?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't persist in normal path as well. I think db will give us the latest ledger info from which we can always get this info back even after recovery.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how will we know what nodes were already committed after recovery?

}
}

if commit_li.ledger_info().ends_epoch() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move this up and skip the syncing

use itertools::Itertools;
use std::sync::Arc;

pub const DAG_WINDOW: u64 = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to move this probably to order rule, since it affects whether a node is included in a commit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking this should be onchain config.

{
let mut dag_writer = sync_dag_store.write();
dag_writer.prune();
if let Some(node_status) = dag_writer.get_node_ref_mut_by_round_digest(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed with @sasha8 on this, I think the best way is probably to read the events from block metadata that contains previous anchor information

@ibalajiarun ibalajiarun force-pushed the balaji/dag-state-sync branch from fab0ad8 to 6dd6796 Compare September 7, 2023 23:41
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2023

✅ Forge suite compat success on aptos-node-v1.6.2 ==> 6dd67967a3c9f46769b8f50a4acbb582b7449617

Compatibility test results for aptos-node-v1.6.2 ==> 6dd67967a3c9f46769b8f50a4acbb582b7449617 (PR)
1. Check liveness of validators at old version: aptos-node-v1.6.2
compatibility::simple-validator-upgrade::liveness-check : committed: 4252 txn/s, latency: 7318 ms, (p50: 6300 ms, p90: 11100 ms, p99: 22600 ms), latency samples: 170080
2. Upgrading first Validator to new version: 6dd67967a3c9f46769b8f50a4acbb582b7449617
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1779 txn/s, latency: 16369 ms, (p50: 19000 ms, p90: 22300 ms, p99: 22400 ms), latency samples: 92540
3. Upgrading rest of first batch to new version: 6dd67967a3c9f46769b8f50a4acbb582b7449617
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1435 txn/s, latency: 17093 ms, (p50: 19000 ms, p90: 22900 ms, p99: 38400 ms), latency samples: 91840
4. upgrading second batch to new version: 6dd67967a3c9f46769b8f50a4acbb582b7449617
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 3478 txn/s, latency: 9173 ms, (p50: 9500 ms, p90: 12900 ms, p99: 13200 ms), latency samples: 146080
5. check swarm health
Compatibility test for aptos-node-v1.6.2 ==> 6dd67967a3c9f46769b8f50a4acbb582b7449617 passed
Test Ok

@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2023

✅ Forge suite realistic_env_max_load success on 6dd67967a3c9f46769b8f50a4acbb582b7449617

two traffics test: inner traffic : committed: 6548 txn/s, latency: 5987 ms, (p50: 5700 ms, p90: 7700 ms, p99: 11600 ms), latency samples: 2829080
two traffics test : committed: 100 txn/s, latency: 3097 ms, (p50: 3000 ms, p90: 3800 ms, p99: 4800 ms), latency samples: 1820
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.220, avg: 0.211", "QsPosToProposal: max: 0.166, avg: 0.152", "ConsensusProposalToOrdered: max: 0.623, avg: 0.589", "ConsensusOrderedToCommit: max: 0.554, avg: 0.525", "ConsensusProposalToCommit: max: 1.160, avg: 1.115"]
Max round gap was 1 [limit 4] at version 1363788. Max no progress secs was 3.631506 [limit 10] at version 1363788.
Test Ok

@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2023

✅ Forge suite framework_upgrade success on aptos-node-v1.5.1 ==> 6dd67967a3c9f46769b8f50a4acbb582b7449617

Compatibility test results for aptos-node-v1.5.1 ==> 6dd67967a3c9f46769b8f50a4acbb582b7449617 (PR)
Upgrade the nodes to version: 6dd67967a3c9f46769b8f50a4acbb582b7449617
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 5143 txn/s, latency: 6282 ms, (p50: 6500 ms, p90: 9600 ms, p99: 12400 ms), latency samples: 190320
5. check swarm health
Compatibility test for aptos-node-v1.5.1 ==> 6dd67967a3c9f46769b8f50a4acbb582b7449617 passed
Test Ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants