Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Quorum Store] networking integration #5779

Merged
merged 14 commits into from
Dec 18, 2022
Merged

Conversation

bchocho
Copy link
Contributor

@bchocho bchocho commented Dec 6, 2022

Description

Quorum Store related changes to consensus networking.

  • Add messages for Fragment, BatchRequest, Batch, SignedDigest, ProofOfStore
  • Filter these messages if quorum store is not enabled
  • Add sender-only verification for some messages
  • Add a separate FIFO local channel for quorum store messages
  • Add and use broadcast without sending to self

Test Plan

Existing tests. The logic is coming in future PRs.

@bchocho bchocho force-pushed the brian/quorum-store-networking branch from a36cc36 to 4c22759 Compare December 8, 2022 23:28
}
}

// TODO: implement properly (and proper place) w.o. public fields.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sasha8 @gelash would you consider this resolved?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. Fields in SignedDigestInfo and SignedDigest are public. @gelash ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's okay, but I guess we wanted to revisit whether the fields should be pub or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks very weird especially one field is private and has getter and others are public, we should be consistent

pub(crate) batch_info: BatchInfo,
}

// TODO: make epoch, source, signature fields treatment consistent across structs.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sasha8 @gelash would you consider this resolved?

@bchocho bchocho marked this pull request as ready for review December 9, 2022 00:57
@bchocho bchocho requested a review from igor-aptos December 9, 2022 00:57
Copy link
Contributor

@sasha8 sasha8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Mostly around quorum_store_enabled verification.

let msg = ConsensusMsg::ProofOfStoreMsg(Box::new(proof_of_store));
self.broadcast_without_self(msg).await
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move the two above methods to QuorumStoreSender trait?

/// Quorum Store: Send a signed batch digest. This is a vote for the batch and a promise that
/// the batch of transactions was received and will be persisted until batch expiration.
SignedDigestMsg(Box<SignedDigest>),
/// Quorum Store: Broadcast a completed proof of store (a digest that received 2f+1 votes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed -> certified or valid?

validator: &ValidatorVerifier,
quorum_store_enabled: bool,
) -> anyhow::Result<()> {
match (quorum_store_enabled, self) {
(false, Payload::DirectMempool(_)) => Ok(()),
(true, Payload::InQuorumStore(proof_with_status)) => {
for proof in proof_with_status.proofs.iter() {
proof.verify(validator)?;
proof.verify(peer_id, validator, quorum_store_enabled)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass quorum_store_enabled here? We know it is true from the match so why check again inside the verify?

) -> anyhow::Result<()> {
if !quorum_store_enabled {
return Err(anyhow::anyhow!(
"Quorum store is not enabled locally. Sender: {}, epoch: {}",
Copy link
Contributor

@sasha8 sasha8 Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we need this as this verify will always be called with quorum_store_enabled = true. See the comment above.

}
}

// TODO: implement properly (and proper place) w.o. public fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. Fields in SignedDigestInfo and SignedDigest are public. @gelash ?

@@ -91,6 +98,23 @@ impl UnverifiedEvent {
cd.verify(validator)?;
VerifiedEvent::CommitDecision(cd)
}
UnverifiedEvent::FragmentMsg(f) => {
f.verify(peer_id, quorum_store_enabled)?;
Copy link
Contributor

@sasha8 sasha8 Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be cleaner to check quorum_store_enabled before? We know already here that if it is false we should not even got quorum store messages.
Maybe check it in process_message() in epoch_manager?

"Quorum store is not enabled locally. Sender: {}, epoch: {}",
peer_id,
self.epoch(),
));
Copy link
Contributor

@sasha8 sasha8 Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. I would check quorum_store_enabled when getting the message.

- Add messages for Fragment, Batch, SignedDigest, ProofOfStore
- Add sender-only verification for some messages
- Add a separate FIFO local channel for quorum store messages
- Add and use broadcast without sending to self
@bchocho bchocho force-pushed the brian/quorum-store-networking branch from 19adda4 to cc72c96 Compare December 9, 2022 19:49
@@ -836,6 +846,29 @@ impl EpochManager {
Ok(None)
}

async fn check_quorum_store_enabled(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is confusing. Maybe check_unverified_event_type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not an async function, also isn't this redundant since we already pass self.quorum_store_enabled to verify function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally did the check in the verify function, but changed it to do the separate check based on Sasha's feedback (#5779 (comment)). I think this is a bit clearer; it avoids dirtying the verify function which should really verify the correctness of the message.

Copy link
Contributor

@davidiw davidiw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are we testing backwards compatibility?

}
}

// TODO: implement properly (and proper place) w.o. public fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks very weird especially one field is private and has getter and others are public, we should be consistent

@@ -836,6 +846,29 @@ impl EpochManager {
Ok(None)
}

async fn check_quorum_store_enabled(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not an async function, also isn't this redundant since we already pass self.quorum_store_enabled to verify function

@@ -289,6 +344,11 @@ impl NetworkTask {
) -> (NetworkTask, NetworkReceivers) {
let (consensus_messages_tx, consensus_messages) =
aptos_channel::new(QueueStyle::LIFO, 1, Some(&counters::CONSENSUS_CHANNEL_MSGS));
let (quorum_store_messages_tx, quorum_store_messages) = aptos_channel::new(
QueueStyle::FIFO,
1000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a lot of messages, why do we need to buffer so many?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really shouldn't need to buffer this many, especially if back pressure is working well. However, it looks like we'll hve to redo back pressure, so I'm inclined to just keep this for now and revisit it later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may consume significant amount of memory if not dequeue fast enough, imagine a fragment message with maximum message size (~64MB) * 1000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's probably risky even with a single bad validator and quorum store off. Reduced to 50 for now.

BlockStage::NETWORK_RECEIVED,
);
}
if let Err(e) = self.consensus_messages_tx.push(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this piece of code seems repetitive

}
}

pub(crate) fn take_transactions(self) -> Vec<SerializedTransaction> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is typically called into_transactions, take mostly refers to a &mut self function

#[derive(Clone, Debug, Deserialize, Serialize, PartialEq, Eq)]
pub struct Batch {
pub(crate) source: PeerId,
// None is a request, Some(payload) is a response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems hacky, we should just have two types

- Private for all members of "message" types. We will likely have to implement getters/setters when integrating QS implementation
- Public for all members of *Info types. These are pure containers.
- Remove (crate) where the mod is already pub(crate)
@bchocho bchocho added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Dec 16, 2022
@bchocho
Copy link
Contributor Author

bchocho commented Dec 16, 2022

@davidiw for these integration PRs the "compat" e2e test should be enough.

For quorum store as a whole, it will be enabled via onchain configs, and we plan to write forge tests that test the transition when the config is flipped.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@bchocho
Copy link
Contributor Author

bchocho commented Dec 17, 2022

@zekun000 I iterated on your comments and also merged the latest changes in main. Please review!

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@@ -289,6 +344,11 @@ impl NetworkTask {
) -> (NetworkTask, NetworkReceivers) {
let (consensus_messages_tx, consensus_messages) =
aptos_channel::new(QueueStyle::LIFO, 1, Some(&counters::CONSENSUS_CHANNEL_MSGS));
let (quorum_store_messages_tx, quorum_store_messages) = aptos_channel::new(
QueueStyle::FIFO,
1000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may consume significant amount of memory if not dequeue fast enough, imagine a fragment message with maximum message size (~64MB) * 1000

@bchocho bchocho enabled auto-merge (squash) December 18, 2022 19:13
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

✅ Forge suite land_blocking success on 0b90884d4a47b1f2235f44fee95324c07a313b6f

performance benchmark with full nodes : 5734 TPS, 6914 ms latency, 12500 ms p99 latency,(!) expired 940 out of 2449500 txns
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 0b90884d4a47b1f2235f44fee95324c07a313b6f

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 0b90884d4a47b1f2235f44fee95324c07a313b6f (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7240 TPS, 5339 ms latency, 6900 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 0b90884d4a47b1f2235f44fee95324c07a313b6f
compatibility::simple-validator-upgrade::single-validator-upgrade : 4510 TPS, 9229 ms latency, 12400 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 0b90884d4a47b1f2235f44fee95324c07a313b6f
compatibility::simple-validator-upgrade::half-validator-upgrade : 4163 TPS, 9873 ms latency, 13200 ms p99 latency,no expired txns
4. upgrading second batch to new version: 0b90884d4a47b1f2235f44fee95324c07a313b6f
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6232 TPS, 6293 ms latency, 11800 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 0b90884d4a47b1f2235f44fee95324c07a313b6f passed
Test Ok

@bchocho bchocho merged commit d264c3b into main Dec 18, 2022
@bchocho bchocho deleted the brian/quorum-store-networking branch December 18, 2022 20:02
@Markuze Markuze mentioned this pull request Dec 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants