Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[quorum store] new onchain config for turning on quorum store and turning off mempool broadcast #5400

Merged
merged 9 commits into from
Dec 16, 2022

Conversation

bchocho
Copy link
Contributor

@bchocho bchocho commented Nov 1, 2022

Description

Adds quorum_store_enabled to the consensus onchain config. This is read by

  • Mempool, to turn off mempool broadcast.
  • Epoch manager, to determine whether to use quorum store in the epoch.

Note, there is a race between mempool (long-lived) and consensus + quorum store components (recreated by epoch manager) during the config transition. We believe txns can be duplicated but do not expect txns to be “lost” or “stuck”. We need to further test this when the quorum store implementation is merged into main.

Test Plan

Existing tests. In particular, test_txn_broadcast.


This change is Reviewable

@bchocho bchocho force-pushed the brian/qs-onchain-config branch from 602b919 to d1754ed Compare November 8, 2022 00:52
@bchocho bchocho changed the title [DRAFT] onchain config for turning on quorum store and turning off mempool broadcast [quorum store] new onchain config for turning on quorum store and turning off mempool broadcast Nov 8, 2022
Copy link
Contributor Author

@bchocho bchocho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is larger than the actual logic change. Added some pointers in the comments above.

@@ -67,6 +68,17 @@ pub(crate) async fn coordinator<V>(
let workers_available = smp.config.shared_mempool_max_concurrent_inbound_syncs;
let bounded_executor = BoundedExecutor::new(workers_available, executor.clone());

let initial_reconfig = mempool_reconfig_events
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously mempool was not reading the initial config before starting, which seems like an oversight.

This change required a lot of refactoring of the tests. MockDbReaderWriter could not handle the on chain configs.

}
}

pub fn broadcast_within_validator_network(&self) -> bool {
self.config.shared_mempool_validator_broadcast
*self.broadcast_within_validator_network.read()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially the logic change. Instead of taking a config value, take the onchain config value.

This should not be too expensive, because it's only read on the validators which receive broadcasted transactions in batches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is changing broadcast_within_validator_network underneath safe from mempools perspective?
i.e. if it was false, and we change it to true, do we need to re-broadcast some transactions in mempool that we have skipped before?

it seems odd to have onchain config modify things underneath the mempool without any coordination. it might be safe - but if it is - it requires an inline comment explaining why that is the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's well-behaved going from true to false, but the other way around we can have transactions that are in mempool and are never broadcasted or in a batch.

We really don't want to go back from quorum store, so I think it's ok to have this disruption. WDYT?

I can add this as a comment if it makes sense.

@bchocho bchocho added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Nov 10, 2022
@bchocho bchocho marked this pull request as ready for review November 10, 2022 00:50
@bchocho bchocho requested a review from JoshLind as a code owner November 10, 2022 00:50
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@@ -12,6 +12,7 @@ use serde::{Deserialize, Serialize};
#[derive(Clone, Debug, Deserialize, PartialEq, Eq, Serialize)]
pub enum OnChainConsensusConfig {
V1(ConsensusConfigV1),
V2(ConsensusConfigV2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noob question: why do we have to create a new config version in this case? Can't we add the new flag to V1 and annotate it with #serde(default = ...) to maintain backwards compatibility? in other words, is there a need for V2 config?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. cc @igor-aptos

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. So if I make the change and see compat test succeed we should be good, right?

Copy link
Contributor

@movekevin movekevin Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m 100% sure (need to look at the code more later) which is why I want to check with @igor-aptos. The consensus config is stored on chain as bytes so if the structure changes, deserialization can change. But yes, tests would fail if deserialization fails. I’d also recommend adding a specific test for the new config + corresponding usage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @zekun000 offline. Since this is using BCS, it will fail because BCS is not extendable (to be canonical).

The on chain config probably doesn't need to use BCS, but it's using BCS today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to move away from bcs encoding to json or something more flexible. the data stored on-chain just as vector, and the interpretation is off-chain so we can be more flexible here. but need to do a migration first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't plan to turn off quorum store, we could have V2 represent the quorum store, i.e. have:
V1(ConsensusConfigV1),
V2(ConsensusConfigV1),
so you don't need to create another ConsensusConfigV1 class. that's what I did for the LeaderReputationType::ProposerAndVoterV2.

not sure if that is better here, or not, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's interesting. So the enabling will be like below?

    pub fn enable_quorum_store(&self) -> bool {
        match &self {
            OnChainConsensusConfig::V1(_) => false,
            OnChainConsensusConfig::V2(_) => true,
        }
    }

I like that it avoids a lot of repeating. If in emergency we need to revert, we just push a new V1 config, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

use std::sync::Arc;
use storage_interface::DbReaderWriter;

pub fn create_database() -> Arc<RwLock<DbReaderWriter>> {
Copy link
Contributor

@JoshLind JoshLind Nov 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I need to look at the details in this PR, but from a quick skim, it seems like overkill to me to create a new crate for a single test helper. I'd just copy/paste this 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, makes things easier :) CLion was giving me shame about the copied code :P

counters::VM_RECONFIG_UPDATE_FAIL_COUNT.inc();
error!(LogSchema::event_log(LogEntry::ReconfigUpdate, LogEvent::VMUpdateFail).error(&e));
}

let consensus_config: anyhow::Result<OnChainConsensusConfig> = config_update.get();
if let Err(error) = &consensus_config {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a good place to do

match consensus_config {
  Ok() => ..
  Err() => ..
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason the match was not used: If we fail to get the onchain consensus config, we instead use the default value (so we want to still proceed with .unwrap_or_default() ) in the Err case.

This is tricky though. Does it make sense to use the default or just ignore? (In this case, if a previous config was good, it would continue to use that value.) Epoch manager uses the default when starting round manager: https://github.com/aptos-labs/aptos-core/blob/main/consensus/src/epoch_manager.rs#L720

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if new config is not valid, I think it is safer to keep the old config, than to revert to default? what would be the reason to revert to default?

also, given the error! below, this is something unexpected, we want to be alerted on, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is something unexpected that we should alert on. I guess using the previous value also makes sense -- but it's really all unexpected.

mempool/src/shared_mempool/tasks.rs Outdated Show resolved Hide resolved
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

counters::VM_RECONFIG_UPDATE_FAIL_COUNT.inc();
error!(LogSchema::event_log(LogEntry::ReconfigUpdate, LogEvent::VMUpdateFail).error(&e));
}

let consensus_config: anyhow::Result<OnChainConsensusConfig> = config_update.get();
if let Err(error) = &consensus_config {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if new config is not valid, I think it is safer to keep the old config, than to revert to default? what would be the reason to revert to default?

also, given the error! below, this is something unexpected, we want to be alerted on, correct?

}
}

pub fn broadcast_within_validator_network(&self) -> bool {
self.config.shared_mempool_validator_broadcast
*self.broadcast_within_validator_network.read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is changing broadcast_within_validator_network underneath safe from mempools perspective?
i.e. if it was false, and we change it to true, do we need to re-broadcast some transactions in mempool that we have skipped before?

it seems odd to have onchain config modify things underneath the mempool without any coordination. it might be safe - but if it is - it requires an inline comment explaining why that is the case.

@@ -12,6 +12,7 @@ use serde::{Deserialize, Serialize};
#[derive(Clone, Debug, Deserialize, PartialEq, Eq, Serialize)]
pub enum OnChainConsensusConfig {
V1(ConsensusConfigV1),
V2(ConsensusConfigV2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't plan to turn off quorum store, we could have V2 represent the quorum store, i.e. have:
V1(ConsensusConfigV1),
V2(ConsensusConfigV1),
so you don't need to create another ConsensusConfigV1 class. that's what I did for the LeaderReputationType::ProposerAndVoterV2.

not sure if that is better here, or not, though.

back_pressure_limit: 10,
exclude_round: 20,
max_failed_authors_to_store: 10,
proposer_election_type: ProposerElectionType::LeaderReputation(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have a default for ProposerElectionType, so this is not repeated?

@bchocho
Copy link
Contributor Author

bchocho commented Dec 7, 2022

@igor-aptos I added inline responses. Github is so confusing, I can only see these responses in "Files changed" :(

@bchocho bchocho force-pushed the brian/qs-onchain-config branch from 2130b23 to 507db70 Compare December 9, 2022 01:21
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

…broadcast_within_validator_network transition behavior
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@@ -20,13 +21,15 @@ impl OnChainConsensusConfig {
pub fn leader_reputation_exclude_round(&self) -> u64 {
match &self {
OnChainConsensusConfig::V1(config) => config.exclude_round,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, pattern?

}
}

/// Decouple execution from consensus or not.
pub fn decoupled_execution(&self) -> bool {
match &self {
OnChainConsensusConfig::V1(config) => config.decoupled_execution,
OnChainConsensusConfig::V2(config) => config.decoupled_execution,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -47,13 +51,22 @@ impl OnChainConsensusConfig {
pub fn max_failed_authors_to_store(&self) -> usize {
match &self {
OnChainConsensusConfig::V1(config) => config.max_failed_authors_to_store,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, I hope I don't forget the pattern syntax

}
}

// Type and configuration used for proposer election.
pub fn proposer_election_type(&self) -> &ProposerElectionType {
match &self {
OnChainConsensusConfig::V1(config) => &config.proposer_election_type,
OnChainConsensusConfig::V2(config) => &config.proposer_election_type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here as well

@@ -641,6 +641,7 @@ impl EpochManager {
));

// Start QuorumStore
self.quorum_store_enabled = onchain_config.quorum_store_enabled();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why in strat_round_manager() and not before at start_new_epoch()?

let consensus_config: anyhow::Result<OnChainConsensusConfig> = config_update.get();
match consensus_config {
Ok(consensus_config) => {
*broadcast_within_validator_network.write() = !consensus_config.quorum_store_enabled();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that mempool will broadcast for some time after quorum_store is enabled, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a small window where this can happen. But once broadcast_within_validator_network is set, the validator won't broadcast anymore

pub fn quorum_store_enabled(&self) -> bool {
match &self {
OnChainConsensusConfig::V1(_config) => false,
OnChainConsensusConfig::V2(_config) => true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we cannot use ConsensusConfigV2 and add it there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but as Igor pointed out it adds a lot of boilerplate repeated configs.

@bchocho bchocho enabled auto-merge (squash) December 16, 2022 18:28
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

✅ Forge suite land_blocking success on 55a58ed9754f44e3323bbc690e67f553d4d5ddd1

performance benchmark with full nodes : 5736 TPS, 6914 ms latency, 14100 ms p99 latency,(!) expired 940 out of 2450380 txns
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 55a58ed9754f44e3323bbc690e67f553d4d5ddd1

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 55a58ed9754f44e3323bbc690e67f553d4d5ddd1 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7260 TPS, 5411 ms latency, 7900 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 55a58ed9754f44e3323bbc690e67f553d4d5ddd1
compatibility::simple-validator-upgrade::single-validator-upgrade : 4490 TPS, 8962 ms latency, 12600 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 55a58ed9754f44e3323bbc690e67f553d4d5ddd1
compatibility::simple-validator-upgrade::half-validator-upgrade : 4608 TPS, 9233 ms latency, 12400 ms p99 latency,no expired txns
4. upgrading second batch to new version: 55a58ed9754f44e3323bbc690e67f553d4d5ddd1
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6540 TPS, 6032 ms latency, 9600 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 55a58ed9754f44e3323bbc690e67f553d4d5ddd1 passed
Test Ok

@bchocho bchocho merged commit 05510f2 into main Dec 16, 2022
@bchocho bchocho deleted the brian/qs-onchain-config branch December 16, 2022 19:34
@Markuze Markuze mentioned this pull request Dec 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants