Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Consensus Observer] Add block payload verification. #14027

Merged
merged 3 commits into from
Jul 21, 2024
Merged

Conversation

JoshLind
Copy link
Contributor

@JoshLind JoshLind commented Jul 17, 2024

Note: most of this code is new unit tests.

Description

This PR makes several improvements to consensus observer. Specifically, it offers the following commits:

  1. Add a reset() method to the execution client trait. This method is already provided internally by sync_to and we want to expose it so that we can call it directly whenever consensus observer needs to reset execution state (e.g., on subscription changes).
  2. Update consensus observer to reset the execution pipeline and all pending block state when switching subscriptions. This shouldn't be strictly necessary, but it seems like a safe fallback.
  3. Update consensus observer to perform message verification for block payloads. Specifically, we: (i) verify the payload batch digests (i.e., to make sure all transactions are valid and in the correct order); (ii) verify the payload signatures over the batches (if the payload is for the current epoch); (iii) if the payload signatures can't be verified in the current epoch, we store them and verify them once we transition epochs; (iv) when the payloads are retrieved (for an ordered block), we make sure the batches and transactions match the expected values in the block (this ties the payload to a verified block); and (v) add a configurable maximum number of pending payloads (to avoid OOM attacks).

Once this lands, there's a few cleanups and simplifications to be done. I'll have those in the next PR.

Testing Plan

New and existing test infrastructure.

Copy link

trunk-io bot commented Jul 17, 2024

⏱️ 48h 19m total CI duration on this PR
Job Cumulative Duration Recent Runs
test-fuzzers 17h 22m 🟩🟥🟩🟩🟥 (+19 more)
execution-performance / single-node-performance 9h 43m 🟩🟩🟩🟩🟥 (+24 more)
forge-e2e-test / forge 5h 9m 🟥🟩🟥🟩🟩 (+16 more)
forge-compat-test / forge 4h 54m 🟩🟥🟩🟩🟩 (+16 more)
execution-performance / test-target-determinator 2h 15m 🟩🟩🟩🟩🟩 (+24 more)
test-target-determinator 1h 56m 🟩🟩🟩🟩🟩 (+21 more)
check 1h 36m 🟩🟩🟩🟩🟩 (+21 more)
general-lints 46m 🟩🟩🟩🟩🟩 (+21 more)
rust-cargo-deny 44m 🟩🟩🟩🟩🟩 (+21 more)
check-dynamic-deps 35m 🟩🟩🟩🟩🟩 (+25 more)
indexer-grpc-e2e-tests / test-indexer-grpc-docker-compose 32m 🟩🟩🟩🟩🟩 (+16 more)
forge-framework-upgrade-test / forge 16m 🟩
rust-doc-tests 14m 🟩
semgrep/ci 12m 🟩🟩🟩🟩🟩 (+25 more)
rust-doc-tests 11m
rust-move-tests 7m 🟩
rust-move-tests 6m 🟩
rust-move-tests 6m 🟩
rust-move-tests 6m 🟩
file_change_determinator 6m 🟩🟩🟩🟩🟩 (+21 more)
file_change_determinator 5m 🟩🟩🟩🟩🟩 (+22 more)
file_change_determinator 5m 🟩🟩🟩🟩🟩 (+22 more)
rust-move-tests 5m
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 3m 🟩
rust-move-tests 2m 🟩
rust-move-tests 2m 🟩
permission-check 2m 🟩🟩🟩🟩🟩 (+25 more)
permission-check 2m 🟩🟩🟩🟩🟩 (+25 more)
permission-check 2m 🟩🟩🟩🟩🟩 (+22 more)
adhoc-forge-test / forge 1m 🟥
permission-check 1m 🟩🟩🟩🟩🟩 (+22 more)
permission-check 1m 🟩🟩🟩🟩🟩 (+21 more)
determine-docker-build-metadata 1m 🟩🟩🟩🟩🟩 (+21 more)
rust-move-tests 32s
Backport PR 7s 🟥🟥
permission-check 5s 🟩🟩
determine-forge-run-metadata 3s 🟩
rust-move-tests 1s

🚨 3 jobs on the last run were significantly faster/slower than expected

Job Duration vs 7d avg Delta
execution-performance / single-node-performance 26m 12m +105%
test-fuzzers 48m 37m +28%
execution-performance / test-target-determinator 4m 5m -21%

settingsfeedbackdocs ⋅ learn more about trunk.io

@JoshLind JoshLind changed the base branch from main to co_peer_ref_4 July 17, 2024 12:44
@JoshLind JoshLind added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Jul 17, 2024

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@JoshLind JoshLind force-pushed the co_peer_ref_4_v2 branch 2 times, most recently from 6fab6e1 to 423f631 Compare July 17, 2024 19:33
@JoshLind JoshLind force-pushed the co_peer_ref_4_v2 branch 2 times, most recently from 795b301 to da0b809 Compare July 18, 2024 23:42
@JoshLind JoshLind marked this pull request as ready for review July 18, 2024 23:45
@JoshLind JoshLind requested a review from bchocho July 18, 2024 23:45
},
};

// Verify the payload and inline batches before returning the data. The
Copy link
Contributor Author

@JoshLind JoshLind Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: this (final) verification logic will be moved into consensus observer in the next PR (after we clean up the ordered block verification) 😄

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

pub transactions: Vec<SignedTransaction>,
pub limit: Option<u64>,
pub proof_with_data: ProofWithData,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this format looks weird, maybe we should have similar enum like Payload? (proof_with_data can carry the actual dat in the data status)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I'll follow up in the next PR (just to unblock this one) 😄

.transactions
.iter()
.cloned()
.collect::<VecDeque<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't have to be a VecDeque? just a regular iterator should work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot, yes 😄


// Verify each of the proof signatures
let validator_verifier = &epoch_state.verifier;
for proof_of_store in &self.transaction_payload.proof_with_data.proofs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need to parallelize this, I remember it takes ~200ms to verify big block before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll add a comment and then follow up 😄


// If the payload is for the current epoch, verify the proof signatures
let epoch_state = self.get_epoch_state();
let verified_payload = if block_epoch == epoch_state.epoch {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it's too much complexity to handle messages across epoch. we can state sync after joining the new epoch and then buffer messages from the same epoch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I originally started with this, but it turns out to be pretty bad when running at max load across the epoch boundaries. The nodes fall behind and take a while to catch up (spiking the e2e latency). But, we can see how this performs and maybe simplify it if we feel the need (epoch changes are more frequent in our tests) 😄

/// Removes the committed blocks from the payload store
pub fn remove_committed_blocks(&self, committed_blocks: &[Arc<PipelinedBlock>]) {
// Identify the highest epoch and round for the committed blocks
let (highest_epoch, highest_round) = committed_blocks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be the last one without needing to iterate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, good spot -- yeah, the blocks should already be verified here. 😄

@JoshLind JoshLind enabled auto-merge (rebase) July 21, 2024 14:09

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 6644003464ad6a03350d54e7eb17bab7745ae551

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 6644003464ad6a03350d54e7eb17bab7745ae551 (PR)
1. Check liveness of validators at old version: 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5
compatibility::simple-validator-upgrade::liveness-check : committed: 8392.624701977187 txn/s, latency: 3540.7282876091144 ms, (p50: 3000 ms, p90: 4200 ms, p99: 23500 ms), latency samples: 352840
2. Upgrading first Validator to new version: 6644003464ad6a03350d54e7eb17bab7745ae551
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7456.600881150239 txn/s, latency: 3562.465888111888 ms, (p50: 4000 ms, p90: 4300 ms, p99: 4400 ms), latency samples: 143000
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6562.12241028288 txn/s, latency: 4815.236905132193 ms, (p50: 4500 ms, p90: 7600 ms, p99: 9100 ms), latency samples: 257200
3. Upgrading rest of first batch to new version: 6644003464ad6a03350d54e7eb17bab7745ae551
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6169.597485347633 txn/s, latency: 4444.8713716373295 ms, (p50: 4600 ms, p90: 5500 ms, p99: 5600 ms), latency samples: 120440
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6482.983220546226 txn/s, latency: 4649.128990908346 ms, (p50: 4600 ms, p90: 5400 ms, p99: 6900 ms), latency samples: 244180
4. upgrading second batch to new version: 6644003464ad6a03350d54e7eb17bab7745ae551
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 12060.182329471822 txn/s, latency: 2220.835078337555 ms, (p50: 2400 ms, p90: 2600 ms, p99: 2800 ms), latency samples: 213180
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 12153.857692768552 txn/s, latency: 2811.8723151917184 ms, (p50: 2800 ms, p90: 3300 ms, p99: 3800 ms), latency samples: 397980
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 6644003464ad6a03350d54e7eb17bab7745ae551 passed
Test Ok

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 6644003464ad6a03350d54e7eb17bab7745ae551

two traffics test: inner traffic : committed: 9272.781310402523 txn/s, submitted: 9501.802513351633 txn/s, failed submission: 0.2630009220821191 txn/s, expired: 229.02120294910932 txn/s, latency: 2883.5194531108186 ms, (p50: 2700 ms, p90: 3300 ms, p99: 6600 ms), latency samples: 3525760
two traffics test : committed: 100.07503682678474 txn/s, latency: 1940.233 ms, (p50: 2000 ms, p90: 2100 ms, p99: 6600 ms), latency samples: 2000
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.256, avg: 0.215", "QsPosToProposal: max: 1.162, avg: 0.708", "ConsensusProposalToOrdered: max: 0.326, avg: 0.291", "ConsensusOrderedToCommit: max: 0.386, avg: 0.369", "ConsensusProposalToCommit: max: 0.677, avg: 0.661"]
Max round gap was 1 [limit 4] at version 1918207. Max no progress secs was 5.776979 [limit 15] at version 1918207.
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 6644003464ad6a03350d54e7eb17bab7745ae551

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 6644003464ad6a03350d54e7eb17bab7745ae551 (PR)
Upgrade the nodes to version: 6644003464ad6a03350d54e7eb17bab7745ae551
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1245.8203314752282 txn/s, submitted: 1247.3025686334377 txn/s, failed submission: 1.4822371582096705 txn/s, expired: 1.4822371582096705 txn/s, latency: 2921.1516061867937 ms, (p50: 2100 ms, p90: 5700 ms, p99: 12900 ms), latency samples: 100860
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1139.253602083359 txn/s, submitted: 1141.6035476322108 txn/s, failed submission: 2.349945548851813 txn/s, expired: 2.349945548851813 txn/s, latency: 2876.436159240924 ms, (p50: 2100 ms, p90: 5400 ms, p99: 11400 ms), latency samples: 96960
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 6644003464ad6a03350d54e7eb17bab7745ae551 passed
Upgrade the remaining nodes to version: 6644003464ad6a03350d54e7eb17bab7745ae551
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1131.0245716384527 txn/s, submitted: 1133.550205070857 txn/s, failed submission: 2.525633432404178 txn/s, expired: 2.525633432404178 txn/s, latency: 2726.697431993504 ms, (p50: 2100 ms, p90: 5100 ms, p99: 9300 ms), latency samples: 98520
Test Ok

@JoshLind JoshLind disabled auto-merge July 21, 2024 18:27
@JoshLind JoshLind merged commit ff35414 into main Jul 21, 2024
76 of 91 checks passed
@JoshLind JoshLind deleted the co_peer_ref_4_v2 branch July 21, 2024 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants