Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-builder] local simulation of governance proposals #13949

Merged
merged 1 commit into from
Jul 23, 2024

Conversation

vgao1996
Copy link
Contributor

@vgao1996 vgao1996 commented Jul 9, 2024

Description

This introduces a new release builder command that enables the simulation of governance proposals. Currently only multi-step proposals are supported.

It utilizes the the remote debugger infrastructure to fetch real chain states for local simulation, but adds another in-memory database to store the new side effects generated by the governance scripts.

Normally, governance scripts needs to be approved through on-chain governance before they could be executed. This process involves setting up various states (e.g., staking pool, delegated voter), which can be quite complex.

This simulation bypasses these challenges by patching specific Move functions with mock versions, most notably fun resolve_multi_step_proposal, thus allowing the governance process to be skipped altogether. In other words, this simulation is intended for checking whether a governance proposal will execute successfully, assuming it gets approved.

How to run simulation

First generate the proposal

cargo run -p aptos-release-builder generate-proposals --release-config data/release.yaml --output-dir 
output

Then run simulation via the following command

cargo run -p aptos-release-builder simulate-multi-step-proposal --network mainnet --proposal-dir output/sources/v1.14/step_1_upgrade_framework/

Here's how the output should look like

Found 2 scripts
    output/sources/v1.14/step_1_upgrade_framework/0-gas-schedule.move
    output/sources/v1.14/step_1_upgrade_framework/1-features.move
Compiling scripts...
Compiling, may take a little while to download git dependencies...
INCLUDING DEPENDENCY AptosFramework
INCLUDING DEPENDENCY AptosStdlib
INCLUDING DEPENDENCY MoveStdlib
BUILDING script
Compiling, may take a little while to download git dependencies...
INCLUDING DEPENDENCY AptosFramework
INCLUDING DEPENDENCY AptosStdlib
INCLUDING DEPENDENCY MoveStdlib
BUILDING script
Patching framework functions to bypass governance.. done
Creating and funding sender account.. done
Executing governance scripts...
    0-gas-schedule.move
        Keep(
            Success,
        )
    1-features.move
        Keep(
            Success,
        )
All scripts succeeded!

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Other (release tooling)

Key Areas to Review

simulate.rs

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented Jul 9, 2024

⏱️ 2h 21m total CI duration on this PR
Job Cumulative Duration Recent Runs
test-fuzzers 1h 51m 🟩🟩🟩
rust-move-tests 6m 🟩
general-lints 6m 🟩🟩🟩
rust-move-tests 5m 🟩
check-dynamic-deps 4m 🟩🟩🟩
rust-cargo-deny 3m 🟩🟩
rust-move-tests 2m 🟩
semgrep/ci 1m 🟩🟩🟩
file_change_determinator 36s 🟩🟩🟩
file_change_determinator 30s 🟩🟩🟩
permission-check 12s 🟩🟩🟩
permission-check 9s 🟩🟩🟩
permission-check 9s 🟩🟩🟩
permission-check 8s 🟩🟩🟩

settingsfeedbackdocs ⋅ learn more about trunk.io

@vgao1996 vgao1996 requested a review from davidiw July 9, 2024 16:48
@vgao1996 vgao1996 force-pushed the gov-sim branch 2 times, most recently from a62dd2d to ffe4265 Compare July 11, 2024 23:52
aptos-move/aptos-release-builder/src/main.rs Outdated Show resolved Hide resolved
aptos-move/aptos-release-builder/src/simulate.rs Outdated Show resolved Hide resolved
aptos-move/aptos-release-builder/src/simulate.rs Outdated Show resolved Hide resolved
aptos-move/aptos-release-builder/src/simulate.rs Outdated Show resolved Hide resolved
aptos-move/aptos-release-builder/src/simulate.rs Outdated Show resolved Hide resolved
let txn_gas_params = &mut gas_params.vm.txn;
// Use the alternative limits for governance proposals
// TODO: In the future, consider adding the execution hashes of the scripts to the approval list.
txn_gas_params.max_execution_gas = txn_gas_params.max_execution_gas_gov;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't your recent change to automatically bump these numbers to _gov if it is a governance proposal work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the more ugly parts of the current implementation.

Doesn't your recent change to automatically bump these numbers to _gov if it is a governance proposal work here?

Yes, but the alt limits will only kick in if the script has its hash added to the list of approved execution hashes, which gets skipped by the mock version of resolve_multi_step_proposal as well.

It's possible for us to manually add it in Rust and I tried it, but there are some complexities involved.

@vgao1996
Copy link
Contributor Author

Made a major update to the PR

  • @georgemitenkov I addressed all your comments and added the should_restart_execution check you requested.
  • @runtian-zhou I implemented the check you requested, ensuring that the last script cannot have a next execution hash.
  • Also fixed two bugs
    • The warm vm cache is now flushed every time we execute a script, so that we always load the latest framework code cc @msmouse
    • The patching of the framework functions is also done every time we execute a script -- this is needed in case the framework has been overwritten by a previous script.

&log_context,
);
// We require all governance scripts to trigger reconfiguration so check it here.
if AptosVM::should_restart_execution(vm_output.events()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be the negation of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

However looks like this would not be an easy fix. With dkg, reconfiguration is started by the script but may not actually happen until the next epoch.

I guess I'll have to remove this check for now, and we'll need to discuss what the proper solution might be. I'm thinking that maybe reconfiguration_with_dkg::try_start should emit its own event indicating this.

let script_name = script_path.file_name().unwrap().to_string_lossy();
println!(" {}", script_name);

// Create a new VM to ensure the loader is clean.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked this? Because I think warm vm cache does load PackageMetadata for core packages to see if it has changed... or this is the reason, we patch code and package metadata is the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I spent a few hours debugging this yesterday.

or this is the reason, we patch code and package metadata is the same?

Exactly, the patching done here does not change the metadata, so the warm vm cache fails to see it needs to reload everything.

@vgao1996 vgao1996 force-pushed the gov-sim branch 3 times, most recently from 030dd06 to 8f3c907 Compare July 17, 2024 19:10
Copy link
Contributor

@perryjrandall perryjrandall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fucking magical, a really important tool for release verification <3

aptos-move/aptos-release-builder/data/release.yaml Outdated Show resolved Hide resolved
///
/// Possible values: devnet, testnet, mainnet, <url to rest endpoint>
#[clap(long)]
network: NetworkSelection,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! validate proposal also has an "endpoint" arguement, it would be great could replace its usage of URL with network so we dont have to specify the testnet / mainnet / devnet url all the time there either

aptos-move/aptos-release-builder/src/main.rs Outdated Show resolved Hide resolved
aptos-move/aptos-release-builder/src/simulate.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@runtian-zhou runtian-zhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the overall logic makes sense to me. The patching logic looked ugly but I don't see a way round for now. We still need a separate PR in the aptos-network to use this command btw.

aptos-move/aptos-release-builder/src/main.rs Outdated Show resolved Hide resolved
state_view.apply_write_set(write_set);
}

println!("All scripts succeeded!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check there's no pending script hash there. Can you add the check or at least add a todo here? I think it would be important for our release safety.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've implemented this check, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nvm. I thought you didn't.

Comment on lines +353 to +425
if forbid_next_execution_hash {
// If it is needed to forbid a next execution hash, inject additional Move
// code at the beginning that aborts with a magic number if the vector
// representing the hash is not empty.
//
// if (!vector::is_empty(&next_execution_hash)) {
// abort MAGIC_FAILED_NEXT_EXECUTION_HASH_CHECK;
// }
//
// The magic number can later be checked in Rust to determine if such violation
// has happened.
code.code.extend([
ImmBorrowLoc(2),
VecLen(sig_u8_idx),
LdU64(0),
Eq,
BrTrue(7),
LdU64(MAGIC_FAILED_NEXT_EXECUTION_HASH_CHECK),
Abort,
]);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@runtian-zhou here I check if the next execution hash is empty, and if so, abort with a magic number

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing!

@vgao1996
Copy link
Contributor Author

@perryjrandall I've made some updates to the PR

  • Fixed the bug that caused the gas schedule hash to change, by adding script hashes to the approved list properly.
  • Added .context(..) and .with_context(..) in many places to improve error reporting
  • Added a helper to modify on-chain config

Additionally I've also created this issue #14044 for tracking follow-up items that I don't plan to address immediately.

@vgao1996
Copy link
Contributor Author

vgao1996 commented Jul 23, 2024

Update the PR again.

  • @perryjrandall I've renamed the command to "simulate" as you requested. Now it also searches the whole directory recursively, so you can pass in the output dir or any of its sub directories containing the proposals.
  • @georgemitenkov I had to change the way I inject create_signer cuz my previous implementation broke module compatibility. This results in an extra flag being passed into the VM, which is a bit ugly, but I guess this is something we can live with and refactor later.

Given that the PR is rather polished right now I'll proceed to landing. If there are additional feature requests I'll address them separately.

@vgao1996 vgao1996 enabled auto-merge (squash) July 23, 2024 18:06

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on a57745955e2a99457654d0214db1e840448ed85e

two traffics test: inner traffic : committed: 9312.372942499891 txn/s, latency: 4276.053475807454 ms, (p50: 4200 ms, p90: 4700 ms, p99: 10500 ms), latency samples: 3540760
two traffics test : committed: 99.99821397799764 txn/s, latency: 2277.031111111111 ms, (p50: 2100 ms, p90: 2400 ms, p99: 8900 ms), latency samples: 1800
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.242, avg: 0.224", "QsPosToProposal: max: 1.865, avg: 1.823", "ConsensusProposalToOrdered: max: 0.316, avg: 0.294", "ConsensusOrderedToCommit: max: 0.421, avg: 0.406", "ConsensusProposalToCommit: max: 0.712, avg: 0.699"]
Max round gap was 1 [limit 4] at version 1938233. Max no progress secs was 5.740667 [limit 15] at version 1938233.
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> a57745955e2a99457654d0214db1e840448ed85e

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> a57745955e2a99457654d0214db1e840448ed85e (PR)
Upgrade the nodes to version: a57745955e2a99457654d0214db1e840448ed85e
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1337.7657697781233 txn/s, submitted: 1340.6729029250514 txn/s, failed submission: 2.9071331469281922 txn/s, expired: 2.9071331469281922 txn/s, latency: 2641.978069540022 ms, (p50: 2100 ms, p90: 4800 ms, p99: 10200 ms), latency samples: 110440
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1068.9917850611882 txn/s, submitted: 1071.180547929864 txn/s, failed submission: 2.1887628686756515 txn/s, expired: 2.1887628686756515 txn/s, latency: 2809.9613124488124 ms, (p50: 2100 ms, p90: 5200 ms, p99: 10500 ms), latency samples: 97680
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> a57745955e2a99457654d0214db1e840448ed85e passed
Upgrade the remaining nodes to version: a57745955e2a99457654d0214db1e840448ed85e
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1055.6579369475385 txn/s, submitted: 1057.8521958062945 txn/s, failed submission: 2.1942588587560556 txn/s, expired: 2.1942588587560556 txn/s, latency: 2995.9792766576597 ms, (p50: 2200 ms, p90: 5400 ms, p99: 10500 ms), latency samples: 96220
Test Ok

Copy link
Contributor

✅ Forge suite compat success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> a57745955e2a99457654d0214db1e840448ed85e

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> a57745955e2a99457654d0214db1e840448ed85e (PR)
1. Check liveness of validators at old version: 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5
compatibility::simple-validator-upgrade::liveness-check : committed: 7526.7189752704135 txn/s, latency: 3789.1083689890797 ms, (p50: 2800 ms, p90: 4800 ms, p99: 27200 ms), latency samples: 313180
2. Upgrading first Validator to new version: a57745955e2a99457654d0214db1e840448ed85e
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 7427.550906292737 txn/s, latency: 3572.8080405598403 ms, (p50: 4000 ms, p90: 4200 ms, p99: 4400 ms), latency samples: 140040
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6434.249928332043 txn/s, latency: 4715.601646791596 ms, (p50: 4600 ms, p90: 5400 ms, p99: 8400 ms), latency samples: 246540
3. Upgrading rest of first batch to new version: a57745955e2a99457654d0214db1e840448ed85e
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 7196.484471921162 txn/s, latency: 3660.051948614319 ms, (p50: 4100 ms, p90: 4400 ms, p99: 4600 ms), latency samples: 138560
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6797.026845312571 txn/s, latency: 4693.143285443909 ms, (p50: 4800 ms, p90: 5400 ms, p99: 6100 ms), latency samples: 232480
4. upgrading second batch to new version: a57745955e2a99457654d0214db1e840448ed85e
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 2361.955587406585 txn/s, latency: 10972.883252517717 ms, (p50: 13700 ms, p90: 17500 ms, p99: 18300 ms), latency samples: 53620
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 9257.193034621272 txn/s, latency: 3469.895835780571 ms, (p50: 3000 ms, p90: 6900 ms, p99: 9300 ms), latency samples: 340520
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> a57745955e2a99457654d0214db1e840448ed85e passed
Test Ok

@vgao1996 vgao1996 merged commit 9301d80 into aptos-labs:main Jul 23, 2024
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants