Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sharding] Refactor sharded block executor for better remote client support #9221

Merged
merged 5 commits into from
Jul 22, 2023

Conversation

sitalkedia
Copy link
Contributor

Description

The following changes are included in the PR.

  1. Add a basic network controller that provides interface for remote communication. Unfortunately, this network stack is not very reliable and drops network messages. This is just added for POC and we will work on deleting this stack and migrate to using our production stack or something like grpc.

  2. A refactor of Sharded block executor so that we can plug-in both local client and remote client.

Test Plan

  • Added a remote executor test without conflict. The conflict cases don't work because of unreliable message delivery.

@sitalkedia sitalkedia requested review from zjma, zekun000 and gelash and removed request for davidiw, lightmark, msmouse, wrwg and zekun000 July 21, 2023 03:58
Copy link
Contributor

@zjma zjma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current type hierarchy seems:
ShardedBlockExecutor
who takes in a partitioned block and invokes the async start_execution + get_result APIs of:
<X>ExecutorClient
who dispatches partitioned block to multiple instances of:
<X>ExecutorService
who invokes
ShardedExecutorService
who handles num_rounds sub_blocks by invoking BlockSTM for each.

  • Can <X>ExecutionCoordinator be a better name for <X>ExecutorClient?
  • Is <X>ExecutorService layer necessary? it doesn't seem to do a lot in the stack.
  • Can we have a readme to keep track of this hierarchy? it's easy to get lost between all the Clients and Services :)

@sitalkedia
Copy link
Contributor Author

  • Can <X>ExecutionCoordinator be a better name for <X>ExecutorClient?

I am not sure if coordinator is a better name, the coordinator in fact is holding the client.

Is ExecutorService layer necessary? it doesn't seem to do a lot in the stack.

Let me take this as a follow up to see if it make sense to merge the service into the client itself and add some doc as well.

@sitalkedia sitalkedia added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Jul 21, 2023
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@gelash
Copy link
Contributor

gelash commented Jul 22, 2023

Approving, but should we consult with networking folks regarding the basic controller (@brianolson), i.e. having that on main, and also whether maybe using existing stack could be simple/simpler? We did that for QuorumStore prototype and it could be made to work relatively easily somehow, modulo @sasha8 finding one bug in networking itself - but maybe here it's not as easy. Just mentioning.

@gelash gelash requested a review from brianolson July 22, 2023 14:59
@sitalkedia
Copy link
Contributor Author

Approving, but should we consult with networking folks regarding the basic controller (@brianolson), i.e. having that on main, and also whether maybe using existing stack could be simple/simpler? We did that for QuorumStore prototype and it could be made to work relatively easily somehow, modulo @sasha8 finding one bug in networking itself - but maybe here it's not as easy. Just mentioning.

Yes, I started a thread to discuss here - https://aptos-org.slack.com/archives/C056ZGUSQRZ/p1689969386327379. Based on the discussion, I am leaning towards using gRPC for remote execution support.

@sitalkedia sitalkedia enabled auto-merge (squash) July 22, 2023 18:35
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

✅ Forge suite compat success on aptos-node-v1.5.1 ==> ab3e09584d637e23f45d4fc98ec948d02de4ee6c

Compatibility test results for aptos-node-v1.5.1 ==> ab3e09584d637e23f45d4fc98ec948d02de4ee6c (PR)
1. Check liveness of validators at old version: aptos-node-v1.5.1
compatibility::simple-validator-upgrade::liveness-check : committed: 4163 txn/s, latency: 7309 ms, (p50: 7100 ms, p90: 11000 ms, p99: 11800 ms), latency samples: 166520
2. Upgrading first Validator to new version: ab3e09584d637e23f45d4fc98ec948d02de4ee6c
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 1762 txn/s, latency: 15808 ms, (p50: 18900 ms, p90: 22000 ms, p99: 22500 ms), latency samples: 91660
3. Upgrading rest of first batch to new version: ab3e09584d637e23f45d4fc98ec948d02de4ee6c
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 1785 txn/s, latency: 16209 ms, (p50: 18900 ms, p90: 22200 ms, p99: 22500 ms), latency samples: 92820
4. upgrading second batch to new version: ab3e09584d637e23f45d4fc98ec948d02de4ee6c
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 3395 txn/s, latency: 9160 ms, (p50: 9900 ms, p90: 12700 ms, p99: 13000 ms), latency samples: 139220
5. check swarm health
Compatibility test for aptos-node-v1.5.1 ==> ab3e09584d637e23f45d4fc98ec948d02de4ee6c passed
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite realistic_env_max_load success on ab3e09584d637e23f45d4fc98ec948d02de4ee6c

two traffics test: inner traffic : committed: 6940 txn/s, latency: 5639 ms, (p50: 5400 ms, p90: 6900 ms, p99: 10500 ms), latency samples: 3005220
two traffics test : committed: 100 txn/s, latency: 3221 ms, (p50: 3200 ms, p90: 4000 ms, p99: 4400 ms), latency samples: 1780
Max round gap was 1 [limit 4] at version 1472256. Max no progress secs was 3.875723 [limit 10] at version 1472256.
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite framework_upgrade success on aptos-node-v1.5.1 ==> ab3e09584d637e23f45d4fc98ec948d02de4ee6c

Compatibility test results for aptos-node-v1.5.1 ==> ab3e09584d637e23f45d4fc98ec948d02de4ee6c (PR)
Upgrade the nodes to version: ab3e09584d637e23f45d4fc98ec948d02de4ee6c
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 3851 txn/s, latency: 7747 ms, (p50: 7200 ms, p90: 11400 ms, p99: 21000 ms), latency samples: 161760
5. check swarm health
Compatibility test for aptos-node-v1.5.1 ==> ab3e09584d637e23f45d4fc98ec948d02de4ee6c passed
Test Ok

@sitalkedia sitalkedia merged commit cf1accf into main Jul 22, 2023
@sitalkedia sitalkedia deleted the remote_channel_support_4 branch July 22, 2023 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants