Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexer grpc refactor cache performance #6711

Merged
merged 2 commits into from
Mar 3, 2023

Conversation

larry-aptos
Copy link
Contributor

@larry-aptos larry-aptos commented Feb 21, 2023

Description

  • Improve the cache worker performance by 3000% on GCP. Now TPS is 30k.

Test Plan

Tested on devent

image

@larry-aptos larry-aptos added the CICD:build-images when this label is present github actions will start build+push rust images from the PR. label Feb 21, 2023
@larry-aptos larry-aptos changed the base branch from main to indexer-grpc-refactor February 21, 2023 07:43
@larry-aptos larry-aptos marked this pull request as ready for review February 21, 2023 08:31
@larry-aptos larry-aptos force-pushed the indexer-grpc-refactor branch from 01457af to 26c18cb Compare February 23, 2023 00:08
@larry-aptos larry-aptos force-pushed the indexer-grpc-refactor branch from 26c18cb to 0ca31c4 Compare February 23, 2023 00:45
@larry-aptos larry-aptos force-pushed the indexer-grpc-refactor-cache-performance branch from 185c9e3 to 43e7a5d Compare February 23, 2023 00:46
@larry-aptos larry-aptos force-pushed the indexer-grpc-refactor branch 3 times, most recently from cbbd088 to dfa6bf7 Compare March 2, 2023 00:16
Base automatically changed from indexer-grpc-refactor to main March 2, 2023 01:20
@larry-aptos larry-aptos force-pushed the indexer-grpc-refactor-cache-performance branch from 43e7a5d to 1757d21 Compare March 3, 2023 09:34
@larry-aptos larry-aptos force-pushed the indexer-grpc-refactor-cache-performance branch from 1757d21 to b63791c Compare March 3, 2023 09:37
@larry-aptos larry-aptos enabled auto-merge (squash) March 3, 2023 10:01
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

✅ Forge suite land_blocking success on fb9a6de7d1364108d92ed0a54b2a1a6838531200

performance benchmark with full nodes : 6106 TPS, 6507 ms latency, 9000 ms p99 latency,no expired txns
Test Ok

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

✅ Forge suite framework_upgrade success on cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> fb9a6de7d1364108d92ed0a54b2a1a6838531200

Compatibility test results for cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> fb9a6de7d1364108d92ed0a54b2a1a6838531200 (PR)
Upgrade the nodes to version: fb9a6de7d1364108d92ed0a54b2a1a6838531200
framework_upgrade::framework-upgrade::full-framework-upgrade : 7066 TPS, 5436 ms latency, 7700 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> fb9a6de7d1364108d92ed0a54b2a1a6838531200 passed
Test Ok

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2023

✅ Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> fb9a6de7d1364108d92ed0a54b2a1a6838531200

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> fb9a6de7d1364108d92ed0a54b2a1a6838531200 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 8144 TPS, 4690 ms latency, 6900 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: fb9a6de7d1364108d92ed0a54b2a1a6838531200
compatibility::simple-validator-upgrade::single-validator-upgrade : 5142 TPS, 7674 ms latency, 11300 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: fb9a6de7d1364108d92ed0a54b2a1a6838531200
compatibility::simple-validator-upgrade::half-validator-upgrade : 5422 TPS, 6999 ms latency, 9000 ms p99 latency,no expired txns
4. upgrading second batch to new version: fb9a6de7d1364108d92ed0a54b2a1a6838531200
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6996 TPS, 5519 ms latency, 9100 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> fb9a6de7d1364108d92ed0a54b2a1a6838531200 passed
Test Ok

@larry-aptos larry-aptos merged commit ab3983b into main Mar 3, 2023
@larry-aptos larry-aptos deleted the indexer-grpc-refactor-cache-performance branch March 3, 2023 11:09
ibalajiarun added a commit that referenced this pull request Mar 9, 2023
commit 4b2cddb
Author: Rati Gelashvili <[email protected]>
Date:   Wed Mar 8 22:42:48 2023 -0500

    [BlockSTM] No eager validation for first execution wave (#6996)

commit 8ab8c97
Author: Brian (Sunghoon) Cho <[email protected]>
Date:   Wed Mar 8 17:58:57 2023 -0800

    [cherry-pick] PR 7016: [Quorum Store] Downgrade most logs to trace (#7020)

    Downgrade most logs to trace, only keep around enough debug logs to track locally created batch/digest/proof through the stages.

    Run smoke test, observe logs are not too spammy.

commit b741221
Author: Brian (Sunghoon) Cho <[email protected]>
Date:   Wed Mar 8 17:03:35 2023 -0800

    [cherry-pick] PR 7013: [Quorum Store] adjust memory and db quotas (#7017)

    Set defaults to be suitable for a 100 validator network.

    Paraphrasing Sasha:

    20000 tps * 1000B (size of a tnx) * 60 s (duration) = 1_200_000_000 bytes
    Divide by 100 validator = 12_000_000B each validator ask me to persist on average.

    Adding some buffer for peaks:
    Memory: 10x - 120_000_000 bytes
    Storage 25x - 300_000_000 bytes

commit e694ec5
Author: Brian (Sunghoon) Cho <[email protected]>
Date:   Wed Mar 8 13:14:42 2023 -0800

    [cherry-pick] PR 6969: [Quorum Store] poll mempool at a faster pace to reduce mempool queueing time, except when backpressure (#7006)

    We poll mempool for transactions at a fast rate (default 25 ms), unless there is backpressure. If there is backpressure then the next poll will be tried in + 25ms, until a max of 250 ms is reached. Since this makes the poll duration variable, we compute the max txns to pull based on the time passed since the last poll.

    The backpressure is based on the number of proofs left in proof manager. So proofs are used for the rate of pulling, and txns are used for the max number of txns on the pull.

    Running tests with low TPS (2 TPS per validator), confirmed that the time spent in mempool reduced: roughly from an average of 125ms (from the previous 250ms poll time) to an average of 12.5ms (fro the 25ms faster poll time).

commit 1015b10
Author: Brian (Sunghoon) Cho <[email protected]>
Date:   Wed Mar 8 13:06:38 2023 -0800

    [cherry-pick] PR 6785: [quorum store] refactor batch_reader/store and batch requester (#7005)

    This commit merges batch_reader and batch_store into a single struct and uses it as synchronous component (via function calls)
    instead of asynchronous one (via channels).

    Also batch_requester is refactored to use network rpc directly.

    Co-authored-by: Zekun Li <[email protected]>

commit d49118b
Author: Igor <[email protected]>
Date:   Tue Mar 7 16:33:34 2023 -0800

    configs: enable wait_for_full_blocks and increase exclude_rounds

commit bda3c64
Author: Igor <[email protected]>
Date:   Tue Mar 7 08:58:20 2023 -0800

    Add new block poll based on time

commit 49c89fa
Author: igor-aptos <[email protected]>
Date:   Tue Mar 7 21:50:11 2023 -0800

    AverageIntCounter for backpressure states (#6963)

commit 1562e7f
Author: igor-aptos <[email protected]>
Date:   Tue Mar 7 18:35:53 2023 -0800

    [mempool] Wait until max block size or quorum_store_poll_count is reached (#6129)

    Adding two new flags, to improve blockchain performance.

    // Whether to create partial blocks when few transactions exist, or empty blocks when there is
    // pending ordering, or to wait for quorum_store_poll_count * 30ms to collect transactions for a block
    //
    // It is more efficient to execute larger blocks, as it creates less overhead. On the other hand
    // waiting increases latency (unless we are under high load that added waiting latency
    // is compensated by faster execution time). So we want to balance the two, by waiting only
    // when we are saturating the execution pipeline:
    // - if there are more pending blocks then usual in the execution pipeline,
    // block is going to wait there anyways, so we can wait to create a bigger/more efificent block
    // - in case our node is faster than others, and we don't have many pending blocks,
    // but we still see very large recent (pending) blocks, we know that there is demand
    // and others are creating large blocks, so we can wait as well.

    * dynamic enabling of waiting for full blocks

commit cbf1052
Author: Sital Kedia <[email protected]>
Date:   Wed Mar 8 09:26:40 2023 -0800

    [preview] Fix broken unit test in previewnet

commit 7a0002f
Author: Brian (Sunghoon) Cho <[email protected]>
Date:   Tue Mar 7 17:26:31 2023 -0800

    [mempool] eager expiration based on queued transaction age (#6965) (#6991)

    Introduces eager expiration, which expires transactions earlier (default: 3s) than its true client-provided expiration. This prevents transactions that are pulled from mempool expiring upon execution, but more importantly it prevents these transactions from blocking transactions that would have succeeded at execution.

    Eager expiration is triggered if sufficiently old transactions (default: 10s) are observed. This internally signals to mempool that a backlog is building.

    Run an overload test `three_region_simulation_graceful_overload` with quorum store and observe that expirations drop from ~3K/s to < 100/s and this pushes TPS up +15% from 3.8K -> 4.4K

commit db0a542
Author: Brian (Sunghoon) Cho <[email protected]>
Date:   Tue Mar 7 12:50:47 2023 -0800

    [Previewnet] [cherry-pick] PR 6978: [Quorum Store] Use a single quorum_store_db instance across epochs (#6986)

    Previously, the quorum_store_db instance was torn down and restarted on epoch changes.
    We observed this occasionally caused panic when the new instance couldn't start.
    The DB doesn't have to be torn down and restarted. The only interesting thing here is the DB
    will be created regardless of whether quorum store is turned on, but that should be negligible overhead.

    Includes some refactoring to make twins unit tests work.

    ref: #6855

    Existing tests

    Co-authored-by: Balaji Arun <[email protected]>

commit 1aa86fa
Author: Stelian Ionescu <[email protected]>
Date:   Tue Mar 7 11:55:42 2023 -0500

    [PSS] Add script for migrating a cluster from PSP to PSS (#6952) (#6959)

    The script migrate_cluster_psp_to_pss.sh has two modes of operation:

     * check: will check whether there are pods that violate the PSS
     "baseline" profile. It's useful to see where the security policy can
     be tightened from the default "privileged" profile.
     * migrate: will perform the migration on the current K8s context.
     --policy-version should specify the target policy version, usually
     the same version as the K8s cluster.

    The migration works in two phases:

    1. Disabling PodSecurityPolicy

     * create an allow-everything security policy
     * create a rolebinding that binds each namespace service account
     to the security policy newly created. this effectively disables the
     PodSecurityPolicy admission controller

    2. Enabling Pod Security Standards

    * enforce the "privileged" profile on all namespaces
    * warn & audit violations of the "baseline" profile

    Example usage:

    $ ./migrate_cluster_psp_to_pss.sh --policy-version=v1.25 check

    $ ./migrate_cluster_psp_to_pss.sh --policy-version=v1.25 migrate

    If unspecified, the default target policy version is "v1.24".

commit d0f791e
Author: larry-aptos <[email protected]>
Date:   Mon Mar 6 15:20:02 2023 -0800

    [indexer grpc] Move the WriteResource data one-level down. (#6957)

commit 5aef30f
Author: larry-aptos <[email protected]>
Date:   Fri Mar 3 03:09:34 2023 -0800

    Indexer grpc refactor cache performance (#6711)

commit 19cbeb1
Author: larry-aptos <[email protected]>
Date:   Thu Mar 2 18:41:37 2023 -0800

    [indexer grpc][proto] add default type for all enum for datastream and its proto data. (#6834)

commit 29a4c0a
Author: Guoteng Rao <[email protected]>
Date:   Mon Mar 6 17:52:44 2023 -0800

    [CP][Preview] State kv db pruner, and restore. (#6960)

commit 282f963
Author: Rustie Lin <[email protected]>
Date:   Mon Mar 6 10:20:08 2023 -0800

    [PREVIEWNET ONLY][docker] expose validator REST API (#6948)

commit 278c75a
Author: Brian (Sunghoon) Cho <[email protected]>
Date:   Fri Mar 3 17:46:19 2023 -0800

    [Previewnet] [cherry-pick] PR 6889: [Quorum Store] Implement end_epoch to clear state_computer between epoch end and start (#6916)

    * [Quorum Store] Implement end_epoch to clear state_computer between epoch end and start (#6889)

    We should not be using a stale PayloadManager after epoch end, so adding a end_epoch function that sets the PayloadManager to None.

    This fixes a panic that was observed because the stale PayloadManager was still held at StateComputer during sync_to called from initiate_new_epoch. The PayloadManager expects to only see commits from its epoch, which is violated if the epoch change includes multiple epochs.

    changing_working_quorum_test (with failpoints) failed with panics. Rerun and observe no panics.
    TODO (not in this PR): would be nice to have a test that explicitly causes a multiple epoch sync_to from epoch manager.

    * Fix headers (to pass pre-commit hooks)

commit 328eaa9
Author: Brian (Sunghoon) Cho <[email protected]>
Date:   Fri Mar 3 16:40:16 2023 -0800

    [Previewnet] [cherry-pick] PR 6830: [consensus] better protect sync condition (#6915)

    This commit avoids that sync_to races with commit in state computer. previously it'd result in
    state sync error, with quorum store, it may panic the node because of decreasing round number.

    Co-authored-by: Zekun Li <[email protected]>

commit 57cf378
Author: Rustie Lin <[email protected]>
Date:   Fri Mar 3 15:30:34 2023 -0800

    [PREVIEWNET ONLY][helm][aptos-node] update validator and VFN resources (#6913)

commit 97fa3e9
Author: Stelian Ionescu <[email protected]>
Date:   Fri Mar 3 18:19:16 2023 -0500

    [TF] Fix Kubernetes node taint (#6912)

commit add4b51
Author: Josh Lind <[email protected]>
Date:   Fri Mar 3 17:01:48 2023 -0500

    [State Sync] Update output syncing chunk sizes.

commit 0a0cb31
Author: Rustie Lin <[email protected]>
Date:   Fri Mar 3 11:37:48 2023 -0800

    [gha] copy preview images to dockerhub (#6897)

commit 856a0b1
Author: Sital Kedia <[email protected]>
Date:   Thu Mar 2 13:32:26 2023 -0800

    [previewnet] Enable transaction shuffling for previewnet

commit 0192ee9
Author: Sital Kedia <[email protected]>
Date:   Mon Feb 6 16:33:50 2023 -0800

    [BlockSTM] Optimized transaction shuffling

commit 5a7ca31
Author: Stelian Ionescu <[email protected]>
Date:   Thu Mar 2 15:02:07 2023 -0500

    Remove PodSecurityPolicy from Terraform configs (#6874)

    Remove also ClusterRole and ClusterRoleBinding resources that were
    used to enact the PodSecurityPolicy policies.

    The current recommended Kubernetes version for these configs is 1.23
     * updated autoscaler image tag v.1.21.0 -> v.1.23.0
     * updated autoscaler permissions to the recommended set for this version

    The recommended mechanism to replace PodSecurityPolicy is [Pod
    Security Standards](https://v1-23.docs.kubernetes.io/docs/tasks/configure-pod-container/migrate-from-psp/).

     * removed SYS_RESOURCE from requested capability set for Haproxy
    Deployment for compatibility with the PSS Baseline profile. Without
    this change, the entire "default" namespace would have to run under
    the Privileged profile, possibly compromising the security of the
    validator nodes.

commit 079cc28
Author: Rustie Lin <[email protected]>
Date:   Thu Mar 2 11:15:34 2023 -0800

    [tf/gha] build on push to preview branch and bump machine type (#6872)

    * [gha] build on push to preview branch

    * [tf] update instance types for preview

commit 6b092f5
Author: Sital Kedia <[email protected]>
Date:   Tue Feb 28 18:01:00 2023 -0800

    Revert accidental change in preview commit

commit 75ae87a
Author: Sital Kedia <[email protected]>
Date:   Tue Feb 28 17:53:18 2023 -0800

    [Previewnet] Configuration tuning to achieve high TPS (#6825)

commit d883a7c
Author: Rustie Lin <[email protected]>
Date:   Thu Mar 2 13:00:50 2023 -0800

    [forge] enforce max pod name len (#6879)

    * [forge] enforce max pod name len (#6875)

    * [forge] enforce max pod name len fix (#6878)

commit 3be58b2
Author: Rustie Lin <[email protected]>
Date:   Thu Mar 2 09:53:25 2023 -0800

    [gha] rename dockerhub release workflow & avoid forge preemption (#6842)

    * [gha] rename dockerhub release workflow

    * [gha][forge] include image tag in namespace to prevent cadence clash
@bchocho bchocho mentioned this pull request Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:build-images when this label is present github actions will start build+push rust images from the PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants