Indexer grpc refactor cache performance #6711

larry-aptos · 2023-02-21T07:42:41Z

Description

Improve the cache worker performance by 3000% on GCP. Now TPS is 30k.

Test Plan

Tested on devent

github-actions · 2023-03-03T11:05:47Z

✅ Forge suite `land_blocking` success on `fb9a6de7d1364108d92ed0a54b2a1a6838531200`

performance benchmark with full nodes : 6106 TPS, 6507 ms latency, 9000 ms p99 latency,no expired txns
Test Ok

github-actions · 2023-03-03T11:07:27Z

✅ Forge suite `framework_upgrade` success on `cb4ba0a57c998c60cbab65af31a64875d2588ca5` ==> `fb9a6de7d1364108d92ed0a54b2a1a6838531200`

Compatibility test results for cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> fb9a6de7d1364108d92ed0a54b2a1a6838531200 (PR)
Upgrade the nodes to version: fb9a6de7d1364108d92ed0a54b2a1a6838531200
framework_upgrade::framework-upgrade::full-framework-upgrade : 7066 TPS, 5436 ms latency, 7700 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> fb9a6de7d1364108d92ed0a54b2a1a6838531200 passed
Test Ok

github-actions · 2023-03-03T11:08:55Z

✅ Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `fb9a6de7d1364108d92ed0a54b2a1a6838531200`

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> fb9a6de7d1364108d92ed0a54b2a1a6838531200 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 8144 TPS, 4690 ms latency, 6900 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: fb9a6de7d1364108d92ed0a54b2a1a6838531200
compatibility::simple-validator-upgrade::single-validator-upgrade : 5142 TPS, 7674 ms latency, 11300 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: fb9a6de7d1364108d92ed0a54b2a1a6838531200
compatibility::simple-validator-upgrade::half-validator-upgrade : 5422 TPS, 6999 ms latency, 9000 ms p99 latency,no expired txns
4. upgrading second batch to new version: fb9a6de7d1364108d92ed0a54b2a1a6838531200
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6996 TPS, 5519 ms latency, 9100 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> fb9a6de7d1364108d92ed0a54b2a1a6838531200 passed
Test Ok

commit 4b2cddb Author: Rati Gelashvili <[email protected]> Date: Wed Mar 8 22:42:48 2023 -0500 [BlockSTM] No eager validation for first execution wave (#6996) commit 8ab8c97 Author: Brian (Sunghoon) Cho <[email protected]> Date: Wed Mar 8 17:58:57 2023 -0800 [cherry-pick] PR 7016: [Quorum Store] Downgrade most logs to trace (#7020) Downgrade most logs to trace, only keep around enough debug logs to track locally created batch/digest/proof through the stages. Run smoke test, observe logs are not too spammy. commit b741221 Author: Brian (Sunghoon) Cho <[email protected]> Date: Wed Mar 8 17:03:35 2023 -0800 [cherry-pick] PR 7013: [Quorum Store] adjust memory and db quotas (#7017) Set defaults to be suitable for a 100 validator network. Paraphrasing Sasha: 20000 tps * 1000B (size of a tnx) * 60 s (duration) = 1_200_000_000 bytes Divide by 100 validator = 12_000_000B each validator ask me to persist on average. Adding some buffer for peaks: Memory: 10x - 120_000_000 bytes Storage 25x - 300_000_000 bytes commit e694ec5 Author: Brian (Sunghoon) Cho <[email protected]> Date: Wed Mar 8 13:14:42 2023 -0800 [cherry-pick] PR 6969: [Quorum Store] poll mempool at a faster pace to reduce mempool queueing time, except when backpressure (#7006) We poll mempool for transactions at a fast rate (default 25 ms), unless there is backpressure. If there is backpressure then the next poll will be tried in + 25ms, until a max of 250 ms is reached. Since this makes the poll duration variable, we compute the max txns to pull based on the time passed since the last poll. The backpressure is based on the number of proofs left in proof manager. So proofs are used for the rate of pulling, and txns are used for the max number of txns on the pull. Running tests with low TPS (2 TPS per validator), confirmed that the time spent in mempool reduced: roughly from an average of 125ms (from the previous 250ms poll time) to an average of 12.5ms (fro the 25ms faster poll time). commit 1015b10 Author: Brian (Sunghoon) Cho <[email protected]> Date: Wed Mar 8 13:06:38 2023 -0800 [cherry-pick] PR 6785: [quorum store] refactor batch_reader/store and batch requester (#7005) This commit merges batch_reader and batch_store into a single struct and uses it as synchronous component (via function calls) instead of asynchronous one (via channels). Also batch_requester is refactored to use network rpc directly. Co-authored-by: Zekun Li <[email protected]> commit d49118b Author: Igor <[email protected]> Date: Tue Mar 7 16:33:34 2023 -0800 configs: enable wait_for_full_blocks and increase exclude_rounds commit bda3c64 Author: Igor <[email protected]> Date: Tue Mar 7 08:58:20 2023 -0800 Add new block poll based on time commit 49c89fa Author: igor-aptos <[email protected]> Date: Tue Mar 7 21:50:11 2023 -0800 AverageIntCounter for backpressure states (#6963) commit 1562e7f Author: igor-aptos <[email protected]> Date: Tue Mar 7 18:35:53 2023 -0800 [mempool] Wait until max block size or quorum_store_poll_count is reached (#6129) Adding two new flags, to improve blockchain performance. // Whether to create partial blocks when few transactions exist, or empty blocks when there is // pending ordering, or to wait for quorum_store_poll_count * 30ms to collect transactions for a block // // It is more efficient to execute larger blocks, as it creates less overhead. On the other hand // waiting increases latency (unless we are under high load that added waiting latency // is compensated by faster execution time). So we want to balance the two, by waiting only // when we are saturating the execution pipeline: // - if there are more pending blocks then usual in the execution pipeline, // block is going to wait there anyways, so we can wait to create a bigger/more efificent block // - in case our node is faster than others, and we don't have many pending blocks, // but we still see very large recent (pending) blocks, we know that there is demand // and others are creating large blocks, so we can wait as well. * dynamic enabling of waiting for full blocks commit cbf1052 Author: Sital Kedia <[email protected]> Date: Wed Mar 8 09:26:40 2023 -0800 [preview] Fix broken unit test in previewnet commit 7a0002f Author: Brian (Sunghoon) Cho <[email protected]> Date: Tue Mar 7 17:26:31 2023 -0800 [mempool] eager expiration based on queued transaction age (#6965) (#6991) Introduces eager expiration, which expires transactions earlier (default: 3s) than its true client-provided expiration. This prevents transactions that are pulled from mempool expiring upon execution, but more importantly it prevents these transactions from blocking transactions that would have succeeded at execution. Eager expiration is triggered if sufficiently old transactions (default: 10s) are observed. This internally signals to mempool that a backlog is building. Run an overload test `three_region_simulation_graceful_overload` with quorum store and observe that expirations drop from ~3K/s to < 100/s and this pushes TPS up +15% from 3.8K -> 4.4K commit db0a542 Author: Brian (Sunghoon) Cho <[email protected]> Date: Tue Mar 7 12:50:47 2023 -0800 [Previewnet] [cherry-pick] PR 6978: [Quorum Store] Use a single quorum_store_db instance across epochs (#6986) Previously, the quorum_store_db instance was torn down and restarted on epoch changes. We observed this occasionally caused panic when the new instance couldn't start. The DB doesn't have to be torn down and restarted. The only interesting thing here is the DB will be created regardless of whether quorum store is turned on, but that should be negligible overhead. Includes some refactoring to make twins unit tests work. ref: #6855 Existing tests Co-authored-by: Balaji Arun <[email protected]> commit 1aa86fa Author: Stelian Ionescu <[email protected]> Date: Tue Mar 7 11:55:42 2023 -0500 [PSS] Add script for migrating a cluster from PSP to PSS (#6952) (#6959) The script migrate_cluster_psp_to_pss.sh has two modes of operation: * check: will check whether there are pods that violate the PSS "baseline" profile. It's useful to see where the security policy can be tightened from the default "privileged" profile. * migrate: will perform the migration on the current K8s context. --policy-version should specify the target policy version, usually the same version as the K8s cluster. The migration works in two phases: 1. Disabling PodSecurityPolicy * create an allow-everything security policy * create a rolebinding that binds each namespace service account to the security policy newly created. this effectively disables the PodSecurityPolicy admission controller 2. Enabling Pod Security Standards * enforce the "privileged" profile on all namespaces * warn & audit violations of the "baseline" profile Example usage: $ ./migrate_cluster_psp_to_pss.sh --policy-version=v1.25 check $ ./migrate_cluster_psp_to_pss.sh --policy-version=v1.25 migrate If unspecified, the default target policy version is "v1.24". commit d0f791e Author: larry-aptos <[email protected]> Date: Mon Mar 6 15:20:02 2023 -0800 [indexer grpc] Move the WriteResource data one-level down. (#6957) commit 5aef30f Author: larry-aptos <[email protected]> Date: Fri Mar 3 03:09:34 2023 -0800 Indexer grpc refactor cache performance (#6711) commit 19cbeb1 Author: larry-aptos <[email protected]> Date: Thu Mar 2 18:41:37 2023 -0800 [indexer grpc][proto] add default type for all enum for datastream and its proto data. (#6834) commit 29a4c0a Author: Guoteng Rao <[email protected]> Date: Mon Mar 6 17:52:44 2023 -0800 [CP][Preview] State kv db pruner, and restore. (#6960) commit 282f963 Author: Rustie Lin <[email protected]> Date: Mon Mar 6 10:20:08 2023 -0800 [PREVIEWNET ONLY][docker] expose validator REST API (#6948) commit 278c75a Author: Brian (Sunghoon) Cho <[email protected]> Date: Fri Mar 3 17:46:19 2023 -0800 [Previewnet] [cherry-pick] PR 6889: [Quorum Store] Implement end_epoch to clear state_computer between epoch end and start (#6916) * [Quorum Store] Implement end_epoch to clear state_computer between epoch end and start (#6889) We should not be using a stale PayloadManager after epoch end, so adding a end_epoch function that sets the PayloadManager to None. This fixes a panic that was observed because the stale PayloadManager was still held at StateComputer during sync_to called from initiate_new_epoch. The PayloadManager expects to only see commits from its epoch, which is violated if the epoch change includes multiple epochs. changing_working_quorum_test (with failpoints) failed with panics. Rerun and observe no panics. TODO (not in this PR): would be nice to have a test that explicitly causes a multiple epoch sync_to from epoch manager. * Fix headers (to pass pre-commit hooks) commit 328eaa9 Author: Brian (Sunghoon) Cho <[email protected]> Date: Fri Mar 3 16:40:16 2023 -0800 [Previewnet] [cherry-pick] PR 6830: [consensus] better protect sync condition (#6915) This commit avoids that sync_to races with commit in state computer. previously it'd result in state sync error, with quorum store, it may panic the node because of decreasing round number. Co-authored-by: Zekun Li <[email protected]> commit 57cf378 Author: Rustie Lin <[email protected]> Date: Fri Mar 3 15:30:34 2023 -0800 [PREVIEWNET ONLY][helm][aptos-node] update validator and VFN resources (#6913) commit 97fa3e9 Author: Stelian Ionescu <[email protected]> Date: Fri Mar 3 18:19:16 2023 -0500 [TF] Fix Kubernetes node taint (#6912) commit add4b51 Author: Josh Lind <[email protected]> Date: Fri Mar 3 17:01:48 2023 -0500 [State Sync] Update output syncing chunk sizes. commit 0a0cb31 Author: Rustie Lin <[email protected]> Date: Fri Mar 3 11:37:48 2023 -0800 [gha] copy preview images to dockerhub (#6897) commit 856a0b1 Author: Sital Kedia <[email protected]> Date: Thu Mar 2 13:32:26 2023 -0800 [previewnet] Enable transaction shuffling for previewnet commit 0192ee9 Author: Sital Kedia <[email protected]> Date: Mon Feb 6 16:33:50 2023 -0800 [BlockSTM] Optimized transaction shuffling commit 5a7ca31 Author: Stelian Ionescu <[email protected]> Date: Thu Mar 2 15:02:07 2023 -0500 Remove PodSecurityPolicy from Terraform configs (#6874) Remove also ClusterRole and ClusterRoleBinding resources that were used to enact the PodSecurityPolicy policies. The current recommended Kubernetes version for these configs is 1.23 * updated autoscaler image tag v.1.21.0 -> v.1.23.0 * updated autoscaler permissions to the recommended set for this version The recommended mechanism to replace PodSecurityPolicy is [Pod Security Standards](https://v1-23.docs.kubernetes.io/docs/tasks/configure-pod-container/migrate-from-psp/). * removed SYS_RESOURCE from requested capability set for Haproxy Deployment for compatibility with the PSS Baseline profile. Without this change, the entire "default" namespace would have to run under the Privileged profile, possibly compromising the security of the validator nodes. commit 079cc28 Author: Rustie Lin <[email protected]> Date: Thu Mar 2 11:15:34 2023 -0800 [tf/gha] build on push to preview branch and bump machine type (#6872) * [gha] build on push to preview branch * [tf] update instance types for preview commit 6b092f5 Author: Sital Kedia <[email protected]> Date: Tue Feb 28 18:01:00 2023 -0800 Revert accidental change in preview commit commit 75ae87a Author: Sital Kedia <[email protected]> Date: Tue Feb 28 17:53:18 2023 -0800 [Previewnet] Configuration tuning to achieve high TPS (#6825) commit d883a7c Author: Rustie Lin <[email protected]> Date: Thu Mar 2 13:00:50 2023 -0800 [forge] enforce max pod name len (#6879) * [forge] enforce max pod name len (#6875) * [forge] enforce max pod name len fix (#6878) commit 3be58b2 Author: Rustie Lin <[email protected]> Date: Thu Mar 2 09:53:25 2023 -0800 [gha] rename dockerhub release workflow & avoid forge preemption (#6842) * [gha] rename dockerhub release workflow * [gha][forge] include image tag in namespace to prevent cadence clash

larry-aptos added the CICD:build-images when this label is present github actions will start build+push rust images from the PR. label Feb 21, 2023

larry-aptos changed the base branch from main to indexer-grpc-refactor February 21, 2023 07:43

larry-aptos requested review from bowenyang007 and geekflyer February 21, 2023 08:29

larry-aptos marked this pull request as ready for review February 21, 2023 08:31

geekflyer approved these changes Feb 21, 2023

View reviewed changes

pavel001k approved these changes Feb 21, 2023

View reviewed changes

larry-aptos force-pushed the indexer-grpc-refactor branch from 01457af to 26c18cb Compare February 23, 2023 00:08

larry-aptos requested review from clay-aptos and saharct February 23, 2023 00:08

larry-aptos force-pushed the indexer-grpc-refactor branch from 26c18cb to 0ca31c4 Compare February 23, 2023 00:45

larry-aptos force-pushed the indexer-grpc-refactor-cache-performance branch from 185c9e3 to 43e7a5d Compare February 23, 2023 00:46

bowenyang007 approved these changes Feb 24, 2023

View reviewed changes

larry-aptos force-pushed the indexer-grpc-refactor branch 3 times, most recently from cbbd088 to dfa6bf7 Compare March 2, 2023 00:16

Base automatically changed from indexer-grpc-refactor to main March 2, 2023 01:20

larry-aptos force-pushed the indexer-grpc-refactor-cache-performance branch from 43e7a5d to 1757d21 Compare March 3, 2023 09:34

update the cache worker performance.

b63791c

larry-aptos force-pushed the indexer-grpc-refactor-cache-performance branch from 1757d21 to b63791c Compare March 3, 2023 09:37

update the cache worker performance.

fb9a6de

larry-aptos enabled auto-merge (squash) March 3, 2023 10:01

This comment has been minimized.

Sign in to view

larry-aptos merged commit ab3983b into main Mar 3, 2023

larry-aptos deleted the indexer-grpc-refactor-cache-performance branch March 3, 2023 11:09

geekflyer mentioned this pull request Mar 7, 2023

cherry-pick indexer-grpc changes into preview branch #6968

Merged

geekflyer pushed a commit that referenced this pull request Mar 7, 2023

Indexer grpc refactor cache performance (#6711)

9ce880b

geekflyer pushed a commit that referenced this pull request Mar 7, 2023

Indexer grpc refactor cache performance (#6711)

5aef30f

bchocho mentioned this pull request Mar 17, 2023

[cherry-pick] PR 7256 #7259

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexer grpc refactor cache performance #6711

Indexer grpc refactor cache performance #6711

larry-aptos commented Feb 21, 2023 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

github-actions bot commented Mar 3, 2023

Indexer grpc refactor cache performance #6711

Indexer grpc refactor cache performance #6711

Conversation

larry-aptos commented Feb 21, 2023 • edited Loading

Description

Test Plan

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Mar 3, 2023

✅ Forge suite land_blocking success on fb9a6de7d1364108d92ed0a54b2a1a6838531200

github-actions bot commented Mar 3, 2023

✅ Forge suite framework_upgrade success on cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> fb9a6de7d1364108d92ed0a54b2a1a6838531200

github-actions bot commented Mar 3, 2023

✅ Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> fb9a6de7d1364108d92ed0a54b2a1a6838531200

larry-aptos commented Feb 21, 2023 •

edited

Loading

✅ Forge suite `land_blocking` success on `fb9a6de7d1364108d92ed0a54b2a1a6838531200`

✅ Forge suite `framework_upgrade` success on `cb4ba0a57c998c60cbab65af31a64875d2588ca5` ==> `fb9a6de7d1364108d92ed0a54b2a1a6838531200`

✅ Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `fb9a6de7d1364108d92ed0a54b2a1a6838531200`