-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TF] Fix Kubernetes node taint #6912
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Closed
rustielin
approved these changes
Mar 3, 2023
This was referenced Mar 7, 2023
[Previewnet] [cherry-pick] PR 6965: [mempool] eager expiration based on queued transaction age
#6991
Merged
Merged
Merged
Merged
Merged
ibalajiarun
added a commit
that referenced
this pull request
Mar 9, 2023
commit 4b2cddb Author: Rati Gelashvili <[email protected]> Date: Wed Mar 8 22:42:48 2023 -0500 [BlockSTM] No eager validation for first execution wave (#6996) commit 8ab8c97 Author: Brian (Sunghoon) Cho <[email protected]> Date: Wed Mar 8 17:58:57 2023 -0800 [cherry-pick] PR 7016: [Quorum Store] Downgrade most logs to trace (#7020) Downgrade most logs to trace, only keep around enough debug logs to track locally created batch/digest/proof through the stages. Run smoke test, observe logs are not too spammy. commit b741221 Author: Brian (Sunghoon) Cho <[email protected]> Date: Wed Mar 8 17:03:35 2023 -0800 [cherry-pick] PR 7013: [Quorum Store] adjust memory and db quotas (#7017) Set defaults to be suitable for a 100 validator network. Paraphrasing Sasha: 20000 tps * 1000B (size of a tnx) * 60 s (duration) = 1_200_000_000 bytes Divide by 100 validator = 12_000_000B each validator ask me to persist on average. Adding some buffer for peaks: Memory: 10x - 120_000_000 bytes Storage 25x - 300_000_000 bytes commit e694ec5 Author: Brian (Sunghoon) Cho <[email protected]> Date: Wed Mar 8 13:14:42 2023 -0800 [cherry-pick] PR 6969: [Quorum Store] poll mempool at a faster pace to reduce mempool queueing time, except when backpressure (#7006) We poll mempool for transactions at a fast rate (default 25 ms), unless there is backpressure. If there is backpressure then the next poll will be tried in + 25ms, until a max of 250 ms is reached. Since this makes the poll duration variable, we compute the max txns to pull based on the time passed since the last poll. The backpressure is based on the number of proofs left in proof manager. So proofs are used for the rate of pulling, and txns are used for the max number of txns on the pull. Running tests with low TPS (2 TPS per validator), confirmed that the time spent in mempool reduced: roughly from an average of 125ms (from the previous 250ms poll time) to an average of 12.5ms (fro the 25ms faster poll time). commit 1015b10 Author: Brian (Sunghoon) Cho <[email protected]> Date: Wed Mar 8 13:06:38 2023 -0800 [cherry-pick] PR 6785: [quorum store] refactor batch_reader/store and batch requester (#7005) This commit merges batch_reader and batch_store into a single struct and uses it as synchronous component (via function calls) instead of asynchronous one (via channels). Also batch_requester is refactored to use network rpc directly. Co-authored-by: Zekun Li <[email protected]> commit d49118b Author: Igor <[email protected]> Date: Tue Mar 7 16:33:34 2023 -0800 configs: enable wait_for_full_blocks and increase exclude_rounds commit bda3c64 Author: Igor <[email protected]> Date: Tue Mar 7 08:58:20 2023 -0800 Add new block poll based on time commit 49c89fa Author: igor-aptos <[email protected]> Date: Tue Mar 7 21:50:11 2023 -0800 AverageIntCounter for backpressure states (#6963) commit 1562e7f Author: igor-aptos <[email protected]> Date: Tue Mar 7 18:35:53 2023 -0800 [mempool] Wait until max block size or quorum_store_poll_count is reached (#6129) Adding two new flags, to improve blockchain performance. // Whether to create partial blocks when few transactions exist, or empty blocks when there is // pending ordering, or to wait for quorum_store_poll_count * 30ms to collect transactions for a block // // It is more efficient to execute larger blocks, as it creates less overhead. On the other hand // waiting increases latency (unless we are under high load that added waiting latency // is compensated by faster execution time). So we want to balance the two, by waiting only // when we are saturating the execution pipeline: // - if there are more pending blocks then usual in the execution pipeline, // block is going to wait there anyways, so we can wait to create a bigger/more efificent block // - in case our node is faster than others, and we don't have many pending blocks, // but we still see very large recent (pending) blocks, we know that there is demand // and others are creating large blocks, so we can wait as well. * dynamic enabling of waiting for full blocks commit cbf1052 Author: Sital Kedia <[email protected]> Date: Wed Mar 8 09:26:40 2023 -0800 [preview] Fix broken unit test in previewnet commit 7a0002f Author: Brian (Sunghoon) Cho <[email protected]> Date: Tue Mar 7 17:26:31 2023 -0800 [mempool] eager expiration based on queued transaction age (#6965) (#6991) Introduces eager expiration, which expires transactions earlier (default: 3s) than its true client-provided expiration. This prevents transactions that are pulled from mempool expiring upon execution, but more importantly it prevents these transactions from blocking transactions that would have succeeded at execution. Eager expiration is triggered if sufficiently old transactions (default: 10s) are observed. This internally signals to mempool that a backlog is building. Run an overload test `three_region_simulation_graceful_overload` with quorum store and observe that expirations drop from ~3K/s to < 100/s and this pushes TPS up +15% from 3.8K -> 4.4K commit db0a542 Author: Brian (Sunghoon) Cho <[email protected]> Date: Tue Mar 7 12:50:47 2023 -0800 [Previewnet] [cherry-pick] PR 6978: [Quorum Store] Use a single quorum_store_db instance across epochs (#6986) Previously, the quorum_store_db instance was torn down and restarted on epoch changes. We observed this occasionally caused panic when the new instance couldn't start. The DB doesn't have to be torn down and restarted. The only interesting thing here is the DB will be created regardless of whether quorum store is turned on, but that should be negligible overhead. Includes some refactoring to make twins unit tests work. ref: #6855 Existing tests Co-authored-by: Balaji Arun <[email protected]> commit 1aa86fa Author: Stelian Ionescu <[email protected]> Date: Tue Mar 7 11:55:42 2023 -0500 [PSS] Add script for migrating a cluster from PSP to PSS (#6952) (#6959) The script migrate_cluster_psp_to_pss.sh has two modes of operation: * check: will check whether there are pods that violate the PSS "baseline" profile. It's useful to see where the security policy can be tightened from the default "privileged" profile. * migrate: will perform the migration on the current K8s context. --policy-version should specify the target policy version, usually the same version as the K8s cluster. The migration works in two phases: 1. Disabling PodSecurityPolicy * create an allow-everything security policy * create a rolebinding that binds each namespace service account to the security policy newly created. this effectively disables the PodSecurityPolicy admission controller 2. Enabling Pod Security Standards * enforce the "privileged" profile on all namespaces * warn & audit violations of the "baseline" profile Example usage: $ ./migrate_cluster_psp_to_pss.sh --policy-version=v1.25 check $ ./migrate_cluster_psp_to_pss.sh --policy-version=v1.25 migrate If unspecified, the default target policy version is "v1.24". commit d0f791e Author: larry-aptos <[email protected]> Date: Mon Mar 6 15:20:02 2023 -0800 [indexer grpc] Move the WriteResource data one-level down. (#6957) commit 5aef30f Author: larry-aptos <[email protected]> Date: Fri Mar 3 03:09:34 2023 -0800 Indexer grpc refactor cache performance (#6711) commit 19cbeb1 Author: larry-aptos <[email protected]> Date: Thu Mar 2 18:41:37 2023 -0800 [indexer grpc][proto] add default type for all enum for datastream and its proto data. (#6834) commit 29a4c0a Author: Guoteng Rao <[email protected]> Date: Mon Mar 6 17:52:44 2023 -0800 [CP][Preview] State kv db pruner, and restore. (#6960) commit 282f963 Author: Rustie Lin <[email protected]> Date: Mon Mar 6 10:20:08 2023 -0800 [PREVIEWNET ONLY][docker] expose validator REST API (#6948) commit 278c75a Author: Brian (Sunghoon) Cho <[email protected]> Date: Fri Mar 3 17:46:19 2023 -0800 [Previewnet] [cherry-pick] PR 6889: [Quorum Store] Implement end_epoch to clear state_computer between epoch end and start (#6916) * [Quorum Store] Implement end_epoch to clear state_computer between epoch end and start (#6889) We should not be using a stale PayloadManager after epoch end, so adding a end_epoch function that sets the PayloadManager to None. This fixes a panic that was observed because the stale PayloadManager was still held at StateComputer during sync_to called from initiate_new_epoch. The PayloadManager expects to only see commits from its epoch, which is violated if the epoch change includes multiple epochs. changing_working_quorum_test (with failpoints) failed with panics. Rerun and observe no panics. TODO (not in this PR): would be nice to have a test that explicitly causes a multiple epoch sync_to from epoch manager. * Fix headers (to pass pre-commit hooks) commit 328eaa9 Author: Brian (Sunghoon) Cho <[email protected]> Date: Fri Mar 3 16:40:16 2023 -0800 [Previewnet] [cherry-pick] PR 6830: [consensus] better protect sync condition (#6915) This commit avoids that sync_to races with commit in state computer. previously it'd result in state sync error, with quorum store, it may panic the node because of decreasing round number. Co-authored-by: Zekun Li <[email protected]> commit 57cf378 Author: Rustie Lin <[email protected]> Date: Fri Mar 3 15:30:34 2023 -0800 [PREVIEWNET ONLY][helm][aptos-node] update validator and VFN resources (#6913) commit 97fa3e9 Author: Stelian Ionescu <[email protected]> Date: Fri Mar 3 18:19:16 2023 -0500 [TF] Fix Kubernetes node taint (#6912) commit add4b51 Author: Josh Lind <[email protected]> Date: Fri Mar 3 17:01:48 2023 -0500 [State Sync] Update output syncing chunk sizes. commit 0a0cb31 Author: Rustie Lin <[email protected]> Date: Fri Mar 3 11:37:48 2023 -0800 [gha] copy preview images to dockerhub (#6897) commit 856a0b1 Author: Sital Kedia <[email protected]> Date: Thu Mar 2 13:32:26 2023 -0800 [previewnet] Enable transaction shuffling for previewnet commit 0192ee9 Author: Sital Kedia <[email protected]> Date: Mon Feb 6 16:33:50 2023 -0800 [BlockSTM] Optimized transaction shuffling commit 5a7ca31 Author: Stelian Ionescu <[email protected]> Date: Thu Mar 2 15:02:07 2023 -0500 Remove PodSecurityPolicy from Terraform configs (#6874) Remove also ClusterRole and ClusterRoleBinding resources that were used to enact the PodSecurityPolicy policies. The current recommended Kubernetes version for these configs is 1.23 * updated autoscaler image tag v.1.21.0 -> v.1.23.0 * updated autoscaler permissions to the recommended set for this version The recommended mechanism to replace PodSecurityPolicy is [Pod Security Standards](https://v1-23.docs.kubernetes.io/docs/tasks/configure-pod-container/migrate-from-psp/). * removed SYS_RESOURCE from requested capability set for Haproxy Deployment for compatibility with the PSS Baseline profile. Without this change, the entire "default" namespace would have to run under the Privileged profile, possibly compromising the security of the validator nodes. commit 079cc28 Author: Rustie Lin <[email protected]> Date: Thu Mar 2 11:15:34 2023 -0800 [tf/gha] build on push to preview branch and bump machine type (#6872) * [gha] build on push to preview branch * [tf] update instance types for preview commit 6b092f5 Author: Sital Kedia <[email protected]> Date: Tue Feb 28 18:01:00 2023 -0800 Revert accidental change in preview commit commit 75ae87a Author: Sital Kedia <[email protected]> Date: Tue Feb 28 17:53:18 2023 -0800 [Previewnet] Configuration tuning to achieve high TPS (#6825) commit d883a7c Author: Rustie Lin <[email protected]> Date: Thu Mar 2 13:00:50 2023 -0800 [forge] enforce max pod name len (#6879) * [forge] enforce max pod name len (#6875) * [forge] enforce max pod name len fix (#6878) commit 3be58b2 Author: Rustie Lin <[email protected]> Date: Thu Mar 2 09:53:25 2023 -0800 [gha] rename dockerhub release workflow & avoid forge preemption (#6842) * [gha] rename dockerhub release workflow * [gha][forge] include image tag in namespace to prevent cadence clash
This was referenced Mar 10, 2023
Merged
Merged
Merged
Closed
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Fix node taint key.
Tested by spinning up Previewnet on GCP.