Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEST ONLY] previewnet consensus-only baseline #6989

Closed
wants to merge 22 commits into from

Conversation

bchocho
Copy link
Contributor

@bchocho bchocho commented Mar 7, 2023

Description

Test Plan

rustielin and others added 22 commits March 2, 2023 11:21
* [gha] rename dockerhub release workflow

* [gha][forge] include image tag in namespace to prevent cadence clash
* [forge] enforce max pod name len (#6875)

* [forge] enforce max pod name len fix (#6878)
* [gha] build on push to preview branch

* [tf] update instance types for preview
Remove also ClusterRole and ClusterRoleBinding resources that were
used to enact the PodSecurityPolicy policies.

The current recommended Kubernetes version for these configs is 1.23
 * updated autoscaler image tag v.1.21.0 -> v.1.23.0
 * updated autoscaler permissions to the recommended set for this version

The recommended mechanism to replace PodSecurityPolicy is [Pod
Security Standards](https://v1-23.docs.kubernetes.io/docs/tasks/configure-pod-container/migrate-from-psp/).

 * removed SYS_RESOURCE from requested capability set for Haproxy
Deployment for compatibility with the PSS Baseline profile. Without
this change, the entire "default" namespace would have to run under
the Privileged profile, possibly compromising the security of the
validator nodes.
…ondition (#6915)

This commit avoids that sync_to races with commit in state computer. previously it'd result in
state sync error, with quorum store, it may panic the node because of decreasing round number.

Co-authored-by: Zekun Li <[email protected]>
…h to clear state_computer between epoch end and start (#6916)

* [Quorum Store] Implement end_epoch to clear state_computer between epoch end and start (#6889)

We should not be using a stale PayloadManager after epoch end, so adding a end_epoch function that sets the PayloadManager to None.

This fixes a panic that was observed because the stale PayloadManager was still held at StateComputer during sync_to called from initiate_new_epoch. The PayloadManager expects to only see commits from its epoch, which is violated if the epoch change includes multiple epochs.

changing_working_quorum_test (with failpoints) failed with panics. Rerun and observe no panics.
TODO (not in this PR): would be nice to have a test that explicitly causes a multiple epoch sync_to from epoch manager.

* Fix headers (to pass pre-commit hooks)
The script migrate_cluster_psp_to_pss.sh has two modes of operation:

 * check: will check whether there are pods that violate the PSS
 "baseline" profile. It's useful to see where the security policy can
 be tightened from the default "privileged" profile.
 * migrate: will perform the migration on the current K8s context.
 --policy-version should specify the target policy version, usually
 the same version as the K8s cluster.

The migration works in two phases:

1. Disabling PodSecurityPolicy

 * create an allow-everything security policy
 * create a rolebinding that binds each namespace service account
 to the security policy newly created. this effectively disables the
 PodSecurityPolicy admission controller

2. Enabling Pod Security Standards

* enforce the "privileged" profile on all namespaces
* warn & audit violations of the "baseline" profile

Example usage:

$ ./migrate_cluster_psp_to_pss.sh --policy-version=v1.25 check

$ ./migrate_cluster_psp_to_pss.sh --policy-version=v1.25 migrate

If unspecified, the default target policy version is "v1.24".
…m_store_db instance across epochs (#6986)

### Description

Previously, the quorum_store_db instance was torn down and restarted on epoch changes.
We observed this occasionally caused panic when the new instance couldn't start.
The DB doesn't have to be torn down and restarted. The only interesting thing here is the DB
will be created regardless of whether quorum store is turned on, but that should be negligible overhead.

Includes some refactoring to make twins unit tests work.

ref: #6855 

### Test Plan

Existing tests

Co-authored-by: Balaji Arun <[email protected]>
@bchocho bchocho added CICD:run-consensus-only-perf-test Builds consensus-only aptos-node image and uses it to run forge CICD:build-consensus-only-image CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR labels Mar 7, 2023
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 7, 2023

❌ Forge suite compat failure on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 6616b93eb0f370de038a3a2c1bf1c21dd798db3a

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 6616b93eb0f370de038a3a2c1bf1c21dd798db3a (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 8061 TPS, 4726 ms latency, 7200 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 6616b93eb0f370de038a3a2c1bf1c21dd798db3a
Test Failed: Timed out waiting for Node validator-1:53972dc0df64582a73c6274b3d340822198465b9f957ce9cdc306687cc0f019b to be healthy
Trailing Log Lines:
::error::Timed out waiting for Node validator-1:53972dc0df64582a73c6274b3d340822198465b9f957ce9cdc306687cc0f019b to be healthy
test compatibility::simple-validator-upgrade ... FAILED
Error: Timed out waiting for Node validator-1:53972dc0df64582a73c6274b3d340822198465b9f957ce9cdc306687cc0f019b to be healthy
Test Statistics: 
Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 6616b93eb0f370de038a3a2c1bf1c21dd798db3a (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 8061 TPS, 4726 ms latency, 7200 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 6616b93eb0f370de038a3a2c1bf1c21dd798db3a
Test Failed: Timed out waiting for Node validator-1:53972dc0df64582a73c6274b3d340822198465b9f957ce9cdc306687cc0f019b to be healthy


Swarm logs can be found here: See fgi output for more information.
{"level":"INFO","source":{"package":"aptos_forge","file":"testsuite/forge/src/backend/k8s/cluster_helper.rs:281"},"thread_name":"main","hostname":"forge-compat-pr-6989-1678228851-testnet-2d8b1b57553d869190f61df","timestamp":"2023-03-07T22:49:04.079640Z","message":"Deleting namespace forge-compat-pr-6989: Some(NamespaceStatus { phase: Some(\"Terminating\") })"}
{"level":"INFO","source":{"package":"aptos_forge","file":"testsuite/forge/src/backend/k8s/cluster_helper.rs:389"},"thread_name":"main","hostname":"forge-compat-pr-6989-1678228851-testnet-2d8b1b57553d869190f61df","timestamp":"2023-03-07T22:49:04.079671Z","message":"aptos-node resources for Forge removed in namespace: forge-compat-pr-6989"}

failures:
    compatibility::simple-validator-upgrade

test result: FAILED. 0 passed; 1 failed; 0 filtered out

Failed to run tests:
Tests Failed
Error: Tests Failed
Debugging output:
NAME                                   READY   STATUS      RESTARTS      AGE
aptos-node-0-validator-0               1/1     Running     0             7m20s
aptos-node-1-validator-0               0/1     Error       3 (38s ago)   2m57s
aptos-node-2-validator-0               1/1     Running     0             7m20s
aptos-node-3-validator-0               1/1     Running     0             7m20s
aptos-node-4-validator-0               1/1     Running     0             7m20s
genesis-aptos-genesis-eforge85-9pxvw   0/1     Completed   0             7m30s

@github-actions
Copy link
Contributor

github-actions bot commented Mar 7, 2023

❌ Forge suite framework_upgrade failure on cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> 6616b93eb0f370de038a3a2c1bf1c21dd798db3a

Compatibility test results for cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> 6616b93eb0f370de038a3a2c1bf1c21dd798db3a (PR)
Upgrade the nodes to version: 6616b93eb0f370de038a3a2c1bf1c21dd798db3a
Test Failed: Timed out waiting for Node validator-0:ac28811f42802c7d6383db5c3a2bc0e8204be94c9ef8835c23191da898a1380f to be healthy
Trailing Log Lines:
{"level":"INFO","source":{"package":"aptos_forge","file":"testsuite/forge/src/backend/k8s/stateful_set.rs:145"},"thread_name":"main","hostname":"forge-framework-upgrade-pr-6989-1678228856-cb4ba0a57c998c60cbab","timestamp":"2023-03-07T22:49:40.726602Z","message":"Waiting for pod aptos-node-0-validator-0"}
{"level":"INFO","source":{"package":"aptos_forge","file":"testsuite/forge/src/backend/k8s/stateful_set.rs:101"},"thread_name":"main","hostname":"forge-framework-upgrade-pr-6989-1678228856-cb4ba0a57c998c60cbab","timestamp":"2023-03-07T22:49:50.732256Z","message":"StatefulSet aptos-node-0-validator has scaled to 1"}
::error::Timed out waiting for Node validator-0:ac28811f42802c7d6383db5c3a2bc0e8204be94c9ef8835c23191da898a1380f to be healthy
test framework_upgrade::framework-upgrade ... FAILED
Error: Timed out waiting for Node validator-0:ac28811f42802c7d6383db5c3a2bc0e8204be94c9ef8835c23191da898a1380f to be healthy
Test Statistics: 
Compatibility test results for cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> 6616b93eb0f370de038a3a2c1bf1c21dd798db3a (PR)
Upgrade the nodes to version: 6616b93eb0f370de038a3a2c1bf1c21dd798db3a
Test Failed: Timed out waiting for Node validator-0:ac28811f42802c7d6383db5c3a2bc0e8204be94c9ef8835c23191da898a1380f to be healthy


Swarm logs can be found here: See fgi output for more information.
{"level":"INFO","source":{"package":"aptos_forge","file":"testsuite/forge/src/backend/k8s/cluster_helper.rs:281"},"thread_name":"main","hostname":"forge-framework-upgrade-pr-6989-1678228856-cb4ba0a57c998c60cbab","timestamp":"2023-03-07T22:50:51.006166Z","message":"Deleting namespace forge-framework-upgrade-pr-6989: Some(NamespaceStatus { phase: Some(\"Terminating\") })"}
{"level":"INFO","source":{"package":"aptos_forge","file":"testsuite/forge/src/backend/k8s/cluster_helper.rs:389"},"thread_name":"main","hostname":"forge-framework-upgrade-pr-6989-1678228856-cb4ba0a57c998c60cbab","timestamp":"2023-03-07T22:50:51.006189Z","message":"aptos-node resources for Forge removed in namespace: forge-framework-upgrade-pr-6989"}

failures:
    framework_upgrade::framework-upgrade

Failed to run tests:
Tests Failed
test result: FAILED. 0 passed; 1 failed; 0 filtered out

Error: Tests Failed
Debugging output:
NAME                       READY   STATUS        RESTARTS   AGE
aptos-node-1-validator-0   1/1     Terminating   0          7m15s
aptos-node-2-validator-0   1/1     Running       0          7m15s
aptos-node-3-validator-0   1/1     Running       0          7m15s
aptos-node-4-validator-0   1/1     Running       0          7m15s

@github-actions
Copy link
Contributor

github-actions bot commented Mar 7, 2023

✅ Forge suite consensus_only_perf_benchmark success on consensus_only_perf_test_6616b93eb0f370de038a3a2c1bf1c21dd798db3a ==> 6616b93eb0f370de038a3a2c1bf1c21dd798db3a

Test Ok

@github-actions
Copy link
Contributor

github-actions bot commented Mar 7, 2023

✅ Forge suite land_blocking success on 6616b93eb0f370de038a3a2c1bf1c21dd798db3a

performance benchmark with full nodes : 7260 TPS, 5464 ms latency, 8100 ms p99 latency,no expired txns
Test Ok

@bchocho bchocho closed this Mar 13, 2023
@bchocho bchocho deleted the brian/consensus-only-preview-net branch March 13, 2023 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:build-consensus-only-image CICD:run-consensus-only-perf-test Builds consensus-only aptos-node image and uses it to run forge CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants