From b77d74b4f91b6bfafdd9dc7070fc6d0b6c821f95 Mon Sep 17 00:00:00 2001 From: Kate Osborn <50597707+kate-osborn@users.noreply.github.com> Date: Fri, 8 Dec 2023 11:16:00 -0700 Subject: [PATCH] Add results for reconfig test (#1354) Problem: We need to run the reconfig test against the 1.1 release. Solution: Record the results of the reconfig test for the 1.1 release. --- tests/reconfig/results/1.0.0/1.0.0.md | 2 +- tests/reconfig/results/1.1.0/1.1.0.md | 92 +++++++++++++++++++++++ tests/reconfig/scripts/delete-multiple.sh | 6 +- tests/reconfig/setup.md | 30 ++++---- 4 files changed, 114 insertions(+), 16 deletions(-) create mode 100644 tests/reconfig/results/1.1.0/1.1.0.md mode change 100644 => 100755 tests/reconfig/scripts/delete-multiple.sh diff --git a/tests/reconfig/results/1.0.0/1.0.0.md b/tests/reconfig/results/1.0.0/1.0.0.md index 89b9ceeb47..101bde04be 100644 --- a/tests/reconfig/results/1.0.0/1.0.0.md +++ b/tests/reconfig/results/1.0.0/1.0.0.md @@ -56,7 +56,7 @@ NGF deployment: ## NumResources -> Total Resources | NumResources | Gateways | Secrets | ReferenceGrants | Namespaces | application Pods | application Services | HTTPRoutes | Total Resources | -| ------------ | -------- | ------- | --------------- | ---------- | ---------------- | -------------------- | ---------- | --------------- | +|--------------|----------|---------|-----------------|------------|------------------|----------------------|------------|-----------------| | x | 1 | 1 | 1 | x+1 | 2x | 2x | 3x | | | 30 | 1 | 1 | 1 | 31 | 60 | 60 | 90 | 244 | | 150 | 1 | 1 | 1 | 151 | 300 | 300 | 450 | 1204 | diff --git a/tests/reconfig/results/1.1.0/1.1.0.md b/tests/reconfig/results/1.1.0/1.1.0.md new file mode 100644 index 0000000000..3dcc8ed2e9 --- /dev/null +++ b/tests/reconfig/results/1.1.0/1.1.0.md @@ -0,0 +1,92 @@ +# Reconfiguration testing Results + + +- [Reconfiguration testing Results](#reconfiguration-testing-results) + - [Summary](#summary) + - [Test environment](#test-environment) + - [Results Tables](#results-tables) + - [NGINX Reloads and Time to Ready](#nginx-reloads-and-time-to-ready) + - [Event Batch Processing](#event-batch-processing) + - [NumResources to Total Resources](#numresources-to-total-resources) + - [Observations](#observations) + - [Future Improvements](#future-improvements) + + +## Summary + +- Better reload times across all tests +- Similar TimeToReadyTotal and TimeToReadyAveSingle times +- Similar event batch totals +- Slightly better event batch processing average times +- No new errors or issues + +## Test environment + +GKE cluster: + +- Node count: 4 +- Instance Type: n2d-standard-2 +- k8s version: 1.27.3-gke.100 +- Zone: us-west2-a +- Total vCPUs: 8 +- Total RAM: 32GB +- Max pods per node: 110 + +NGF deployment: + +- NGF version: edge - git commit 3cab370a46bccd55c115c16e23a475df2497a3d2 +- NGINX Version: 1.25.3 + +## Results Tables + +### NGINX Reloads and Time to Ready + +| Test number | NumResources | TimeToReadyTotal (s) | TimeToReadyAvgSingle (s) | NGINX reloads | NGINX reload avg time (ms) | <= 500ms | <= 1000ms | +|-------------|--------------|----------------------|--------------------------|---------------|----------------------------|----------|-----------| +| 1 | 30 | 1.5 | <1 | 2 | 158.5 | 100% | 100% | +| 1 | 150 | 3.5 | 1 | 2 | 272.5 | 100% | 100% | +| 2 | 30 | 34 | <1 | 93 | 136 | 100% | 100% | +| 2 | 150 | 176.5 | <1 | 451 | 203.98 | 100% | 100% | +| 3 | 30 | <1 | 1 | 93 | 125.7 | 100% | 100% | +| 3 | 150 | 1 | 1 | 453 | 126.71 | 100% | 100% | + + +### Event Batch Processing + +| Test number | NumResources | Event Batch Total | Event Batch Processing avg time (ms) | <= 500ms | <= 1000ms | <= 5000ms | <= 10000ms | <= 30000ms | +|-------------|--------------|-------------------|--------------------------------------|----------|-----------|-----------|------------|------------| +| 1 | 30 | 70 | 5.12 | 100% | 100% | 100% | 100% | 100% | +| 1 | 150 | 309 | 2.14 | 100% | 100% | 100% | 100% | 100% | +| 2 | 30 | 442 | 35.4 | 100% | 100% | 100% | 100% | 100% | +| 2 | 150 | 2009 | 54.76 | 100% | 100% | 100% | 100% | 100% | +| 3 | 30 | 373 | 35.72 | 99.73% | 99.73% | 100% | 100% | 100% | +| 3 | 150 | 1813 | 39.46 | 99.94% | 99.94% | 99.94% | 99.94% | 100% | + +> Note: The outlier for test #3 is the event batch that contains the Gateway. It took ~13s to process. + +## NumResources to Total Resources + +| NumResources | Gateways | Secrets | ReferenceGrants | Namespaces | application Pods | application Services | HTTPRoutes | Attached HTTPRoutes | Total Resources | +|--------------|----------|---------|-----------------|------------|------------------|----------------------|------------|---------------------|-----------------| +| x | 1 | 1 | 1 | x+1 | 2x | 2x | 3x | 2x | | +| 30 | 1 | 1 | 1 | 31 | 60 | 60 | 90 | 60 | 244 | +| 150 | 1 | 1 | 1 | 151 | 300 | 300 | 450 | 300 | 1204 | + +> Note: Only 2x HTTPRoutes attach to the Gateway because the parentRef name in the `cafe-tls-redirect` HTTPRoute is incorrect. This will be fixed in the next release. + +## Observations + +1. The following issues still exist: + + - https://github.com/nginxinc/nginx-gateway-fabric/issues/1124 + - https://github.com/nginxinc/nginx-gateway-fabric/issues/1123 + +2. All NGINX reloads were in the <= 500ms bucket. An increase in the reload time based on number of configured resources resulting in NGINX configuration changes was observed. + +3. No errors (NGF or NGINX) were observed in any test run. + +4. The majority of the event batches were processed in 500ms or less except the 3rd test. In the 3rd test, we create the Gateway resource after all the apps and routes. The batch that contains the Gateway is the only one that takes longer than 500ms. It takes ~13s. + +## Future Improvements + +1. Fix the parentRef name in the `cafe-tls-redirect` [HTTPRoute](/tests/reconfig/scripts/cafe-routes.yaml), so it matches the deployed Gateway. diff --git a/tests/reconfig/scripts/delete-multiple.sh b/tests/reconfig/scripts/delete-multiple.sh old mode 100644 new mode 100755 index 0e46bc2759..19734932d1 --- a/tests/reconfig/scripts/delete-multiple.sh +++ b/tests/reconfig/scripts/delete-multiple.sh @@ -3,11 +3,13 @@ num_namespaces=$1 # Delete namespaces +namespaces="" for ((i=1; i<=$num_namespaces; i++)); do - namespace_name="namespace$i" - kubectl delete namespace "$namespace_name" + namespaces+="namespace$i " done +kubectl delete namespace $namespaces + # Delete single instance resources kubectl delete -f gateway.yaml kubectl delete -f reference-grant.yaml diff --git a/tests/reconfig/setup.md b/tests/reconfig/setup.md index 6462fe4629..1883786c5f 100644 --- a/tests/reconfig/setup.md +++ b/tests/reconfig/setup.md @@ -26,7 +26,7 @@ The following cluster will be sufficient: -- A Kubernetes cluster with 3 nodes on GKE +- A Kubernetes cluster with 4 nodes on GKE - Node: e2-medium (2 vCPU, 4GB memory) ## Setup @@ -43,7 +43,7 @@ ```console helm install my-release oci://ghcr.io/nginxinc/charts/nginx-gateway-fabric --version 0.0.0-edge \ - --create-namespace --wait -n nginx-gateway + --create-namespace --wait -n nginx-gateway --set nginxGateway.config.logging.level=debug ``` 4. Run tests: @@ -58,13 +58,17 @@ - Note: Clean up after each test run for isolated results. There's a script provided for removing all the test fixtures `scripts/delete-multiple.sh` which takes a number (needs to be the same number as what was used in the create script.) -5. After each individual test run, grab logs of both NGF containers and grab metrics. - Note: You can expose metrics by running the below snippet and then navigating to `127.0.0.1:9113/metrics`: - - ```console - GW_POD=$(k get pods -n nginx-gateway | sed -n '2s/^\([^[:space:]]*\).*$/\1/p') - kubectl port-forward $GW_POD -n nginx-gateway 9113:9113 & - ``` +5. After each individual test: + - Describe the Gateway resource and make sure the status is correct. + - Check the logs of both NGF containers for errors. + - Parse the logs for TimeToReady numbers (see steps 6-7 below). + - Grab metrics. + Note: You can expose metrics by running the below snippet and then navigating to `127.0.0.1:9113/metrics`: + + ```console + GW_POD=$(k get pods -n nginx-gateway | sed -n '2s/^\([^[:space:]]*\).*$/\1/p') + kubectl port-forward $GW_POD -n nginx-gateway 9113:9113 & + ``` 6. Measure NGINX Reloads and Time to Ready Results 1. TimeToReadyTotal as described in each test - NGF logs. @@ -75,11 +79,11 @@ 1. The average reload duration can be computed by taking the `nginx_gateway_fabric_nginx_reloads_milliseconds_sum` metric value and dividing it by the `nginx_gateway_fabric_nginx_reloads_milliseconds_count` metric value. 7. Measure Event Batch Processing Results - 1. Event Batch Total - metrics. + 1. Event Batch Total - `nginx_gateway_fabric_event_batch_processing_milliseconds_count` metric. 2. Average Event Batch Processing duration - metrics. - 1. The average event batch processing duraiton can be computed by taking the `nginx_gateway_fabric_event_batch_processing_milliseconds_sum` + 1. The average event batch processing duration can be computed by taking the `nginx_gateway_fabric_event_batch_processing_milliseconds_sum` metric value and dividing it by the `nginx_gateway_fabric_event_batch_processing_milliseconds_count` metric value. -8. For accuracy, repeat the test suite once or twice, take the averages, and look for any anomolies or outliers. +8. For accuracy, repeat the test suite once or twice, take the averages, and look for any anomalies or outliers. ## Tests @@ -90,7 +94,7 @@ e.g. `cd scripts && bash create-resources-gw-last.sh 30`. The script will deploy backend apps and services, wait 60 seconds for them to be ready, and deploy 1 Gateway, 1 RefGrant, 1 Secret, and HTTPRoutes. 2. Deploy NGF - 3. Measure TimeToReadyTotal as the time it takes from start-up -> config written and + 3. Measure TimeToReadyTotal as the time it takes from start-up -> final config written and NGINX reloaded. Measure the other results as described in steps 6-7 of the [Setup](#setup) section. ### Test 2: Start NGF, deploy Gateway, create many resources attached to GW