Skip to content

Commit

Permalink
Add results for reconfig test (#1354)
Browse files Browse the repository at this point in the history
Problem: We need to run the reconfig test against the 1.1 release.

Solution: Record the results of the reconfig test for the 1.1 release.
  • Loading branch information
kate-osborn authored Dec 8, 2023
1 parent 56016e9 commit b77d74b
Show file tree
Hide file tree
Showing 4 changed files with 114 additions and 16 deletions.
2 changes: 1 addition & 1 deletion tests/reconfig/results/1.0.0/1.0.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ NGF deployment:
## NumResources -> Total Resources

| NumResources | Gateways | Secrets | ReferenceGrants | Namespaces | application Pods | application Services | HTTPRoutes | Total Resources |
| ------------ | -------- | ------- | --------------- | ---------- | ---------------- | -------------------- | ---------- | --------------- |
|--------------|----------|---------|-----------------|------------|------------------|----------------------|------------|-----------------|
| x | 1 | 1 | 1 | x+1 | 2x | 2x | 3x | <total> |
| 30 | 1 | 1 | 1 | 31 | 60 | 60 | 90 | 244 |
| 150 | 1 | 1 | 1 | 151 | 300 | 300 | 450 | 1204 |
Expand Down
92 changes: 92 additions & 0 deletions tests/reconfig/results/1.1.0/1.1.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Reconfiguration testing Results

<!-- TOC -->
- [Reconfiguration testing Results](#reconfiguration-testing-results)
- [Summary](#summary)
- [Test environment](#test-environment)
- [Results Tables](#results-tables)
- [NGINX Reloads and Time to Ready](#nginx-reloads-and-time-to-ready)
- [Event Batch Processing](#event-batch-processing)
- [NumResources to Total Resources](#numresources-to-total-resources)
- [Observations](#observations)
- [Future Improvements](#future-improvements)
<!-- TOC -->

## Summary

- Better reload times across all tests
- Similar TimeToReadyTotal and TimeToReadyAveSingle times
- Similar event batch totals
- Slightly better event batch processing average times
- No new errors or issues

## Test environment

GKE cluster:

- Node count: 4
- Instance Type: n2d-standard-2
- k8s version: 1.27.3-gke.100
- Zone: us-west2-a
- Total vCPUs: 8
- Total RAM: 32GB
- Max pods per node: 110

NGF deployment:

- NGF version: edge - git commit 3cab370a46bccd55c115c16e23a475df2497a3d2
- NGINX Version: 1.25.3

## Results Tables

### NGINX Reloads and Time to Ready

| Test number | NumResources | TimeToReadyTotal (s) | TimeToReadyAvgSingle (s) | NGINX reloads | NGINX reload avg time (ms) | <= 500ms | <= 1000ms |
|-------------|--------------|----------------------|--------------------------|---------------|----------------------------|----------|-----------|
| 1 | 30 | 1.5 | <1 | 2 | 158.5 | 100% | 100% |
| 1 | 150 | 3.5 | 1 | 2 | 272.5 | 100% | 100% |
| 2 | 30 | 34 | <1 | 93 | 136 | 100% | 100% |
| 2 | 150 | 176.5 | <1 | 451 | 203.98 | 100% | 100% |
| 3 | 30 | <1 | 1 | 93 | 125.7 | 100% | 100% |
| 3 | 150 | 1 | 1 | 453 | 126.71 | 100% | 100% |


### Event Batch Processing

| Test number | NumResources | Event Batch Total | Event Batch Processing avg time (ms) | <= 500ms | <= 1000ms | <= 5000ms | <= 10000ms | <= 30000ms |
|-------------|--------------|-------------------|--------------------------------------|----------|-----------|-----------|------------|------------|
| 1 | 30 | 70 | 5.12 | 100% | 100% | 100% | 100% | 100% |
| 1 | 150 | 309 | 2.14 | 100% | 100% | 100% | 100% | 100% |
| 2 | 30 | 442 | 35.4 | 100% | 100% | 100% | 100% | 100% |
| 2 | 150 | 2009 | 54.76 | 100% | 100% | 100% | 100% | 100% |
| 3 | 30 | 373 | 35.72 | 99.73% | 99.73% | 100% | 100% | 100% |
| 3 | 150 | 1813 | 39.46 | 99.94% | 99.94% | 99.94% | 99.94% | 100% |

> Note: The outlier for test #3 is the event batch that contains the Gateway. It took ~13s to process.
## NumResources to Total Resources

| NumResources | Gateways | Secrets | ReferenceGrants | Namespaces | application Pods | application Services | HTTPRoutes | Attached HTTPRoutes | Total Resources |
|--------------|----------|---------|-----------------|------------|------------------|----------------------|------------|---------------------|-----------------|
| x | 1 | 1 | 1 | x+1 | 2x | 2x | 3x | 2x | <total> |
| 30 | 1 | 1 | 1 | 31 | 60 | 60 | 90 | 60 | 244 |
| 150 | 1 | 1 | 1 | 151 | 300 | 300 | 450 | 300 | 1204 |

> Note: Only 2x HTTPRoutes attach to the Gateway because the parentRef name in the `cafe-tls-redirect` HTTPRoute is incorrect. This will be fixed in the next release.
## Observations

1. The following issues still exist:

- https://github.com/nginxinc/nginx-gateway-fabric/issues/1124
- https://github.com/nginxinc/nginx-gateway-fabric/issues/1123

2. All NGINX reloads were in the <= 500ms bucket. An increase in the reload time based on number of configured resources resulting in NGINX configuration changes was observed.

3. No errors (NGF or NGINX) were observed in any test run.

4. The majority of the event batches were processed in 500ms or less except the 3rd test. In the 3rd test, we create the Gateway resource after all the apps and routes. The batch that contains the Gateway is the only one that takes longer than 500ms. It takes ~13s.

## Future Improvements

1. Fix the parentRef name in the `cafe-tls-redirect` [HTTPRoute](/tests/reconfig/scripts/cafe-routes.yaml), so it matches the deployed Gateway.
6 changes: 4 additions & 2 deletions tests/reconfig/scripts/delete-multiple.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,13 @@
num_namespaces=$1

# Delete namespaces
namespaces=""
for ((i=1; i<=$num_namespaces; i++)); do
namespace_name="namespace$i"
kubectl delete namespace "$namespace_name"
namespaces+="namespace$i "
done

kubectl delete namespace $namespaces

# Delete single instance resources
kubectl delete -f gateway.yaml
kubectl delete -f reference-grant.yaml
Expand Down
30 changes: 17 additions & 13 deletions tests/reconfig/setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@

The following cluster will be sufficient:

- A Kubernetes cluster with 3 nodes on GKE
- A Kubernetes cluster with 4 nodes on GKE
- Node: e2-medium (2 vCPU, 4GB memory)

## Setup
Expand All @@ -43,7 +43,7 @@

```console
helm install my-release oci://ghcr.io/nginxinc/charts/nginx-gateway-fabric --version 0.0.0-edge \
--create-namespace --wait -n nginx-gateway
--create-namespace --wait -n nginx-gateway --set nginxGateway.config.logging.level=debug
```

4. Run tests:
Expand All @@ -58,13 +58,17 @@
- Note: Clean up after each test run for isolated results. There's a script provided for removing all the test
fixtures `scripts/delete-multiple.sh` which takes a number (needs to be the same number as what was used in the
create script.)
5. After each individual test run, grab logs of both NGF containers and grab metrics.
Note: You can expose metrics by running the below snippet and then navigating to `127.0.0.1:9113/metrics`:

```console
GW_POD=$(k get pods -n nginx-gateway | sed -n '2s/^\([^[:space:]]*\).*$/\1/p')
kubectl port-forward $GW_POD -n nginx-gateway 9113:9113 &
```
5. After each individual test:
- Describe the Gateway resource and make sure the status is correct.
- Check the logs of both NGF containers for errors.
- Parse the logs for TimeToReady numbers (see steps 6-7 below).
- Grab metrics.
Note: You can expose metrics by running the below snippet and then navigating to `127.0.0.1:9113/metrics`:

```console
GW_POD=$(k get pods -n nginx-gateway | sed -n '2s/^\([^[:space:]]*\).*$/\1/p')
kubectl port-forward $GW_POD -n nginx-gateway 9113:9113 &
```

6. Measure NGINX Reloads and Time to Ready Results
1. TimeToReadyTotal as described in each test - NGF logs.
Expand All @@ -75,11 +79,11 @@
1. The average reload duration can be computed by taking the `nginx_gateway_fabric_nginx_reloads_milliseconds_sum`
metric value and dividing it by the `nginx_gateway_fabric_nginx_reloads_milliseconds_count` metric value.
7. Measure Event Batch Processing Results
1. Event Batch Total - metrics.
1. Event Batch Total - `nginx_gateway_fabric_event_batch_processing_milliseconds_count` metric.
2. Average Event Batch Processing duration - metrics.
1. The average event batch processing duraiton can be computed by taking the `nginx_gateway_fabric_event_batch_processing_milliseconds_sum`
1. The average event batch processing duration can be computed by taking the `nginx_gateway_fabric_event_batch_processing_milliseconds_sum`
metric value and dividing it by the `nginx_gateway_fabric_event_batch_processing_milliseconds_count` metric value.
8. For accuracy, repeat the test suite once or twice, take the averages, and look for any anomolies or outliers.
8. For accuracy, repeat the test suite once or twice, take the averages, and look for any anomalies or outliers.

## Tests

Expand All @@ -90,7 +94,7 @@
e.g. `cd scripts && bash create-resources-gw-last.sh 30`. The script will deploy backend apps and services, wait
60 seconds for them to be ready, and deploy 1 Gateway, 1 RefGrant, 1 Secret, and HTTPRoutes.
2. Deploy NGF
3. Measure TimeToReadyTotal as the time it takes from start-up -> config written and
3. Measure TimeToReadyTotal as the time it takes from start-up -> final config written and
NGINX reloaded. Measure the other results as described in steps 6-7 of the [Setup](#setup) section.

### Test 2: Start NGF, deploy Gateway, create many resources attached to GW
Expand Down

0 comments on commit b77d74b

Please sign in to comment.