Skip to content

Commit

Permalink
Add event batch processing results and rerun reconfig test (#1186)
Browse files Browse the repository at this point in the history
Add event batch processing metrics to reconfiguration test and rerun the test. Added some clarification and other small adjustments to test.

Problem: We already measured how long it took for NGINX to reload, but didn't measure for how long it took for NGF to update statuses of a resource.

Solution: Add event batch processing metrics to test.
  • Loading branch information
bjee19 authored Oct 24, 2023
1 parent 9fa18fa commit 95e6ba6
Show file tree
Hide file tree
Showing 3 changed files with 102 additions and 75 deletions.
78 changes: 78 additions & 0 deletions tests/reconfig/results/1.0.0/1.0.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Reconfiguration testing Results

<!-- TOC -->
- [Reconfiguration testing Results](#reconfiguration-testing-results)
- [Test environment](#test-environment)
- [Results Tables](#results-tables)
- [NGINX Reloads and Time to Ready](#nginx-reloads-and-time-to-ready)
- [Event Batch Processing](#event-batch-processing)
- [NumResources -> Total Resources](#numresources---total-resources)
- [Observations](#observations)
<!-- TOC -->

## Test environment

GKE cluster:

- Node count: 3
- Instance Type: e2-medium
- k8s version: 1.27.3-gke.100
- Zone: us-central1-c
- Total vCPUs: 6
- Total RAM: 12GB
- Max pods per node: 110

NGF deployment:

- NGF version: edge - git commit 29b45e38bacd7c4f22834938105e3cda4f29f6d1
- NGINX Version: 1.25.2

## Results Tables

### NGINX Reloads and Time to Ready

| Test number | NumResources | TimeToReadyTotal (s) | TimeToReadyAvgSingle (s) | NGINX reloads | NGINX reload avg time (ms) | <= 500ms | <= 1000ms |
|-------------|--------------|----------------------|--------------------------|---------------|----------------------------|----------|-----------|
| 1 | 30 | 1 | 1 | 2 | 191 | 100% | 100% |
| 1 | 150 | 2 | 2 | 2 | 440 | 50% | 100% |
| 2 | 30 | 50 | <1 | 93 | 162 | 100% | 100% |
| 2 | 150 | 208 | <1 | 396 | 281 | 96.46% | 100% |
| 3 | 30 | 1 | 1 | 93 | 129 | 100% | 100% |
| 3 | 150 | 1 | 1 | 453 | 130 | 100% | 100% |


### Event Batch Processing

| Test number | NumResources | Event Batch Total | Event Batch Processing avg time (ms) | <= 500ms | <= 1000ms |
|-------------|--------------|-------------------|--------------------------------------|----------|-----------|
| 1 | 30 | 69 | 6.232 | 100% | 100% |
| 1 | 150 | 309 | 3.638 | 99.68% | 100% |
| 2 | 30 | 465 | 38.759 | 100% | 100% |
| 2 | 150 | 1941 | 68.539 | 98.51% | 100% |
| 3 | 30 | 374 | 36.834 | 99.73% | 99.73% |
| 3 | 150 | 1812 | 40.411 | 99.94% | 99.94% |


## NumResources -> Total Resources
| NumResources | Gateways | Secrets | ReferenceGrants | Namespaces | application Pods | application Services | HTTPRoutes | Total Resources |
| ------------ | -------- | ------- | --------------- | ---------- | ---------------- | -------------------- | ---------- | --------------- |
| x | 1 | 1 | 1 | x+1 | 2x | 2x | 3x | <total> |
| 30 | 1 | 1 | 1 | 31 | 60 | 60 | 90 | 244 |
| 150 | 1 | 1 | 1 | 151 | 300 | 300 | 450 | 1204 |

## Observations

1. We are reloading after reconciling a ReferenceGrant even when there is no Gateway. This is because we treat every
upsert/delete of a ReferenceGrant as a change. This means we will regenerate NGINX config every time a ReferenceGrant
is created, updated (generation must change), or deleted, even if it does not apply to the accepted Gateway.

Issue filed: https://github.com/nginxinc/nginx-gateway-fabric/issues/1124

2. We are reloading after reconciling a HTTPRoute even when there is no accepted Gateway and no config being generated.

Issue filed: https://github.com/nginxinc/nginx-gateway-fabric/issues/1123

3. Majority of NGINX reloads were in the <= 500ms bucket, with all of them being in the <= 1000ms bucket. An increase
in the reload time based on number of configured resources resulting in NGINX configuration changes was observed.

4. No errors (NGF or NGINX) were observed in any test run.
61 changes: 0 additions & 61 deletions tests/reconfig/results/v1.0.0.md

This file was deleted.

38 changes: 24 additions & 14 deletions tests/reconfig/setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@

## Goals

- Measure how long it takes NGF to reconfigure NGINX when a number of Gateway API and referenced core Kubernetes
resources are created at once.
- Measure how long it takes NGF to reconfigure NGINX and update statuses when a number of Gateway API and
referenced core Kubernetes resources are created at once.
- Two runs of each test should be ran with differing numbers of resources. Each run will deploy:
- a single Gateway, Secret, and ReferenceGrant resources
- `x+1` number of namespaces
Expand All @@ -38,7 +38,8 @@
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v0.8.1/standard-install.yaml
```

3. Deploy NGF from edge using Helm install (NOTE: For Test 1, deploy AFTER resources):
3. Deploy NGF from edge using Helm install and wait for LoadBalancer Service to be ready
(NOTE: For Test 1, deploy AFTER resources):

```console
helm install my-release oci://ghcr.io/nginxinc/charts/nginx-gateway-fabric --version 0.0.0-edge \
Expand All @@ -65,10 +66,20 @@
kubectl port-forward $GW_POD -n nginx-gateway 9113:9113 &
```

6. Measure Time To Ready as described in each test, get the reload count, and get the average NGINX reload duration.
The average reload duration can be computed by taking the `nginx_gateway_fabric_nginx_reloads_milliseconds_sum`
metric value and dividing it by the `nginx_gateway_fabric_nginx_reloads_milliseconds_count` metric value.
7. For accuracy, repeat the test suite once or twice, take the averages, and look for any anomolies or outliers.
6. Measure NGINX Reloads and Time to Ready Results
1. TimeToReadyTotal as described in each test - NGF logs.
2. TimeToReadyAvgSingle which is the average time between updating any resource and the
NGINX configuration being reloaded - NGF logs.
3. NGINX Reload count - metrics.
4. Average NGINX reload duration - metrics.
1. The average reload duration can be computed by taking the `nginx_gateway_fabric_nginx_reloads_milliseconds_sum`
metric value and dividing it by the `nginx_gateway_fabric_nginx_reloads_milliseconds_count` metric value.
7. Measure Event Batch Processing Results
1. Event Batch Total - metrics.
2. Average Event Batch Processing duration - metrics.
1. The average event batch processing duraiton can be computed by taking the `nginx_gateway_fabric_event_batch_processing_milliseconds_sum`
metric value and dividing it by the `nginx_gateway_fabric_event_batch_processing_milliseconds_count` metric value.
8. For accuracy, repeat the test suite once or twice, take the averages, and look for any anomolies or outliers.

## Tests

Expand All @@ -79,8 +90,8 @@
e.g. `cd scripts && bash create-resources-gw-last.sh 30`. The script will deploy backend apps and services, wait
60 seconds for them to be ready, and deploy 1 Gateway, 1 RefGrant, 1 Secret, and HTTPRoutes.
2. Deploy NGF
3. Check logs for time it takes from start-up -> config written and NGINX reloaded. Get reload count and average reload
duration from metrics and logs.
3. Measure TimeToReadyTotal as the time it takes from start-up -> config written and
NGINX reloaded. Measure the other results as described in steps 6-7 of the [Setup](#setup) section.

### Test 2: Start NGF, deploy Gateway, create many resources attached to GW

Expand All @@ -89,9 +100,8 @@
2. Run the provided script with the required number of resources,
e.g. `cd scripts && bash create-resources-routes-last.sh 30`. The script will deploy backend apps and services,
wait 60 seconds for them to be ready, and deploy 1 Gateway, 1 Secret, 1 RefGrant, and HTTPRoutes at the same time.
3. Check logs for time it takes from NGF receiving first resource update -> final config written, and NGINX's final
reload. Check logs for average individual HTTPRoute TTR also. Get reload count and average reload duration from
metrics and logs.
3. Measure TimeToReadyTotal as the time it takes from NGF receiving the first HTTPRoute resource update -> final
config written and NGINX reloaded. Measure the other results as described in steps 6-7 of the [Setup](#setup) section.

### Test 3: Start NGF, create many resources attached to a Gateway, deploy the Gateway

Expand All @@ -101,5 +111,5 @@
e.g. `cd scripts && bash create-resources-gw-last.sh 30`.
The script will deploy the namespaces, backend apps and services, 1 Secret, 1 ReferenceGrant, and the HTTPRoutes;
wait 60 seconds for the backend apps to be ready, and then deploy 1 Gateway for all HTTPRoutes.
3. Check logs for time it takes from NGF receiving gateway resource -> config written and NGINX reloaded. Get reload
count and average reload duration from metrics and logs.
3. Measure TimeToReadyTotal as the time it takes from NGF receiving gateway resource -> config written and NGINX reloaded.
Measure the other results as described in steps 6-7 of the [Setup](#setup) section.

0 comments on commit 95e6ba6

Please sign in to comment.