Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Support OTel tracing using ngx_otel_module #1238

Closed
wants to merge 5 commits into from

Conversation

ciarams87
Copy link
Member

@ciarams87 ciarams87 commented Nov 9, 2023

Proposed changes

Problem:
As a user of NGF
I want NGF to have the ability to generate and send traces
So that I can easily get tracing for all of my traffic to help debug my environment.

Solution: Use the NGINX Otel module to configure tracing from NGINX. Use the NginxProxy CRD to allow the user to configure tracing.

Testing: Unit testing, and e-2-e testing locally using Jaegar on Kind

Please focus on: This is based on a feature branch, to allow for someone else to also start work on #483. I could merge into the feature branch instead if required, but for now I am targeting merging to main as so far no work has happened yet on that issue.

Closes #1166

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

@ciarams87 ciarams87 requested a review from a team as a code owner November 9, 2023 18:04
@github-actions github-actions bot added documentation Improvements or additions to documentation enhancement New feature or request helm-chart Relates to helm chart labels Nov 9, 2023
build/Dockerfile.nginx Outdated Show resolved Hide resolved
@ciarams87 ciarams87 force-pushed the feat/support-otel-tracing branch 2 times, most recently from 6fcef3d to 9c11e1e Compare November 9, 2023 18:51
// +optional
// +kubebuilder:default="5s"
// +kubebuilder:validation:Pattern=`^(\d+y)??\s*(\d+M)??\s*(\d+w)??\s*(\d+d)??\s*(\d+h)??\s*(\d+m)??\s*(\d+s?)??\s*(\d+ms)??$`
Interval string `json:"interval,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional fields should be pointers

//
// +optional
// +kubebuilder:default=512
BatchSize int `json:"batchSize,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe k8s recommends integer types be int32, so they aren't ambiguous in their size.

build/Dockerfile.nginx Outdated Show resolved Hide resolved
tracing := ngfproxy.Spec.HTTP.Telemetry.Tracing
proxy.Tracing = &Tracing{
Endpoint: tracing.Endpoint,
ServiceName: gwcName + ":ngf",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we make this the Gateway name instead of the class name? Thinking ahead to having multiple Gateways per Class, may want to delineate between them better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps Gateway Namespace and Name, not just name?

@@ -58,6 +58,9 @@ nginx:
tag: edge
pullPolicy: Always

## The configuration for the data plane that is contained in the NginxProxy resource.
config: {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it easier on users to understand the structure, maybe we add the telemetry config in here but comment it out. Similar to how we did it in the old NSM chart

apis/v1alpha1/nginxproxy_types.go Outdated Show resolved Hide resolved
apis/v1alpha1/nginxproxy_types.go Outdated Show resolved Hide resolved
apis/v1alpha1/nginxproxy_types.go Show resolved Hide resolved
docs/data-plane-configuration.md Outdated Show resolved Hide resolved
docs/data-plane-configuration.md Outdated Show resolved Hide resolved
internal/mode/static/state/graph/gatewayclass.go Outdated Show resolved Hide resolved
internal/mode/static/nginx/config/tracing_template.go Outdated Show resolved Hide resolved
internal/mode/static/nginx/config/tracing_test.go Outdated Show resolved Hide resolved
internal/mode/static/nginx/config/tracing_test.go Outdated Show resolved Hide resolved
@@ -0,0 +1,12 @@
package config

var otelTemplateText = `
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use a const?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use var for the other <x>TemplateText, does it make sense to use const just here or everywhere else too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, const makes sense. not sure why we needed a var

//
// +optional
// +kubebuilder:default=false
Enabled bool `json:"enabled,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it, this should probably be "enable" present tense to match our other enable/disable fields.

| --------- | ---------------------------------------------- | --------------------------- | -------- |
| telemetry | Telemetry defines the telemetry configuration. | [telemetry](#spectelemetry) | no |

### Spec.Telemetry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Spec.HTTP.Telemetry?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, would it need to be Spec.HTTP.Telemetry.Tracing? But I also see there are parts where we just reference proxy.Tracing without needing to go through the objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, would it need to be Spec.HTTP.Telemetry.Tracing

Yes, I've updated it 👍🏼

But I also see there are parts where we just reference proxy.Tracing without needing to go through the objects.

I think what you're referring to is when we're referencing the graph Tracing object?

Copy link
Contributor

@bjee19 bjee19 Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it's in the configuration.go files I'm seeing the graph's NginxProxy directly reference proxy.Tracing, why is that able to do that?

@@ -0,0 +1,190 @@
# Example

In this example we deploy NGINX Gateway Fabric, a simple web application, configure NGINX Gateway Fabric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding this a bit hard to follow. I wonder if we need to describe everything we are going to do in the example or if it's enough to describe the tracing piece of it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I struggled a bit with this example 😅 I simplified the description, but maybe I should change it completely remove the deploy app/ gateway/ HTTPRoutes? Just describe how to configure the tracing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is documentation enough? Is there anything specific that would be in the example that wouldn't be in our docs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. Happy to remove it if we don't feel it adds value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like this example because it's an easy step-by-step guide a user can follow. Also, it makes it easy for us to test tracing.

@ciarams87 I think the new intro is much better 👍

1. Get the local DNS name of the all in one Jaegar collector service. It will follow the standard
[Kubernetes DNS naming conventions](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#a-aaaa-records)
e.g. `simplest-collector.default.svc.cluster.local`.
Also get the number of either the `grpc-otlp` port or the `http-otlp` port of the collector service, e.g.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comma after "Also"

```


## 2. Deploy the Cafe Application
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be step 3

@@ -76,7 +75,7 @@ func buildGatewayClass(gc *v1beta1.GatewayClass, npCfg *ngfAPI.NginxProxy) *Gate
}
}

func validateGatewayClass(gc *v1beta1.GatewayClass, npCfg *ngfAPI.NginxProxy) error {
func validateGatewayClass(gc *v1beta1.GatewayClass, npCfg *NginxProxy) error {
if gc.Spec.ParametersRef != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A related question about validation: If tracing is disabled, but a tracing endpoint is provided, what should we do?

@@ -7,18 +7,59 @@ import (
ngfAPI "github.com/nginxinc/nginx-gateway-fabric/apis/v1alpha1"
)

type NginxProxy struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something, but why do we need these NginxProxy and Tracing struct definitions? Can't we use the ones from ngfAPI here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to simplify where we had to do the nested nil-checking, but also now with the validation code, I think it makes sense to have this is a separate type. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two pros for removing it:

  • for consistency with other types - we don't do the same thing for other types
  • although we will need to do nil checking, we're not gonna do it twice - we can convert NginxProxy into dataplane.Tracing when building dataplane.Configuration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is around validation and building the status. I'd need to do something different here than what we do for the other resources, because I'm doing the validation here and need to propagate the conditions to the status updater.

@@ -4,11 +4,15 @@ FROM nginx:1.25.3-alpine
ARG NJS_DIR
ARG NGINX_CONF_DIR

RUN apk update && apk upgrade && apk add --no-cache libcap \
RUN apk update && apk upgrade && apk add --no-cache libcap curl ca-certificates \
&& printf "%s%s%s\n" "http://nginx.org/packages/mainline/alpine/v" `egrep -o '^[0-9]+\.[0-9]+' /etc/alpine-release` "/main" | tee -a /etc/apk/repositories \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
&& printf "%s%s%s\n" "http://nginx.org/packages/mainline/alpine/v" `egrep -o '^[0-9]+\.[0-9]+' /etc/alpine-release` "/main" | tee -a /etc/apk/repositories \
&& printf "%s\n" "http://nginx.org/packages/mainline/alpine/v$(egrep -o '^[0-9]+\.[0-9]+' /etc/alpine-release)/main" >> /etc/apk/repositories \

This seems cleaner to me 😄 (but anyway we don't need tee)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this from the official NGINX docs, but yes, yours looks a lot cleaner 😂

@sjberman
Copy link
Contributor

Be sure to update the proposal for the data plane config to "Completed".

description: NginxProxySpec defines the desired state of the NginxProxy.
properties:
http:
description: Http defines the Http configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make the change above that @pleshakov suggested about Http -> HTTP, should change here too.

description: NginxProxySpec defines the desired state of the NginxProxy.
properties:
http:
description: Http defines the Http configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment from other CRD yaml about Http -> HTTP

| --------- | ---------------------------------------------- | --------------------------- | -------- |
| telemetry | Telemetry defines the telemetry configuration. | [telemetry](#spectelemetry) | no |

### Spec.Telemetry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, would it need to be Spec.HTTP.Telemetry.Tracing? But I also see there are parts where we just reference proxy.Tracing without needing to go through the objects.

HTTP *HTTP `json:"http,omitempty"`
}

// HTTP defines the HTTP configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more explicit to say that this is for the nginx http block configuration?

//
// +optional
// +kubebuilder:default="5s"
// +kubebuilder:validation:Pattern=`^(\d+y)??\s*(\d+M)??\s*(\d+w)??\s*(\d+d)??\s*(\d+h)??\s*(\d+m)??\s*(\d+s?)??\s*(\d+ms)??$`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same validation pattern that the otel module uses?

// BatchSize specifies the maximum number of spans to be sent in one batch per worker. Default is 512.
//
// +optional
// +kubebuilder:default=512
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's set validation to not allow a negative value. Not sure if we also should set a max?

// BatchCount specifies the number of pending batches per worker, spans exceeding the limit are dropped. Default is 4.
//
// +optional
// +kubebuilder:default=4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as previous.

//
// +required
// ++kubebuilder:validation:Required
// ++kubebuilder:validation:Pattern=`^(?:(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}|(?:\d{1,3}\.){3}\d{1,3}):\d{1,5}$`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as before about the pattern.

Copy link
Contributor

@pleshakov pleshakov Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are IPv6 addresses allowed?


The control plane only watches this single instance of the custom resource. If the resource is invalid per the OpenAPI
schema, the Kubernetes API server will reject the changes. If the resource is deleted or deemed invalid by NGINX
Gateway Fabric, a warning Event is created in the `nginx-gateway` Namespace, and the default values will be used by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually create an event for this?

of the resource is `<release-name>-proxy-config`. It is deployed in the same Namespace as the controller
(default `nginx-gateway`).

The control plane only watches this single instance of the custom resource. If the resource is invalid per the OpenAPI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than "only watches this single instance of the custom resource", I'd mention that we only watch the one that's referenced in the associated GatewayClass for this control plane.


| name | description | type | required |
| ---------- | ------------------------------------------------------------------------------------------------------------------- | ------ | -------- |
| enable | Enabled enables or disables OpenTelemetry tracing at the HTTP context. Default is false. | bool | no |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| enable | Enabled enables or disables OpenTelemetry tracing at the HTTP context. Default is false. | bool | no |
| enable | Enable enables or disables OpenTelemetry tracing at the HTTP context. Default is false. | bool | no |


```shell
kubectl -n nginx-gateway describe nginxproxies nginx-proxy-config
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my other comment regarding whether or not we need an example, I would update this doc maybe with an example Jaeger deployment and telemetry config that would send traces there.

//
// +required
// ++kubebuilder:validation:Required
// ++kubebuilder:validation:Pattern=`^(?:(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}|(?:\d{1,3}\.){3}\d{1,3}):\d{1,5}$`
Copy link
Contributor

@pleshakov pleshakov Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are IPv6 addresses allowed?

@@ -1,4 +1,5 @@
load_module /usr/lib/nginx/modules/ngx_http_js_module.so;
load_module modules/ngx_otel_module.so;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We load this file into the base image, do you think we need to create a new includes at this level for global configuration options like this?

yeah. that could work

// +kubebuilder:storageversion
// +kubebuilder:subresource:status

// NginxProxy represents the dynamic configuration for an NGINX Gateway Fabric data plane.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// NginxProxy represents the dynamic configuration for an NGINX Gateway Fabric data plane.
// NginxProxy represents the configuration for an NGINX Gateway Fabric data plane.

@@ -0,0 +1,12 @@
package config

var otelTemplateText = `
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, const makes sense. not sure why we needed a var

if ngfTracing.Interval != nil && !reTime.MatchString(*ngfTracing.Interval) {
valErr = append(valErr, field.Invalid(
path.Child("interval"), *ngfTracing.Interval, "tracing.interval must be a valid time string"))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also validate that BatchCount and BatchSize are not negative?

@sjberman
Copy link
Contributor

There are probably some v1beta1 references that were added that will need to be updated to v1 once #1250 is merged and this branch is rebased.

sjberman and others added 4 commits November 17, 2023 11:00
Problem: We have two upcoming stories that require the new NginxProxy CRD in order to be worked on.

Solution: To prevent one story from being blocked by the other, this work adds the new CRD with a blank spec, a controller that watches for it and triggers an update, and sets the status on the resource.
@ciarams87 ciarams87 force-pushed the feat/support-otel-tracing branch from 800bad7 to b806b53 Compare November 17, 2023 11:07
@ciarams87
Copy link
Member Author

Closed as we are taking a different direction - see #1258

@ciarams87 ciarams87 closed this Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request helm-chart Relates to helm chart
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Support for tracing using OpenTelemetry
6 participants