-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Support OTel tracing using ngx_otel_module #1238
Conversation
6fcef3d
to
9c11e1e
Compare
apis/v1alpha1/nginxproxy_types.go
Outdated
// +optional | ||
// +kubebuilder:default="5s" | ||
// +kubebuilder:validation:Pattern=`^(\d+y)??\s*(\d+M)??\s*(\d+w)??\s*(\d+d)??\s*(\d+h)??\s*(\d+m)??\s*(\d+s?)??\s*(\d+ms)??$` | ||
Interval string `json:"interval,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional fields should be pointers
apis/v1alpha1/nginxproxy_types.go
Outdated
// | ||
// +optional | ||
// +kubebuilder:default=512 | ||
BatchSize int `json:"batchSize,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe k8s recommends integer types be int32, so they aren't ambiguous in their size.
tracing := ngfproxy.Spec.HTTP.Telemetry.Tracing | ||
proxy.Tracing = &Tracing{ | ||
Endpoint: tracing.Endpoint, | ||
ServiceName: gwcName + ":ngf", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we make this the Gateway name instead of the class name? Thinking ahead to having multiple Gateways per Class, may want to delineate between them better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps Gateway Namespace and Name, not just name?
@@ -58,6 +58,9 @@ nginx: | |||
tag: edge | |||
pullPolicy: Always | |||
|
|||
## The configuration for the data plane that is contained in the NginxProxy resource. | |||
config: {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it easier on users to understand the structure, maybe we add the telemetry config in here but comment it out. Similar to how we did it in the old NSM chart
@@ -0,0 +1,12 @@ | |||
package config | |||
|
|||
var otelTemplateText = ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use a const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use var
for the other <x>TemplateText
, does it make sense to use const
just here or everywhere else too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, const makes sense. not sure why we needed a var
apis/v1alpha1/nginxproxy_types.go
Outdated
// | ||
// +optional | ||
// +kubebuilder:default=false | ||
Enabled bool `json:"enabled,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it, this should probably be "enable" present tense to match our other enable/disable fields.
docs/data-plane-configuration.md
Outdated
| --------- | ---------------------------------------------- | --------------------------- | -------- | | ||
| telemetry | Telemetry defines the telemetry configuration. | [telemetry](#spectelemetry) | no | | ||
|
||
### Spec.Telemetry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be Spec.HTTP.Telemetry
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, would it need to be Spec.HTTP.Telemetry.Tracing
? But I also see there are parts where we just reference proxy.Tracing
without needing to go through the objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, would it need to be Spec.HTTP.Telemetry.Tracing
Yes, I've updated it 👍🏼
But I also see there are parts where we just reference proxy.Tracing without needing to go through the objects.
I think what you're referring to is when we're referencing the graph
Tracing object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it's in the configuration.go
files I'm seeing the graph's NginxProxy
directly reference proxy.Tracing
, why is that able to do that?
examples/tracing/README.md
Outdated
@@ -0,0 +1,190 @@ | |||
# Example | |||
|
|||
In this example we deploy NGINX Gateway Fabric, a simple web application, configure NGINX Gateway Fabric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding this a bit hard to follow. I wonder if we need to describe everything we are going to do in the example or if it's enough to describe the tracing piece of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I struggled a bit with this example 😅 I simplified the description, but maybe I should change it completely remove the deploy app/ gateway/ HTTPRoutes? Just describe how to configure the tracing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is documentation enough? Is there anything specific that would be in the example that wouldn't be in our docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good point. Happy to remove it if we don't feel it adds value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally like this example because it's an easy step-by-step guide a user can follow. Also, it makes it easy for us to test tracing.
@ciarams87 I think the new intro is much better 👍
examples/tracing/README.md
Outdated
1. Get the local DNS name of the all in one Jaegar collector service. It will follow the standard | ||
[Kubernetes DNS naming conventions](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#a-aaaa-records) | ||
e.g. `simplest-collector.default.svc.cluster.local`. | ||
Also get the number of either the `grpc-otlp` port or the `http-otlp` port of the collector service, e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: comma after "Also"
examples/tracing/README.md
Outdated
``` | ||
|
||
|
||
## 2. Deploy the Cafe Application |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be step 3
@@ -76,7 +75,7 @@ func buildGatewayClass(gc *v1beta1.GatewayClass, npCfg *ngfAPI.NginxProxy) *Gate | |||
} | |||
} | |||
|
|||
func validateGatewayClass(gc *v1beta1.GatewayClass, npCfg *ngfAPI.NginxProxy) error { | |||
func validateGatewayClass(gc *v1beta1.GatewayClass, npCfg *NginxProxy) error { | |||
if gc.Spec.ParametersRef != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A related question about validation: If tracing is disabled, but a tracing endpoint is provided, what should we do?
@@ -7,18 +7,59 @@ import ( | |||
ngfAPI "github.com/nginxinc/nginx-gateway-fabric/apis/v1alpha1" | |||
) | |||
|
|||
type NginxProxy struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably missing something, but why do we need these NginxProxy and Tracing struct definitions? Can't we use the ones from ngfAPI here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to simplify where we had to do the nested nil-checking, but also now with the validation code, I think it makes sense to have this is a separate type. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two pros for removing it:
- for consistency with other types - we don't do the same thing for other types
- although we will need to do nil checking, we're not gonna do it twice - we can convert NginxProxy into dataplane.Tracing when building dataplane.Configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is around validation and building the status. I'd need to do something different here than what we do for the other resources, because I'm doing the validation here and need to propagate the conditions to the status updater.
build/Dockerfile.nginx
Outdated
@@ -4,11 +4,15 @@ FROM nginx:1.25.3-alpine | |||
ARG NJS_DIR | |||
ARG NGINX_CONF_DIR | |||
|
|||
RUN apk update && apk upgrade && apk add --no-cache libcap \ | |||
RUN apk update && apk upgrade && apk add --no-cache libcap curl ca-certificates \ | |||
&& printf "%s%s%s\n" "http://nginx.org/packages/mainline/alpine/v" `egrep -o '^[0-9]+\.[0-9]+' /etc/alpine-release` "/main" | tee -a /etc/apk/repositories \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&& printf "%s%s%s\n" "http://nginx.org/packages/mainline/alpine/v" `egrep -o '^[0-9]+\.[0-9]+' /etc/alpine-release` "/main" | tee -a /etc/apk/repositories \ | |
&& printf "%s\n" "http://nginx.org/packages/mainline/alpine/v$(egrep -o '^[0-9]+\.[0-9]+' /etc/alpine-release)/main" >> /etc/apk/repositories \ |
This seems cleaner to me 😄 (but anyway we don't need tee
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took this from the official NGINX docs, but yes, yours looks a lot cleaner 😂
Be sure to update the proposal for the data plane config to "Completed". |
description: NginxProxySpec defines the desired state of the NginxProxy. | ||
properties: | ||
http: | ||
description: Http defines the Http configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make the change above that @pleshakov suggested about Http -> HTTP, should change here too.
description: NginxProxySpec defines the desired state of the NginxProxy. | ||
properties: | ||
http: | ||
description: Http defines the Http configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment from other CRD yaml about Http -> HTTP
docs/data-plane-configuration.md
Outdated
| --------- | ---------------------------------------------- | --------------------------- | -------- | | ||
| telemetry | Telemetry defines the telemetry configuration. | [telemetry](#spectelemetry) | no | | ||
|
||
### Spec.Telemetry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, would it need to be Spec.HTTP.Telemetry.Tracing
? But I also see there are parts where we just reference proxy.Tracing
without needing to go through the objects.
623a8e1
to
a510628
Compare
apis/v1alpha1/nginxproxy_types.go
Outdated
HTTP *HTTP `json:"http,omitempty"` | ||
} | ||
|
||
// HTTP defines the HTTP configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be more explicit to say that this is for the nginx http block configuration?
// | ||
// +optional | ||
// +kubebuilder:default="5s" | ||
// +kubebuilder:validation:Pattern=`^(\d+y)??\s*(\d+M)??\s*(\d+w)??\s*(\d+d)??\s*(\d+h)??\s*(\d+m)??\s*(\d+s?)??\s*(\d+ms)??$` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same validation pattern that the otel module uses?
// BatchSize specifies the maximum number of spans to be sent in one batch per worker. Default is 512. | ||
// | ||
// +optional | ||
// +kubebuilder:default=512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's set validation to not allow a negative value. Not sure if we also should set a max?
// BatchCount specifies the number of pending batches per worker, spans exceeding the limit are dropped. Default is 4. | ||
// | ||
// +optional | ||
// +kubebuilder:default=4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as previous.
apis/v1alpha1/nginxproxy_types.go
Outdated
// | ||
// +required | ||
// ++kubebuilder:validation:Required | ||
// ++kubebuilder:validation:Pattern=`^(?:(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}|(?:\d{1,3}\.){3}\d{1,3}):\d{1,5}$` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as before about the pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are IPv6 addresses allowed?
|
||
The control plane only watches this single instance of the custom resource. If the resource is invalid per the OpenAPI | ||
schema, the Kubernetes API server will reject the changes. If the resource is deleted or deemed invalid by NGINX | ||
Gateway Fabric, a warning Event is created in the `nginx-gateway` Namespace, and the default values will be used by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually create an event for this?
of the resource is `<release-name>-proxy-config`. It is deployed in the same Namespace as the controller | ||
(default `nginx-gateway`). | ||
|
||
The control plane only watches this single instance of the custom resource. If the resource is invalid per the OpenAPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than "only watches this single instance of the custom resource", I'd mention that we only watch the one that's referenced in the associated GatewayClass for this control plane.
|
||
| name | description | type | required | | ||
| ---------- | ------------------------------------------------------------------------------------------------------------------- | ------ | -------- | | ||
| enable | Enabled enables or disables OpenTelemetry tracing at the HTTP context. Default is false. | bool | no | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| enable | Enabled enables or disables OpenTelemetry tracing at the HTTP context. Default is false. | bool | no | | |
| enable | Enable enables or disables OpenTelemetry tracing at the HTTP context. Default is false. | bool | no | |
|
||
```shell | ||
kubectl -n nginx-gateway describe nginxproxies nginx-proxy-config | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my other comment regarding whether or not we need an example, I would update this doc maybe with an example Jaeger deployment and telemetry config that would send traces there.
apis/v1alpha1/nginxproxy_types.go
Outdated
// | ||
// +required | ||
// ++kubebuilder:validation:Required | ||
// ++kubebuilder:validation:Pattern=`^(?:(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}|(?:\d{1,3}\.){3}\d{1,3}):\d{1,5}$` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are IPv6 addresses allowed?
@@ -1,4 +1,5 @@ | |||
load_module /usr/lib/nginx/modules/ngx_http_js_module.so; | |||
load_module modules/ngx_otel_module.so; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We load this file into the base image, do you think we need to create a new includes at this level for global configuration options like this?
yeah. that could work
// +kubebuilder:storageversion | ||
// +kubebuilder:subresource:status | ||
|
||
// NginxProxy represents the dynamic configuration for an NGINX Gateway Fabric data plane. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// NginxProxy represents the dynamic configuration for an NGINX Gateway Fabric data plane. | |
// NginxProxy represents the configuration for an NGINX Gateway Fabric data plane. |
@@ -0,0 +1,12 @@ | |||
package config | |||
|
|||
var otelTemplateText = ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, const makes sense. not sure why we needed a var
if ngfTracing.Interval != nil && !reTime.MatchString(*ngfTracing.Interval) { | ||
valErr = append(valErr, field.Invalid( | ||
path.Child("interval"), *ngfTracing.Interval, "tracing.interval must be a valid time string")) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also validate that BatchCount and BatchSize are not negative?
There are probably some v1beta1 references that were added that will need to be updated to v1 once #1250 is merged and this branch is rebased. |
Problem: We have two upcoming stories that require the new NginxProxy CRD in order to be worked on. Solution: To prevent one story from being blocked by the other, this work adds the new CRD with a blank spec, a controller that watches for it and triggers an update, and sets the status on the resource.
800bad7
to
b806b53
Compare
Closed as we are taking a different direction - see #1258 |
Proposed changes
Problem:
As a user of NGF
I want NGF to have the ability to generate and send traces
So that I can easily get tracing for all of my traffic to help debug my environment.
Solution: Use the NGINX Otel module to configure tracing from NGINX. Use the NginxProxy CRD to allow the user to configure tracing.
Testing: Unit testing, and e-2-e testing locally using Jaegar on Kind
Please focus on: This is based on a feature branch, to allow for someone else to also start work on #483. I could merge into the feature branch instead if required, but for now I am targeting merging to main as so far no work has happened yet on that issue.
Closes #1166
Checklist
Before creating a PR, run through this checklist and mark each as complete.