-
Notifications
You must be signed in to change notification settings - Fork 681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design: add tracing design doc #3802
Conversation
Closes projectcontour#3792. Signed-off-by: Steve Kriss <[email protected]>
Codecov Report
@@ Coverage Diff @@
## main #3802 +/- ##
=======================================
Coverage 75.83% 75.83%
=======================================
Files 107 107
Lines 7338 7338
=======================================
Hits 5565 5565
Misses 1652 1652
Partials 121 121 |
|
||
## Non Goals | ||
- **Per-HTTPProxy trace configuration.** There is a technical limitation to enabling per-proxy configuration. Trace settings are configured on the HTTP connection manager (HCM). Each TLS-enabled virtual host uses its own HCM, but *all* non-TLS virtual hosts share a single HCM. As such, while it's possible to have unique trace settings for each TLS virtual host, it's not possible to do the same for non-TLS virtual hosts. Note that this same limitation has come up when designing external authorization and global rate limiting as well. | ||
- **Trace formats other than OpenCensus and Zipkin.** While Contour *may* choose to add direct support for other trace formats in the future, our hope is that by supporting the OpenTelemetry collector, users are able to export traces to the backend of their choice without having to directly configure it in Contour/Envoy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our hope is that by supporting the OpenTelemetry collector, users are able to export traces to the backend of their choice without having to directly configure it in Contour/Envoy.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great, @skriss! Do you think there'll be anything for the alternatives or security concerns sections (fine if no)?
... | ||
``` | ||
|
||
For Zipkin, an Envoy cluster corresponding to the Zipkin collector must exist. There are a few different options for creating this cluster: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this exactly what ExtensionService is for? I'd prefer to see us reuse that for the second and third options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple challenges with using ExtensionService here:
- ExtensionService is tailored to gRPC services that are part of xDS, i.e. the services listed here, which these are not. Specifically, it includes the ProtocolVersion field and also requires either h2 or h2c.
- Zipkin doesn't necessarily use gRPC (though it can). Right now, ExtensionService pretty much assumes the service is gRPC.
- for OpenCensus, the API is gRPC, but Envoy's OC tracer does not allow using the Envoy gRPC client, only the Google gRPC client. As such, it can't be directly targeted at an Envoy cluster; it needs a gRPC address. I'm not sure off the top of my head how we'd make this work.
If we're committed to using ExtensionService here, then I can look more into what changes would be needed to make this work. TBH, I feel like this would be easier if instead of having a single generic ExtensionService type, we had AuthorizationService
, RateLimitService
, TracingService
, etc., that maybe shared some common configuration but could also vary independently to support the variations among the different services. I know that's a bigger redesign though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this would be easier if instead of having a single generic ExtensionService type, we had AuthorizationService, RateLimitService, TracingService, etc
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the last time we discussed this, @skriss was going to work on an updated ExtensionService design (which may include separate *Service objects for each use case). Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I've been thinking about it and starting to write up a doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we moved to a CRD per service type (ExtAuthService, RateLimitService, TracingService), we could make this work since each type could be significantly different - and tracing would need to be significantly different than the others given that it's not an xDS gRPC service, and, depending on the provider, does not make use of an Envoy cluster at all (which is effectively what an ExtensionService is right now). However, I think it's going to be difficult to move from the generic ExtensionService to a CRD per service type since it'll require what I believe is a breaking API change for ExtAuth (HTTPProxy currently has an implicitly-typed reference to an ExtensionService), and disruptive/breaking config file changes for rate limiting. If we wanted to do that, it might need to wait for a v2 API/v2 Contour.
Another option would be to add type-specific subfields to ExtensionService; however, because tracing is so different from the existing ExtensionService types, many of the existing fields in ExtensionService don't apply to tracing, so we end up with a messy API where many fields just don't apply in certain circumstances. I'm not a huge fan of this option given where we're starting from -- even though ExtensionService is "alpha", we don't want to break users if we can avoid it, which makes it harder to make in-place changes.
We could introduce just a TracingService CRD in the v1alpha1
group which would give us a chance to experiment with having a CRD per service type, and then look to migrate the other existing service types to specific CRDs down the road. I like this approach in theory, though there are some challenges specifically around OpenCensus since it can't be configured more than once per Envoy lifetime (that would apply even if we used the config file).
Or, we could use the config file to define everything.
Separately, since a main driver for defining an ExtensionService CRD to begin with was that it gave a way to report status, I think we should do that work for the existing service types. We could start simple by reporting whether or not a given ExtensionService has healthy Kubernetes Endpoints for its services, and then decide if we wanted to report more information from there.
Before going much further down any one of these paths, I'd appreciate some input from others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that summary @skriss. I kind of expected that if we were going to look at separate objects, we would start with TracingService and then come back for the others later, I think that is fine. That's why these are alpha, after all. If the way tracing needs to work is that the config file references the TracingService rather than configuring it at an object level, I thin that's fine. Having the TracingService then makes it easier to operate tracing sinks, since you can flip to a new one, check everything is working, then flip back if necessary (although obviously it's not that easy for OpenCensus if it's per-Envoy-lifetime).
I don't like the idea of adding extra fields to ExtensionService because of what you said - although the structs may live on in the new CRDs, I think the use cases are distinct enough to need separate things. We just need to keep an eye on how many we end up with.
I agree that we need to do something with Status for ExtensionService, and that "Are there a nonzero number of Endpoints available" is a good start on a Ready conditrion.
For moving the existing things like ExternalAuth references, we could do something like:
- add GVK fields to the references, default to ExtensionService in one release.
- Later, change default to reference AuthService instead. This is a breaking change, but an easy fix for users, just specify the kind.
- Later, officially announce ExtensionService as deprecated and ask people to move to AuthorizationService.
Total timeframe for this, I would guess nine to twelve months, however many releases that is.
- Contour config file takes a Kubernetes service name, and a DAG processor adds a cluster for it to the DAG as a root. | ||
- Contour config file takes an address, and a DAG processor adds a cluster for it to the DAG as a root. | ||
|
||
For OpenCensus, Envoy can only be configured once per lifetime (see [this Envoy docs section][6] for information). Attempts to modify the OpenCensus configuration will cause Envoy to NACK config updates which, because of https://github.com/projectcontour/contour/issues/1176, will cause all future config updates to be ignored until Envoy is restarted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I was wrong about #1176 - at some point the behavior has changed and now the behavior is that no updates will be accepted until the incorrect config is removed, at which time Envoy will recover.
Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days. |
## High-Level Design | ||
The Contour configuration file will be extended with a new optional `tracing` section. | ||
This configuration block, if present, will enable tracing and will define the trace format and other properties needed to generate and export trace data. | ||
Initially, the `opencensus` and `zipkin` trace formats will be supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you try to implement OpenTelemetry interface and leave it backend agnostic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pavelnikolov as far as I know Envoy does not yet support OpenTelemetry: envoyproxy/envoy#9958
Otherwise, yes,, OTel is the direction we'd prefer. One option here is to just wait on implementing support in Contour for tracing until Envoy does support OTel directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skriss it's already implemented in Envoy. Are you going to prioritise it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd like to, we'll see what we can do.
```yaml | ||
tracing: | ||
# Trace format to use. Valid options are opencensus and zipkin. | ||
format: [opencensus, zipkin] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't care about the format if you use OpenTelemetry interface(s). Then the user can choose whatever exporter (backend) they see fit. What is more Zipkin is one of the supported backends for OpenTelemetry. So instead of format you might want to choose backend exporter here.
## Compatibility | ||
A discussion of any compatibility issues that need to be considered | ||
|
||
- Migration from OpenCensus to OpenTelemery in the future |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we start with OpenTelemetry in the first place?
Should this be closed? |
This is blocked behind a redesign of ExtensionService at the moment, which would move from the base ExtensionService (currently used for both external auth and rate limiting), to having separate CRDs, for AuthenticationService, RateLimitingService, and then add TracingService. The original ExtensionService idea ends up too overloaded with all the stuff that needs to be configured for tracing, so it's better to have a single object type per use-case. Once we've done that work, it should be reasonably straightforward to add the tracing support. We just have to pay down the tech debt first. |
@youngnick Is there an issue tracking the ExtensionService refactoring? Could I help? |
|
||
## Goals | ||
- **Support a single global trace configuration** that applies to all requests. | ||
- **Support OpenCensus and Zipkin trace formats**, since together they enable a large set of common backends to be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenCensus has folded into OpenTelemetry now. The opentelemetry collector can export to Zipkin using an exporter so I do not think supporting either should be needed.
Envoy already supports OpenTelemetry should the design document be updated to only support exporting tracing to OpenTelemetry? |
@yangyy93 yep I believe so, which should also make our work here a lot easier since it requires a gRPC cluster and not a static cluster or http cluster etc. that other export formats needed I believe |
If anyone involved wants to take this over feel free to start a new design PR, can use this as a reference! |
Putting this up as a draft in case anyone wants to preview. I"m still actively refining and working out some details.
Closes #3792.
xref #399.
Signed-off-by: Steve Kriss [email protected]