-
Notifications
You must be signed in to change notification settings - Fork 681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design: add tracing design doc #3802
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
# Design for Supporting Tracing in Contour | ||
|
||
Status: Draft | ||
|
||
## Abstract | ||
Envoy has rich support for [distributed tracing][1]. | ||
However, Contour does not currently offer a way to configure Envoy for tracing. | ||
This design document proposes an implementation of tracing support in Contour. | ||
|
||
## Background | ||
Envoy's [documentation on tracing][1] says: | ||
> Distributed tracing allows developers to obtain visualizations of call flows in large service oriented architectures. It can be invaluable in understanding serialization, parallelism, and sources of latency. | ||
|
||
Envoy supports: | ||
- generating traces for incoming requests | ||
- joining existing traces when a client sets the `x-client-trace-id` header | ||
- exporting trace data to various third-party providers (Zipkin, Jaeger, Datadog, etc.) | ||
|
||
The tracing ecosystem is complex right now. | ||
However, a few observations can be made: | ||
- [OpenTelemetry][2] is a CNCF project which is working to become a standard in the space. It was formed as a merger of the OpenTracing and OpenCensus projects. It supports ingesting trace data in a variety of formats, transforming it, and exporting it to a variety of backends. | ||
- [Zipkin][3] and [Jaeger][4] are two other common providers. Jaeger supports collecting data in the Zipkin format as well as in its own format. | ||
|
||
## Goals | ||
- **Support a single global trace configuration** that applies to all requests. | ||
- **Support OpenCensus and Zipkin trace formats**, since together they enable a large set of common backends to be used. | ||
|
||
## Non Goals | ||
- **Per-HTTPProxy trace configuration.** There is a technical limitation to enabling per-proxy configuration. Trace settings are configured on the HTTP connection manager (HCM). Each TLS-enabled virtual host uses its own HCM, but *all* non-TLS virtual hosts share a single HCM. As such, while it's possible to have unique trace settings for each TLS virtual host, it's not possible to do the same for non-TLS virtual hosts. Note that this same limitation has come up when designing external authorization and global rate limiting as well. | ||
- **Trace formats other than OpenCensus and Zipkin.** While Contour *may* choose to add direct support for other trace formats in the future, our hope is that by supporting the OpenTelemetry collector, users are able to export traces to the backend of their choice without having to directly configure it in Contour/Envoy. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
+1 |
||
|
||
## High-Level Design | ||
The Contour configuration file will be extended with a new optional `tracing` section. | ||
This configuration block, if present, will enable tracing and will define the trace format and other properties needed to generate and export trace data. | ||
Initially, the `opencensus` and `zipkin` trace formats will be supported. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't you try to implement OpenTelemetry interface and leave it backend agnostic? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @pavelnikolov as far as I know Envoy does not yet support OpenTelemetry: envoyproxy/envoy#9958 Otherwise, yes,, OTel is the direction we'd prefer. One option here is to just wait on implementing support in Contour for tracing until Envoy does support OTel directly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @skriss it's already implemented in Envoy. Are you going to prioritise it now? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'd like to, we'll see what we can do. |
||
|
||
The parameters defined in the configuration file will be used to configure each [HTTP connection manager's `tracing` section][5]. | ||
|
||
Since the tracing backend is pluggable, Contour will not package any particular backend. | ||
The user is responsible for deploying and operating the tracing backend of their choice, and configuring Contour to make use of it. | ||
|
||
## Detailed Design | ||
The `tracing` section of the Contour config file will look like: | ||
|
||
(WIP, this is the rough idea) | ||
```yaml | ||
tracing: | ||
# Trace format to use. Valid options are opencensus and zipkin. | ||
format: [opencensus, zipkin] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You shouldn't care about the format if you use OpenTelemetry interface(s). Then the user can choose whatever exporter (backend) they see fit. What is more Zipkin is one of the supported backends for OpenTelemetry. So instead of format you might want to choose backend exporter here. |
||
config: | ||
# If tracing.format == opencensus, then this block is expected | ||
# to be populated. | ||
opencensus: | ||
agent-address: otel-collector.default:55678 | ||
# If tracing.format == zipkin, then this block is expected | ||
# to be populated. | ||
zipkin: | ||
collector-cluster: <Envoy cluster name> | ||
collector-endpoint: /api/v2/spans | ||
collector-endpoint-version: ... | ||
``` | ||
|
||
In the presence of the `tracing` block above, all HTTP connection managers (HCM) will be configured for tracing. | ||
|
||
A Zipkin-enabled HCM will include the following as an example: | ||
```go | ||
... | ||
Tracing: &http.HttpConnectionManager_Tracing{ | ||
Provider: &envoy_config_trace_v3.Tracing_Http{ | ||
Name: "envoy.tracers.zipkin", | ||
ConfigType: &envoy_config_trace_v3.Tracing_Http_TypedConfig{ | ||
TypedConfig: protobuf.MustMarshalAny(&envoy_config_trace_v3.ZipkinConfig{ | ||
CollectorCluster: "default/zipkin/9411/da39a3ee5e", | ||
CollectorEndpoint: "/api/v2/spans", | ||
CollectorEndpointVersion: envoy_config_trace_v3.ZipkinConfig_HTTP_JSON, | ||
}), | ||
}, | ||
}, | ||
}, | ||
... | ||
``` | ||
|
||
An OpenCensus-enabled HCM will include the following as an example: | ||
```go | ||
... | ||
Tracing: &http.HttpConnectionManager_Tracing{ | ||
Provider: &envoy_config_trace_v3.Tracing_Http{ | ||
Name: "envoy.tracers.opencensus", | ||
ConfigType: &envoy_config_trace_v3.Tracing_Http_TypedConfig{ | ||
TypedConfig: protobuf.MustMarshalAny(&envoy_config_trace_v3.OpenCensusConfig{ | ||
OcagentAddress: "otel-collector.default:55678", | ||
OcagentExporterEnabled: true, | ||
}), | ||
}, | ||
}, | ||
... | ||
``` | ||
|
||
For Zipkin, an Envoy cluster corresponding to the Zipkin collector must exist. There are a few different options for creating this cluster: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this exactly what ExtensionService is for? I'd prefer to see us reuse that for the second and third options. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are a couple challenges with using ExtensionService here:
If we're committed to using ExtensionService here, then I can look more into what changes would be needed to make this work. TBH, I feel like this would be easier if instead of having a single generic ExtensionService type, we had There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
+1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the last time we discussed this, @skriss was going to work on an updated ExtensionService design (which may include separate *Service objects for each use case). Is that right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, I've been thinking about it and starting to write up a doc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think if we moved to a CRD per service type (ExtAuthService, RateLimitService, TracingService), we could make this work since each type could be significantly different - and tracing would need to be significantly different than the others given that it's not an xDS gRPC service, and, depending on the provider, does not make use of an Envoy cluster at all (which is effectively what an ExtensionService is right now). However, I think it's going to be difficult to move from the generic ExtensionService to a CRD per service type since it'll require what I believe is a breaking API change for ExtAuth (HTTPProxy currently has an implicitly-typed reference to an ExtensionService), and disruptive/breaking config file changes for rate limiting. If we wanted to do that, it might need to wait for a v2 API/v2 Contour. Another option would be to add type-specific subfields to ExtensionService; however, because tracing is so different from the existing ExtensionService types, many of the existing fields in ExtensionService don't apply to tracing, so we end up with a messy API where many fields just don't apply in certain circumstances. I'm not a huge fan of this option given where we're starting from -- even though ExtensionService is "alpha", we don't want to break users if we can avoid it, which makes it harder to make in-place changes. We could introduce just a TracingService CRD in the Or, we could use the config file to define everything. Separately, since a main driver for defining an ExtensionService CRD to begin with was that it gave a way to report status, I think we should do that work for the existing service types. We could start simple by reporting whether or not a given ExtensionService has healthy Kubernetes Endpoints for its services, and then decide if we wanted to report more information from there. Before going much further down any one of these paths, I'd appreciate some input from others. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for that summary @skriss. I kind of expected that if we were going to look at separate objects, we would start with TracingService and then come back for the others later, I think that is fine. That's why these are alpha, after all. If the way tracing needs to work is that the config file references the TracingService rather than configuring it at an object level, I thin that's fine. Having the TracingService then makes it easier to operate tracing sinks, since you can flip to a new one, check everything is working, then flip back if necessary (although obviously it's not that easy for OpenCensus if it's per-Envoy-lifetime). I don't like the idea of adding extra fields to ExtensionService because of what you said - although the structs may live on in the new CRDs, I think the use cases are distinct enough to need separate things. We just need to keep an eye on how many we end up with. I agree that we need to do something with Status for ExtensionService, and that "Are there a nonzero number of Endpoints available" is a good start on a Ready conditrion. For moving the existing things like ExternalAuth references, we could do something like:
|
||
- user explicitly defines the cluster as part of a custom bootstrap file | ||
- Contour config file takes a Kubernetes service name, and a DAG processor adds a cluster for it to the DAG as a root. | ||
- Contour config file takes an address, and a DAG processor adds a cluster for it to the DAG as a root. | ||
|
||
For OpenCensus, Envoy can only be configured once per lifetime (see [this Envoy docs section][6] for information). Attempts to modify the OpenCensus configuration will cause Envoy to NACK config updates which, because of https://github.com/projectcontour/contour/issues/1176, will cause all future config updates to be ignored until Envoy is restarted. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note: I was wrong about #1176 - at some point the behavior has changed and now the behavior is that no updates will be accepted until the incorrect config is removed, at which time Envoy will recover. |
||
|
||
## Alternatives Considered | ||
If there are alternative high level or detailed designs that were not pursued they should be called out here with a brief explanation of why they were not pursued. | ||
|
||
- additional providers (could be added down the road if needed, but desire is to standardize on OpenTelemetry eventually) | ||
|
||
## Security Considerations | ||
If this proposal has an impact to the security of the product, its users, or data stored or transmitted via the product, they must be addressed here. | ||
|
||
## Compatibility | ||
A discussion of any compatibility issues that need to be considered | ||
|
||
- Migration from OpenCensus to OpenTelemery in the future | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we start with OpenTelemetry in the first place? |
||
|
||
## Implementation | ||
A description of the implementation, timelines, and any resources that have agreed to contribute. | ||
|
||
## Open Issues | ||
- is there any reason to use ExtensionService for this? (I think not; it's designed for the gRPC extension services and doesn't fit well here) | ||
- OpenCensus can only be configured once per Envoy lifetime - so if a user changes the Contour config for OpenCensus and restarts Contour, Envoy will barf - how to deal with this? | ||
- Zipkin requires an Envoy cluster to exist for the collector - how to get this into the DAG? | ||
|
||
## Survey of Other Ingress Controllers | ||
|
||
### Ambassador | ||
https://www.getambassador.io/docs/edge-stack/latest/topics/running/services/tracing-service/ | ||
- zipkin, lightstep, datadog | ||
|
||
### Gloo | ||
https://docs.solo.io/gloo-edge/latest/guides/observability/tracing/ | ||
- anything Envoy supports | ||
|
||
[1]: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/observability/tracing | ||
[2]: https://opentelemetry.io/ | ||
[3]: https://zipkin.io/ | ||
[4]: https://www.jaegertracing.io/ | ||
[5]: https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/network/http_connection_manager/v3/http_connection_manager.proto#envoy-v3-api-msg-extensions-filters-network-http-connection-manager-v3-httpconnectionmanager-tracing | ||
[6]: https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/network/http_connection_manager/v3/http_connection_manager.proto#extensions-filters-network-http-connection-manager-v3-httpconnectionmanager-tracing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenCensus has folded into OpenTelemetry now. The opentelemetry collector can export to Zipkin using an exporter so I do not think supporting either should be needed.