diff --git a/pip/pip-320.md b/pip/pip-320.md new file mode 100644 index 0000000000000..3c169e4340bd1 --- /dev/null +++ b/pip/pip-320.md @@ -0,0 +1,256 @@ +# PIP-320 OpenTelemetry Scaffolding + +# Background knowledge + +## PIP-264 - parent PIP titled "Enhanced OTel-based metric system" +[PIP-264](https://github.com/apache/pulsar/pull/21080), which can also be viewed [here](pip-264.md), describes in high +level a plan to greatly enhance Pulsar metric system by replacing it with [OpenTelemetry](https://opentelemetry.io/). +You can read in the PIP the numerous existing problems PIP-264 solves. Among them are: +- Control which metrics to export per topic/group/namespace via the introduction of a metric filter configuration. + This configuration is planned to be dynamic as outline in the [PIP-264](pip-264.md). +- Reduce the immense metrics cardinality due to high topic count (One of Pulsar great features), by introducing +the concept of Metric Group - a group of topics for metric purposes. Metric reporting will also be done to a +group granularity. 100k topics can be downsized to 1k groups. The dynamic metric filter configuration would allow +the user to control which metric group to un-filter. +- Proper histogram exporting +- Clean-up codebase clutter, by relying on a single industry standard API, SDK and metrics protocol (OTLP) instead of +existing mix of home-brew libraries and hard coded Prometheus exporter. +- any many more + +You can [here](pip-264.md#why-opentelemetry) why OpenTelemetry was chosen. + +## OpenTelemetry +Since OpenTelemetry (a.k.a. OTel) is an emerging industry standard, there are plenty of good articles, videos and +documentation about it. In this very short paragraph I'll describe what you need to know about OTel from this PIP +perspective. + +OpenTelemetry is a project aimed to standardize the way we instrument, collect and ship metrics from applications +to telemetry backends, be it databases (e.g. Prometheus, Cortex, Thanos) or vendors (e.g. Datadog, Logz.io). +It is divided into API, SDK and Collector: +- API: interfaces to use to instrument: define a counter, record values to a histogram, etc. +- SDK: a library, available in many languages, implementing the API, and other important features such as +reading the metrics and exporting it out to a telemetry backend or OTel Collector. +- Collector: a lightweight process (application) which can receive or retrieve telemetry, transform it (e.g. +filter, drop, aggregate) and export it (e.g. send it to various backends). The SDK supports out-of-the-box +exporting metrics as Prometheus HTTP endpoint or sending them out using OTLP protocol. Many times companies choose to +ship to the Collector and there ship to their preferred vendors, since each vendor already published their exporter +plugin to OTel Collector. This makes the SDK exporters very light-weight as they don't need to support any +vendor. It's also easier for the DevOps team as they can make OTel Collector their responsibility, and have +application developers only focus on shipping metrics to that collector. + +Just to have some context: Pulsar codebase will use the OTel API to create counters / histograms and records values to +them. So will the Pulsar plugins and Pulsar Function authors. Pulsar itself will be the one creating the SDK +and using that to hand over an implementation of the API where ever needed in Pulsar. Collector is up to the choice +of the user, as OTel provides a way to expose the metrics as `/metrics` endpoint on a configured port, so Prometheus +compatible scrapers can grab it from it directly. They can also send it via OTLP to OTel collector. + +## Telemetry layers +PIP-264 clearly outlined there will be two layers of metrics, collected and exported, side by side: OpenTelemetry +and the existing metric system - currently exporting in Prometheus. This PIP will explain in detail how it will work. +The basic premise is that you will be able to enable or disable OTel metrics, alongside the existing Prometheus +metric exporting. + +## Why OTel in Pulsar will be marked experimental and not GA +As specified in [PIP-264](pip-264.md), OpenTelemetry Java SDK has several fixes the Pulsar community must +complete before it can be used in production. They are [documented](pip-264.md#what-we-need-to-fix-in-opentelemetry) +in PIP-264. The most important one is reducing memory allocations to be negligible. OTel SDK is built upon immutability, +hence allocated memory in O(`#topics`) which is a performance killer for low latency application like Pulsar. + +You can track the proposal and progress the Pulsar and OTel communities are making in +[this issue](https://github.com/open-telemetry/opentelemetry-java/issues/5105). + + +## Metrics endpoint authentication +Today Pulsar metrics endpoint `/metrics` has an option to be protected by the configured `AuthenticationProvider`. +The configuration option is named `authenticateMetricsEndpoint` in the broker and +`authenticateMetricsEndpoint` in the proxy. + + +# Motivation + +Implementing PIP-264 consists of a long list of steps, which are detailed in +[this issue](https://github.com/apache/pulsar/issues/21121). The first step is add all the bare-bones infrastructure +to use OpenTelemetry in Pulsar, such that next PRs can use it to start translating existing metrics to their +OTel form. It means the same metrics will co-exist in the codebase and also in runtime, if OTel was enabled. + +# Goals + +## In Scope +- Ability to add metrics using OpenTelemetry to Pulsar components: Broker, Function Worker and Proxy. +- User can disable or enable OpenTelemetry metrics, which by default will be disabled +- OpenTelemetry metrics will be configured via its native OTel Java SDK configuration options +- All the necessary information to use OTel with Pulsar will be documented in Pulsar documentation site +- OpenTelemetry metrics layer defined as experimental, and *not* GA + + +## Out of Scope +- Ability to add metrics using OpenTelemetry as Pulsar Function author. +- Only authenticated sessions can access OTel Prometheus endpoint, using Pulsar authentication +- Metrics in Pulsar clients (as defined in [PIP-264](pip-264.md#out-of-scope))) + +# High Level Design + +## Configuration +OpenTelemetry, as any good telemetry library (e.g. log4j, logback), has its own configuration mechanisms: +- System properties +- Environment variables +- Experimental file-based configuration + +Pulsar doesn't need to introduce any additional configuration. The user can decide, using OTel configuration +things like: +* How do I want to export the metrics? Prometheus? Which port prometheus will be exposed at +* Change histogram buckets using Views +* and more + +Pulsar will use `AutoConfiguredOpenTelemetrySdk` which uses all the above configuration mechanisms +(documented [here](https://github.com/open-telemetry/opentelemetry-java/tree/main/sdk-extensions/autoconfigure)). +This class builds an `OpenTelemetrySdk` based on configurations. This is the entry point to OpenTelemetry API, as it +implements `OpenTelemetry` API class. + +### Setting sensible defaults for Pulsar +There are some configuration options we wish to change their default, but still allow the users to override it +if they wish. We think those default values will make a much easier user experience. + +* `otel.experimental.metrics.cardinality.limit` - value: 10,000 +This property sets an upper bound on the amount of unique `Attributes` an instrument can have. Take Pulsar for example, +an instrument like `pulsar.broker.messaging.topic.received.size`, the unique `Attributes` would be in the amount of +active topics in the broker. Since Pulsar can handle up to 1M topics, it makes more sense to put the default value +to 10k, which translates to 10k topics. + +`AutoConfiguredOpenTelemetrySdkBuilder` allows to add properties using the method `addPropertiesSupplier`. The +System properties and environment variables override it. The file-based configuration still doesn't take +those properties supplied into account, but it will. + + +## Opting in +We would like to have the ability to toggle OpenTelemetry-based metrics, as they are still new. +We won't need any special Pulsar configuration, as OpenTelemetry SDK comes with a configuration key to do that. +Since OTel is still experimental, it will have to be opt-in, hence we will add the following property to be the default +using the mechanism described [above](#setting-sensible-defaults-for-pulsar): + +* `otel.sdk.disabled` - value: true + This property value disables OpenTelemetry. + +With OTel disabled, the user remains with the existing metrics system. OTel in a disabled state operates in a +no-op mode. This means, instruments do get built, but the instrument builders return the same instance of a +no-op instrument, which does nothing on record-values method (e.g. `add(number)`, `record(number)`). The no-op +`MeterProvider` has no registered `MetricReader` hence no metric collection will be made. The memory impact +is almost 0 and the same goes for CPU impact. + +The current metric system doesn't have a toggle which causes all existing data structures to stop collecting +data. Inserting will need changing in so many places since we don't have a single place which through +all metric instrument are created (one of the motivations for PIP-264). +The current system do have a toggle: `exposeTopicLevelMetricsInPrometheus`. It enables toggling off +topic-level metrics, which means the highest cardinality metrics will be namespace level. +Once that toggle is `false`, the amount of data structures accounting memory would in the range of +a few thousands which shouldn't post a burden memory wise. If the user refrain from calling +`/metrics` it will also reduce the CPU and memory cost associated with collecting metrics. + +When the user enables OTel it means there will be a memory increase, but if the user disabled topic-level +metrics in existing system, as specified above, the majority of the memory increase will be due to topic level +metrics in OTel, at the expense of not having them in the existing metric system. + + + +## Cluster attribute name +A broker is part of a cluster. It is configured in the Pulsar configuration key `clusterName`. When the broker is part +of a cluster, it means it shares the topics defined in that cluster (persisted in Metadata service: e.g. ZK) +among the brokers of that cluster. + +Today, each unique time series emitted in Prometheus metrics contains the `cluster` label (almost all of them, as it +is done manually). We wish the same with OTel - to have that attribute in each exported unique time series. + +OTel has the perfect location to place attributes which are shared across all time series: Resource. An application +can have multiple Resource, with each having 1 or more attributes. You define it once, in OTel initialization or +configuration. It can contain attributes like the hostname, AWS region, etc. The default contains the service name +and some info on the SDK version. + +Attributes can be added dynamically, through `addResourceCustomizer()` in `AutoConfiguredOpenTelemetrySdkBuilder`. +We will use that to inject the `cluster` attribute, taken from the configuration. + +In Prometheus, we submitted a [proposal](https://github.com/open-telemetry/opentelemetry-specification/pull/3761) +to opentelemetry specifications, which was merged, to allow copying resource attributes into each exported +unique time series in Prometheus exporter. +We plan to contribute its implementation to OTel Java SDK. + +Resources in Prometheus exporter, are exported as `target_info{} 1` and the attributes are added to this +time series. This will require making joins to get it, making it extremely difficult to use. +The other alternative was to introduce our own `PulsarAttributesBuilder` class, on top of +`AttributesBuilder` of OTel. Getting every contributor to know this class, use it, is hard. Getting this +across Pulsar Functions or Plugins authors, will be immensely hard. Also, when exporting as +OTLP, it is very inefficient to repeat the attribute across all unique time series, instead of once using +Resource. Hence, this needed to be solved in the Prometheus exporter as we did in the proposal. + +The attribute will be named `pulsar.cluster`, as both the proxy and the broker are part of this cluster. + +## Naming and using OpenTelemetry + +### Attributes +* We shall prefix each attribute with `pulsar.`. Example: `pulsar.topic`, `pulsar.cluster`. + +### Instruments +We should have a clear hierarchy, hence use the following prefix +* `pulsar.broker` +* `pulsar.proxy` +* `pulsar.function_worker` + +### Meter +It's customary to use reverse domain name for meter names. Hence, we'll use: +* `org.apache.pulsar.broker` +* `org.apache.pulsar.proxy` +* `org.apache.pulsar.function_worker` + +OTel meter name is converted to the attribute name `otel_scope_name` and added to each unique time series +attributes by Prometheus exporter. + +We won't specify a meter version, as it is used solely to signify the version of the instrumentation, and +currently we are the first version, hence not use it. + + +# Detailed Design + +## Design & Implementation Details + +* `OpenTelemetryService` class + * Parameters: + * Cluster name + * What it will do: + - Override default max cardinality to 10k + - Register a resource with cluster name + - Place defaults setting to instruct Prometheus Exporter to copy resource attributes + - In the future: place defaults for Memory Mode to be REUSABLE_DATA + +* `PulsarBrokerOpenTelemetry` class + * Initialization + * Construct an `OpenTelemetryService` using the cluster name taken from the broker configuration + * Constructs a Meter for the broker metrics + * Methods + * `getMeter()` returns the `Meter` for the broker + * Notes + * This is the class that will be passed along to other Pulsar service classes that needs to define + telemetry such as metrics (in the future: traces). + +* `PulsarProxyOpenTelemetry` class + * Same as `PulsarBrokerOpenTelemetry` but for Pulsar Proxy +* `PulsarWorkerOpenTelemetry` class + * Same as `PulsarBrokerOpenTelemetry` but for Pulsar function worker + + +## Public-facing Changes + +### Public API +* OTel Prometheus Exporter adds `/metrics` endpoint on a user defined port, if user chose to use it + +### Configuration +* OTel configurations are used + +# Security Considerations +* OTel currently does not support setting a custom Authenticator for Prometheus exporter. +An issue has been raised [here](https://github.com/open-telemetry/opentelemetry-java/issues/6013). + * Once it do we can secure the Prometheus exporter metrics endpoint using `AuthenticationProvider` +* Any user can access metrics, and they are not protected per tenant. Like today's implementation + +# Links + +* Mailing List discussion thread: https://lists.apache.org/thread/xcn9rm551tyf4vxrpb0th0wj0kktnrr2 +* Mailing List voting thread: https://lists.apache.org/thread/zp6vl9z9dhwbvwbplm60no13t8fvlqs2