Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

Propose OpenTelemetry Profiling Vision #212

Merged
merged 17 commits into from
Sep 22, 2022
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions text/profiles/0212-profiling-vision.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Propose OpenTelemetry Profiling Vision
Rperry2174 marked this conversation as resolved.
Show resolved Hide resolved
Rperry2174 marked this conversation as resolved.
Show resolved Hide resolved

The following are high-level items that define our long-term vision for Profiling support in the OpenTelemetry project that we aspire to achieve.

While this vision document reflects our current desires, it is meant to be a guide towards a collectively agreed upon set of objectives rather than a checklist of requirements. A group of OpenTelemetry community members have participated in a series of bi-weekly meetings for 2 months. The group represents a cross-section of industry and domain expertise, who have found common cause in the creation of this document. It is our shared intention to continue to ensure alignment moving forward. As our vision evolves and matures, we intend to incorporate our learnings further to facilitate an optimal outcome.

This document and efforts thus far are motivated by:

- This [long-standing issue](https://github.com/open-telemetry/oteps/issues/139) created in October 2020
- A conversation about priorities at the in-person OTEL meeting at Kubecon EU 2022
- Increasing community interest in profiling as an observability signal alongside logs, metrics, and traces

## How Profiling aligns with the OpenTelemetry vision

The [OpenTelemetry vision](https://opentelemetry.io/mission/#vision-mdash-the-world-we-imagine-for-otel-end-users) states:

_Effective observability is powerful because it enables developers to innovate faster while maintaining high reliability. But effective observability absolutely requires high-quality telemetry – and the performant, consistent instrumentation that makes it possible._

While existing OpenTelemetry signals fit all of these criteria, until recently no effort has been explicitly geared towards creating performant and consistent instrumentation of profiling data.

Rperry2174 marked this conversation as resolved.
Show resolved Hide resolved
## Making a well-rounded observability suite by adding profiling

Currently Logs, Metrics, and Traces are widely accepted as the main “pillars” of observability, each providing a different set of data from which a user can query to answer questions about their system/application. However, to limit observability, arbitrarily, to three pillars does a disservice to main goal of observability.
Rperry2174 marked this conversation as resolved.
Show resolved Hide resolved

Profiling data can help further this goal by answering certain questions about a system or application which logs, metrics, and traces are less equipped to answer. We aim to facilitate implementations capable of best-in-class support for collecting , processing, and transporting this profiling data.
Rperry2174 marked this conversation as resolved.
Show resolved Hide resolved

Rperry2174 marked this conversation as resolved.
Show resolved Hide resolved
Our goals for profiling align with those of OpenTelemetry as a whole:

- **Profiling should be easy**: the nature of profiling offers fast time-to-value by often being able to optionally drop in a minimal amount of code and instantly have details about application resource utilization
- **Profiling should be universal**: currently profiling is slightly different across different languages, but with a little effort the representation of profiling data can be standardized in a way where not only are languages consistent, but profiling data itself is also consistent with the other observability signals as well
- **Profiling should be vendor neutral**: From one profiling agent, users should be able to send data to whichever vendor they like (or no vendor at all) and interoperate with other OSS projects

## Current State of Profilers

As it currently stands, the method for collecting profiles for an application and the format of the profiles collected varies greatly depending on several factors such as:

- Language (and language runtime)
- Profiler Type
- Data type being profiled (i.e. cpu, memory, etc)
- Availability or utilization of symbolic information

A fairly comprehensive taxonomy of various profiling formats can be found on the [profilerpedia website](https://profilerpedia.markhansen.co.nz/formats/).
Rperry2174 marked this conversation as resolved.
Show resolved Hide resolved

As a result of this variation, the tooling and collection of profiling data lacks in exactly the areas in which OpenTelemetry has built as its core engineering values:

- Profiling currently lacks compatibility: Each vendor, open source project, and language has different ways of collecting, sending, and storing profiling data and often with no regard to linking to other signals
- Profiling currently lacks consistency: Currently profiling agents and formats can change arbitrarily with no unified criteria for how to take end-users into account

## Making Profiling Compatible with other Signals
Rperry2174 marked this conversation as resolved.
Show resolved Hide resolved

Profiles are particularly useful in the context of other signals. For example, having a profile for a particular “slow” span in a trace yields more actionable information than simply knowing that the span was slow. The nature of profiling also provides the potential for a more “hands-off” approach to adding it to code compared to manual instrumentation needed for other signals.
Rperry2174 marked this conversation as resolved.
Show resolved Hide resolved

OpenTelemetry will define how profiles will be correlated with logs, traces, and metrics and how this correlation information will be stored.

Correlation will work across 2 major dimensions:

- To correlate telemetry emitted for the same request (also known as request or trace context correlation)
- To correlate telemetry emitted from the same source (also known as Resource Context Correlation)

## Standardize profiling data model for industry-wide sharing and reuse

We will design a profiling data model that will aim to represent the vast majority of profiling data with the following goals in mind:

- Profile format should be as compact as possible
Rperry2174 marked this conversation as resolved.
Show resolved Hide resolved
- Profiling data should be transferred as efficiently as possible and the model should be lossless with intentional bias for enabling efficient marshaling, transcoding, and analysis
Rperry2174 marked this conversation as resolved.
Show resolved Hide resolved
- When needed, existing profiling formats should be able to be unambiguously mapped to the standardized data model (i.e. collapsed, pprof, JFR, etc.)
- Providing minimal/terse data model components that show relationships between other telemetry components. For example, linking call stacks with spans
Rperry2174 marked this conversation as resolved.
Show resolved Hide resolved

## Supporting Legacy profiling formats

For existing profilers we will provide instructions on how these legacy formats can emit profiles in a manner that makes them compatible with OpenTelemetry’s approach and enables telemetry data correlation.

Particularly for popular profilers such as the ones native to Golang and Java (JFR) we will help to have them produce OpenTelemetry-compatible profiles with minimal overhead.

## Performance considerations

Profiling agents can be architected in a variety of differing ways, with reasonable trade offs made that may impact performance, completeness, accuracy and so on. Similarly, the manner in which such a profiler might produce or consume OpenTelemetry-compatible data could vary significantly. As such, in our standardization effort it is not feasible to be proscriptive on the matter of resource usage for profilers.

However, the output of OpenTelemetry's standardization effort must take into account that some existing profilers are designed to be low overhead and high performance. For example, they may operate in a whole-datacenter, always-on manner, and/or in environments where they must guarantee low CPU/RAM/network usage. The OpenTelemetry standardisation effort should take this into account and strive to produce a format that is usable by profilers of this nature without sacrificing their guarantees.

Similar to other Otel signals, we target production environments. Thus, the profiling signal must be implementable with low overhead and conforming to Otel-wide runtime overhead / intrusiveness and wire data size requirements.

Rperry2174 marked this conversation as resolved.
Show resolved Hide resolved
## Promoting Cloud-Native best practices with Profiling

The CNCF’s mission states:
_Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds_

We will have best-in-class support for profiles emitted in cloud native environments (e.g. Kubernetes, serverless, etc), including legacy applications running in those environments. As we aim to achieve this goal we will center our efforts around making profiling applications resilient, manageable and observable. This is in line with the CNCF and OTEL missions and will thus allow us to further expand and leverage those communities to further the respective missions.

## Profiling use cases

- Understanding what code is responsible for consuming resources (i.e. CPU, Ram, disk, network)
- Planning for resource alotment for a group of services running in production
Rperry2174 marked this conversation as resolved.
Show resolved Hide resolved
- Comparing profiles of different versions of code to understand how code has improved or degraded over time
- Detecting frequently used and "dead" code in production
- Breaking a trace span into code-level granularity to understand the performance for that particular unit