Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Donation Proposal]: Continuous Profiling Agent #1918

Closed
AlexanderWert opened this issue Jan 30, 2024 · 25 comments
Closed

[Donation Proposal]: Continuous Profiling Agent #1918

AlexanderWert opened this issue Jan 30, 2024 · 25 comments

Comments

@AlexanderWert
Copy link
Member

AlexanderWert commented Jan 30, 2024

Description

Elastic would like to offer the donation of the Elastic profiling agent to the OpenTelemetry project.

The Elastic profiling agent is an eBPF-based mature multi-runtime/multi-language CPU profiler. It enables fleet-wide and system-wide continuous profiling without the need for any application instrumentation or even application restart. It's currently, to the best of our knowledge, the only existing continuous profiler with no instrumentation/process restart for a broad range of real-world languages. The agent has been used in real large-scale customer production environments since August 2021.

Some of the core features and strengths of the agent are:

  • Very low CPU and memory overhead (1% CPU and 250MB memory are our upper limits in testing and the agent typically manages to stay way below that)
  • Support for native C/C++ executables without the need for DWARF debug information (by leveraging .eh_frame data as described in US11604718B1)
  • Support profiling of system libraries without frame pointers and without debug symbols on the host.
  • Support for mixed stacktraces between runtimes - stacktraces go from Kernel space through unmodified system libraries all the way into high-level languages.
  • Support for native code (C/C++, Rust, Zig, Go, etc. without debug symbols on host)
  • Support for a broad set of HLLs (Hotspot JVM, Python, Ruby, PHP, Node.JS, V8, Perl), .NET is in preparation.
  • 100% non-intrusive: there's no need to load agents or libraries into the processes that are being profiled.
  • No need for any reconfiguration, instrumentation or restarts of HLL interpreters and VMs: the agent supports unwinding each of the supported languages in the default configuration.
  • ARM64 support for all unwinders except NodeJS.
  • Support for native inline frames, which provide insights into compiler optimizations and offer a higher precision of function call chains.

Benefits to the OpenTelemetry community

The donation of the Elastic profiling agent would fill the gap in OpenTelemetry's component landscape/architecture with a mature, feature-rich and efficient profiling solution. With that, cutting-edge technologies in eBPF and profiling would become a standard through OpenTelemetry for collecting in-production profiling data. Collecting profiling data with OpenTelemetry across a broad range of languages/technologies would come with a frictionless deployment experience.

OpenTelemetry users will get a continuous profiling in production solution with all the previously mentioned core features and strengths. In addition, there is existing work on correlating profiling data from this profiling agent with the tracing data from the OTel Java Agent / SDK. This lays out the basis for achieving cross-signal correlation for profiling data (beyond the pure correlation through resource attributes).

In addition, with this donation, the OpenTelemetry community would gain a team of profiling domain experts to (co-)maintain and advance OpenTelemetry's profiling story.

Reasons for donation

Elastic is dedicated to OpenTelemetry's vision to make it the single, ubiquitous standard and framework for Observability. With this donation, we strive to help OpenTelemetry to successfully expand into the area of continuous profiling by donating one of the industry's leading profiling agents to OpenTelemetry's ecosystem.

Repository

Currently, the code of Elastic's profiling agent is closed source. However, we are already in contact with the TC and working on providing access to the code for review and evaluation of the donation proposal in the next few weeks.

Existing usage

The agent has been used in real large-scale customer (hundreds of customers) production environments (on thousands of nodes) since August 2021 without notable incidents.

Maintenance

In case of a successful donation Elastic's profiling team is dedicated to further (co-)maintain and evolve OTel's profiling agent and help drive its adoption as an industry standard for collecting profiling data in production.

Licenses

Currently, the code of Elastic's profiling agent is closed source. However, we are already in contact with the TC and working on providing access to the code for review and evaluation of the donation proposal in the next few weeks.

In case of the acceptance of this donation proposal Elastic commits to fully open-source the profiling agent under the Apache 2.0 license.

Trademarks

The name "Elastic Universal Profiling" currently appears in the codebase, but we do not intend to donate any code that includes the name "Elastic" or "Elastic Universal Profiling". We will make sure "Elastic" as a name will be removed from the codebase in case the donation is being accepted.

Other notes

Roadmap as part of / after the donation

We are aware that the current code base of the agent will require a few additional features to make the profiling agent usable in the context of OpenTelemetry. However, the above-mentioned strengths of the agent overweight the remaining, manageable effort to make the agent OTel compliant by far. Also, with a positive development of the donation proposal Elastic commits to drive this roadmap and work on the following aspects as soon as possible, to make the profiling agent OTel compliant:

  • Implement the OTLP profiling protocol (This work is already in progress and is a matter of weeks)
  • Reporting meta information (e.g. host/container metadata) as resource attributes
  • Code cleanup, removal of bits not relevant to OTel

We are not aware of any technical blockers and difficulties to achieve the above.

Additional (potential) future work and contributions are:

  • Modularization of the agent and introduction of extension points (e.g. for exporters, etc.)
  • Contributing a specification and tooling for the symbolization service to the OTel specification
  • Further features are possible in the future, such as off-CPU profiling, memory profiling or network profiling

Examples of produced traces

image

Figure: flamegraph of a mixed native and interpreter workload (native code, Java, Python, and Ruby)


image

Figure: flamegraph excerpt of Java calling into libc.so which in turn does a syscall into the kernel.

Due Diligence Document

https://docs.google.com/document/d/1ro8qKlAOrxqHYE3YmfUBtiATrS7r267lg9aNdDS6Pps/edit

@brancz
Copy link

brancz commented Jan 30, 2024

This is awesome!

It's currently, to the best of our knowledge, the only existing continuous profiler with no instrumentation/process restart for a broad range of real-world languages.

The apache2 licensed project, Parca Agent, which we work on at Polar Signals, comes pretty close I would say 😉 .

I'm much in favor of a neutral ground so we don't all have to keep reinventing the same things! Not writing a blank check of course, but we'd strongly think about merging the Parca Agent project with this including people to work on it and maintain it. For what it's worth we have a pretty well-thought-through protocol that is already open source for uploading debuginfos.

@kakkoyun
Copy link

I'm much in favor of a neutral ground so we don't all have to keep reinventing the same things! Not writing a blank check of course, but we'd strongly think about merging the Parca Agent project with this including people to work on it and maintain it.

I couldn't agree more. I can't wait to contribute 🎉

@mtwo
Copy link
Member

mtwo commented Jan 30, 2024

@AlexanderWert cool! Does this donation come with people who will continue working on it?

@iogbole
Copy link
Contributor

iogbole commented Jan 30, 2024

@AlexanderWert cool! Does this donation come with people who will continue working on it?

Yes, it will. See

with this donation, the OpenTelemetry community would gain a team of profiling domain experts to (co-)maintain

@mtwo
Copy link
Member

mtwo commented Jan 30, 2024

Read it twice and I somehow missed that both times, thanks!

@rektide
Copy link

rektide commented Jan 30, 2024

ARM64 support for all unwinders except NodeJS.

Just curious, is this omission owing to anything in specific? How addressable is it or is there something obstructing implementation?

Would the instrumentation for node.js be at all useful as a starting point for Deno or Chrome (which also use v8), or is there more to it?

Thanks for this donation.

@athre0z
Copy link
Member

athre0z commented Jan 30, 2024

Just curious, is this omission owing to anything in specific? How addressable is it or is there something obstructing implementation?

Not really, we just haven't gotten around to porting it, yet. :) We definitely expect this to be doable.

I'm much in favor of a neutral ground so we don't all have to keep reinventing the same things! Not writing a blank check of course, but we'd strongly think about merging the Parca Agent project with this including people to work on it and maintain it.

Sounds great to me -- I think our whole team will agree that we'd love to collaborate!

@petethepig
Copy link
Member

This is fantastic news!

At Profiling SIG we're approaching a significant milestone with the impending merge of the OTEP that introduces profiling support to OTLP — really excited about seeing real-world implementations of it soon.

@austinlparker
Copy link
Member

Question; what's the projected LOE to add support for other key OpenTelemetry languages (e.g., .NET)? Is this something that was roadmapped prior to the donation?

@athre0z
Copy link
Member

athre0z commented Jan 31, 2024

Question; what's the projected LOE to add support for other key OpenTelemetry languages (e.g., .NET)? Is this something that was roadmapped prior to the donation?

It very much depends on the complexity of the language runtime. The bulk of the work in adding new language unwinders is in researching how exactly the interpreter or VM works internally. In particular, we need to understand execution flow through native, interpreted and JITed code and then develop methodology to reliably extract all the meta-data from process memory that is required to efficiently unwind and symbolize stack traces.

For languages with a relatively simple interpreter (e.g. Python), this is a matter of days for a prototype and weeks to get it to production quality. For highly complex runtimes like JVM/HotSpot or .NET, it's more in the realm of months.

For .NET Core specifically, we already started working on that a while ago and it is making good progress: a draft PR was opened internally just a few days ago. Can't make any promises, but it appears likely that this would have reached completion by the time the proposal is accepted and the agent is actually donated.

@Kielek
Copy link
Contributor

Kielek commented Jan 31, 2024

For .NET Core specifically, we already started working on that a while ago and it is making good progress: a draft PR was opened internally just a few days ago. Can't make any promises, but it appears likely that this would have reached completion by the time the proposal is accepted and the agent is actually donated.

@athre0z, for .NET we have scaffolding of in-process continuous profiler in place: open-telemetry/opentelemetry-dotnet-instrumentation#3196. The missing part is exported due to lack of common protocol. I will be happy to look/help with your solution when publicly available.

@brancz
Copy link

brancz commented Jan 31, 2024

@Kielek the elastic agent, much like Parca Agent, work by inspecting the profiled processes from the outside, no instrumentation or in-process things are necessary, so there's rather little overlap with the instrumented profiler work.

@iogbole
Copy link
Contributor

iogbole commented Jan 31, 2024

This is awesome!
[]
..but we'd strongly think about merging the Parca Agent project with this including people to work on it and maintain it.

@brancz the Elastic profiling team will be delighted to collaborate with Parca on this initiative. I believe we share a common goal, which is to help organizations optimize software efficiency. As evidenced by Elastic Universal Profiling testimonials, efficient software not only benefits businesses by reducing Cost of Goods Sold (COGS) but is also good for the planet 🌍, as it reduces carbon footprint.

A unified OTel (eBPF-based) continuous profiling agent is indeed a worthy cause to help everyone achieve hyperscaler efficiency.

@lizthegrey
Copy link
Member

lizthegrey commented Jan 31, 2024

Some of the core features and strengths of the agent are:

  • Support for native C/C++ executables without the need for DWARF debug information (by leveraging .eh_frame data as described in US11604718B1)

Licenses

Currently, the code of Elastic's profiling agent is closed source. However, we are already in contact with the TC and working on providing access to the code for review and evaluation of the donation proposal in the next few weeks.

In case of the acceptance of this donation proposal Elastic commits to fully open-source the profiling agent under the Apache 2.0 license.

For abundance of clarity, if it is going to be Apache 2.0 licenced, I believe this means that a licence to use the mentioned patent is conveyed to those using/modifying the source code implementing that patent. But I am not a lawyer and this seems important to explicitly call out and verify. Don't want anyone stumbling into a patent trap!

@iogbole
Copy link
Contributor

iogbole commented Jan 31, 2024

For abundance of clarity, if it is going to be Apache 2.0 licenced, I believe this means that a licence to use the mentioned patent is conveyed to those using/modifying the source code implementing that patent. But I am not a lawyer and this seems important to explicitly call out and verify. Don't want anyone stumbling into a patent trap!

@lizthegrey , the Apache 2.0 license addresses this concern. The license includes a specific patent clause 👇🏽 that grants users a free license to any related patents embodied in the software. This clause ensures that users can utilize, modify, and distribute the software without worrying about patent infringement claims from the contributors.

https://www.apache.org/licenses/LICENSE-2.0

  1. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted.

@felixge
Copy link
Member

felixge commented Feb 1, 2024

On behalf of Datadog I want to express our gratitude to the Elastic/Optimyze team for their contributions to continuous profiling in the observability industry, including this proposed donation.

As contributors to the Profiling SIG, we are very excited about this development and would be interested in joining the OpenTelemetry effort of evaluating the source code in order to explore the possibility of collaborations around this technology.

@tigrannajaryan
Copy link
Member

The TC discussed this donation proposal.

One of the differences of this proposal from others we had in the past is that the profiling agent is closed source and the due diligence process will need to be performed while it is still closed source.

The TC made a decision that it does not prevent us from being able to do the evaluation provided that:

  • There is an understanding that once due diligence is completed the due diligence report will be made publicly available regardless of the outcome (whether the recommendation is positive or negative).
  • The due diligence participants will not be bound by any sort of confidentiality agreement.

The TC will further discuss this with the GC and will come back with an update on how we can move forward.

@mtwo
Copy link
Member

mtwo commented Feb 1, 2024

Following up on the donation process after discussing with the GC and TC:

Donations to OpenTelemetry typically have a due diligence report included with them. This is written by OTel maintainers and community members who are experts about the particular area (profiling in this case). Here are examples for logging (related doc) and Android client instrumentation. These documents inform the GC and TC, who aren't always experts on the topic being evaluated.

The Profiling SIG will need to do the same to review the proposal from Elastic. Things that this report will want to evaluate include (this list is not exhaustive):

  • Does the donation fit with goals around profiling (given the comments on the donation, I assume yes, but the profiling SIG will still want to discuss this)?
  • Are there alternatives that are worth considering? These could be other open source projects, implementing this within OTel, or others.
  • What changes will need to be made for it to be accepted? Data model, protocol, interfaces, etc.
  • Would OTel still offer direct language instrumentation for profiling after accepting this? I'm guessing that the TC will have questions around the benefits of using eBPF vs. direct instrumentation, particularly for languages that already have profiling interfaces (this topic is more to inform people about the benefits of different mechanisms and our plans, rather than impacting our recommendation)
  • An overall recommendation

@tsloughter
Copy link
Member

we'd strongly think about merging the Parca Agent project with this including people to work on it and maintain it.

This would be awesome. The Elastic profiler does not support use of perf data for profiling, so does not support Erlang, like Parca. Not that this is the only way to support more languages like Erlang, but the only way I see it happening since it isn't a high demand ask :).

@syepes
Copy link

syepes commented Mar 19, 2024

+1

@juergen-walter
Copy link

Is there an expected date when donation shall be finished (i.e. donation can be consumed)?
Is there some way to follow the progress or see timelines?

@AlexanderWert
Copy link
Member Author

AlexanderWert commented Mar 27, 2024

Hi Jürgen 👋 ,

the profiling agent / donation is currently under review by a group of volunteers from the Profiling SIG.
Once that is done, there will be a broader review and we will start working on more concrete plans on how to integrate it with the OTel ecosystem, addressing review comments, etc. So once the donation is official accepted we will start creating issues for the above-mentioned work, that's where you can follow the progress.

I would guess this to be rather a matter of several months until we have a first, fully integrated version of the profiling agent / functionality (depending on the review result, the concrete realization plans, etc.).

@mtwo
Copy link
Member

mtwo commented May 16, 2024

The Profiling SIG has completed their due diligence, which is ready for TC and GC review. We strongly approve of this donation!

@jsuereth
Copy link
Contributor

jsuereth commented Jun 5, 2024

The OpenTelemetry technical committee approves of this donation.

  • We see this contribution greatly accelerating the mission of observability from OpenTelemetry via Profiles.
  • We see the donation coming with dedicated contributors that will reduce burden on integrating with already overloaded open-telemetry components.
  • We see clear integration points and integration into the overall whole of OpenTelemetry.
  • We agree to next steps and acceptance criteria.

Looking forward to this new functionality!

@mtwo
Copy link
Member

mtwo commented Jun 6, 2024

The OpenTelemetry Governance Committee approves of this donation, meaning that all required approvals are now complete. This can now become part of OpenTelemetry, thank you to Elastic for donating this and to everyone who participated in the review process!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests