-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Donation Proposal]: Continuous Profiling Agent #1918
Comments
This is awesome!
The apache2 licensed project, Parca Agent, which we work on at Polar Signals, comes pretty close I would say 😉 . I'm much in favor of a neutral ground so we don't all have to keep reinventing the same things! Not writing a blank check of course, but we'd strongly think about merging the Parca Agent project with this including people to work on it and maintain it. For what it's worth we have a pretty well-thought-through protocol that is already open source for uploading debuginfos. |
I couldn't agree more. I can't wait to contribute 🎉 |
@AlexanderWert cool! Does this donation come with people who will continue working on it? |
Yes, it will. See
|
Read it twice and I somehow missed that both times, thanks! |
Just curious, is this omission owing to anything in specific? How addressable is it or is there something obstructing implementation? Would the instrumentation for node.js be at all useful as a starting point for Deno or Chrome (which also use v8), or is there more to it? Thanks for this donation. |
Not really, we just haven't gotten around to porting it, yet. :) We definitely expect this to be doable.
Sounds great to me -- I think our whole team will agree that we'd love to collaborate! |
This is fantastic news! At Profiling SIG we're approaching a significant milestone with the impending merge of the OTEP that introduces profiling support to OTLP — really excited about seeing real-world implementations of it soon. |
Question; what's the projected LOE to add support for other key OpenTelemetry languages (e.g., .NET)? Is this something that was roadmapped prior to the donation? |
It very much depends on the complexity of the language runtime. The bulk of the work in adding new language unwinders is in researching how exactly the interpreter or VM works internally. In particular, we need to understand execution flow through native, interpreted and JITed code and then develop methodology to reliably extract all the meta-data from process memory that is required to efficiently unwind and symbolize stack traces. For languages with a relatively simple interpreter (e.g. Python), this is a matter of days for a prototype and weeks to get it to production quality. For highly complex runtimes like JVM/HotSpot or .NET, it's more in the realm of months. For .NET Core specifically, we already started working on that a while ago and it is making good progress: a draft PR was opened internally just a few days ago. Can't make any promises, but it appears likely that this would have reached completion by the time the proposal is accepted and the agent is actually donated. |
@athre0z, for .NET we have scaffolding of in-process continuous profiler in place: open-telemetry/opentelemetry-dotnet-instrumentation#3196. The missing part is exported due to lack of common protocol. I will be happy to look/help with your solution when publicly available. |
@Kielek the elastic agent, much like Parca Agent, work by inspecting the profiled processes from the outside, no instrumentation or in-process things are necessary, so there's rather little overlap with the instrumented profiler work. |
@brancz the Elastic profiling team will be delighted to collaborate with Parca on this initiative. I believe we share a common goal, which is to help organizations optimize software efficiency. As evidenced by Elastic Universal Profiling testimonials, efficient software not only benefits businesses by reducing Cost of Goods Sold (COGS) but is also good for the planet 🌍, as it reduces carbon footprint. A unified OTel (eBPF-based) continuous profiling agent is indeed a worthy cause to help everyone achieve hyperscaler efficiency. |
For abundance of clarity, if it is going to be Apache 2.0 licenced, I believe this means that a licence to use the mentioned patent is conveyed to those using/modifying the source code implementing that patent. But I am not a lawyer and this seems important to explicitly call out and verify. Don't want anyone stumbling into a patent trap! |
@lizthegrey , the Apache 2.0 license addresses this concern. The license includes a specific patent clause 👇🏽 that grants users a free license to any related patents embodied in the software. This clause ensures that users can utilize, modify, and distribute the software without worrying about patent infringement claims from the contributors. https://www.apache.org/licenses/LICENSE-2.0
|
On behalf of Datadog I want to express our gratitude to the Elastic/Optimyze team for their contributions to continuous profiling in the observability industry, including this proposed donation. As contributors to the Profiling SIG, we are very excited about this development and would be interested in joining the OpenTelemetry effort of evaluating the source code in order to explore the possibility of collaborations around this technology. |
The TC discussed this donation proposal. One of the differences of this proposal from others we had in the past is that the profiling agent is closed source and the due diligence process will need to be performed while it is still closed source. The TC made a decision that it does not prevent us from being able to do the evaluation provided that:
The TC will further discuss this with the GC and will come back with an update on how we can move forward. |
Following up on the donation process after discussing with the GC and TC: Donations to OpenTelemetry typically have a due diligence report included with them. This is written by OTel maintainers and community members who are experts about the particular area (profiling in this case). Here are examples for logging (related doc) and Android client instrumentation. These documents inform the GC and TC, who aren't always experts on the topic being evaluated. The Profiling SIG will need to do the same to review the proposal from Elastic. Things that this report will want to evaluate include (this list is not exhaustive):
|
This would be awesome. The Elastic profiler does not support use of perf data for profiling, so does not support Erlang, like Parca. Not that this is the only way to support more languages like Erlang, but the only way I see it happening since it isn't a high demand ask :). |
+1 |
Is there an expected date when donation shall be finished (i.e. donation can be consumed)? |
Hi Jürgen 👋 , the profiling agent / donation is currently under review by a group of volunteers from the Profiling SIG. I would guess this to be rather a matter of several months until we have a first, fully integrated version of the profiling agent / functionality (depending on the review result, the concrete realization plans, etc.). |
The Profiling SIG has completed their due diligence, which is ready for TC and GC review. We strongly approve of this donation! |
The OpenTelemetry technical committee approves of this donation.
Looking forward to this new functionality! |
The OpenTelemetry Governance Committee approves of this donation, meaning that all required approvals are now complete. This can now become part of OpenTelemetry, thank you to Elastic for donating this and to everyone who participated in the review process! |
Description
Elastic would like to offer the donation of the Elastic profiling agent to the OpenTelemetry project.
The Elastic profiling agent is an eBPF-based mature multi-runtime/multi-language CPU profiler. It enables fleet-wide and system-wide continuous profiling without the need for any application instrumentation or even application restart. It's currently, to the best of our knowledge, the only existing continuous profiler with no instrumentation/process restart for a broad range of real-world languages. The agent has been used in real large-scale customer production environments since August 2021.
Some of the core features and strengths of the agent are:
.eh_frame
data as described in US11604718B1)inline frames
, which provide insights into compiler optimizations and offer a higher precision of function call chains.Benefits to the OpenTelemetry community
The donation of the Elastic profiling agent would fill the gap in OpenTelemetry's component landscape/architecture with a mature, feature-rich and efficient profiling solution. With that, cutting-edge technologies in eBPF and profiling would become a standard through OpenTelemetry for collecting in-production profiling data. Collecting profiling data with OpenTelemetry across a broad range of languages/technologies would come with a frictionless deployment experience.
OpenTelemetry users will get a continuous profiling in production solution with all the previously mentioned core features and strengths. In addition, there is existing work on correlating profiling data from this profiling agent with the tracing data from the OTel Java Agent / SDK. This lays out the basis for achieving cross-signal correlation for profiling data (beyond the pure correlation through resource attributes).
In addition, with this donation, the OpenTelemetry community would gain a team of profiling domain experts to (co-)maintain and advance OpenTelemetry's profiling story.
Reasons for donation
Elastic is dedicated to OpenTelemetry's vision to make it the single, ubiquitous standard and framework for Observability. With this donation, we strive to help OpenTelemetry to successfully expand into the area of continuous profiling by donating one of the industry's leading profiling agents to OpenTelemetry's ecosystem.
Repository
Currently, the code of Elastic's profiling agent is closed source. However, we are already in contact with the TC and working on providing access to the code for review and evaluation of the donation proposal in the next few weeks.
Existing usage
The agent has been used in real large-scale customer (hundreds of customers) production environments (on thousands of nodes) since August 2021 without notable incidents.
Maintenance
In case of a successful donation Elastic's profiling team is dedicated to further (co-)maintain and evolve OTel's profiling agent and help drive its adoption as an industry standard for collecting profiling data in production.
Licenses
Currently, the code of Elastic's profiling agent is closed source. However, we are already in contact with the TC and working on providing access to the code for review and evaluation of the donation proposal in the next few weeks.
In case of the acceptance of this donation proposal Elastic commits to fully open-source the profiling agent under the Apache 2.0 license.
Trademarks
The name "Elastic Universal Profiling" currently appears in the codebase, but we do not intend to donate any code that includes the name "Elastic" or "Elastic Universal Profiling". We will make sure "Elastic" as a name will be removed from the codebase in case the donation is being accepted.
Other notes
Roadmap as part of / after the donation
We are aware that the current code base of the agent will require a few additional features to make the profiling agent usable in the context of OpenTelemetry. However, the above-mentioned strengths of the agent overweight the remaining, manageable effort to make the agent OTel compliant by far. Also, with a positive development of the donation proposal Elastic commits to drive this roadmap and work on the following aspects as soon as possible, to make the profiling agent OTel compliant:
We are not aware of any technical blockers and difficulties to achieve the above.
Additional (potential) future work and contributions are:
Examples of produced traces
Figure: flamegraph of a mixed native and interpreter workload (native code, Java, Python, and Ruby)
Figure: flamegraph excerpt of Java calling into libc.so which in turn does a syscall into the kernel.
Due Diligence Document
https://docs.google.com/document/d/1ro8qKlAOrxqHYE3YmfUBtiATrS7r267lg9aNdDS6Pps/edit
The text was updated successfully, but these errors were encountered: