-
Notifications
You must be signed in to change notification settings - Fork 164
Proposal: Adding profiling as a support event type #139
Comments
FYI, JFR is probably the top JVM profiling tool, as it's built-in to the JVM these days. |
That is awesome to know @jkwatson :D I don't often work with JVM based languages but I will 100% have a look :D |
@jkwatson @MovieStoreGuy The top profiling tool for JVM is async-profiler :) |
The docs on that are seriously out of date...they still reference JFR as a commercial product. I guess that's true if you're profiling java 7, but I don't wish that on anyone. |
What docs are out of date? |
well, now I can't find the ones I was just looking at, so /shrug. |
I agree with @jkwatson, I appreciate bring JVM tools to my intention, that is not the focus of this proposal :) |
We're interested in being able to collect CPU, memory, contention and other profiles with OpenTelemetry and have representations of profiles in OTLP and support in the collector. We are currently also looking into existing data model alternatives such as pprof as an option given its wide use in open source and language support. We want to enable cases where we can use OpenTelemetry attributes to label profiles as well. pprof has support for labelling (an example can be seen at https://rakyll.org/profiler-labels/). As of today, it's very difficult for our users to enable profiling at a later time, especially in production. They need to add CodeGuru Profiler libraries, rebuild and redeploy. As more and more of them are linking OpenTelemetry for other telemetry collection, we want to enable cases where we can enable profile collection dynamically in runtime. This use case will require the OpenTelemetry client libraries to speak to the collector (or another control plane) to enable/disable collection. |
Not sure if this will help, but I thought I'd chip in with what we're doing at Datadog. For the continuous profiler (which is integrating with our tracer), we're using our own profiling libraries for most platforms, and our own agent using JFR on the JVM. For the JVM we've added our own profiling events for various different kinds of profiling (e.g. rate limited exception profiling). For non-JVM languages we're partly using pprof as the serialization format (some data doesn't fit well into the model, so it's currently an archive with multiple files in it). For the JVM we're using JFR for the serialization format. There are a few interesting initiatives for JFR in recent and upcoming versions - such as a new allocation profiler in JDK 16, and much faster stack trace capturing (I believe JDK 17). We (Datadog), are also considering contributing an all new, full process, proper CPU profiler and some neat new capabilities allowing you to, for example, easily implement your own dynamic wall clock profiler. |
It has been sometime since I have opened this, but I'd like to know how I could speed up anything that is required to make this part of the default otel offering. |
Hi @MovieStoreGuy. We're pretty heads down getting metrics and logs completed, as well as expanding and improving library instrumentation. There probably will not be a lot of bandwidth from the current community until these components are stable; apologies in advance, it will probably be slow going. However, profiling is definitely top priority after metrics and logs! If you, @thegreystone, and others are interested in contributing work towards this project, I would suggest the following steps, which any new signal would need to take:
If there is a group willing to put in the time to prototype, we can help by creating a OTel SIG for this work (a repo plus a slack channel for discussion). But again, I'm concerned that the spec reviewers and language maintainers are fully committed, so there may not be a lot of bandwidth for review or assistance until we clear the deck. I hate saying "next year" but six months to complete metrics and the remaining current initiatives is probably realistic. If there are well thought out proposals and prototypes by then, it would definitely give this project a speed boost. :) |
We (owners of https://github.com/google/pprof repo) would be curious what it would take to standardize on the profile.proto as the wire format for profiling data in OTel. @thegreystone RE "some data doesn't fit well into the model, so it's currently an archive with multiple files in it" - do you mind elaborating on that? |
Is pprof being evaluated? It would be great to have a formal issue in the community repo. Ty! |
Hey @alolita , Which community repo are you referring to? |
@alolita do you mean this repository? https://github.com/open-telemetry/community If yes, could you point out which SIGs or teams to chime for this topic? |
Here's the donation process for contributing code.
No matter which process is in place, we should have a location where we collect documentation on:
|
I think I can help with this. I've just researched the ecosystem of profilers, profile data formats, data format converters, and profile analysis UIs: https://www.markhansen.co.nz/profilerpedia/. I'm probably missing a few, but I think I've covered most of the main ones. I hope this can be a useful starting point for the standardisation process. |
FYI, I've now made a website for Profilerpedia (it's not just a Google Sheet any more): https://profilerpedia.markhansen.co.nz/, and the site renders directed graphs of profilers, their data formats, the transitive closure of data formats you can convert to, and UIs that can read those formats. For example, the transitive set of profilers that are convertable to pprof (warning: huge graph, and some conversions are lossy): https://profilerpedia.markhansen.co.nz/formats/pprof/#converts-from-transitive |
For what it's worth: We've been running prodfiler's continuous profiling service for the last 15 months, and have collected extensive experience with the various footguns involved in collecting profiling data & how to make use of it. Would be more than happy to help share what we've learnt and what to watch out for or otherwise assist in the design process. A few things to keep in mind:
On (1): For a good user experience, it is often necessary for users to drill-down into fine-grained profiling event data; which means filtering profiling events by things like container, thread, and timeframes. This ends up creating problems when the data is pre-aggregated too early at too coarse granularity. The ideal format for the recipient is actual individual sampling events. This ideal format then needs to be balanced with other requirements. (2) It's important to be careful about data volume. Given that the ideal format sends individual samples, and given that one wants to sample at anywhere between 20Hz and 200Hz per core, we are looking at 20 * 2^6 to 200 * 2^6 events events per second in the worst case on a 64-core server. This means that sending out full stack traces for each event quickly comes prohibitive: A java method name can easily have 32-64 characters, and a deep java stack can be 128+ frames. So if we look at:
We ended up solving this by not transmitting full stack traces, and just hashes of traces, which reduces the amount of data dramatically. Happy to help & provide more input! |
Tagging @brancz, which should have an opinion or two about this. |
At Pyroscope we've been building an open source continuous profiling platform for over a year now. We integrate with many different profilers from various languages and other open source projects in our agents:
Since we've had to deal with supporting all these different formats of profiles in order to store them, we are also looking forward to an agreed upon standardized format for profiles — especially as more tooling gets created to analyze and interact with profiles. For example, we recently created an otelpyroscope package to link traces to profiles. Thanks to label support in pprof, this was really easy to implement. On the other hand, some agents report profiling data in a format that doesn't support "labels" which makes an integration like this impossible since labels are needed to link profiles to other types of telemetry data. Another example of where standardization would be useful is that to support Java profiles from async-profiler we had to write a JFR Parser in Go so that we can ingest profiles from async-profiler. Again, if all profilers were using (or at least supported) one output format it would have made this much easier. All that being said, every profiler on this list also has its own quirks and nuances in output formats which make supporting them all overly complicated compared to if they supported the same standardized format. Happy to help provide our thoughts and experience as we've gone through supporting many profiling formats across languages and projects and would love to help contribute to this effort. |
With the metrics effort hitting release candidate stage (yay!) we're hopefully approaching a period when reviewers have a bit more time available. However, that only matters if there is something to review... |
This is perfect timing! We discussed the project roadmap during the in-person community meeting at Kubecon last week, and profiling support was the second most popular topic, after logging (which is already in-flight)! The process that you mentioned (contributing seed work, writing requirements, forming a SIG) is what we used for logging, and I think that it makes sense to follow that here as well. Do people want to discuss this on a call sometime next week? Any objections to 8:00 AM PT on Friday, June 3rd? |
I've created a meeting in the OpenTelemetry calendar for 8:00 AM PT this Friday for us to meet! |
would be good to include eBFP tools like Pixie |
Hi all, as many of you know there has a been a working group of many people in this thread meeting to come up with a collective vision for profiling. A PR has been submitted detailing that vision and we'd love to get more feedback on it! Please check it out and comment if you have any feedback or if you are generally in agreement we'd love to get more approvals from various community members who have expressed interest in this (even if you are not part of the OTel org)! |
This change proposes high-level items that define our long-term vision for Profiling support in OpenTelemetry project. A group of open source maintainers, vendors, end-users, and developers excited about profiling / otel have been meeting for ~8 weeks now and this document was created collectively in order to share with the broader community for feedback. We've had ~15 people contribute directly to the document and over 60 people who have attended our meetings over the past couple of months. This idea of "Adding profiling as a support event type" has also been discussed at length in [this issue](#139) created in October 2020. If this proposal is accepted/approved then we will proceed with filling out this [project tracking issue](open-telemetry/opentelemetry-specification#2731) and following the other procedures outlined in the [project management instructions](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/project-management.md). Comments, ideas, feedback, etc. are all very welcome and highly appreciated!
Is there some experiments and alpha tests around this subject? |
hi @gillg we are actively doing tests around this subject right now. You can follow the progress of our most recent benchmarks here, but yes we are definitely planning on something close to pprof. The majority of the discussion is happening in the #otel-profiles channel in the cncf slack. Would love to have you hop in and give your thoughts there! |
I guess this proposal: open-telemetry/community#1918 might fix this issue. |
@brunobat Came here to post the same thing. |
Closed by #239 |
This change proposes high-level items that define our long-term vision for Profiling support in OpenTelemetry project. A group of open source maintainers, vendors, end-users, and developers excited about profiling / otel have been meeting for ~8 weeks now and this document was created collectively in order to share with the broader community for feedback. We've had ~15 people contribute directly to the document and over 60 people who have attended our meetings over the past couple of months. This idea of "Adding profiling as a support event type" has also been discussed at length in [this issue](open-telemetry/oteps#139) created in October 2020. If this proposal is accepted/approved then we will proceed with filling out this [project tracking issue](open-telemetry#2731) and following the other procedures outlined in the [project management instructions](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/project-management.md). Comments, ideas, feedback, etc. are all very welcome and highly appreciated!
This change proposes high-level items that define our long-term vision for Profiling support in OpenTelemetry project. A group of open source maintainers, vendors, end-users, and developers excited about profiling / otel have been meeting for ~8 weeks now and this document was created collectively in order to share with the broader community for feedback. We've had ~15 people contribute directly to the document and over 60 people who have attended our meetings over the past couple of months. This idea of "Adding profiling as a support event type" has also been discussed at length in [this issue](open-telemetry#139) created in October 2020. If this proposal is accepted/approved then we will proceed with filling out this [project tracking issue](open-telemetry/opentelemetry-specification#2731) and following the other procedures outlined in the [project management instructions](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/project-management.md). Comments, ideas, feedback, etc. are all very welcome and highly appreciated!
This change proposes high-level items that define our long-term vision for Profiling support in OpenTelemetry project. A group of open source maintainers, vendors, end-users, and developers excited about profiling / otel have been meeting for ~8 weeks now and this document was created collectively in order to share with the broader community for feedback. We've had ~15 people contribute directly to the document and over 60 people who have attended our meetings over the past couple of months. This idea of "Adding profiling as a support event type" has also been discussed at length in [this issue](open-telemetry#139) created in October 2020. If this proposal is accepted/approved then we will proceed with filling out this [project tracking issue](open-telemetry/opentelemetry-specification#2731) and following the other procedures outlined in the [project management instructions](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/project-management.md). Comments, ideas, feedback, etc. are all very welcome and highly appreciated!
This change proposes high-level items that define our long-term vision for Profiling support in OpenTelemetry project. A group of open source maintainers, vendors, end-users, and developers excited about profiling / otel have been meeting for ~8 weeks now and this document was created collectively in order to share with the broader community for feedback. We've had ~15 people contribute directly to the document and over 60 people who have attended our meetings over the past couple of months. This idea of "Adding profiling as a support event type" has also been discussed at length in [this issue](open-telemetry#139) created in October 2020. If this proposal is accepted/approved then we will proceed with filling out this [project tracking issue](open-telemetry/opentelemetry-specification#2731) and following the other procedures outlined in the [project management instructions](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/project-management.md). Comments, ideas, feedback, etc. are all very welcome and highly appreciated!
This change proposes high-level items that define our long-term vision for Profiling support in OpenTelemetry project. A group of open source maintainers, vendors, end-users, and developers excited about profiling / otel have been meeting for ~8 weeks now and this document was created collectively in order to share with the broader community for feedback. We've had ~15 people contribute directly to the document and over 60 people who have attended our meetings over the past couple of months. This idea of "Adding profiling as a support event type" has also been discussed at length in [this issue](open-telemetry/oteps#139) created in October 2020. If this proposal is accepted/approved then we will proceed with filling out this [project tracking issue](#2731) and following the other procedures outlined in the [project management instructions](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/project-management.md). Comments, ideas, feedback, etc. are all very welcome and highly appreciated!
Profiling events
There is a shifting concept that performance monitoring and application monitoring (the idea of tracking the time spent in functions and or methods, vs how long it takes to serve a request) are near identical and come under the realm of Observability (understanding how your service is performing).
How is this different from tracing
Conventional tracing looks at showing the user's request flow through the application to show time spent in different operations. However, this can miss any background operations that indirectly impact the user request flow.
ie. If I take a rate limiting service that has a background sync to share state among other nodes:
In the above example, I can clearly see how the function
ShouldRateLimit
impacts the requests processing time considering the context used as part of the request can be used to link spans together but there is a hidden cost here withSyncLimits
that currently can not be exposed due to the fact it runs independently from in bound requests and thus can not / should not share the same context.Now, the
SyncLimits
function could implement metrics to help expose runtime performance issues but could be problematic due to:Suggestion
At least within the golang community, https://github.com/google/pprof has been the leading tool in order to facilitate these kinds of questions while also offering first part support within Go. Moreover, AWS also have their own solution https://aws.amazon.com/codeguru/ that offers something similar for JVM based applications.
Desired outcomes of data:
Desired outcomes of orchestration:
I understand that software based profiling is not 100% accurate as per the write up here https://go.googlesource.com/proposal/+/refs/changes/08/219508/2/design/36821-perf-counter-pprof.md however, this could give an amazing insight into hidden application performance that could help increase reliability, performance and discover resource issues that were hard to discover with the existing events being emitted.
The text was updated successfully, but these errors were encountered: