Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics for System.Runtime #85372

Closed
JamesNK opened this issue Apr 26, 2023 · 53 comments · Fixed by #104680
Closed

Metrics for System.Runtime #85372

JamesNK opened this issue Apr 26, 2023 · 53 comments · Fixed by #104680
Labels
area-System.Diagnostics.Metric enhancement Product code improvement that does NOT require public API changes/additions feature-request in-pr There is an active PR which will close this issue when it is merged
Milestone

Comments

@JamesNK
Copy link
Member

JamesNK commented Apr 26, 2023

Today there are event counters for System.Runtime: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/available-counters#systemruntime-counters

Metrics should be added to this area. Advantages:

  1. Metrics has new features such as tags (allows for dimensions) and histograms.
  2. Easier for tests and libraries to listen to and consume with MeterListener.
  3. Some libraries only support collecting custom counters that use System.Diaganostics.Metrics. For example, opentelemetry-net.

What instruments should we have?

  • Existing System.Runtime event counters provide a good starting place.
    • Some of these counters could be combined by using tags.
    • Or tags could provide more information. For example, should the exception-count counter include the exception type name as a tag? Then tooling can provide a breakdown of not just the total exception count but the exception count grouped by type.
  • OpenTelemetry has specs with conventions for counters that a system provides. We should try to provide the same data. Note that counter and tag names don't need to match. We should use .NET conventions.
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Apr 26, 2023
@JamesNK
Copy link
Member Author

JamesNK commented Apr 26, 2023

@tarekgh
Copy link
Member

tarekgh commented Apr 26, 2023

CC @reyang

@reyang
Copy link

reyang commented Apr 26, 2023

@JamesNK FYI in OpenTelemetry .NET we've implemented these:

Having these in in future versions of the runtime would be awesome, the existing OpenTelemetry instrumentation libraries can do a runtime detection and leverage the runtime instrumentation if it is there. Eventually as old versions of the runtime get deprecated, we'll land in a better situation where we don't need a separate instrumentation library as things are "baked in".

@JamesNK
Copy link
Member Author

JamesNK commented Apr 26, 2023

I think OTel would still have something because .NET counters won't follow the OTel naming standard. However, the implementation should be very simple because built-in counters will provide all the information needed.

@noahfalk
Copy link
Member

@JamesNK is this work you were planning to pursue yourself or just recording the request?

@tommcdon tommcdon added this to the 8.0.0 milestone Apr 26, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Apr 26, 2023
@tommcdon tommcdon added enhancement Product code improvement that does NOT require public API changes/additions untriaged New issue has not been triaged by the area owner and removed untriaged New issue has not been triaged by the area owner labels Apr 26, 2023
@JamesNK
Copy link
Member Author

JamesNK commented Apr 26, 2023

Recording the request.

@tarekgh tarekgh modified the milestones: 8.0.0, Future Apr 26, 2023
@tarekgh
Copy link
Member

tarekgh commented Apr 26, 2023

@JamesNK I moved this to future milestone. Please let me know if there is strong demand to have this in .NET 8.0.

@samsp-msft
Copy link
Member

The main reason to implement these as metrics in 8, is so that we can wean people off eventcounters and onto the metrics instead. As these are the main process-wide counters, getting them converted will be a major signal towards that goal.

There are likely few counters that need many dimensions here, as most are process wide. We should evaluate the work in comparison to the infrastructure needed to implement it.

@omajid
Copy link
Member

omajid commented May 3, 2023

Hey folks! I am interested in trying to help implement this.

@noahfalk
Copy link
Member

noahfalk commented May 5, 2023

Hi @omajid, glad to have help! I'm guessing that most of the work on this feature will be investigating design options and trying to get a concensus on the best design rather than writing the implementation code. If that is something you are interested in taking a stab at thats great. If you are interested in having someone else work through the design first thats fine too, but I don't know necessarily when that would occur.

If you did want to pursue the design part, these are the major questions that come to mind right now:

  1. Do we exactly duplicate the System.Runtime EventCounters for maximal compatibility, or are we going to intentionally make some changes?
  2. If we are making changes, what kind of changes?
    • Making use of new instrument types that weren't available before like Histogram?
    • Making use of tags which weren't available before?
    • Re-naming anything we believe was poorly named before?
    • Removing metrics that were potentially confusing?
    • Reorganizing what metrics are in which Meters?
  3. Is the Meter a static singleton, or are we somehow using the new DI Meter work to create per-DI-container instantiations?

My hunch is that, yes, some kinds of changes are going to be appealing but we need to figure out what are the impacts of different kinds of changes, is there anything we can do to make migration easier, and then figure out which changes seem worthwhile. For (3) my guess is that we would make it a static singleton, but we need to figure out how that intersects with DI Meter work and the new Meter config work so there might be stalls in there where one design needs to wait for stuff to resolve in the other, or they have to be resolved simultaneously.

I think there is design inspiration we could take from https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Process. For changes/removals to existing counters there is also some past discussion here: #77530

So if all this sounds like something you still want to dive into I think a first step would be to create an initial proposal (in a gist or a PR'ed markdown doc) describing what instruments would be exposed. Thanks!

@omajid
Copy link
Member

omajid commented May 5, 2023

Hey, @noahfalk! Thanks for the various links. These are great questions.

I have been prototyping an implementation and I came up with some similar questions (and some possible answers to what you asked). I wouldn't mind helping with the design, though I am not a runtime or OpenTelemetry expert. Advice from anyone more familiar with this is more than welcome.

I have been looking at OpenTelemetry's docs as a great starting point from which to evaluate design ideas.

Do we exactly duplicate the System.Runtime EventCounters for maximal compatibility, or are we going to intentionally make some changes?

I think if we are creating a Metrics based implementation for first-class support for OpenTelemetry, we should take advantage of that and provide similar (or additional) information, but in a way that is easier to consume and/or feels more natural for anyone looking to consume it via an OpenTelemetry-compatible tool.

The opentelemetry-dotnet-contrib docs almost match the existing EventCounters of System.Runtime, with a few differences.

If we are making changes, what kind of changes?

  • Making use of new instrument types that weren't available before like Histogram?

This isn't currently listed as something used in the OpenTelemetry docs, and isn't done in the EventCounter implementation either. So I think we can pass on this for a first stab? If we find some good use cases, we should consider using adding Histograms for those.

  • Making use of tags which weren't available before?
  • Re-naming anything we believe was poorly named before?

Yes. In fact, I think we have to. Otherwise we provide OpenTelemetry-compatible metrics but violate all assumptions in the ecosystem, making things harder to parse and use. For example, all our metrics via EventCounters have a single name, but OpenTelemetry expects metrics to be namespaced via dots:

Associated metrics SHOULD be nested together in a hierarchy based on their usage. Define a top-level hierarchy for common metric categories: for OS metrics, like CPU and network; for app runtimes, like GC internals. Libraries and frameworks should nest their metrics into a hierarchy as well. This aids in discovery and adhoc comparison. This allows a user to find similar metrics given a certain metric.

There's also prior art in the form of https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Runtime which does a great job creating a hierarchy, in the form of process.runtime.dotnet.{gc,jit,memory,etc}.

Though I see that https://github.com/dotnet/runtime/pull/85447/files does things differently and I am not sure why.

Removing metrics that were potentially confusing?

I think we should? We aren't putting our users on a path to success by putting thing that are confusing or likely to be misinterpreted in. Specially in a new/fresh design.

I also think we should leave out various -rate (or -per-second) metrics. That's something the OpenTelemetry tooling should compute/handle from the raw data.

Reorganizing what metrics are in which Meters?

I hadn't really thought about this.

The current design (eg, looking at the output of dotnet counters) uses the Meter name for namespacing instead of the name of the Instrument. From skimming the OpenTelemetry docs, the convention appears to be to use Meter names for scope. As a non-expert, it's hard for me to understand the scope of, for example, System.Net.Http vs Microsoft-AspNetCore-Server-Kestrel vs System.Net.Sockets. There's a lot of overlap in terms of bytes/handshakes/connections/requests that all 3 seem to track to some extent.

Is the Meter a static singleton, or are we somehow using the new #77514 to create per-DI-container instantiations?

This shouldn't matter from a usage point of view, right? Could we make a static signleton for now and later switch to DI without breaking users?

@noahfalk
Copy link
Member

noahfalk commented May 6, 2023

I wouldn't mind helping with the design, though I am not a runtime or OpenTelemetry expert. Advice from anyone more familiar with this is more than welcome.

No worries. I think its fine to toss out ideas and then get feedback on it. If we need folks with certain areas of expertise we'll try to find them. Ultimately if there is no consensus forming and a contentious decision needs to be made I can make it.

[Refering to histograms]
This isn't currently listed as something used in the OpenTelemetry docs, and isn't done in the EventCounter implementation either. So I think we can pass on this for a first stab?

If we add a histogram in the future where there previously the Meter has no similar instrument defined, that seems straightforward and easy to postpone. What feels less straightforward would be adding a counter or gauge now, then later deciding it would have been better to define that instrument as a histogram. For example we might propose an ObservableGauge that was an average GC pause duration, then later we think oops, maybe that should have been a gc pause duration histogram instead.
So, yes we could pass on it if need to as long as we are careful not to box ourselves into a corner.

Otherwise we provide OpenTelemetry-compatible metrics but violate all assumptions in the ecosystem, making things harder to parse and use. For example, all our metrics via EventCounters have a single name but OpenTelemetry expects metrics to be namespaced via dots:

A common pattern that has arisen with the OTel work is that .NET will have some pre-existing convention or naming scheme, then OTel defines a new scheme that isn't consistent. No matter which one we choose to use it will always be inconsistent with something, either inconsistent with OTel recommendations or inconsistent with a .NET developer's past experience of the platform. When this happens we try to make a judgement call about which behavior more .NET developers are going to prefer in the long run, and often we do wind up favoring .NET self-consistency instead of OTel consistency. The pattern we've landed on in other places (example) is that we are staying consistent with .NET metric naming convention rather than switching to OTel naming convention. I'm expecting we'd do the same here. For folks who want something that conforms tightly to OTel naming and semantic conventions, the instrumentation packages from OTel such as https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Runtime package better fill that role right now. I expect what we'll want to build fairly soon (but not as part of this PR) are mechanisms that make schema conversion very easy so that users can get the data into whatever shape they need.

There's also prior art in the form of https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Runtime

Whoops, that is actually the one I meant to link to above rather than Process. Glad you found it anyways :)

Though I see that https://github.com/dotnet/runtime/pull/85447/files does things differently and I am not sure why.

Above I mentioned how we can't be consistent with both past-precedent and with OTel so we had to make a choice. As to why we choose this way, a few reasons:

  • For many .NET users, Meters are simply an updated API used for metrics. Although we designed the API in coordination with OpenTelemetry, that doesn't mean OpenTelemetry is the only scenario that makes use of them and users may not even be aware of OpenTelemetry. For example Meters can also be used with Prometheus.NET, dotnet-counters, and dotnet-monitor. So it isn't automatically going to follow for many users that changing an API also means all the past conventions need to change too.
  • The OTel semantic conventions are currently all marked Experimental and they have been that way a long time. Choosing to use an experimental standard in a platform that tries to maintain strong back-compat is risky because the specification could change after we've already shipped the experimental spec. If our built-in metrics follow a noticably different convention then users don't expect that the data uses OTel schema and they know to look around for a OTel convention instrumentation or a conversion option if that is what they need. However if the built-in metrics look pretty close users probably assume that it exactly implements the convention and get aggravated when more subtle discrepancies emerge.
  • As a platform we need to be able to add metrics without waiting for OTel to define a convention for a given scenario first. Its awkward if OTel adds a convention later, chooses different semantics than what we picked, and then retroactively developers are aggravated the platform data matches OTel conventions in some places but not others. I can imagine versioning schemes might help resolve this scoping issue in the future, but from what I am aware of the OTel schema versioning mechanisms are also experimental right now.
  • Even though OTel is a gaining a good deal of mind-share, there are a variety of other standards and conventions out there, including private ones used within a particular company or project. We expect no matter what we start with there will be a desire for conversions for a long time to come, so we are better off leaving our default conventions as-is and investing in conversion mechanisms.
  • The OTel.NET project is providing instrumentation that follows OTel conventions closely so .NET devs that want that option do have it.

I also think we should leave out various -rate (or -per-second) metrics. That's something the OpenTelemetry tooling should compute/handle from the raw data.

Yeah, that sounds pretty reasonable to me as well.

Reorganizing what metrics are in which Meters?
I hadn't really thought about this.

I think the one place this came up in the past was in the discussion of GC related metrics. One set of customers will ask for fairly detailed metrics in a specific area, but I worried that if we add too much to the runtime Meter it will be confusing for users who have simple needs. I think direction things were going though is that we shouldn't worry too much about adding more detailed metrics as long as most users are seeing the metrics via dashboards and docs that can guide them in slowly rather than seeing raw dumps of every available instrument.

This shouldn't matter from a usage point of view, right? Could we make a static signleton for now and later switch to DI without breaking users?

One area it might matter is with the Meter config work. For example in logging there is no concept of a static singleton static Logger s_logger = new Logger("System.Runtime"). This means that APIs like LoggingBuilder.AddFilter("System.Runtime", ...) only apply to ILoggers that were created from the associated ILoggerFactory and not to these singletons that have no connection to the DI container. If there was an equivalent MetricsBuilder.AddFilter("System.Runtime", ...) it raises the question, would it recognize the static singleton, would it be ignored, would a special overload or different API call be required for static things?
I'm guessing static singleton is the likely outcome and this is probably orthogonal to most of the other design choices so we could just assume that is the case for now. But at some point we'll need to nail it down.

@samsp-msft
Copy link
Member

A couple of questions:

  • Do we want to take a dependency on DI for basic counters? ASP.NET Core is all-in on DI, so its not a concern for that workload, but is it something that all processes will be using?
  • About half of the existing "runtime" counters are for the GC. Should they be namespaced into something more specific for the GC, and maybe have one more general counter for the overall memory usage - the number that once exceeded will cause kubernetes etc to kill the process, and have that in the main bucket - is that the working set?
    • I'm thinking when you go to something like App Insights and a hierarchy of metrics is shown, how do customers know which are the most important metrics to monitor to understand the process health, and base their load balancing etc on.
    • It feels like some of these counters are more for diagnostics purposes than for proactive health monitoring - should we be naming/namespacing then to make it clearer to users on how counters are intended to be used?
  • Which dimensions should be added to counters is a lot less obvious than for http/networking
    • GC counters are global and I don't think we want to start tracking the types for each allocation
    • Exception count could be bucketed by the type of the exception - but i'd be worried about an explosion in dimension values
    • Deeper diagnostics on what is on the threadpool queue feels like something that should be exposed through the debugger, rather than as dimensions.
    • JIT counters such as method-jitted-count could be dimensioned based on the assembly or namespace. I am wondering if that would help with cases where lambda's are causing constant re-jit to be more easily understood?

@noahfalk
Copy link
Member

Do we want to take a dependency on DI for basic counters?

Recently I've been assuming that the runtime counters will be a static singleton Meter defined in System.Diagnostics.DiagnosticSource, so no dependency on DI there. However if want to listen to it via the Meter config work, that would take a DI dependency. Other ways such MeterListener, OpenTelemetry, or external tools do not require DI.

About half of the existing "runtime" counters are for the GC. Should they be namespaced into something more specific for the GC, and maybe have one more general counter for the overall memory usage

I think a good portion of the counters are the result of EventCounters compensating for not having dimensions. For example there are 5 different heap size metrics (gen0, gen1, gen2, LOH, POH) that could probably be a single metric with a dimension. I'd suggest we start with a design that doesn't explicitly namespace them and if it still feels overwhelming then we think on how we'd split them.

is that the working set?

Probably. I like OpenTelemetry's approach of separating these metrics as 'Process' metrics rather than 'Runtime'. These high-level stats like CPU-usage and VM-usage are measurements the OS tracks for all processes rather than anything specific to a language runtime like Java, Python, or .NET.

I'm thinking when you go to something like App Insights and a hierarchy of metrics is shown, how do customers know which are the most important metrics to monitor to understand the process health, and base their load balancing etc on.

I'm hoping we don't have a huge number of them + we can provide better guidance than we currently do. Today our docs mostly say "Here is what each counter measures". I think we should get to the point where the docs say "These are the counters we think are most useful for health monitoring, here is a default dashboard you can use, here is how you might use this data to start an investigation of different common problems..."

should we be naming/namespacing then to make it clearer to users on how counters are intended to be used?

I'm hoping we aren't going to have so many that more elaborate naming schemes are needed, but I certainly don't rule it out. I'd propose starting with something that looks like OTel's runtime metrics but using .NET's traditional naming conventions.

Which dimensions should be added to counters is a lot less obvious than for http/networking...

I think looking at what OpenTelemetry did with runtime metrics is a good starting point. I'm guessing we'd land somewhere quite similar.

@MihaZupan
Copy link
Member

From #79459 (comment):

Hey folks, we had another ask from the same team for an additional memory metric.

Ask: The ask is for visibility into the free space in each generation of the heap.

Context: We are seeing sometimes large heap footprints and are unable to determine if they are mostly free for whatever reason uncompacted (transient allocation activity leaving large uncompated holes) or if they are actually rooted and we need to intervene and reboot the node and investigate as a likely cause for suspecting leak.

@davidfowl
Copy link
Member

cc @Maoni0 for the last metric

@JamesNK
Copy link
Member Author

JamesNK commented May 15, 2024

I've opened an initial open-telemetry/semantic-conventions#1035 to propose adding experimental runtime metrics.

@stevejgordon I am wondering why we need to have this in OpenTelemetry? Why can't we list the detailed proposal here and then proceed with that?

I highly recommend getting input from OTEL experts, e.g. @lmolkova, on counters, tags and naming. It was a great help when putting together aspnetcore metrics.

Also, we should document the metrics on learn.microsoft.com and OTEL semantic conventions docs. With aspnetcore metrics there is lightweight docs on learn.microsoft.com, and links to details docs on OTEL semantic conventions for people who want more detail.

@noahfalk
Copy link
Member

I mean tools like dotnet-counters can report the runtime metrics for apps didn't initialize a listener?

dotnet-counters connects to the MetricsEventSource which uses a MeterListener internally to obtain the data. There shouldn't be any alternative path to get the data from arbitrary Meters (excluding truly shady approaches like private reflection).

I highly recommend getting input from OTEL experts

+1. I think that is the path we are already on by virtue of posting the sem-conv proposal in the OTel repo.

@stevejgordon
Copy link
Contributor

@noahfalk Yeah, I was heading in that direction myself, although I was hoping to avoid it if there was some clever way. One thing I did consider this morning was whether AppContext.Setup could also be used to trigger the initialisation. Using the MeterListener should work, though, as someone, whether that be an end user or the OTel SDK, will end up creating a listener for this. if they care about observing the metrics.

@noahfalk
Copy link
Member

Btw @lmolkova is currently out, but she is scheduled to be back in a week. I'm glad to get other feedback but I do want to get her feedback specifically on this one :)

@stevejgordon
Copy link
Contributor

I'm out for two weeks starting on Monday, but I will keep an eye on these discussions. I'll continue playing with a POC to implement this, and once we have a design, we can determine what (if anything) needs to be prepared for API review, etc.

@tarekgh
Copy link
Member

tarekgh commented May 15, 2024

@noahfalk @stevejgordon looking at the PR open-telemetry/semantic-conventions#1035 and I am seeing the proposal is missing at least three metrics comparing to what we expose in https://learn.microsoft.com/en-us/dotnet/core/diagnostics/available-counters#systemruntime-counters.

  • Working Set (working-set) The number of megabytes of physical memory mapped to the process context at a point in time based on Environment.WorkingSet.
  • Gen 0 GC Budget based on GC.GetGenerationBudget(0)
  • % Time in GC since last GC based on GC.GetLastGCPercentTimeInGC().

https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/RuntimeEventSource.cs

Is it intentional we don't want to include these?

@stevejgordon
Copy link
Contributor

Is it intentional we don't want to include these?

@tarekgh, not really. @noahfalk suggested starting with the metrics exposed via the existing OTel contrib library, so I didn't review the runtime event source. We can consider proposing those, too, or they could be added later.

@noahfalk An alternative implementation I've been thinking about last night is whether we should consider adding an EventListener for the GC events which could be used to set most of the GC instruments. One advantage of this is that instead of an observable instrument, we can switch to non-observable forms, with their values updated only after GC events. For the GC metrics, at least, I think most (maybe all) of them will only change after a GC occurs. Perhaps preferring events is slightly more efficient since we don't poll the various GC.xyz() methods. I've not dug deeply into this yet, but I'm throwing it out for discussion. Perhaps @Maoni0 has a view of the "best" way to collect and update GC metrics?

@KalleOlaviNiemitalo
Copy link

Can those GC events be received by an EventListener? I'd imagine that the events are fired to ETW from the unmanaged part of the runtime, rather than by EventSource, and that makes them invisible to EventListener.

@stevejgordon
Copy link
Contributor

Can those GC events be received by an EventListener? I'd imagine that the events are fired to ETW from the unmanaged part of the runtime, rather than by EventSource, and that makes them invisible to EventListener.

@KalleOlaviNiemitalo, I believe they are piped through and can be observed as per this post from @Maoni0.

The reason I am considering this as an option is it opens the door to collecting GC duration and perhaps some other useful metrics if we base at least some of them on these richer events.

@KalleOlaviNiemitalo
Copy link

I see; the events are buffered in unmanaged EventPipe code, and a thread in managed code pulls them via EventPipeInternal.GetNextEvent, so the runtime doesn't need to call managed code in the middle of garbage collection.

@noahfalk
Copy link
Member

Working Set (working-set) The number of megabytes of physical memory mapped to the process context at a point in time based on Environment.WorkingSet.

I would deliberately not include this one. OpenTelemetry includes WorkingSet, Cpu, and other OS level metrics in a separate group of process metrics. I think its fine if we had a built-in implementation of process metrics too, I just wouldn't lump them in the same Meter with the runtime metrics.

Gen 0 GC Budget based on GC.GetGenerationBudget(0)

I'd be fine with it as long as @Maoni0 is. It also raises the question if we only want the gen0 value of this or do we want higher generation budgets too.

% Time in GC since last GC based on GC.GetLastGCPercentTimeInGC().

This metric has history as being confusing and I think folks would be better off observing the rate of change in the clr.gc.pause.time metric. We did look at adding this to the OTel metrics in the past and decided against it. Some past discussion.

An alternative implementation I've been thinking about last night is whether we should consider adding an EventListener for the GC events which could be used to set most of the GC instruments

Although functionally it works I'd worry you are going to incur higher perf overheads for no clear benefit. Creating the first EventListener in the process requires a thread to pump the events for a callback + blocks of virtual memory are allocated to store the buffered events prior to dispatching them.

@lmolkova
Copy link

Catching up on the discussion here.

Having CLR metrics would be great!

I've shared some feedback on the open-telemetry/semantic-conventions#1035 and happy to help polish names and attributes.

One thing I hope we can discuss more is if translating existing counters is enough. I believe in some cases we can do better - specifically when we want to measure duration or a distribution of something:

  • would is make sense and be possible to report GC pause duration as a histogram? what about lock contention duration? JIT compilation time?
  • would we add new metrics if we were designing them from scratch? E.g. would we add a histogram representing time a work item spends waiting for a thread?

Some of these can be addressed incrementally, but if we'd rather eventually report GC pause time as a histogram, we should not add it as a counter now - adding a histogram in future will introduce duplication as it will allow to derive all the counts from it.

@noahfalk
Copy link
Member

would is make sense and be possible to report GC pause duration as a histogram? what about lock contention duration? JIT compilation time?

All of those things are theoretically possible (and sound nice!) but two big caveats:

  • JIT compilation, GCs, lock contention, and threadpool scheduling are things that can occur very frequently. Some of them might be occurring on the order of 1M iterations/sec/core. This means any implementation would need to be very performance sensitive and we'd likely find multiple places where the straightforward implementation doesn't scale well to those frequences.
  • The deadline for feature PRs in .NET 9 isn't far away (call it ~6 weeks) and we've got lots of work competing to get into that window. Even if @stevejgordon had unlimited time on this the folks on the runtime and BCL team who would be needed for advice and review are also time-limited.

I'm glad to contemplate brand new metrics and think about them as part of a roadmap, but I'd put implementing them as part of a different feature that we do in a future .NET release.

if we'd rather eventually report GC pause time as a histogram, we should not add it as a counter now - adding a histogram in future will introduce duplication as it will allow to derive all the counts from it.

This seems like an awkward place for us to be. The GC pause time counter is a pretty important metric that many people count on. If we omit it I think it could be a hard sell that anyone should bother with the built-in runtime counters at all for .NET 9. What are the concerns around adding a pause time Counter now and potentially in the future adding a histogram? I see that one is derivable from the other so enabling both isn't optimal storage efficiency, but it doesn't seem like it would rise to the level that I'd expect most .NET developers to care about it. If they did care about it couldn't they easily opt to drop one of the two metrics?

@lmolkova
Copy link

Thanks for the context @noahfalk !

To confirm my understanding:

  • existing counters are super-useful, performant and we'd like to have them expressed as otel metrics (with dimensions and following otel naming patterns)
  • adding more metrics is possible, but is unlikely to happen in .NET 9 timeframe
  • histograms (in general case) would be more advanced and affect perf much more, so they are not part of 'default' experience and if we do them eventually, we'd want users to opt into them separately

It makes sense and sounds good!

From .NET 9 timeline perspective, would it make sense to limit the scope to the most basic metrics (or at least make sure they are in)?
I think CPU and memory + GC should be in the minimal set.

WDYT?

@noahfalk
Copy link
Member

Yep!

histograms (in general case) would be more advanced and affect perf much more, so they are not part of 'default' experience and if we do them eventually, we'd want users to opt into them separately

I think the performance penalty for histograms largely depends on how frequently the thing they are measuring occurs and how good of job we do at optimizing everything. There is some level of performance overhead where its OK to still be on by default so we'd probably decide case-by-case which ones were efficient/optimized enough and which ones should be opt-in.

I think CPU and memory + GC should be in the minimal set.

Yeah, I'm hoping we would be able to include all the metrics that OTel's separate runtime instrumentation offers. For CPU and memory it seemed like OTel recommended having a separate 'Runtime' and 'Process' meters which seemed fine by me.

@stevejgordon
Copy link
Contributor

From my side, I agree that it would be ideal to port the existing OTel contrib runtime metrics as the initial set with the goal that consumers no longer need an extra instrumentation package to collect the "core" metrics. The timescales are tight if the implementation gets much more complex than mostly porting the contrib implementation. I'll try to prototype this during the week, but I think finalising the conventions is a blocker to getting this in before the .NET 9 cutoff.

@lmolkova
Copy link

lmolkova commented Jun 4, 2024

I think having runtime metrics in .NET is something that would have impact for years and we should not rush it for the sake of eliminating existing otel library.

I'm happy to work with you on semconv and make sure it's not a blocker. Again, I don't think we need to port all the metrics and by reducing the scope we can significantly reduce the complexity.

@noahfalk
Copy link
Member

noahfalk commented Jun 4, 2024

we should not rush it for the sake of eliminating existing otel library.

Its not primarily about eliminating that library. My main motivation is to get OTel style metrics available out-of-the box for all .NET projects. We've already done this for a bunch of the stack in .NET 8 but runtime metrics stick out like a sore thumb now. Also using the OTel SDK isn't the only way that folks see metrics, we want to provide them in other experiences like dotnet-counters and codeless instrumentation which only happens when they are built-in.

Right now we've got 17 proposed metrics, all with a working reference implementation, docs, customers are using them right now, and they've already gotten though a previous attempt at making it align nicely with the OTel semantic conventions. Spending a few weeks to review them seems like that should be plenty of time no?

Again, I don't think we need to port all the metrics and by reducing the scope we can significantly reduce the complexity.

Reduce complexity for whom? I think its pretty likely that if we cut any of the current OTel runtime instrumentation metrics (and provide no obvious alternative) then next release I'll have GH issues and emails from .NET customers saying "what happened to XYZ metric?" We already cut a bunch of those metrics in the past when we switched from Windows Performance Counters -> EventCounters and then customers told us where we went too far. A couple examples here and here.

@stevejgordon
Copy link
Contributor

I've started an initial port of the metrics into System.Diagnostics.Metrics. The implementation adds no new public APIs, but we must agree on the name for the Meter, which is essentially public since developers who want to subscribe to the metrics, even with a custom MeterListener or, most likely, via the OpenTelemetry SDK, will need to use this well-known name.

For now, I've gone with:

private const string MeterName = "System.Diagnostics.Runtime";

@tarekgh Regarding the .NET API review process, will this name also need discussion there?

@tarekgh
Copy link
Member

tarekgh commented Jun 21, 2024

Regarding the .NET API review process, will this name also need discussion there?

If we are not exposing any public API, we don't need to go through the design review.

For Meter name, if the involved people @noahfalk @lmolkova @reyang @CodeBlanch @JamesNK @cijothomas agree on the meter name, that will be enough to proceed. The proposed name looks good to me as it conforms to the guidelines.


The name passed to the Meter constructor should be unique to distinguish it from other Meters. We recommend OpenTelemetry naming guidelines, which use dotted hierarchical names. Assembly names or namespace names for code being instrumented are usually a good choice. If an assembly adds instrumentation for code in a second, independent assembly, the name should be based on the assembly that defines the Meter, not the assembly whose code is being instrumented.

@JamesNK
Copy link
Member Author

JamesNK commented Jun 22, 2024

Metrics names are public API. They're not a .NET API and doesn't need to go through the standard .NET review process, but it definitely needs to be reviewed.

@noahfalk
Copy link
Member

I would propose "System.Runtime" for the Meter name, the same as we are currently using for our runtime EventCounters. We've done the same pattern where we have EventCounters registered in EventSources named "System.Net.Http", "System.Net.NameResolution", "Microsoft.AspNetCore.Hosting", etc. and we named the Meter identically.

Effectively it means that folks only need to remember one name, rather than one name for the EventCounters based data and 2nd name for the Meter based data. It also implies that as soon as the Meter shows up named "System.Runtime", that is what dotnet-counters is going to show by default because it has a hard-coded default on that name + a policy that automatically prefers Meter over EventSource when both exist with the same name.

@noahfalk
Copy link
Member

but it definitely needs to be reviewed.

Not sure if you were proposing some form of review beyond Tarek pinged folks above and they will hopefully weigh in on this issue via comments? Thats all I was planning on.

Also adding @samsp-msft to the review ping list :)

@stevejgordon
Copy link
Contributor

@noahfalk, I've put together an initial draft of the PR for the implementation in my fork. Would you mind taking a look and seeing if it aligns with what you were expecting? I'll keep the PR there until the semantic conventions are finalised, and then I can create one in runtime.

@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Jul 10, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Diagnostics.Metric enhancement Product code improvement that does NOT require public API changes/additions feature-request in-pr There is an active PR which will close this issue when it is merged
Projects
None yet
Development

Successfully merging a pull request may close this issue.