-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meter and instrument identity #2226
Comments
Maybe we need to improve the wording of the SDK spec. Take an example:
The SDK SHOULD use the following logic to determine how to process Measurements
made with an Instrument:
* Determine the `MeterProvider` which "owns" the Instrument.
* If the `MeterProvider` has no `View` registered, take the Instrument and apply
the default configuration.
* If the `MeterProvider` has one or more `View`(s) registered:
* For each View, if the Instrument could match the instrument selection
criteria:
* Try to apply the View configuration. If there is an error or a conflict
(e.g. the View requires to export the metrics using a certain name, but
the name is already used by another View), provide a way to let the user
know (e.g. expose
[self-diagnostics logs](../error-handling.md#self-diagnostics)).
* If the Instrument could not match with any of the registered `View`(s), the
SDK SHOULD provide a default behavior. The SDK SHOULD also provide a way for
the user to turn off the default behavior via MeterProvider (which means the
Instrument will be ignored when there is no match). Individual
implementations can decide what the default behavior is, and how to turn the
default behavior off.
* END. |
@reyang it seems to me in the following case:
The second instrument won't take effect in the "DIFFERENT instances" case since it is determined with view names? If that is true, it sounds to me that the behavior variation still exists between returning "SAME instance of Meter" and "DIFFERENT instances of Meter". |
I don't get the question. Would you elaborate more? Thanks! |
In the above (#2226 (comment)) explanations, it is said when the SDK decided to return DIFFERENT instances for meters, the name conflicts are resolved when
Should the Instrument("Bar") of Meter("Bar") be resolved within MeterProvider's scope of Views, or should it be resolved in Meter's scope? |
I see! Thanks for the clarification! This is an ongoing discussion which we haven't yet solved. My understanding is:
|
@reyang you discussed the following example in your comment:
I think we need to adjust the wording of the spec to relax the requirement that "2nd instrument creation would fail according to the API spec". The java SDK returns the same meter as long if the name, version, and schemaUrl match a previously registered meter. And it also allows the same instrument (same name, description, units, instrument type) to be registered from a meter multiple times, as long as the name, description, unit, and instrument type match. If you register an instrument that is in conflict with a previously registered instrument, we log a warning, and values recorded to the instrument are dropped. This behavior seems acceptable for synchronous instruments. It seems like it could also be acceptable for asynchronous instruments if we decide that if multiple callbacks are registered, each is called. |
The Java SIG has been talking and its become clear that the API spec's current requirement that "Meter implementations MUST return an error when multiple Instruments are registered under the same Meter instance using the same name" is very detrimental to java's instrumentation. It's common to ask for the same meter multiple times, and for the same instrument from that meter multiple times. Requiring that the API return an error in this situation reduces API usability, forcing users to think carefully about instrument initialization ordering details and requiring clunky code for correctness. #2270 alleviates much of the concern, but there is still asymmetry between synchronous and asynchronous instruments. #2281 resolves the asymmetry, but with multiple callbacks, we find ourselves wanting / needing the ability to delete or unregister callbacks (#2232), else apps risk accumulating many callbacks without a way to release resources. |
Thanks @jack-berg. To add some insight, one way to think about this is that we have semantic conventions that will be shared among multiple components. For example, we have Anecdotally, I've worked with an internal metrics framework that did prevent duplicate registration and had a flag to suppress the exception - almost every app enabled the flag. It's probably for reasons similar to the above. Sorry for not noticing the gap between the Java implementation and the spec until now - but this does seem to be an issue with the spec and now is probably a good time to fix it. Otherwise Java would probably launch with this non-compliance, expecting it to be fixed forward in a future release since we don't really have a good way of preventing duplicate registration right now. |
I'd like to point out that this is still a valid concern even without multiple callbacks support - you can still create lots of async instruments that'll leak resources (although it is maybe a bit less plausible).
I was forced to do exactly that in my micrometer integration in the Java instrumentation repo - with the one-callback limitation in place it was impossible to use plain OTel metrics API to implement that. I believe that not being able to register multiple callbacks is going to block us from implementing further metrics instrumentations - for example, the database connection pool instrumentations that I recently created a spec PR for can't possibly happen without being able to register (and remove) multiple callbacks. |
As an FYI - when implementing the specification for the Java SDK (and when initially reviewing) I interpreted the following:
As this:
Specifically, I though this segment of the API was to prevent obvious logic/typing errors in metrics for instrumentation authors. I had to read through this whole issue to realize I still consistently mis-read the spec as such. Would it be feasible for us to just limit multiple instrument types (e.g. not allowing a counter + gauge to register with the same name) and consider this a 'bug fix' to the APi specification? |
This seems like a nice improvement to the current wording. But thinking about it, I wonder if even validation of types may be more harm than good. We have to keep in mind that the restriction only applies within a It is probably aimed more at app authors. In that case, failing with duplicate types seems to increase the attack surface of OTel's existing trap, the Another sharp edge I can think of is there will be cases, possibly many, where a company's internal libraries don't care about the meter name because it's not used by their backend, and just use empty string, or maybe the company name or something. This isn't compliant with our spec, but they're internal and shouldn't need to be, they are just interested in getting the instrument to get some metrics. But such practices would increase the attack surface of the trap even further, if it's an internal library consumed by the app, the app will generally have no good solution to this error. I have a feeling that such confusing situations will come up more often then the user being saved by the validation, though I am not sure of this. We need to keep in mind that many, probably most, users of our APIs, especially metrics since it's commonly used within app code as opposed to tracing more commonly consumed through instrumentation, will not be even aware of the opentelemetry specification let alone be happy if it constrains them too much. |
FWIW, #2270 is adding resolutions to duplicated instruments, which is based on the type and unit of the instruments. Error is returned on the instrument registration with inconsistent type and unit. |
If a violation occurs we shouldn't we should return a no-op instrument and log it. Truly erroring when these situations occur would be an especially bad experience for apps which create new instruments throughout the app's lifecycle. E.g. the app is running fine for days but at some point the conditions are met for the creation of a new instrument and the app crashes - yikes. Are you suggesting that we try to accommodate having a meter support instruments with the same name but different types? If so, I don't think this is a good idea. I think we ought to log an error and return a no-op instrument. I interpret logging an error as in compliance with the spec's wording of |
Yeah that is my idea - if attributes were also part of identity then I think the chance of problems goes down but I outlined in my comment the reasons I feel like any sort of dedupe seems like it will cause more problems then help for users in the end. The main issue for me is that a name, e.g. something plain like queue.size, doesn't seem to be intrinsically typed in every single case. Semantic conventions only help a little since user code will not nor should not need to comply with semantic conventions. |
I understand this sentiment, but it's really mostly valid in a typeless backend (like Prometheus). One of the key-tenants of OTel (and OpenMetrics) is typed metrics, as well as "default aggregations". I know for a fact that this suggestion would break some backends (as well as OpenMetrics conventions). e.g. even InfluxDb one of the most flexible, if you have a schematized bucket this would be an error. Additionally, we have an issue where 1 timeseries in OTEL != 1 timeseries in Prometheus (or other backends) e.g. Specifically, Histograms turn into a family of metrics in prometheus. Allowing the same metric name here could cause issues. I also suspect if we have type-conflicts, we can't really guarantee we won't also have label conflicts. What you're suggesting is a (large) data model specification change. I think it would require a siginificant rewrite of the metric specification, starting from the data model.
You're right. We can mitigate the error message with auto-generated helper libraries. I'd argue that folks will still interact with the API so we shouldn't leave sharp edges for them even if we expect to provide nicer solutions over time.
This is the primary reason why we want this error message. It's a sharp edge in our design. I'd love to find ways to reduce this friction, but I think a warning message and reporting as many metrics as possible is the best we can do right now in this situation. There are some scenarios we need to treat as a "bug", and I think our design and documentation should highly encourage others to pull a new Meter (or allow a Meter to be passed in) for their libraries. We've sort of built the API around this notion. In fact, if we got rid of the importance of Meter, we'd still have timeseries overlap as a sharp-edge. We wouldn't be able to get rid of the warnings, we'd just see them more often. Right now, we've included Meter as part of the metric identity which allows as at least SOME way to avoid conflicts across groups in a large organization. If we get rid of this, I think it just makes the situation worse. |
Thanks @jsuereth - framing it as a sharp edge in the data model itself makes sense to me and agree that there isn't much we can do at a higher level like the SDK to work around that. |
The setup for this scenario is a large code base with existing instrumentation APIs that the owner wishes to adapt to use OTel. Let's say the instrument is one of these ambiguous Prometheus-style Gauges that could be up to three different instruments to the user unfamiliar with OTel: it has So without a discriminating instrumentation library in this scenario, let me suggest using one instrumentation library per kind of instrument in your code base. E.g., "mycompany.com#counters", "mycompany.com#gauges", "mycompany.com#histograms", and "mycompany.com#updowncounters". This seems like no harm done, provided the receiver of the OTLP is willing to recognize instrumentation library as a distinguishing feature. I want to point out that I support a mixture of temporality for the same instrumentation library name and metric name across a system. I.e., OK to have synchronous Counters here and asynchronous Counters there, because these are semantically compatible and can be changed either by exporter configuration (e.g., expressing a temporality preference for performance reasons) or by re-instrumentation (e.g., replacing synchronous Counters with asynchronous Counters). However, in this case, I would expect to see Resource attributes--part of the stream's identity--to serve as a distinguishing feature. I do not like the idea of having attributes or resource attributes be identifying in this way. If pressed to do something more accommodating, which is what I feel you're doing, I'd rather allow the instrument kind to be considered part of the identity. If you happen to report a Counter and a Gauge with the same name, that would be fine I suppose, as long as they're defined as having distinct metric identities. |
Exploring the compromise I hinted at above, I see a solution where we tolerate duplicate-conflicting instruments.
I'm not suggesting we change OTLP or the data model. This is a compromise to make the SDK simpler and "pass along" the ambiguity when duplicate-conflicting instruments are registered. For SDKs this is a simple change, just include the instrument kind in whatever identity map you have. There's no need to return a "registration error" from the API, as a result. Does this still deserve a warning? Yes, because it will generate errors for certain metric-type-aware vendors. I would continue to say that users SHOULD NOT re-use instrument names for the different-semantic-kind instruments. This is not good, we're just tolerating it within the SDK specification. I would not recommend consumers of OTLP change in response to this, unless it is to pass through the same. For example in an OTel Collector metrics processor stage, where an identity map is being built for some purpose, it will be natural to just pass-through the ambiguity by recognizing them as distinct metrics. Speaking for my employer, for example, Lightstep will experience an error (*) if the sender writes a different type of metric stream than its type on the first arrival. Errors will result when the SDK "passes along" the ambiguity described above to Lightstep. We want metrics to have types for a lot of reasons, but the one that matters most to users is that our query builder is type-aware. We build different sorts of queries about counters as we do about gauges, and we believe the user is helped by having this type awareness build in to the product. There's one exception to the Lightstep example above, we will not complain if you switch the type between UpDownCounter and Gauge. So, you may ask why not? It's because we're not recognizing the distinction; however, this decision has created problems in our query builder that would be fixed if we'd begin to recognize the distinction. (Here it is: (1) When your Gauge query matches multiple timeseries and you need an action to reduce those series into one series, the user is best served to Average those series by default; (2) When your UpDownCounter query matches multiple timeseries and you need an action to reduce those series into one series, the user is best served to Sum those series by default. By not recognizing the distinction, we were forced to choose one default for the two underlying instruments. We chose to Sum timeseries by default for the UpDownCounter/Gauge type, and well, that's not always the best for the uyser. |
(*) I wrote "experience an error" because this condition is difficult to return to the user so that they will see it. We don't want to fail all the metrics that arrived in the same batch, and we don't want to signal anything that looks retry-able. So, we return gRPC or HTTP response trailers to tell the user about this sort of conflict and only the original-type metric stream will continue writing. |
@jmacd Your proposal sounds good to me, as your example shows the type + name identity may be important for certain backends but may not be - logging a warning to the user but still passing through the ambiguity instead of explicitly failing or forcing a no-op will still allow users some flexibility / time if they do have issues or know their backend won't be affected. |
To confirm my understanding, the proposal is to allow creating instruments with the same name and different types from the same meter. If this occurs, the resulting exported metrics would contain metrics with the same name and different types, and its up to the backend to figure out with them. For example, the following code would be allowed:
And on export would produce something like:
Is that an accurate representation? If so, what are the implications on prometheus export? |
TL;DR Lengthy response in support of my own argument based on experimenting with Prometheus. First, a Prometheus library (Golang) returns error if you try to register the same instrument name, regardless of identical instrument kind or not. If you're using Next, I produced a fake scrape target w/ duplicates-- a counter and a gauge with the same name.
I tried this with and without using the same label set, but in both cases Prometheus doesn't care. I tested with both Prometheus v2.23 and v2.33. It appears to always choose the first value to appear in the scrape. No warnings were emitted on the console. Historically, the internal representation of Prometheus doesn't depend on instrument metadata such as this, as far as I know, so whether you report a counter or a gauge you end up with a timeseries of numbers. This is the sense in which @anuraaga's request above ("use labels for identity") is realistic: as long as the user uses different labels on the different same-name instruments, the data is preserved and can be used--technically the single-writer rule is not being broken. However, this seems to introduce confusion that could've been prevented earlier (e.g., with a registration error, as we're discussing). We may want to ask Prometheus developers for more context. Like the current OTel data model specification, the OpenMetrics specification says that metric family names MUST be unique in a metric set. I am aware of optional support in Prometheus to include instrument metadata in the PRW data stream (see, e.g., prometheus/prometheus#6815, prometheus/prometheus#7771); I am not sure what would happen with that support turned on, but it appears that the tide is shifting on this topic. The reason Prometheus is adding metadata to PRW is because vendors are demanding it, which is because to build a scalable Prometheus data store you some times want to automatically remove labels, and when you're automatically removing labels we get back to one of OpenTelemetry Metric's tenets: there's a natural merge function that should be applied and if you don't know Counter vs. Gauge you can't automatically merge timeseries. |
Did you do this via the client library or via the prometheus backend? The prometheus backend is untyped, so effectively the type label is just ignored, I'd be surprised if it had failed to scrape. However, at least in Java, there's SOME restrictions in the prometheus client library, e.g. you can't expose the same metric name with different label sets for the whole process/endpoint. When it comes to types in Prometheus, given the direction OpenMetrics is going, I'd like for us to prefer preserving types and having a "strict" mode that abides by best practices. However, we also need to deal with the real world and give users every out we can. I also think (or assume?) we designed the perfect solution to the problem of different typed instruments: Views. Specifically, my straw man scenario:
the original Java view implementation was designed with the flexibility to allow views to resolve metric-name conflicts for THIS EXACT SCENARIO. I think that may have been undone (cc @jack-berg for confirmation) to abide by the strict definition in the specification. EDIT: Additionally I think we COULD even specify an error message that explains to users HOW to register a view to remove the conflict. E.g.
|
Thank you @jsuereth -- especially for connecting this problem with a Views solution. (Nice suggestion that we print a Views configuration to fix!) This suggests I should close #2270 and #2281 and work on combining these revisions with a new position. Can we separate the notion of a "strict mode" from this problem, specifically? I can imagine other forms of strictness, and we might be able to agree on a general statement in the library guidelines. Then we'd write something on this topic in the API spec like duplicate instrument registration with different instrument kind (including synchronous vs. asynchronous) MUST be permitted, except in strict mode where duplicate-conflicting registration is considered a runtime error. For the SDK spec, the implementation SHOULD print instructions on how to resolve duplicate-conflicting registration errors by printing an example View configuration that selects on instrument type. For an OTel-Go implementation, e.g., strict mode would even prevent registration of both |
This proposal makes sense to me. Thanks @jmacd and @jsuereth for the additional context. There's one scenario I still have a question about. Suppose a user asks a meter for the same instrument type, and same name, but a different unit or description:
The SDK could extend the proposal to propagate ambiguity to the exporter as follows:
Or it could attempt to merge and resolve the ambiguity before export, for example by taking the first unit / description registered:
|
@jack-berg that's a good call out. I'm a lot more biased against failing for different units and descriptions. I think the suggestion of "first one wins" is good. We could also do "first non-default wins, or use the default", however that's a bit riskier. |
First one wins makes most sense to me because any other approach implies that a metric's description / unit is subject to being updated from one export to another, which feels wrong. If a user wants a certain description and unit, they can simply take care to make sure that all calls to create the instrument include the same description / unit. |
@jack-berg and @jsuereth about the I don't want to require the SDK to handle unit conversion, so different units have to be treated as different streams. If a user provides two descriptions, it's easy to make an exception and say "first wins" for description. It's probably even easier to say that all fields are identifying, so that two descriptions means two streams with a warning about a conflict. If the user is really trying to register the identical metric stream, it should be easy for them to supply the identical description. If this happens and a user configures a Prometheus exporter, they'd SHOULD be using distinct attributes, otherwise should work fine. |
What are you trying to achieve?
In the metrics API spec it is said for
MeterProvider.getMeter
:It also suggests that:
So, as long as SDK propagates the configurations to all meters created without another
getMeter
call, it should be spec-compliant for either returning the same or a new Meter instance for every name+version+schema_url pair.However, for the instrument name in API spec, it is said:
It is not clear when the SDK always returning a new Meter instance for every name+version+schema_url pair, if the instrument name should be resolved in the Meter instance, or it should be checked SDK-wide with the same meter name+version+schema_url? If it is resolved within a single Meter instance, users could create multiple meter instances with the same meter name+version+schema_url and create an instrument with the same name in these meters. In that case, I'm not sure how the unique instrument name constraints can effectively work, since the exporters may observe multiple same name instruments associated with identical instrumentation libraries.
The text was updated successfully, but these errors were encountered: