-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine which attributes of Resource contribute to Metric Identity. #2775
Comments
ProposalIf we follow the previous work for Prometheus, then we can leverage the fact that OpenTelemetry metrics relies on service Resource attributes for identity:
These attributes become required and identifying for all resources. All other attributes are considered "descriptive" and may be appended to a resource at any time. Specifically, this enables the following:
|
This is only sufficient for Resources which are Services. I think we can say that Resources that are emitted using Otel SDK are Services (augmented with additional useful non-Service attributes). However, not all Resources that we can imagine in Otel are Services. For example the Collector emits metrics about a Host or about a Process. The Host and the Process are the Resource in this case. There is no Service to associate this data with. There are many other such examples possible. I think we need to generalize this to the following:
Given the above, I believe we can say the following: the identify of a metric timeseries must include the minimal set of the identifying attributes of its Resource plus the identifying attributes of the entities in the context of which the identity is defined. So, for example for a metric of a Service both (service.name, service.instance.id) attributes and I think the consequence of this is that the definition of a telemetry emitting Resource must include not only its identifying attributes but also reference the containing Resource in the context of which the identity is defined. As for the cardinality of some identifiers. I think it is the reality of telemetry data that some Resources have naturally high cardinality. It is their important inherent property. I do not want us to be perpetually locked in a world where technical limitations of some technologies dictate our understanding of what comprises the identity of the object of interest and force us to drop vital information about the object right at the source. For example, when emitting metrics about the Process I don't think we should omit the process id. Whether it is high cardinality or no, it is the only single attribute that correctly and uniquely identifies the Process (within its Host). If a certain destination has trouble dealing with the cardinality of the attribute either it is that particular destination's problem they need to deal with or if we believe it is our problem then we can think about how to record the knowledge about expected cardinality of attributes and use it for lossy transformation in the exporter targeting that particular destination while retaining full fidelity when sending data over OTLP to destinations which do not have such cardinality limitations. Perhaps this means that high-cardinality is one of the labels that we associate with a particular attribute in the semantic conventions, record the label in the Schema files, and use the Schema file for automatic cardinality reduction in the exporters. |
+1 on all of this with one caveat. The non-identifying attributes of the Resource may change over time. This seems to contradict with our current definition of a Resource, which says it is immutable. However, I don't think there is a contradiction. We need to revise the definition to the following: the Resource emitted by an Otel SDK is immutable within the lifetime of the SDK runtime. Outside of that the Resource's non-identifying attributes can certainly mutate and there is no reasonable mechanism in Otel SDK that can prevent that so we may as well embrace it. An example of mutating non-identifying attribute is the following: a Kubernetes Pod is unique identified by (k8s.cluster.name, k8s.namespace.name, k8s.pod.name) triplet. The Pod may also have multiple labels associated with it, which can be recorded as non-identifying attributes. In Kubernetes world these labels can change over time and so as a consequence can change the non-identifying attributes that record the labels. Given that non-identifying attribute of a Resource may mutate over time, I think we need to be careful with saying that "Backends can merge descriptive attributes at any time". This is certainly doable as long as we understand the values of merged attribute can change overtime, it should be considered as immutable and no such consideration should be made when querying the stored data. |
Tigran - You have a lot of good thoughts here so I'm going to respond to things in individual comments (hopefully easier to reference).
This is actually where the rub is. Emitting process metrics with an attirbute of PID is fine. What I'm objecting to us is having PID be an attribute of every metric an SDK produces because that SDK happens to be a process resource. The difference here is nuanced but important. TODAY, even http latency will have the process command line + PID attached given the way we do resources. Does that make sense from a metric standpoint? I think this data should be joinable with the actual timeseries, but it doesn't need to be a label on that timeseries itself. (e.g. see kubestate metrics in practice). The key here is we should limit cardinality to what's necessary. The example of process-metrics, PID is necessary, and beneficial. I'd argue having PID on HTTP latency metrics, not so much. The key distinction I want to make here is resource attributes are appended to ALL metrics not just ones where that nuance is important. I think fundamentally the issue here is that Spans/Logs are inherently an "(sampled) event" model, while Metrics are an "aggregated event" model. I suspect the profiling signal is likely to run into similar concerns/issues for their aggregated events (although profiling will have both aggregated + sampled use cases combined). |
This is a great point. However, we see in Prometheus, Jaeger, DataDog,etc. a "simplification" where these Resources are "forced" (or coerced) to belong to a service for simplicity in representation and understanding. You're right it's not a very accurate model, but is it more useful than something more nuanced? Perhaps. I don't think we should dismiss this simplification just yet though. Remember all models are lies, but some are useful At minimum though, if I read your agreement:
I'd like to hear your thoughts on keeping a simplified Resource model (I'm challenging my own thought with it given what I've seen from Jaeger, Prometheus + DataDog, e.g.), but additionally I have some alternative thoughts/designs for Resource around identifying attributes, kinds + how to make progress there. |
Yeah. I'm actually thinking of splitting resource into a few components as a way to make non-breaking changes to it in the current form. More specifically:
// Resource information.
message Resource {
// Defines the core type of this resource, e.g k8s_pod, etc.
// A resource kind defines the minimal set of key attributes that can uniquely identify the resource.
// If empty, the resource is assumed to be "generic_resource".
// Note: rather than resource_kind, this could ALSO be a telemetry-schema reference for the resource type.
string resource_kind = 3;
// Set of attributes that identify the resource.
// Attribute keys MUST be unique (it is not allowed to have more than one
// attribute with the same key).
repeated opentelemetry.proto.common.v1.KeyValue attributes = 1;
// Set of attributes that describe the resource.
// Attribute keys MUST be unique (it is not allowed to have more than one
// attribute with the same key).
repeated opentelemetry.proto.common.v1.KeyValue descriptive_attributes = 4;
// dropped_attributes_count is the number of dropped descriptive attributes. If the value is 0, then
// no attributes were dropped.
// Note: No identifying attributes are allowed to be dropped.
uint32 dropped_attributes_count = 2;
} |
I believe what you say is exactly correct. In my view every metric is emitted for a particular entity. Metrics don't exist in vacuum. They are attached to some entity that we can pinpoint to (this is a simplification that I can argue against as well, but bear with me for now). IMO the identity of metric timeseries is defined as a combination of (metric datapoint dimensions + minimal globally unique identifier of the emitting entity). So, when emitting metrics from Otel SDK the emitting entity is the Service. Thus only Service's uniquely identifying attributes should be part of the metric's timeseries. Yes, the Service is also associated with a Process. This is important for other purposes, but is irrelevant from timeseries identification purposes. |
No, I think it doesn't. See my previous comment to why I think so.
I assume by "joinable" you mean the following: if we know the Process's full identity (e.g. by knowing its process id, host name, etc) there must be a way to locate the timeseries of all metrics that are emitted by a Service Instance that is associated with that Process. Notice there is more than one level of indirection here - we must find the Service Instance from the Process, then find the timeseries from the Service Instance's attributes. If this is what you mean then I definitely agree.
Agreed. |
Yes.
Yes.
Yes.
Possibly. I am not yet sure about this. I have some possible different use-case for the "kind" in my mind, but not yet sure about it. However, I would like to understand the use case that you describe. How do you intend the "kind" to interact with the merging logic? |
There is likely a few more nuances that we haven't discussed yet. The most important unresolved problem for me is: what constitutes the globally unique identifier of a Resource? I can define a locally unique identifiers easily. For example the process id is a locally unique identifier for a Process in the context of a Host where the process runs. However, the process id is not sufficient as a global identifier. We need something more there, in particular we need to in addition know the context within which the local identifier was recorded (i.e. the identity of the Host in my example). I am not sure how exactly we solve this. The globally unique identifier of a Resource depends on the environment where it runs. There is no single definition of the global identifier for each individual Resource that we are interested in. I will try to allocate some time to this and see if I can put together an OTEP that addresses all these questions. |
For backward compatibility reasons we cannot change the semantics of the current // Resource information.
message Resource {
// Set of attributes that describe the resource.
repeated opentelemetry.proto.common.v1.KeyValue attributes = 1;
// dropped_attributes_count is the number of dropped attributes. If the value is 0, then
// no attributes were dropped.
uint32 dropped_attributes_count = 2;
// Defines the core type of this resource, e.g k8s_pod, etc.
string resource_kind = 3;
// Set of attributes that identify the resource. Note that the identifying attributes may be
// also recorded in the "attributes" field.
repeated opentelemetry.proto.common.v1.KeyValue id = 4;
} This definition is backward compatible, so we won't break OTLP. However, I am not happy with the definition of the |
I actually wasn't changing the semantics. Today, by specification, all attributes in the Resource are descriptive and identifying. What I was proposing was a forward+backward compatible change (older users don't need to change their notion of identity or know to look for new fields). Specifically, if you look at use cases: Old Telemetry:
By specification, we treat all of those as identifying when we merge together telemetry by-resource. New Telemetry:
Here, an old client would still join together streams by attributes and be correct about it. |
Unfortunately today we allow to (and do) record all sorts of junk information in the Resource. Any assumption that current Resource attributes are identifying is incorrect and any recipient that makes this assumption is generally speaking broken.
I think this is broken today. It probably works in some cases. The best we can do is to leave it alone so that it continues working the way it does. My suggestion is that we should continue recording what we record today in the Resource's In the new telemetry we introduce an additional field Old Telemetry:
New Telemetry:
Here the old recipients are oblivious to the existence of the New recipients are now wise enough to know that the |
So I think we both agree in general on the shape of the solution, just not specifics or what will be more annoying to fix going forward. I've been focusing on the way we bundle-by Specifically, we still have the major issues of:
|
Agreed. (BTW, @bogdandrutu suggested to consider an alternate where we only have a single identifier string, instead of a set of attributes. We can consider that, but I think in the context of the current discussion the difference is not yet important).
Yes and in addition to this: which identifying attributes become descriptive when merging? For example if I merge a Process with a Host then both
No, I think it applies to all attributes, both identifying and non-identifying.
Yes. In addition - conventions on which identifying attributes can be eliminated/aggregated to reduce cardinality when exporting to low cardinality formats/destinations.
If we accept this rule we will need to have guidelines about how to compute Also, AFAIK, Collector does not necessarily follow this rule rigorously. We will need to double check. |
This implies, then, that ALL resources (that can merge) need to be part of the same schema, which basically means ALL resources must be fully specified by OTEL semantic conventions. Is this what you intend? E.g. where is the room for user-generated descriptive attributes?
Yeah, I know the collector does not follow this and it's been the source of some friction. SO I think our major disagreements are on:
Otherwise, we're both looking to solve the same problem. I think this means we're ready to start exploring the solution space. I'll try to put together a proposal this week and link it here for review. |
No, that is not my intent. The presence of schema_url in the Resource does not mean that all Resource attributes belong to that schema. Only the ones that have a matching name defined by that schema. User-generated attributes are of course possible. We already have guidelines about application-specific and company-specific attribute naming here. Because these are named differently they naturally are not covered by Otel schema and essentially are exempt from schema-related functionality. Note that if in the future we introduce Derived Schemas it will be possible to have a schema_url which refers to an application-specific or company-specific Otel-derived Schema, such that even application-specific and company-specific attributes are schema-compliant, potentially allow to subject all Resource attributes to schema rules. However, I do not think it is a mandatory requirement to support Derived Schemas or to ensure that all attributes are covered by Otel Schema. Do you see any problems with this approach? To me this does not look like a major disagreement area. I think we agree more than disagree :-) |
I think my view on schema-url is slightly different than yours, so we probably need to have that conversation elsewhere. Specifically, the notion of Dervied Schema or company-supported schema. In practice we don't allow Resource.merge to work for different schema-urls unless one is blank. We need to directly address that verbage. Perhaps we both agree we need to address it, and you plan to do so through expansion on Schemas themselves whereas I'd like to better denote how Schema + Resource interact (especially since I don't think schema_url exists on resource but beside resource in OTLP). |
@jsuereth @tigrannajaryan are you planning to take this issue forward anytime soon? When you discuss on a SIG meeting I would like to join in the conversation and explore the possibility of introducing one more collection of attributes in |
For reference: OTEP open-telemetry/community#1976 aims to address this. |
I've asked a related question on Stack Overflow: "Do the OpenTelemetry spec/docs recommend/require OTLP consumer behavior for attribute names that are not defined in the semantic conventions?". That question is related to this issue because it refers to the different ways in which OTLP consumers ("telemetry backends") handle "unrecognized" resource attributes and metric data point attributes. Although I don't specifically mention it in that question, a workaround to getting some telemetry backends to ingest unrecognized attributes is to characterize them as metric data point attributes. (Sorry, I don't want to name specific backends. Neither, at this time, do I want to elaborate on the specific nature of these attributes, other than to say they're not in the current registry or semantic conventions.) With apologies if referring to a question on an external site is poor etiquette (I'd like to get informed eyeballs on that question, but I don't want to annoy anyone), I invite anyone interested in this issue to read, and answer, that question. |
The Metric Cardinality Problem
Today, every attribute in an OpenTelemetry Resource, according to the metric datamodel, is used to determine the identity of metric. Given known issues in metric time-series database implementation around cardinality, this can cause major issues if Resources are allowed to leverage high cardinality attributes.
Given many Resource attributes semantic conventions today were defined for the tracing instrumentation, we do find many high cardinality definitions, e.g. the Process resource includes
pid
and andparent_pid
, which are known to churn between instances of an application and would lead to higher cardinality streams.Many metric backends are simply erasing resource attributes from metrics to workaround the issue. Here's an example solution for prometheus, and another proposal for yet another point-fix for prometheus.
However, these workarounds prevent Metrics users from regaining descriptive attributes (and benefits) of current OTEL Resource detection.
The Resource identity problem
It turns out that the identity problem isn't isolated to metrics alone. It also impacts our ability to work with Resources within OpenTelemetry. For example, these bugs
Rather than isolating point-fixes as they arise (e.g. the prometheus exporter case), it's become clear we need a better hollistic solution.
The text was updated successfully, but these errors were encountered: