Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that service.* conventions apply to all telemetry sources #630

Merged

Conversation

jack-berg
Copy link
Member

@jack-berg jack-berg commented Jan 8, 2024

The question of whether the service.* conventions are applicable only to web services is an important one that has come up several times. If the answer is yes, then everything that isn't a web service needs its own version of service.name, service.instance.id, service.namespace, service.version to uniquely define the thing producing telemetry. We've discussed this at length several times and while we don't have anything written down yet in the spec / semantic-conventions, the actions we've taken (i.e. rejecting alternatives like telemetry.source, app.name, etc) confirm that the service.* attributes are applicable to all telemetry services, not some narrower subset that some people consider a web service.

This PR aims to clarify this to avoid repeating the same discussion.

Some PRs, issues that are related:

@jack-berg jack-berg requested review from a team January 8, 2024 21:18
@jack-berg
Copy link
Member Author

cc @open-telemetry/technical-committee since this is an important detail that extends beyond just semantic-conventions

Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a changelog entry.

@svrnm
Copy link
Member

svrnm commented Jan 9, 2024

Does not have to be part of this PR, but adding the same wording / a link to that definition to the glossary may resolve this long standing issue (or make it obsolete if it is not added to the glossary: open-telemetry/opentelemetry-specification#2050)

I am wondering if there should be an explicit note stating that this is "set in stone", and even if people disagree with this wording and definition, it will not be changed due to code relying on this definition: if someone not familar with the matter/new to the otel community comes across this definition, and disagrees with it, they can (and will?) still go ahead and raise an issue/PR/OTEP asking for a change.

@joaopgrassi
Copy link
Member

@svrnm

Does not have to be part of this PR, but adding the same wording / a link to that definition to the glossary may resolve this long standing issue (or make it obsolete if it is not added to the glossary: open-telemetry/opentelemetry-specification#2050)

Would you want to add this to the glossary, since you have the issue? Otherwise I'm glad to work on it. Let me know :)

@svrnm
Copy link
Member

svrnm commented Jan 9, 2024

@svrnm

Does not have to be part of this PR, but adding the same wording / a link to that definition to the glossary may resolve this long standing issue (or make it obsolete if it is not added to the glossary: open-telemetry/opentelemetry-specification#2050)

Would you want to add this to the glossary, since you have the issue? Otherwise I'm glad to work on it. Let me know :)

Sure, I can do that, after this issue got merged.

@joaopgrassi joaopgrassi merged commit 7f35c49 into open-telemetry:main Jan 10, 2024
9 checks passed
@yurishkuro
Copy link
Member

This was merged too fast for such a significant change. What does it mean to treat hardware devices as services? Are we saying that instead of some_metric{host=xyz} the correct way is to emit some_metric{service=xyz}?

@tigrannajaryan
Copy link
Member

This was merged too fast for such a significant change. What does it mean to treat hardware devices as services? Are we saying that instead of some_metric{host=xyz} the correct way is to emit some_metric{service=xyz}?

That's a good point. I understood the change as applying to a software that uses Otel SDK to emit telemetry, but not necessarily to everything at all that can have telemetry associated with it.

@jack-berg
Copy link
Member Author

This was merged too fast for such a significant change.

Agree, given the disagreement about this in the past.

I understood the change as applying to a software that uses Otel SDK to emit telemetry, but not necessarily to everything at all that can have telemetry associated with it.

I believe it should apply to all telemetry sources. Limiting to those which use the otel SDK to emit is a strange distinction because between async instruments and the metric producer concept, its possible to imagine telemetry sources like the hostmetricsreceiver emitting data from the SDK instead of directly producing metric data points.

I see no advantage to having separate sets of identifying attributes for different types of telemetry sources.

@yurishkuro
Copy link
Member

I see no advantage to having separate sets of identifying attributes for different types of telemetry sources.

Cf. Monarch paper https://www.vldb.org/pvldb/vol13/p3181-adams.pdf

image

We have a similar design at Meta - different "sources" may need different identity schemas, it's not reasonable to think that a single string field service can cover all of the varieties. We even specifically covered this point in our recent paper that you get all kinds of weird distribution skews if service is the only identifying dimension, even when referring to software workloads that do conceptually look like services.

@jack-berg
Copy link
Member Author

The service.* attributes are not solely responsible for describing the telemetry source - you have to look at the entire set of resource attributes for that. But they include powerful general concepts that we have the option to either re-use, or continually reinvent:

  • service.name - a logic name.
  • service.instance.id - a unique identifier when there are multiple instances of what is logically the same thing.
  • service.version - the version of the thing.
  • service.namespace - a namespace for service.name.

Putting the service.* prefix aside, those concept are broadly applicable, so why reinvent the same concepts and assign to a different name? This isn't to say that we shouldn't have domain specific attributes to provide additional information (identifying or descriptive) about the source.

@yurishkuro
Copy link
Member

those concept are broadly applicable

Broadly perhaps, but not universally. What is the meaning of service.name, service.instance.id , service.version for a bare metal host? Whichever the answer is I'd bet it's not intuitive and not the only possible mapping.

so why reinvent the same concepts and assign to a different name

Who are we optimizing for, for us to make it easier to maintain the spec or for the end user to make it easier to understand the telemetry? And it's not like we don't have precedents - we have various vendor specific conventions (aws, gcp) which are distinct because they describe unique business entities, even though they may have some overlap in conceptually similar dimensions.

@jack-berg
Copy link
Member Author

What is the meaning of service.name, service.instance.id , service.version for a bare metal host? Whichever the answer is I'd bet it's not intuitive and not the only possible mapping.

When I see the service.* attributes in my head just omit the service.* prefix and see name, id, version, namespace attributes. With the interpretation that service.* is applicable to everything, we might as well think of the attributes without the prefix. So for a bare metal host, what is the name, id, version, and namespace? I see clear matches from the host semantic conventions for host.name and host.id. No obvious matches for version and namespace.

I admit its not the most intuitive. In hindsight we might have used something like telemetry.source.* which doesn't have the same connotations as service.*.

This is the real crux of the issue: we've rejected proposals to use alternatives to service.* for domains like mobile and browser applications, arguing that everything is a service in the abstract and that are stable documents that can't be updated to use different core identifying attributes. I don't see what makes a host different from a mobile application - in both cases the argument is that the service.* attributes are unintuitive. Sure we could use the distinction of everything producing data from an SDK is a service, and carve out an exemption for the hostmetricsreceiver, but this seems flimsy and hard to defend.

@tigrannajaryan
Copy link
Member

Putting the service.* prefix aside, those concept are broadly applicable, so why reinvent the same concepts and assign to a different name? This isn't to say that we shouldn't have domain specific attributes to provide additional information (identifying or descriptive) about the source.

I disagree with this. There are clearly entities which are not Services (e.g. a Host, a Process, a Kubernetes Node or a Pod, etc). Including the service.* attributes in their telemetry is misleading and unnecessary since they have all the necessary attributes to describe themselves.

If this was the intent behind this PR I think it needs to be reversed since I personally completely misunderstood it and would have not approved it.

@jack-berg
Copy link
Member Author

Yes this was the intent of the PR. I was under the impression I was clarifying the stance that has been implied by our actions (see PR description). Agree that should probably revert to continue discussion.

A telemetry source. OpenTelemetry has adopted a broad interpretation such that every telemetry source is a service. Examples include, but are not limited to: [..] Specific types of telemetry sources may have additional conventions defining domain specific information, but the service conventions are applicable to all telemetry sources.

I can't reconcile that we reject attempts to add new sets of identifying attributes for specific types of entities like app.name, but need type specific identifying attributes for things like a host. I can't think of a good heuristic for when a type or telemetry source needs special identifying attributes which aligns with the precedent of rejecting app.name. To me, we need to either go one way or another: either all telemetry sources shared set of attributes or allow special identifying attributes for all types of telemetry sources, and accept proposals like app.name.

There are clearly entities which are not Services (e.g. a Host, a Process, a Kubernetes Node or a Pod, etc). Including the service.* attributes in their telemetry is misleading and unnecessary since they have all the necessary attributes to describe themselves.

They have attributes to describe themselves and those same attributes are attached to other telemetry sources running on the host, process, node as correlating metadata. This means its challenging / impossible to identity the difference between something with a host.name as the source of the telemetry or as something else running on the host. You have to use some sort of algo where you arrange the types of telemetry sources hierarchically (e.g. k8s cluster -> node -> pod -> service) and look for the most specific identifying attributes to figure out what type of thing is actually sending telemetry. And since we don't define such a hierarchy or algo, its essentially guesswork how to distinguish the type of thing sending data. And you need to know the type in order to know which set of identifying attributes present represent the actual telemetry source.

I think rejecting that service.* attributes apply to all telemetry sources necessitates defining how the identify of a resource which contains multiple sets of identifying attributes. I.e. does a resource with host.* attributes actually represent the host or does it represent something running on the host?

@yurishkuro
Copy link
Member

Here's a possible solution:

  • resource.type: service | host | ...
  • resource.name: my-service | my.host.com | ...

Basically just replacing s/service/resource/ seems to address the concerns. And more resource-specific attributes like app.app-store (making it up) are both possible and recognizable because of resource.type. service.name can be treated as a fallback to resource.name (OTEL schema transforms should support such upgrade).

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Jan 12, 2024

@jack-berg OK, I am open to considering it, but haven't though through the implications of every source having service.* attributes. I would feel a lot more comfortable if we revert it and give ourselves more time to carefully consider all implications. I believe now that I rushed my approval without fully understanding the PR, I am not sure about other approvers.

@trask
Copy link
Member

trask commented Jan 12, 2024

I agree with reverting for now, definitely some good points raised worth discussing more

joaopgrassi added a commit to dynatrace-oss-contrib/semantic-conventions that referenced this pull request Jan 12, 2024
@joaopgrassi
Copy link
Member

I created the revert PR #638. Note that we will need a new PR when we are ready to do this change again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

10 participants