Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: add otel sdk tracing and metric providers to the core cli #4889

Merged
merged 1 commit into from
Mar 25, 2024

Conversation

jsternberg
Copy link
Contributor

@jsternberg jsternberg commented Feb 21, 2024

- What I did

This adds the code used by buildx and compose into the default CLI program
to help normalize the usage of these APIs and allow code reuse between
projects. It also allows these projects to benefit from improvements or
changes that may be made by another team.

At the moment, these APIs are a pretty thin layer on the OTEL SDK. It
configures an additional exporter to a docker endpoint that's used for
usage collection and is only active if the option is configured in docker
desktop.

This also upgrades the OTEL version to v1.19 which is the one being used by
buildkit, buildx, compose, etc.

- How I did it

I added a new interface to the DockerCli and implemented that interface using
code that is mostly identical to the same code in buildx which was adapted from
compose. It's missing the user-facing OTEL environment components that buildx
has though.

- How to verify it

Not done yet.

- Description for the changelog

Added configurable OTEL utilities and basic instrumentation to commands.

- A picture of a cute animal (not mandatory but encouraged)

@jsternberg
Copy link
Contributor Author

I've created this as a draft, but the code itself is mostly complete and could be merged as-is.

The code in this PR is already part of buildx and there's a version this was based off of in compose for the telemetry. I thought it might be a good idea to start working on integrating these two so buildx and compose could be using the same code and could benefit from the same changes.

@milas I'm really interested in feedback from you on this too since you have worked on the compose side of this.

@jsternberg jsternberg force-pushed the universal-telemetry-client branch from f5bc6da to e809aba Compare February 21, 2024 15:55
Copy link
Contributor

@krissetto krissetto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general i like the approach.

I do have a few thoughts though, I'll leave them here so we can discuss (maybe i'm missing some details):

  • I think integrating otel things into the cli like this could cause some issues down the road, since we'd be using otel structs and methods directly even in projects that import the cli to use this common code. Maybe it could be worth it to decouple our usage of telemetry things from our actual implementation using otel, also for keeping things more easily standardized across projects;
  • Given this ^, it might be a good idea to separate these bits into a dedicated repo/pkg that only exports the necessary functionality and abstracts otel resources away from the projects using it. Since the CLI is already doing double duty as both an application and a "library" for plugins, I'd try to avoid adding more responsibilities here. Separating things out, we can also collaborate on improvements to the required functionality more easily. Do you see any issues with this?;
  • Shall we also include some functionality for adding Attributes, span Events, etc so clients can do so without depending directly on otel pkgs?

cli/command/cli.go Outdated Show resolved Hide resolved
@jsternberg
Copy link
Contributor Author

@krissetto I agree with some of what you're saying on principle, but I do worry a bit about overabstracting in the case of the OTEL libraries.

I think integrating otel things into the cli like this could cause some issues down the road, since we'd be using otel structs and methods directly even in projects that import the cli to use this common code. Maybe it could be worth it to decouple our usage of telemetry things from our actual implementation using otel, also for keeping things more easily standardized across projects;

I think there's two components of otel that need to be treated separately. There's the library portion, which is mostly just interfaces, and the SDK, which is the implementation of the SDK. For most code, they should use the library portion and we shouldn't really worry about the actual implementation. The purpose of these interfaces is to create this separation and adding our own separation would probably not work very well.

For the SDK, the only part that's really supposed to use the SDK is the outermost section of the program. The part that configures the SDK and passes the TracerProvider or MeterProvider to the remaining code. I think this area would be the place to consider putting our own wrapper around it, but I suspect the effort to put a wrapper around it will result in code bloat that shows up mostly as a leaky abstraction rather than anything useful.

In particular, there is one area of this PR that's a leaky abstraction when it comes to the SDK.

// TelemetryClient provides the methods for using OTEL tracing or metrics.
type TelemetryClient interface {
	// Resource returns the OTEL Resource configured with this TelemetryClient.
	// This resource may be created lazily, but the resource should be the same
	// each time this function is invoked.
	Resource() *resource.Resource

        // TracerProvider returns a TracerProvider. This TracerProvider will be configured
	// with the default tracing components for a CLI program along with any options given
	// for the SDK.
	TracerProvider(opts ...sdktrace.TracerProviderOption) TracerProvider

	// MeterProvider returns a MeterProvider. This MeterProvider will be configured
	// with the default metric components for a CLI program along with any options given
	// for the SDK.
	MeterProvider(opts ...sdkmetric.Option) MeterProvider
}

The options to TracerProvider and MeterProvider are transparently taking the sdk options and using the sdk versions of these implementations. The returned providers themselves are abstracted behind interfaces.

While I'm not thrilled with this, the alternative is wrapping each option in a separate library. But, these options get fairly involved and you suddenly end up creating an interface for most of the structs in the SDK.

At the end of the day, no application or plugin is required to use the elements provided by the CLI. It's just convenient and takes away the hassle of figuring out how to do specific CLI-specific things with it. And I can't think of a situation where a CLI application that is using tracing or metrics wouldn't just use the official OTEL SDK or provide their own custom implementation of these methods.

Given this ^, it might be a good idea to separate these bits into a dedicated repo/pkg that only exports the necessary functionality and abstracts otel resources away from the projects using it. Since the CLI is already doing double duty as both an application and a "library" for plugins, I'd try to avoid adding more responsibilities here. Separating things out, we can also collaborate on improvements to the required functionality more easily. Do you see any issues with this?;

Mentioned above, but my primary concern is that the abstraction won't yield better code. The code in this PR is mostly low/no-cost if the functions aren't invoked. If you're integrating tracing or metrics into your docker cli plugin, they're pretty much exactly what you would want with options to replace certain aspects (such as the default resource).

Shall we also include some functionality for adding Attributes, span Events, etc so clients can do so without depending directly on otel pkgs?

Attributes, span events, and other things are included as part of the OTEL library and not part of the SDK so I don't think there's much value in abstracting these away. The otel libraries themselves should be safe to use directly and are considered stable for tracing and metrics. OTEL logs are mostly an internal abstraction inside of the collector and can mostly be ignored from an application point of view.

@jsternberg jsternberg force-pushed the universal-telemetry-client branch from e809aba to 0f9e229 Compare March 6, 2024 17:02
@codecov-commenter
Copy link

codecov-commenter commented Mar 6, 2024

Codecov Report

Merging #4889 (b484bc0) into master (a2f3f40) will decrease coverage by 0.43%.
Report is 11 commits behind head on master.
The diff coverage is 0.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4889      +/-   ##
==========================================
- Coverage   61.43%   61.01%   -0.43%     
==========================================
  Files         289      291       +2     
  Lines       20237    20374     +137     
==========================================
- Hits        12433    12431       -2     
- Misses       6903     7041     +138     
- Partials      901      902       +1     

@jsternberg jsternberg marked this pull request as ready for review March 6, 2024 20:14
@krissetto
Copy link
Contributor

@krissetto I agree with some of what you're saying on principle, but I do worry a bit about overabstracting in the case of the OTEL libraries.

I think integrating otel things into the cli like this could cause some issues down the road, since we'd be using otel structs and methods directly even in projects that import the cli to use this common code. Maybe it could be worth it to decouple our usage of telemetry things from our actual implementation using otel, also for keeping things more easily standardized across projects;

I think there's two components of otel that need to be treated separately. There's the library portion, which is mostly just interfaces, and the SDK, which is the implementation of the SDK. For most code, they should use the library portion and we shouldn't really worry about the actual implementation. The purpose of these interfaces is to create this separation and adding our own separation would probably not work very well.

Yeah, I agree with your reasoning here. I got a bit confused between sdk and lib things being new to otel things.

For the SDK, the only part that's really supposed to use the SDK is the outermost section of the program. The part that configures the SDK and passes the TracerProvider or MeterProvider to the remaining code. I think this area would be the place to consider putting our own wrapper around it, but I suspect the effort to put a wrapper around it will result in code bloat that shows up mostly as a leaky abstraction rather than anything useful.

In particular, there is one area of this PR that's a leaky abstraction when it comes to the SDK.

// TelemetryClient provides the methods for using OTEL tracing or metrics.
type TelemetryClient interface {
	// Resource returns the OTEL Resource configured with this TelemetryClient.
	// This resource may be created lazily, but the resource should be the same
	// each time this function is invoked.
	Resource() *resource.Resource

        // TracerProvider returns a TracerProvider. This TracerProvider will be configured
	// with the default tracing components for a CLI program along with any options given
	// for the SDK.
	TracerProvider(opts ...sdktrace.TracerProviderOption) TracerProvider

	// MeterProvider returns a MeterProvider. This MeterProvider will be configured
	// with the default metric components for a CLI program along with any options given
	// for the SDK.
	MeterProvider(opts ...sdkmetric.Option) MeterProvider
}

The options to TracerProvider and MeterProvider are transparently taking the sdk options and using the sdk versions of these implementations. The returned providers themselves are abstracted behind interfaces.

While I'm not thrilled with this, the alternative is wrapping each option in a separate library. But, these options get fairly involved and you suddenly end up creating an interface for most of the structs in the SDK.

Totally agree with not wrapping most of the SDK structs, I was initially thinking about only a core set of basic functionality that we expect most apps/plugins to use. Given the comments above, i see this makes less sense than I initially thought.

At the end of the day, no application or plugin is required to use the elements provided by the CLI. It's just convenient and takes away the hassle of figuring out how to do specific CLI-specific things with it. And I can't think of a situation where a CLI application that is using tracing or metrics wouldn't just use the official OTEL SDK or provide their own custom implementation of these methods.

This is true, even if I'd personally push for standardizing our approaches to use these common bits as much as possible.

Given this ^, it might be a good idea to separate these bits into a dedicated repo/pkg that only exports the necessary functionality and abstracts otel resources away from the projects using it. Since the CLI is already doing double duty as both an application and a "library" for plugins, I'd try to avoid adding more responsibilities here. Separating things out, we can also collaborate on improvements to the required functionality more easily. Do you see any issues with this?;

Mentioned above, but my primary concern is that the abstraction won't yield better code. The code in this PR is mostly low/no-cost if the functions aren't invoked. If you're integrating tracing or metrics into your docker cli plugin, they're pretty much exactly what you would want with options to replace certain aspects (such as the default resource).

Shall we also include some functionality for adding Attributes, span Events, etc so clients can do so without depending directly on otel pkgs?

Attributes, span events, and other things are included as part of the OTEL library and not part of the SDK so I don't think there's much value in abstracting these away. The otel libraries themselves should be safe to use directly and are considered stable for tracing and metrics. OTEL logs are mostly an internal abstraction inside of the collector and can mostly be ignored from an application point of view.

My main concern here was with our dependencies, and keeping the various projects all on the same version of the sdk/libs. I generally always try to avoid having many projs on potentially different versions of the same libs (even though SemVer should theoretically give us some guarantees, i've had too many bad experiences 😂).

No hard feelings though. Let's see what others think, and let's keep this PR moving!

@jsternberg
Copy link
Contributor Author

My main concern here was with our dependencies, and keeping the various projects all on the same version of the sdk/libs. I generally always try to avoid having many projs on potentially different versions of the same libs (even though SemVer should theoretically give us some guarantees, i've had too many bad experiences 😂).

My personal hope is that, with this in the CLI, it'll mostly standardize the version the various CLI programs use. I do think a follow up thing after that is maybe creating a separate repository for OTEL stuff in general so server and cli components are synced on the versions.

@jsternberg jsternberg force-pushed the universal-telemetry-client branch 2 times, most recently from 4fceb95 to f7e1640 Compare March 11, 2024 15:04
Copy link
Collaborator

@laurazard laurazard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks really good to me, thank you @jsternberg! I'd like to get these basic bits merged in quickly so we can start making use of them and moving forward with a proper OTEL implementation on the CLI side.
Just a couple of questions –

cli/command/telemetry.go Outdated Show resolved Hide resolved
cli/command/telemetry.go Show resolved Hide resolved
@jsternberg jsternberg force-pushed the universal-telemetry-client branch from f7e1640 to a41fbee Compare March 12, 2024 15:23
cli/command/telemetry.go Outdated Show resolved Hide resolved
cli/command/telemetry.go Outdated Show resolved Hide resolved
cli/command/telemetry_docker.go Outdated Show resolved Hide resolved
cli/command/telemetry.go Outdated Show resolved Hide resolved
@jsternberg jsternberg force-pushed the universal-telemetry-client branch from a41fbee to b484bc0 Compare March 15, 2024 16:42
Copy link
Member

@Benehiko Benehiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me at least this looks good! :)

@laurazard
Copy link
Collaborator

@neersighted @thaJeztah can you TAL when you get a sec?

Copy link
Member

@neersighted neersighted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some (mostly Go-level/without a strong OTel background) thoughts; but overall the direction looks good.

cli/command/telemetry.go Outdated Show resolved Hide resolved
cli/command/telemetry.go Outdated Show resolved Hide resolved
cli/command/telemetry.go Show resolved Hide resolved
cli/command/telemetry.go Outdated Show resolved Hide resolved
cli/command/telemetry_docker.go Outdated Show resolved Hide resolved

// dockerExporterOTLPEndpoint retrieves the OTLP endpoint used for the docker reporter
// from the current context.
func dockerExporterOTLPEndpoint(cli Cli) (endpoint string, secure bool, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate on the above, this is more-or-less the code I want to see moved into the context itself (making "OTEL" a first class property/component of the context).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand a bit on this? I'm not really sure I understand what you mean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see this eventually become part of the Context struct/have the lifecycle handled there; this is really just a note for a future refactor.

@laurazard laurazard requested a review from milas March 25, 2024 15:33
This adds the code used by buildx and compose into the default CLI
program to help normalize the usage of these APIs and allow code reuse
between projects. It also allows these projects to benefit from
improvements or changes that may be made by another team.

At the moment, these APIs are a pretty thin layer on the OTEL SDK. It
configures an additional exporter to a docker endpoint that's used for
usage collection and is only active if the option is configured in
docker desktop.

This also upgrades the OTEL version to v1.19 which is the one being used
by buildkit, buildx, compose, etc.

Signed-off-by: Jonathan A. Sternberg <[email protected]>
@jsternberg jsternberg force-pushed the universal-telemetry-client branch from b484bc0 to 89db01e Compare March 25, 2024 16:11
Copy link
Collaborator

@laurazard laurazard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🥳

@neersighted neersighted merged commit b39bbb4 into docker:master Mar 25, 2024
88 checks passed
@thaJeztah thaJeztah added this to the 27.0.0 milestone Mar 25, 2024
@jsternberg jsternberg deleted the universal-telemetry-client branch March 25, 2024 17:02
@vvoland vvoland modified the milestones: 27.0.0, 26.1.0 Apr 10, 2024
renovate bot added a commit to earthly/dind that referenced this pull request Apr 29, 2024
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [docker/docker](https://togithub.com/docker/docker) | minor | `26.0.2`
-> `26.1.0` |

---

### Release Notes

<details>
<summary>docker/docker (docker/docker)</summary>

### [`v26.1.0`](https://togithub.com/moby/moby/releases/tag/v26.1.0)

[Compare
Source](https://togithub.com/docker/docker/compare/v26.0.2...v26.1.0)

#### 26.1.0

For a full list of pull requests and changes in this release, refer to
the relevant GitHub milestones:

- [docker/cli, 26.1.0
milestone](https://togithub.com/docker/cli/issues?q=is%3Aclosed+milestone%3A26.1.0)
- [moby/moby, 26.1.0
milestone](https://togithub.com/moby/moby/issues?q=is%3Aclosed+milestone%3A26.1.0)
- Deprecated and removed features, see [Deprecated
Features](https://togithub.com/docker/cli/blob/v26.1.0/docs/deprecated.md).
- Changes to the Engine API, see [API version
history](https://togithub.com/moby/moby/blob/v26.1.0/docs/api/version-history.md).

##### New

- Add configurable OpenTelemetry utilities and basic instrumentation to
commands.
For more information, see [OpenTelemetry for the Docker
CLI](https://docs.docker.com/config/otel).
[docker/cli#4889](https://togithub.com/docker/cli/pull/4889)

##### Bug fixes and enhancements

- Native Windows containers are configured with an internal DNS server
for container name resolution, and external DNS servers for other
lookups. Not all resolvers, including `nslookup`, fall back to the
external resolvers when they get a `SERVFAIL` answer from the internal
server. So, the internal DNS server can now be configured to forward
requests to the external resolvers, by setting `"features":
{"windows-dns-proxy": true }` in the `daemon.json` file.
[moby/moby#47584](https://togithub.com/moby/moby/pull/47584)

> \[!NOTE]
> This will be the new default behavior in Docker Engine 27.0.

> \[!WARNING]
> The `windows-dns-proxy` feature flag will be removed in a future
release.

- Swarm: Fix `Subpath` not being passed to the container config.
[moby/moby#47711](https://togithub.com/moby/moby/pull/47711)
- Classic builder: Fix cache miss on `WORKDIR <directory>/` build step
(directory with a trailing slash).
[moby/moby#47723](https://togithub.com/moby/moby/pull/47723)
- containerd image store: Fix `docker images` failing when any image in
the store has unexpected target.
[moby/moby#47738](https://togithub.com/moby/moby/pull/47738)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - "after 6am on monday" (UTC), Automerge
- At any time (no schedule defined).

🚦 **Automerge**: Enabled.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/earthly/dind).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zMjEuMiIsInVwZGF0ZWRJblZlciI6IjM3LjMyMS4yIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZSJdfQ==-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: idodod <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants