Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Donation proposal: OpenTelemetry Go Automatic Instrumentation #2

Closed
edeNFed opened this issue May 18, 2022 · 36 comments
Closed

Donation proposal: OpenTelemetry Go Automatic Instrumentation #2

edeNFed opened this issue May 18, 2022 · 36 comments

Comments

@edeNFed
Copy link
Contributor

edeNFed commented May 18, 2022

Description of donated project

This project adds OpenTelemetry instrumentation to Go applications without having to modify their source code.
Instrumentation is done by using eBPF uprobes.

Automatic instrumentation is available for a wide range of Go applications: Go version 1.12 and above, in addition to supporting stripped binaries (compiled with go build -ldflags "-s -w")

Instrumented libraries follow the OpenTelemetry specification and semantic conventions to produce standard OpenTelemetry data.

GitHub repos:
https://github.com/keyval-dev/opentelemetry-go-instrumentation
https://github.com/keyval-dev/offsets-tracker (used by this project, can also be donated if needed)

A detailed technical explanation is documented here.

Summary of existing usage of donated project

This project is a core part of keyval's product and production systems.
We are using this instrumentation successfully on many Go applications including multiple popular open-source Go projects.

Benefits to the OpenTelemetry community

OpenTelemetry Adoption

Go is the most popular programming language in the CNCF landscape (see DevStats)
Providing automatic instrumentation for Go applications would allow projects without the resources needed for implementing manual instrumentation to easily adopt OpenTelemetry.

Alignment

Currently, automatic instrumentation exists for Python, .NET, JavaScript, and Java. Our goal is to provide the same level of automatic instrumentation for Go applications. We hope this project will lead the way for automatic instrumentation for other compiled languages such as Rust and C++ which may also be implemented using eBPF.

Security

Running eBPF programs requires elevated privileges. We believe that being part of the OpenTelemetry community would make users more comfortable using this project.

Maintenance plan

Keyval will continue to maintain this project and we welcome the opportunity to work with more contributors.
Multiple developers in the OpenTelemetry community (from Go SDK SIG, Operator SIG & eBPF SIG) and in the eBPF community have expressed interest in contributing to this project.

Our current roadmap contains the following tasks:

  1. Adding more instrumentations (with PostgreSQL and MySQL coming soon)
  2. Integration with Go manual instrumentation
  3. Better goroutine id tracking (supporting more complex scenarios)
  4. Add Go as part of automatic instrumentation done by OpenTelemetry Operator - In Progress
  5. Context propagation
  6. Metrics support

Existing software license(s)

This project is licensed under the terms of the Apache 2.0 open source license.

List of trademarks (if any)

None

@tigrannajaryan
Copy link
Member

@edeNFed thank you for the proposal. I believe this can be a very valuable addition to OpenTelemetry.

I had a quick glance at the project and it looks very promising. I am going to add this to today's TC meeting agenda (good timing!). A few quick questions to help us get started:

  • What's the history of the project? I see the first commit is a few months ago. How much real-world production usage has the instrumentation seen?

  • Does the solution work on all platforms targeted by Go compiler or only on some subset? (e.g. x86-32/64, arm, etc).

  • Which functions are instrumented? Do you locate and instrument specific functions from specific libraries? If yes, is there a list of functions instrumented somewhere?

  • What is the typical runtime overhead? I assume there is a startup cost to inject the probes and there is an ongoing cost when the probes are hit. Do you have any perf measurements done one some typical binary?

  • What are the failure modes? If the probes do something wrong can they crash the instrumented application?

A few more questions (quotes from the "How it works" doc):

Notice that one of our design goals is to support stripped Go binaries - meaning binaries that do not contain debug information. In order to support stripped binaries and to create a stable instrumentation, we created a library called offsets-tracker. This library tracks the offset of different fields across versions.

Do you regenerate the offsets every time a new version of a supported library is released?

What happens if the version of library used is unsupported? Does instrumentation detect it and bail out?

We overcome this issue by analyzing the target binary and detecting all the return statements in the instrumented functions. We then place a uprobe at the end of each return statement. This uprobe invokes the eBPF code that collects the end timestamp.

How reliable is this detection? Is it possible to miss a return statement? What if for example the function exits by panic and not by return? If missed does the span remain forever unfinished?

Are these analysis dependent on the platform the machine code is in? Does it work on x86-64 only or other platforms too?

Since version 1.17 and above, Go changed the way it passes arguments to functions. Prior to version 1.17, Go placed arguments in the stack in the order they were defined in the function signature. Version 1.17 and above uses the machine registers to pass arguments.
We overcome this by analyzing the target binary and detecting the compiled Go version.

I assume this is dependent not only on the compiler version but also the target platform, so each specific combination may require a different analyzing code. That sounds like quite a lot of targets to support and keep maintaining. What has been your experience with this? How cumbersome it is?

@edeNFed
Copy link
Contributor Author

edeNFed commented May 18, 2022

Hi @tigrannajaryan, thank you for the quick response.

  1. We have been using this instrumentation for about 6 months. Overall, I estimate we tested it on about 30 different projects.
  2. This project requires an operating system that supports eBPF - which basically means Linux (maybe windows soon). We currently test it only on x86 but there should be no technical limitation to running it on other platforms.
  3. Each instrumentation declares the instrumented functions at the FuncNames() function. We dynamically detect the available functions in the target binary and invoke only the relevant instrumentations. Currently, the following functions are instrumented:
  • google.golang.org/grpc.(*Server).handleStream
  • github.com/gorilla/mux.(*Router).ServeHTTP
  • google.golang.org/grpc.(*ClientConn).Invoke
  • net/http.(*ServeMux).ServeHTTP
  1. As this instrumentation runs as a separate process from the target application, there is no startup overhead. We are currently working on performing performance testing so I don't have exact numbers yet. Theoretically, the performance overhead would be doing another context switch every time uprobe is hit. This should be neglectable compared to operations such as HTTP requests or DB queries. However, as I said, we are still validating this.
  2. In case of failure, we detach all the attached uprobes and exit. The target application is not affected even if the instrumentation process itself crashes.
  3. Currently, offsets are generated periodically. However, this is something that is definitely on our roadmap and we want to do very soon.
  4. Currently unsupported version would not be instrumented. We are debating about trying to guess the offset (if $current_version > $max_tracked_version use $max_tracked version, if $current_version < $min_tracked_version use $min_tracked_version or something like that).
  5. panics are not handled currently. I guess we can detect unfinished spans and terminate them after some timeout.
  6. The registers are selected at runtime (see https://github.com/keyval-dev/opentelemetry-go-instrumentation/blob/master/pkg/instrumentors/bpf/headers/arguments.h) If I am not mistaken, supporting other platforms like ARM should be a matter of implementing a version of get_argument_by_reg with the order of the registers matching to ARM.

Let me know if I can help with anything else 😄

@tigrannajaryan
Copy link
Member

@edeNFed thanks a lot for the answers, they are very helpful.

The TC has discussed the donation. We believe that it indeed can be a very valuable addition to OpenTelemetry. We have 2 primary concerns:

  1. How is context propagation going to work?
  2. How will auto instrumentation work with manual instrumentation?

Both of these are on your roadmap already, which is great. It would be very useful if you could provide more details about what you think the solutions will look like. Are there any design documents on these topics?

We do not necessarily require that donations are complete and done before we accept them. However we do need certain level of confidence that important OpenTelemetry functionality is implementable, even if some of the implementation work happens after the donation.

The TC believes the above 2 topics are important for an OpenTelemetry instrumentation. We would like to learn a bit more about how you intend to solve these before we are ready to move forward.

@edeNFed
Copy link
Contributor Author

edeNFed commented Jun 2, 2022

Makes sense.
I hope to finish those documents by next week.

@edeNFed
Copy link
Contributor Author

edeNFed commented Jun 30, 2022

Hi @tigrannajaryan,
Sorry for the delay, here are the two requested design documents:

I believe those documents should give a good idea regarding how we are going to implement both features.
Let me know if you have any other questions

@tigrannajaryan
Copy link
Member

Thanks @edeNFed

I will add this to the next week's TC meeting agenda to review the proposal with the new design documents again.

@tigrannajaryan
Copy link
Member

I had a quick look at the Context Propagation proposal. The proposal uses goroutine ids to associate incoming and outgoing requests. AFAIK there is no official way to obtain goroutine ids, in fact this is actively discouraged by Go language documentation. I know there are some hacky ways to obtain the goroutine id but I am not aware of any production quality implementations to get the goroutine id.

IMO this may be a deal breaker. I wonder what @open-telemetry/go-approvers, @open-telemetry/collector-approvers, @open-telemetry/collector-contrib-approvers think.

@edeNFed
Copy link
Contributor Author

edeNFed commented Jul 5, 2022

Yuri had a similar opinion, reposting what I answered on the other issue:

Regarding depending on goroutine ids: The main objection that I saw is that Go's internal structures like runtime.G can greatly change between Go versions and therefore it is hard to assume which offset the goid field will have in the G struct. This is solved by automatically tracking the goroutine id field in every version of Go. In the very unlikely case that the goid field will be removed (Go's scheduler also need some way to track goroutines) the instrumentation will fail at compile time. We then will probably write a different logic based on some other unique identifier for goroutines.

@RichiH
Copy link
Member

RichiH commented Jul 6, 2022

Context: I stumbled over this issue following up on some notes from OpenTelemetry Day in Austin; we had a well-attended eBPF breakout session there. See https://photos.app.goo.gl/EboFNpkGz41EE6ve6 for some notes from that session.

@tigrannajaryan the FAQ discourages the use of goroutine IDs by developers while designing and writing code. It lists good arguments there. It does not seem to argue against using them for automated extraction of information at runtime.

Carrying a small lookup table for well-known offsets as linked to by @edeNFed seems to be a reasonable and lightweight trade-off. The one constraint it introduces is that the autoinstrumentation needs to be at least as current as the Go version used for compilation, which seems to be another reasonable trade-off.

@edeNFed
Copy link
Contributor Author

edeNFed commented Jul 11, 2022

There is another approach to avoid tracking goroutine ids (suggested by @tedsuo):
Storing the spancontext in the context.Context argument of the instrumented function instead of in an eBPF map.
The downsides to this approach are:

  • We lose the ability to create spans for functions without context argument. For instance, currently, we are able to create spans for database/sql driver invocations such as result := db.Query("SELECT * FROM table").
  • If the user won't pass the context argument correctly between functions the trace will break.

Overall I think this approach is less hacky but may be more error-prune (from the user perspective). If the TC believes this is a better way, I'm also good with this implementation.

@tigrannajaryan
Copy link
Member

Overall I think this approach is less hacky but may be more error-prune (from the user perspective). If the TC believes this is a better way, I'm also good with this implementation.

It may also be less error prone in certain situations. context.Context is the standard propagation mechanism. The goroutine id-based proposal in a sense is an alternate way to propagate execution context which only works if the execution stays in the same goroutine. However, in many cases in Go code the receiving goroutine is different from the processing or sending one. context.Context correctly works in such cases but goroutine-id doesn't.

My personal opinion is that context.Context-based approach is preferable. Perhaps, in addition to that the goroutine id tracking can be optionally enabled as an experimental feature.

@edeNFed
Copy link
Contributor Author

edeNFed commented Jul 13, 2022

👍🏻 Agree. Please let me know if there are more pending questions.

@jsuereth
Copy link

Regrading manual + automatic instrumentation coordination, can you update with what you'd do for Metrics (and soon, Logs) APIs?

I assume it'd be similar but would be good to write down.

@edeNFed
Copy link
Contributor Author

edeNFed commented Jul 19, 2022

@jsuereth sure here are my initial thoughts about coordination with metrics/logs manual instrumentation:

Metrics:
I think that when talking about combining manual and automatic instrumentation for metrics, the main benefit is to automatically add exemplars for manually created metrics. It looks like exemplars support for Go SDK is not implemented yet.
I see that all the Metrics APIs functions get context.Context as an argument so I think the same approach of modifying it to contain the current span context should work also for metrics. I don't see a use case of the same metric being incremented by both manual and automatic instrumentations but if you feel like this is also something that needs to be supported, I think we can also achieve this by attaching uprobes to the Metrics SDK (very similar to the approach used for traces).

Logs:
logs are pretty early, but I think that we could add traceId and spanId to any logging library that gets context.Context as an argument.

@jsuereth
Copy link

Understand the point. My main concerns are as follows:

  1. Autoinstrumentation + Manual instrumentation can co-exist
  2. Users know where to configure exporters/SDK features. For Autoinstrumentation, can we do so ONLY in the autoinstrumentation and have it hijack any manual?
  3. Will all features work in a way users can't tell the difference between Auto + Manual instrumentation.

It sounds like yes, but for (2) wantedto make sure, e.g. Auto + manual instrumetnation use the same exporters/buffering/etc.

@edeNFed
Copy link
Contributor Author

edeNFed commented Jul 19, 2022

Yes this is the plan. I will add it to the roadmap and will extend the design documents to also describe metrics and logs.

@MrAlias
Copy link
Contributor

MrAlias commented Jul 19, 2022

+1 on not using the goroutine ID as the source of operation identity.

The important part of tracing instrumentation is the ability to connect spans into a unified trace that represents the operation workflow.

I'm guessing there is a plan to introspect a context to determine if there is an active span present that can act as a parent, but what happens when there is not? There is a good chance that many orphaned spans will be created if every function that accepts a context is encapsulated in a span, but there is an origination issue if you never start a span because of the absence of a parent. How do you plan to originate traces?

@dashpole I know has put some thought into this. I would value his take if he has time to provide it 😄.

@edeNFed
Copy link
Contributor Author

edeNFed commented Jul 20, 2022

if every function that accepts a context is encapsulated in a span

This is not how it works.
For automatic instrumentation, spans are created only for a specific set of instrumented functions (gRPC calls, HTTP, DB, etc).
For manual instrumentation, everything is the same, the user has to explicitly create spans by invoking OTel API
and the manually created spans will be placed correctly in a trace with the automatically created spans.

Orphans are also not a problem, If there is no active span in the context a new span will be created. Basically, it means that the root span of the trace will be the first instrumented function that got invoked in the request.

@MrAlias
Copy link
Contributor

MrAlias commented Jul 20, 2022

spans are created only for a specific set of instrumented functions (gRPC calls, HTTP, DB, etc).

Gotcha 👍

Do we have a list of functions this will support? Also, how will library support be expanded?

I would be hesitant to add gRPC support, at least initially, as it is not included in the standard library. But it brings up the question about how 3rd party libraries will be supported and how will we separate them (or not?).

@edeNFed
Copy link
Contributor Author

edeNFed commented Jul 20, 2022

Every implementation of the Instrumentor interface declares which target functions are instrumented. You can find all the implementations here. The automatic instrumentation looks for the specified function in the target binary file if the function does not exists (for example a Go app that does not use gRPC) than the instrumentor is not loaded.

@MrAlias
Copy link
Contributor

MrAlias commented Jul 21, 2022

Every implementation of the Instrumentor interface declares which target functions are instrumented. You can find all the implementations here. The automatic instrumentation looks for the specified function in the target binary file if the function does not exists (for example a Go app that does not use gRPC) than the instrumentor is not loaded.

Gotcha, thanks.

From a community support perspective, how are we planing to maintain this? Who will comprise the maintainer roles and what groups plan to support the development?

@edeNFed
Copy link
Contributor Author

edeNFed commented Jul 21, 2022

Maintenance plan

Keyval will continue to maintain this project and we welcome the opportunity to work with more contributors.
Multiple developers in the OpenTelemetry community (from Go SDK SIG, Operator SIG & eBPF SIG) and in the eBPF community have expressed interest in contributing to this project.

I guess we should follow the same process that other donations went through.

@MrAlias
Copy link
Contributor

MrAlias commented Jul 22, 2022

Maintenance plan
Keyval will continue to maintain this project and we welcome the opportunity to work with more contributors.
Multiple developers in the OpenTelemetry community (from Go SDK SIG, Operator SIG & eBPF SIG) and in the eBPF community have expressed interest in contributing to this project.

I guess we should follow the same process that other donations went through.

@open-telemetry/technical-committee correct me if I'm wrong, but I'm not sure there is a precedence. It looked like other donations were handed off to existing SIGs.

I imagined this living in its own repository, like the Java auto-instrumentation. Is that the intention?

If so, we might also need a distinct maintainer group from the opentelemetry-go maintainers. I wouldn't want to assume we all have the bandwidth to be a part of this. I would be happy to be a maintainer for this, but would want to ensure others would as well.

Similarly for Approvers, we would want to build an initial group that might differ from the opentelemetry-go group.

@pellared
Copy link
Member

pellared commented Jul 25, 2022

.NET auto-instrumentation has it in a similar way: separate SIG + teams from the .NET SDK.

@MrAlias
Copy link
Contributor

MrAlias commented Jul 25, 2022

There is another approach to avoid tracking goroutine ids (suggested by @tedsuo): Storing the spancontext in the context.Context argument of the instrumented function instead of in an eBPF map. The downsides to this approach are:

Is it possible for for an eBPF to modify the argument to a function? I'm not the most familiar with the technology, but I thought the eBPF virtual machine has read-only access to the syscall parameters and return value.

@edeNFed
Copy link
Contributor Author

edeNFed commented Jul 26, 2022

Yes. Please look at the Context propagation design doc I added a proof of concept for that in the document.

@jmacd
Copy link
Contributor

jmacd commented Jul 29, 2022

@edeNFed OpenTelemetry Community guidelines for donations require a timely response, I'm very sorry we have not been able to assemble one until now.

The community reviewed this proposal of yours and @pdelewski's #3. We see both of these as technically viable solutions to the auto-instrumentation problem. OpenTelemetry endorses work on auto-instrumentation as a sub-group of each language SIG, and it appears that we have the making of a new OpenTelemetry-Go auto-instrumentation group in these two donation proposals.

We like both of these proposals, and we recognize that methods for auto-instrumentation are only part of the larger-problem that includes all aspects of automatic SDK setup and program initialization. From what we can tell, there is not a hard boundary between automatically configuring the Golang SDK and automatically configuring Golang instrumentation. We recommend the creation of an opentelemetry-go-instrumentation GH repository, that will be initially populated with current opentelemetry-go approvers and maintainers. The goal should be to establish a connected but separate SIG and to plan and develop common frameworks for Golang auto-instrumentation. We expect new members to join this group with expertise in source-level auto-instrumentation and eBPF-level auto-instrumentation; eventually we expect these repositories to have separate ownership, the way Java and .NET languages have done already in OpenTelemetry.

We also expect members of the OpenTelemetry-Go vendor community looking to streamline and automate SDK setup for vendor-specific configuration (or, e.g., those looking for open-telemetry/opentelemetry-specification#1773 for Go), to work together in this new repository.

Finally, we look forward to members of the auto-instrumentation group to work with the OpenTelemetry-Go maintainers and Go developers to bridge gaps between built-in diagnostic features and telemetry SDK support. See, for example,

@edeNFed are you willing to take ownership of the code that you propose donating and eventually become a maintainer of the (proposed) new opentelemetry-go-instrumentation repository?

@edeNFed
Copy link
Contributor Author

edeNFed commented Jul 29, 2022

@jmacd Yes. I agree to take ownership over the proposed code and become a maintainer in the new repository.

Thank you for taking the time to review this donation proposal.

@MikeGoldsmith
Copy link
Member

@jmacd @edeNFed We're definitely open to helping on this project alongside the work to simplify SDK config / setup for the main SDK 👍🏻

cc @JamieDanielson @robbkidd @kentquirk

@jmacd
Copy link
Contributor

jmacd commented Aug 3, 2022

@edeNFed I created https://github.com/open-telemetry/opentelemetry-go-instrumentation. One of the first tasks for you and @pdelewski will be to become members of the OpenTelemetry org and begin sending and reviewing PRs in this repository. I copied the current Go maintainers in as maintainers of this project temporarily; please consider me your sponsor, I will be glad to help review the first steps in this repository as we work to establish you as new a new maintainer.

@jmacd jmacd transferred this issue from open-telemetry/community Aug 3, 2022
@jmacd
Copy link
Contributor

jmacd commented Aug 3, 2022

I've tranfered this issue. We can close this when the donation has been merged. Thanks, @edeNFed

@edeNFed
Copy link
Contributor Author

edeNFed commented Aug 3, 2022

Opened #4

@dineshg13
Copy link
Member

@edeNFed The go-instrumentation uses https://github.com/keyval-dev/offsets-tracker to generate offset_results.json, I believe that is the critical input to the code. What are your thoughts , on how this work once the code is merged ?
Ideally , i believe offsets-tracker should be part of open-telemetry , but would be like to know your thoughts .

@edeNFed
Copy link
Contributor Author

edeNFed commented Nov 2, 2022

Hi @dineshg13 I totally agree (I mentioned the offset tracker repo in the original donation proposal). Not sure if this is needed as a first step but ultimately I also think this should be owned by the community.

@dineshg13
Copy link
Member

Thanks @edeNFed , i don't think it is necessary as a first step. We can add it to our roadmap

@dineshg13
Copy link
Member

Closing issue as PR is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants