Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoconfigure experimental OTLP retry #3791

Merged
merged 7 commits into from
Nov 9, 2021

Conversation

jack-berg
Copy link
Member

I'm taking another try at making progress on OTLP retry! 😬

The PR #3636 draft PR demonstrated some strategies for implementing the OTLP retry required as a "MUST" in the spec. There were some good conversations and the resolution was to go to try to improve the wording of the spec so that the java implementation could be confident in its alignment.

I've attempted to improve the spec language, but that has stalled due to what I would describe as a catch 22. The spec says SDKs "MUST" implement retry, but lacks specificity needed by SDKs to implement with confidence that they'll be aligned with future language. At the same time the spec is reluctant to add more specific language without supporting data.

This PR proposes a strategy for adding experimental support for OTLP retry configuration. I hope it can help progress the issue by providing additional data to influence spec language. Here's what I propose:

  • Define a retry policy as a bundle of the parameters described in the gRPC retry specification.
  • For gRPC exporters using a gRPC client (i.e. not okhttp), define a method for configuring the retry policy via DefaultGrpcExporterBuilder#addRetryPolicy(RetryPolicy). This method is on a class in an internal package. Neither the gRPC over okhttp or http/protobuf exporters have support for retry in this PR.
  • Provide an internal accessor for accessing the DefaultGrpcExporterBuilder delegate of the exporter builders. I.e. you can access the OtlpGrpcSpanExporterBuilder#delegate field reflectively, and call addRetryPolicy() on it.
  • Provide an experimental autoconfigure parameter for enabling a reasonable default retry policy via otel.experimental.exporter.otlp.retry.enabled. When enabled, this calls sets up the retry policy via the internal accessors.

The net affect is the ability to add a retry policy without any additional public API surface area. The compromise needed to accomplish this was an accessor that uses reflection. Maybe there's another way, but I couldn't think of one.

Why should we care about retry?
This isn't a theoretical problem - I run otel for applications and I regularly see failed exports. Some applications may be able to tolerate intermittent failures in exports, but some will not. Lack of tolerance for network faults is a gap for production readiness.

@jack-berg jack-berg requested a review from a user October 26, 2021 20:12
@jack-berg jack-berg requested a review from Oberon00 as a code owner October 26, 2021 20:12
@jack-berg jack-berg mentioned this pull request Oct 26, 2021
@codecov
Copy link

codecov bot commented Oct 26, 2021

Codecov Report

Merging #3791 (16a663f) into main (f6754b6) will increase coverage by 0.01%.
The diff coverage is 91.17%.

Impacted file tree graph

@@             Coverage Diff              @@
##               main    #3791      +/-   ##
============================================
+ Coverage     89.28%   89.30%   +0.01%     
- Complexity     3956     4017      +61     
============================================
  Files           473      476       +3     
  Lines         12307    12458     +151     
  Branches       1207     1217      +10     
============================================
+ Hits          10988    11125     +137     
- Misses          908      920      +12     
- Partials        411      413       +2     
Impacted Files Coverage Δ
...etry/exporter/otlp/internal/grpc/GrpcExporter.java 100.00% <ø> (ø)
.../otlp/internal/grpc/OkHttpGrpcExporterBuilder.java 71.73% <0.00%> (-1.60%) ⬇️
...exporter/otlp/logs/OtlpGrpcLogExporterBuilder.java 100.00% <ø> (ø)
...er/otlp/metrics/OtlpGrpcMetricExporterBuilder.java 100.00% <ø> (ø)
...porter/otlp/trace/OtlpGrpcSpanExporterBuilder.java 100.00% <ø> (ø)
...pentelemetry/sdk/autoconfigure/OtlpConfigUtil.java 90.90% <66.66%> (-0.99%) ⬇️
...otlp/internal/grpc/DefaultGrpcExporterBuilder.java 93.44% <71.42%> (-6.56%) ⬇️
...ntelemetry/exporter/otlp/internal/RetryPolicy.java 100.00% <100.00%> (ø)
...try/exporter/otlp/internal/RetryPolicyBuilder.java 100.00% <100.00%> (ø)
.../exporter/otlp/internal/grpc/GrpcExporterUtil.java 100.00% <100.00%> (ø)
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6754b6...16a663f. Read the comment docs.

Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach of avoiding public API with reflection seems quite reasonable to me

@@ -19,8 +19,7 @@ testSets {

dependencies {
api(project(":sdk:logs"))

implementation(project(":exporters:otlp:common"))
api(project(":exporters:otlp:common"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change needed? common only has internal code, so we'd expect an implementation dependency to always suffice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is.

Code in :sdk-extensions:autoconfigure needs to access internal code of :exporters:otlp:common (i.e. the RetryPolicy class). The :sdk-extensions:autoconfigure module has only a compileOnly dependency on :exporters:otlp:all (trace) and :exporters:otlp:metrics, since it requires users to add the runtime dependency. But with a runtime dependency on :exporters:otlp:all and :exporters:otlp:metrics, they also now need a runtime dependency on :exporters:otlp:common in order to have access to RetryPolicy. Making :exporters:otlp:common an api dependency makes it work without the additional dependency.

If this eventually matures, I think :exporters:otlp:common would be the natural home for RetryPolicy, so at that point a api gradle dependency on :exporters:otlp:common would make sense.

@jack-berg
Copy link
Member Author

@jkwatson / @anuraaga what are you thoughts on including the upcoming release?

@anuraaga
Copy link
Contributor

anuraaga commented Nov 8, 2021

@jkwatson Want to review this? Otherwise we can probably go ahead and merge

@jkwatson
Copy link
Contributor

jkwatson commented Nov 9, 2021

@jkwatson Want to review this? Otherwise we can probably go ahead and merge

Should be fine from my perspective. I can give it a thorough review tomorrow if you think it needs it

@anuraaga anuraaga merged commit a1a45d2 into open-telemetry:main Nov 9, 2021
* @throws IllegalArgumentException if the instance does not contain a field called "delegate" of
* type {@link DefaultGrpcExporterBuilder}
*/
public static <T> DefaultGrpcExporterBuilder<?> getDelegateBuilder(Class<T> type, T instance) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we scope T to at least have some bounds? Do we have a common interface we could enforce here to make it a bit clearer what this is for/about? This is a weird API to have be public at the moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we scope T to at least have some bounds? Do we have a common interface we could enforce here to make it a bit clearer what this is for/about?

Unfortunately not.

This is a weird API to have be public at the moment.

Yeah its not ideal, but at least its an internal package. There's tons of public stuff in the internal packages that is questionable if you view it as truly public.

The alternative to this was to have this functionality package private in the autoconfigure module. I like this approach though because it allows someone to programmatically configure a retry policy that is different from the default one that is used when you enable it via env vars.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow I missed this was internal. Carry on!

@jack-berg jack-berg mentioned this pull request Nov 9, 2021
This was referenced Dec 19, 2021
@tylerbenson
Copy link
Member

@jack-berg is this still "experimental"? Should we consider making it official at this point?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants