OTLP retry #3636

jack-berg · 2021-09-17T22:10:55Z

There's an outstanding issue related to supporting this in java, but the spec is unambiguous about the need for retry strategy in SDKs:

Transient errors MUST be handled with a retry strategy. This retry strategy MUST implement an exponential back-off with jitter to avoid overwhelming the destination until the network is restored or the destination has recovered.

While I think the spec could use clarity about the default retry strategy, there's enough prior art that we should be able to at least provide a mechanism for users to configure a retry strategy if they choose.

I put together this branch to demonstrate what a retry implementation might look like. Along the way I learned:

gRPC's built in retry functionality is insufficient. It doesn't support jitter, which is required by the spec. And it wouldn't be applicable for our otlp/http exporters.
Not clear we can use an off the shelf retry library. This branch uses failsafe, which is great but I learned is littered with usages of CompletableFuture (which I learned today is available in the version of android we target). Another popular library called resilience4j also uses CompletableFuture. There's also guava-retrying and retry4j, which haven't been updated in a couple of years.

Wanted to propose this to help the discussion move forward.

jkwatson · 2021-09-17T23:59:32Z

If the core grpc retries don't meet the requirements of the spec, I think we should go back to the spec and request a re-think.

jack-berg · 2021-09-18T13:52:44Z

The thing is, the jitter functionality the spec dictates but which is unsupported out of the box by grpc is really useful for reducing contention.

Also, if we're going to need retry for otlp/http anyway, its not such a bad thing to use a different retry mechanism because otlp/grpc and otlp/http can be consistent.

I think it would be pretty straight forward to build an simple retry mechanism. It could live in a :exporters:otlp:common *.internal.* package. It could be built with Future instead of CompletableFuture, and have no external dependencies.

jkwatson · 2021-09-21T17:55:02Z

It's a huge bummer that we can't use CompleteableFuture agreed. Perhaps this feature can be unsupported on Android, or we can provide a less sophisticated option for Android, maybe?

jkwatson · 2021-09-21T17:55:41Z

exporters/otlp/common/src/main/java/io/opentelemetry/RetryConfig.java

+import io.opentelemetry.exporter.otlp.internal.ExponentialBackoffConfig;
+import java.time.Duration;
+
+@SuppressWarnings("InterfaceWithOnlyStatics")


heh. Let's not make it an interface, but a utility class. :)

jkwatson · 2021-09-21T18:00:59Z

I'm a little leery of adding a mandatory dependency into the OTLP exporter modules. What if we define our own Retryer interface and provide a range of implementations in extension modules?

jack-berg · 2021-09-21T21:20:00Z

It's a huge bummer that we can't use CompleteableFuture agreed. Perhaps this feature can be unsupported on Android, or we can provide a less sophisticated option for Android, maybe?

I'm a little leery of adding a mandatory dependency into the OTLP exporter modules.

Agree that it's a bummer, and also don't like the idea of adding another dependency to the exporter modules.

I've been doing some prototyping to see what it would take to have a zero dependency minimalist retry executor. My goal is to have a contract that's something like:

public class RetryExecutor {
  public <T> ListenableFuture<T> submit(Function<RetryContext, ListenableFuture<T>> supplier);
}

Where supplier is called 1 or more times to obtain a future result based on the rules of a RetryConfig, and aborts are handled according to a Predicate<Throwable>.

Its actually not bad. Pretty easy to understand and not too much code. Failsafe and these other libraries are great, but have a much wider scope than what we need. I think we can definitely get away with a custom solution.

anuraaga · 2021-09-22T02:39:33Z

From what I can tell, gRPC retry does have jitter

The initial retry attempt will occur at random(0, initialBackoff). In general, the n-th attempt will occur at random(0, min(initialBackoff*backoffMultiplier**(n-1), maxBackoff)).

Isn't the random in these equations the jitter? (Edit: Ah, yeah these equations look the same as FullJitter from that link @jack-berg )

Definitely want to stick to gRPC retry (which is enabled by default in gRPC 1.40 which I think happens with OTel 1.7) if possible, the features that are built into the gRPC protocol itself like server side configuration seem like things we'd like to benefit from rather than throw away.

We do need something for OkHttp - OkHttp seems to automatically handle the Retry-After header which is also a cool one for giving control to the server, is it worth bringing up the idea of putting the retry computation in the collector instead of the client to the spec?

anuraaga · 2021-09-22T06:14:51Z

On second thought Retry-After doesn't help if not able to actually get a response back so scratch that. Since okhttp interceptors are synchronous, though, I think implementing a simple interceptor with the retry logic embedded shouldn't be as hard as for async.

I think what could be good to add clarification to the spec

Use default gRPC retry for gRPC exporters
Provide the backoff parameters for HTTP exporters, and don't make this configurable for simplicity

jack-berg · 2021-09-23T00:45:05Z

From what I can tell, gRPC retry does have jitter

That's a good catch. I guess expected a configuration property that specified the jitter factor, but selecting a random backoff within a range definitely qualifies.

is it worth bringing up the idea of putting the retry computation in the collector instead of the client to the spec

The collector already supports otlp retry. I suppose we could ask if retry support in sdks is really necessary given support in the collector? But I think we should have support in the sdks. Its a substantial improvement data durability if someone is running without the collector, and even if the collector is running, retry can help prevent data loss due to transient network errors between a java application and a collector in the the same environment.

Definitely want to stick to gRPC retry (which is enabled by default in gRPC 1.40 which I think happens with OTel 1.7) if possible

I believe that while retries are "enabled" by default in gRPC 1.40, you'd still have to specify a service config defining the policy via ManagedChannelBuilder#defaultServiceConfig(Map<String, ?>) to actually enable retry (or configure the gRPC server to pass back the config).

I agree that it would be nice to stick to gRPC retry, but given the need for a retry for http/protobuf I also think that a consistent mechanism between the protocols is advantageous.

I've pushed up a commit that has a RetryExector which only depends on guava (for ListenableFuture stuff), and which is used by both the grpc and http/protobuf exporters. This type of strategy has an advantage over utilizing the built in grpc retry mechanism in that new retry policies can easily be added.

Whether or not we use the built in grpc retry mechanism, I do think it would be good to define a data class describing an exponential backoff policy. The parameters of a retry policy are unlikely to be a one-size-fits-all, I think.

anuraaga · 2021-09-23T02:20:58Z

I think we should step back on whether we use a common mechanism for retry.

If we only had the grpc exporter would we be using grpc retry? I'm pretty sure the answer is yes - it satisfies the requirements and has some extra power user features that may as well be unlocked. We wouldn't consider any other approach really - and I don't think the presence of the http exporter changes this that much especially given how it's basically free code-wise to use grpc mechanism.

We can say similar for http - if it was the only exporter would we use a listenable future based solution? Nah, okhttp requests run synchronously on a threadpool and we assuredly would have used a simple interceptor rather than mangling it into futures, it's much more complicated that way. And we definitely want to avoid adding the large guava dependency to it - one of the main goals of that exporter is to be as small as possible.

I think exposing a RetryPolicy for configurability could make sense - but not without adding it to the spec. The spec is very light weight on the topic right now. We can add retry with hard coded parameters with the current spec and it probably makes sense to do so - but we should wait for a spec update with more details on configuration knobs before adding API for a policy. I suspect such a spec update would also include a point that grpc exporters should use grpc retry.

jack-berg · 2021-09-23T03:26:59Z

I think I agree with you @anuraaga. Additionally, we likely don't need to be concerned with supporting retry policies not supported by gRPC, since it seems improbable that the spec would ever require such a thing.

I'll try to get the spec to provide more details on this issue. Specifically, I think the spec should:

Specify the ways that exponential retry can be configured.
Specify default values for those parameters. This would allow SDKs to implement retry and enable it by default (as the spec suggests is required) without the risk of behavioral changes if the spec were to later add those defaults.

jack-berg · 2021-09-23T03:33:30Z

Also, I may have been mistaken (or at least misleading) about the collector supporting retries. If I'm reading this correctly, then the collector will only retry if the status code is retryable AND the server specified the retry delay via RetryInfo. So doesn't seem like exponential backoff w/ jitter is implemented. May be missing something though.

anuraaga · 2021-09-23T03:54:55Z

Thanks @jack-berg sounds good. Also to clarify the collector retry I mentioned was probably confusing - I meant having the collector return to the client when it should retry in a response header (not retry from collector to backend). But it's probably not general enough to replace the client config.

But we probably do need to take into consideration for the HTTP exporter whether the spec should also mandate respecting HTTP retry-after or ratelimit headers if they happen to be present, I guess the answer is yes.

jack-berg · 2021-10-26T20:13:10Z

Proposing PR #3791 as an alternative.

Demonstrate OTLP retry with failsafe

36b4b7f

jkwatson reviewed Sep 21, 2021

View reviewed changes

Use simple home grown retry mechanism instead of failsafe

223c5b5

jack-berg mentioned this pull request Sep 27, 2021

Add specification for otlp retry configuration parameters and defaults open-telemetry/opentelemetry-specification#1974

Closed

jack-berg mentioned this pull request Oct 26, 2021

Autoconfigure experimental OTLP retry #3791

Merged

jack-berg closed this Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTLP retry #3636

OTLP retry #3636

jack-berg commented Sep 17, 2021

jkwatson commented Sep 17, 2021

jack-berg commented Sep 18, 2021

jkwatson commented Sep 21, 2021

jkwatson Sep 21, 2021

jkwatson commented Sep 21, 2021

jack-berg commented Sep 21, 2021

anuraaga commented Sep 22, 2021 •

edited

Loading

anuraaga commented Sep 22, 2021

jack-berg commented Sep 23, 2021

anuraaga commented Sep 23, 2021 •

edited

Loading

jack-berg commented Sep 23, 2021

jack-berg commented Sep 23, 2021

anuraaga commented Sep 23, 2021 •

edited

Loading

jack-berg commented Oct 26, 2021

OTLP retry #3636

OTLP retry #3636

Conversation

jack-berg commented Sep 17, 2021

jkwatson commented Sep 17, 2021

jack-berg commented Sep 18, 2021

jkwatson commented Sep 21, 2021

jkwatson Sep 21, 2021

Choose a reason for hiding this comment

jkwatson commented Sep 21, 2021

jack-berg commented Sep 21, 2021

anuraaga commented Sep 22, 2021 • edited Loading

anuraaga commented Sep 22, 2021

jack-berg commented Sep 23, 2021

anuraaga commented Sep 23, 2021 • edited Loading

jack-berg commented Sep 23, 2021

jack-berg commented Sep 23, 2021

anuraaga commented Sep 23, 2021 • edited Loading

jack-berg commented Oct 26, 2021

anuraaga commented Sep 22, 2021 •

edited

Loading

anuraaga commented Sep 23, 2021 •

edited

Loading

anuraaga commented Sep 23, 2021 •

edited

Loading