Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTLP retry #3636

Closed
Closed

Conversation

jack-berg
Copy link
Member

There's an outstanding issue related to supporting this in java, but the spec is unambiguous about the need for retry strategy in SDKs:

Transient errors MUST be handled with a retry strategy. This retry strategy MUST implement an exponential back-off with jitter to avoid overwhelming the destination until the network is restored or the destination has recovered.

While I think the spec could use clarity about the default retry strategy, there's enough prior art that we should be able to at least provide a mechanism for users to configure a retry strategy if they choose.

I put together this branch to demonstrate what a retry implementation might look like. Along the way I learned:

  • gRPC's built in retry functionality is insufficient. It doesn't support jitter, which is required by the spec. And it wouldn't be applicable for our otlp/http exporters.
  • Not clear we can use an off the shelf retry library. This branch uses failsafe, which is great but I learned is littered with usages of CompletableFuture (which I learned today is available in the version of android we target). Another popular library called resilience4j also uses CompletableFuture. There's also guava-retrying and retry4j, which haven't been updated in a couple of years.

Wanted to propose this to help the discussion move forward.

@jkwatson
Copy link
Contributor

If the core grpc retries don't meet the requirements of the spec, I think we should go back to the spec and request a re-think.

@jack-berg
Copy link
Member Author

The thing is, the jitter functionality the spec dictates but which is unsupported out of the box by grpc is really useful for reducing contention.

Also, if we're going to need retry for otlp/http anyway, its not such a bad thing to use a different retry mechanism because otlp/grpc and otlp/http can be consistent.

I think it would be pretty straight forward to build an simple retry mechanism. It could live in a :exporters:otlp:common *.internal.* package. It could be built with Future instead of CompletableFuture, and have no external dependencies.

@jkwatson
Copy link
Contributor

It's a huge bummer that we can't use CompleteableFuture agreed. Perhaps this feature can be unsupported on Android, or we can provide a less sophisticated option for Android, maybe?

import io.opentelemetry.exporter.otlp.internal.ExponentialBackoffConfig;
import java.time.Duration;

@SuppressWarnings("InterfaceWithOnlyStatics")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh. Let's not make it an interface, but a utility class. :)

@jkwatson
Copy link
Contributor

I'm a little leery of adding a mandatory dependency into the OTLP exporter modules. What if we define our own Retryer interface and provide a range of implementations in extension modules?

@jack-berg
Copy link
Member Author

It's a huge bummer that we can't use CompleteableFuture agreed. Perhaps this feature can be unsupported on Android, or we can provide a less sophisticated option for Android, maybe?

I'm a little leery of adding a mandatory dependency into the OTLP exporter modules.

Agree that it's a bummer, and also don't like the idea of adding another dependency to the exporter modules.

I've been doing some prototyping to see what it would take to have a zero dependency minimalist retry executor. My goal is to have a contract that's something like:

public class RetryExecutor {
  public <T> ListenableFuture<T> submit(Function<RetryContext, ListenableFuture<T>> supplier);
}

Where supplier is called 1 or more times to obtain a future result based on the rules of a RetryConfig, and aborts are handled according to a Predicate<Throwable>.

Its actually not bad. Pretty easy to understand and not too much code. Failsafe and these other libraries are great, but have a much wider scope than what we need. I think we can definitely get away with a custom solution.

@anuraaga
Copy link
Contributor

anuraaga commented Sep 22, 2021

From what I can tell, gRPC retry does have jitter

The initial retry attempt will occur at random(0, initialBackoff). In general, the n-th attempt will occur at random(0, min(initialBackoff*backoffMultiplier**(n-1), maxBackoff)).

Isn't the random in these equations the jitter? (Edit: Ah, yeah these equations look the same as FullJitter from that link @jack-berg )

Definitely want to stick to gRPC retry (which is enabled by default in gRPC 1.40 which I think happens with OTel 1.7) if possible, the features that are built into the gRPC protocol itself like server side configuration seem like things we'd like to benefit from rather than throw away.

We do need something for OkHttp - OkHttp seems to automatically handle the Retry-After header which is also a cool one for giving control to the server, is it worth bringing up the idea of putting the retry computation in the collector instead of the client to the spec?

@anuraaga
Copy link
Contributor

On second thought Retry-After doesn't help if not able to actually get a response back so scratch that. Since okhttp interceptors are synchronous, though, I think implementing a simple interceptor with the retry logic embedded shouldn't be as hard as for async.

I think what could be good to add clarification to the spec

  • Use default gRPC retry for gRPC exporters
  • Provide the backoff parameters for HTTP exporters, and don't make this configurable for simplicity

@jack-berg
Copy link
Member Author

From what I can tell, gRPC retry does have jitter

That's a good catch. I guess expected a configuration property that specified the jitter factor, but selecting a random backoff within a range definitely qualifies.

is it worth bringing up the idea of putting the retry computation in the collector instead of the client to the spec

The collector already supports otlp retry. I suppose we could ask if retry support in sdks is really necessary given support in the collector? But I think we should have support in the sdks. Its a substantial improvement data durability if someone is running without the collector, and even if the collector is running, retry can help prevent data loss due to transient network errors between a java application and a collector in the the same environment.

Definitely want to stick to gRPC retry (which is enabled by default in gRPC 1.40 which I think happens with OTel 1.7) if possible

I believe that while retries are "enabled" by default in gRPC 1.40, you'd still have to specify a service config defining the policy via ManagedChannelBuilder#defaultServiceConfig(Map<String, ?>) to actually enable retry (or configure the gRPC server to pass back the config).

I agree that it would be nice to stick to gRPC retry, but given the need for a retry for http/protobuf I also think that a consistent mechanism between the protocols is advantageous.

I've pushed up a commit that has a RetryExector which only depends on guava (for ListenableFuture stuff), and which is used by both the grpc and http/protobuf exporters. This type of strategy has an advantage over utilizing the built in grpc retry mechanism in that new retry policies can easily be added.

Whether or not we use the built in grpc retry mechanism, I do think it would be good to define a data class describing an exponential backoff policy. The parameters of a retry policy are unlikely to be a one-size-fits-all, I think.

@anuraaga
Copy link
Contributor

anuraaga commented Sep 23, 2021

I think we should step back on whether we use a common mechanism for retry.

If we only had the grpc exporter would we be using grpc retry? I'm pretty sure the answer is yes - it satisfies the requirements and has some extra power user features that may as well be unlocked. We wouldn't consider any other approach really - and I don't think the presence of the http exporter changes this that much especially given how it's basically free code-wise to use grpc mechanism.

We can say similar for http - if it was the only exporter would we use a listenable future based solution? Nah, okhttp requests run synchronously on a threadpool and we assuredly would have used a simple interceptor rather than mangling it into futures, it's much more complicated that way. And we definitely want to avoid adding the large guava dependency to it - one of the main goals of that exporter is to be as small as possible.

I think exposing a RetryPolicy for configurability could make sense - but not without adding it to the spec. The spec is very light weight on the topic right now. We can add retry with hard coded parameters with the current spec and it probably makes sense to do so - but we should wait for a spec update with more details on configuration knobs before adding API for a policy. I suspect such a spec update would also include a point that grpc exporters should use grpc retry.

@jack-berg
Copy link
Member Author

I think I agree with you @anuraaga. Additionally, we likely don't need to be concerned with supporting retry policies not supported by gRPC, since it seems improbable that the spec would ever require such a thing.

I'll try to get the spec to provide more details on this issue. Specifically, I think the spec should:

  1. Specify the ways that exponential retry can be configured.
  2. Specify default values for those parameters. This would allow SDKs to implement retry and enable it by default (as the spec suggests is required) without the risk of behavioral changes if the spec were to later add those defaults.

@jack-berg
Copy link
Member Author

Also, I may have been mistaken (or at least misleading) about the collector supporting retries. If I'm reading this correctly, then the collector will only retry if the status code is retryable AND the server specified the retry delay via RetryInfo. So doesn't seem like exponential backoff w/ jitter is implemented. May be missing something though.

@anuraaga
Copy link
Contributor

anuraaga commented Sep 23, 2021

Thanks @jack-berg sounds good. Also to clarify the collector retry I mentioned was probably confusing - I meant having the collector return to the client when it should retry in a response header (not retry from collector to backend). But it's probably not general enough to replace the client config.

But we probably do need to take into consideration for the HTTP exporter whether the spec should also mandate respecting HTTP retry-after or ratelimit headers if they happen to be present, I guess the answer is yes.

@jack-berg
Copy link
Member Author

Proposing PR #3791 as an alternative.

@jack-berg jack-berg closed this Oct 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants