-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTLP retry #3636
OTLP retry #3636
Conversation
If the core grpc retries don't meet the requirements of the spec, I think we should go back to the spec and request a re-think. |
The thing is, the jitter functionality the spec dictates but which is unsupported out of the box by grpc is really useful for reducing contention. Also, if we're going to need retry for I think it would be pretty straight forward to build an simple retry mechanism. It could live in a |
It's a huge bummer that we can't use |
import io.opentelemetry.exporter.otlp.internal.ExponentialBackoffConfig; | ||
import java.time.Duration; | ||
|
||
@SuppressWarnings("InterfaceWithOnlyStatics") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh. Let's not make it an interface, but a utility class. :)
I'm a little leery of adding a mandatory dependency into the OTLP exporter modules. What if we define our own Retryer interface and provide a range of implementations in extension modules? |
Agree that it's a bummer, and also don't like the idea of adding another dependency to the exporter modules. I've been doing some prototyping to see what it would take to have a zero dependency minimalist retry executor. My goal is to have a contract that's something like:
Where Its actually not bad. Pretty easy to understand and not too much code. Failsafe and these other libraries are great, but have a much wider scope than what we need. I think we can definitely get away with a custom solution. |
From what I can tell, gRPC retry does have jitter
Isn't the Definitely want to stick to gRPC retry (which is enabled by default in gRPC 1.40 which I think happens with OTel 1.7) if possible, the features that are built into the gRPC protocol itself like server side configuration seem like things we'd like to benefit from rather than throw away. We do need something for OkHttp - OkHttp seems to automatically handle the |
On second thought Retry-After doesn't help if not able to actually get a response back so scratch that. Since okhttp interceptors are synchronous, though, I think implementing a simple interceptor with the retry logic embedded shouldn't be as hard as for async. I think what could be good to add clarification to the spec
|
That's a good catch. I guess expected a configuration property that specified the jitter factor, but selecting a random backoff within a range definitely qualifies.
The collector already supports otlp retry. I suppose we could ask if retry support in sdks is really necessary given support in the collector? But I think we should have support in the sdks. Its a substantial improvement data durability if someone is running without the collector, and even if the collector is running, retry can help prevent data loss due to transient network errors between a java application and a collector in the the same environment.
I believe that while retries are "enabled" by default in gRPC 1.40, you'd still have to specify a service config defining the policy via I agree that it would be nice to stick to gRPC retry, but given the need for a retry for I've pushed up a commit that has a Whether or not we use the built in grpc retry mechanism, I do think it would be good to define a data class describing an exponential backoff policy. The parameters of a retry policy are unlikely to be a one-size-fits-all, I think. |
I think we should step back on whether we use a common mechanism for retry. If we only had the grpc exporter would we be using grpc retry? I'm pretty sure the answer is yes - it satisfies the requirements and has some extra power user features that may as well be unlocked. We wouldn't consider any other approach really - and I don't think the presence of the http exporter changes this that much especially given how it's basically free code-wise to use grpc mechanism. We can say similar for http - if it was the only exporter would we use a listenable future based solution? Nah, okhttp requests run synchronously on a threadpool and we assuredly would have used a simple interceptor rather than mangling it into futures, it's much more complicated that way. And we definitely want to avoid adding the large guava dependency to it - one of the main goals of that exporter is to be as small as possible. I think exposing a RetryPolicy for configurability could make sense - but not without adding it to the spec. The spec is very light weight on the topic right now. We can add retry with hard coded parameters with the current spec and it probably makes sense to do so - but we should wait for a spec update with more details on configuration knobs before adding API for a policy. I suspect such a spec update would also include a point that grpc exporters should use grpc retry. |
I think I agree with you @anuraaga. Additionally, we likely don't need to be concerned with supporting retry policies not supported by gRPC, since it seems improbable that the spec would ever require such a thing. I'll try to get the spec to provide more details on this issue. Specifically, I think the spec should:
|
Also, I may have been mistaken (or at least misleading) about the collector supporting retries. If I'm reading this correctly, then the collector will only retry if the status code is retryable AND the server specified the retry delay via RetryInfo. So doesn't seem like exponential backoff w/ jitter is implemented. May be missing something though. |
Thanks @jack-berg sounds good. Also to clarify the collector retry I mentioned was probably confusing - I meant having the collector return to the client when it should retry in a response header (not retry from collector to backend). But it's probably not general enough to replace the client config. But we probably do need to take into consideration for the HTTP exporter whether the spec should also mandate respecting HTTP retry-after or ratelimit headers if they happen to be present, I guess the answer is yes. |
Proposing PR #3791 as an alternative. |
There's an outstanding issue related to supporting this in java, but the spec is unambiguous about the need for retry strategy in SDKs:
While I think the spec could use clarity about the default retry strategy, there's enough prior art that we should be able to at least provide a mechanism for users to configure a retry strategy if they choose.
I put together this branch to demonstrate what a retry implementation might look like. Along the way I learned:
CompletableFuture
(which I learned today is available in the version of android we target). Another popular library called resilience4j also usesCompletableFuture
. There's also guava-retrying and retry4j, which haven't been updated in a couple of years.Wanted to propose this to help the discussion move forward.