-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement OTLP retry mechanism #4616
Implement OTLP retry mechanism #4616
Conversation
// TODO: Consider introducing a fixed max number of retries (e.g. max 5 retries). | ||
// The spec does not specify a max number of retries, but it may be bad to not cap the number of attempts. | ||
// Without a max number of retry attempts, retries would continue until the deadline. | ||
// Maybe this is ok? However, it may lead to an unexpected behavioral change. For example: | ||
// 1) When using a batch processor, a longer delay due to repeated | ||
// retries up to the deadline may lead to a higher chance that the queue will be exhausted. | ||
// 2) When using the simple processor, a longer delay due to repeated | ||
// retries up to the deadline will lead to a prolonged blocking call. | ||
// if (attemptCount >= MaxAttempts) | ||
// { | ||
// return false | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to set a maximum cap of 5 retries, but I'd like to get other people's thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 sounds good.
1)
, users can adjust the queue length if they frequently see dropped items. for 2)
there is no mitigation given the retries are not configurable, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 sounds good.
Yes, I agree. I plan to leave this as a follow up for now. Put it in place when I actually wire this up with the exporter.
Codecov Report
@@ Coverage Diff @@
## main #4616 +/- ##
==========================================
+ Coverage 83.91% 83.94% +0.02%
==========================================
Files 292 293 +1
Lines 11908 11966 +58
==========================================
+ Hits 9993 10045 +52
- Misses 1915 1921 +6
|
Having this on by default for Http exporter may impact certain users who have a custom HttpClient set up with their own retry mechanism. I have not seen this but something we might run into this so bringing it up. |
|
||
private static bool IsGrpcStatusCodeRetryable(StatusCode statusCode, bool hasRetryDelay) | ||
{ | ||
switch (statusCode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alanwest Is it possible to do this custom retry for Unavailable
and ResourceExhausted
only? And rest with https://learn.microsoft.com/en-us/aspnet/core/grpc/retries?view=aspnetcore-7.0#configure-a-grpc-retry-policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, especially if we introduce a max number of retries. There's no way to know how many attempts the client library attempted vs. special handling for unavailable/resource exhausted.
Though, even if it were possible, I'd be hard pressed to say the complexity would be buying us anything. I think it's pretty nice that gRPC and HTTP uses the exact same retry logic.
I'd like to think on this a bit more to gauge how concerned we need to be about this. Worst case, we could disable retry if someone is using the HttpClientFactory option - though I think this would be a bit of a bummer. This is one reason I think I'd like to hold on introducing support for the HttpClientFactory for the gRPC exporter. That is, I'd like to land retry first and then consider adding HttpClientFactory support proposed in #4625. |
Would it be better if retry is not enabled by default but offered as an option?
|
I want to understand the direction here, the above code snippet has two parts that are concerning:
|
What's breaking? |
Looked into this bit more. Just sharing some points for consideration: I think we should consider this as an opt-in feature or on by default with ability to turn off. More for http protobuf as we allow custom HttpClient. a) Retrying with this approach has some overhead involved for recreating request messages for each retry during failures. |
The call to Export is a blocking call. Task.Delay could be used, but probably to the same effect. I studied the retry mechanism for Grpc.Net.Client extensively to guide this implementation. It uses Task.Delay as the delay is make from an asynchronous method https://github.com/grpc/grpc-dotnet/blob/16e01b9369a1c2b1440bbe7f77c1520b11dc6db3/src/Grpc.Net.Client/Internal/Retry/RetryCall.cs#L219.
The strategy is guided by the spec. I'm not looking to make anything up here, so I think our goal should be that we mutually understand the specification. Here's my interpretation with regards to your question: If export fails with a retryable status code, then the OTLP exporter should retry the same data. Per the spec, retries attempts will continue up until the configured timeout which defaults to 10 seconds. If using the batch export processor this could mean the buffer fills up and new data is dropped. That said, I am proposing one additional requirement not in the spec and that is a cap of 5 retry attempts - this is not currently implemented in this PR. |
Where are these coming from? Is this the one we're trying to follow? https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/exporter.md#retry |
I don't see how this section in the spec is related to the code snippet from #4616 (comment). I also couldn't find where "per the spec, retries attempts will continue up until the configured timeout which defaults to 10 seconds. If using the batch export processor this could mean the buffer fills up and new data is dropped" is coming from. |
This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day. |
This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day. |
This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day. |
This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM considering this will be enabled via feature flag allowing us to clarify things with spec and gather feedback.
Feature flag can be added as a separate PR while wiring things up.
This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day. |
src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/google/rpc/status.proto
Outdated
Show resolved
Hide resolved
src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/google/rpc/error_details.proto
Outdated
Show resolved
Hide resolved
src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs
Show resolved
Hide resolved
src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs
Outdated
Show resolved
Hide resolved
src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs
Show resolved
Hide resolved
src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs
Show resolved
Hide resolved
// Consider pulling in Random.Shared implementation. | ||
lock (Random) | ||
{ | ||
return Random.Next(min, max); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to use max + 1
? The second argument for Random.Next
is the exclusive upper bound of the random number returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 Not sure it matters a whole lot. I'm happy to go either way. Just for some additional context, the behavior of the retry logic in this PR is very much inspired by grpc-dotnet's implementation of the gRPC retry specification including the calculation for random jitter.
Using Random.Next(min, max)
is equivalent to grpc-dotnet's implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it matters a whole lot.
Probably not. It's more of a code readability thing. It might seem a bit odd as the variable names suggest that the max value could indeed be a possible delayDuration
value.
I'm okay with the code as-is.
src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs
Outdated
Show resolved
Hide resolved
return null; | ||
} | ||
|
||
var statusDetails = trailers.Get(GrpcStatusDetailsHeader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is every Grpc service expected to set this trailer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metadata.Get returns null if the header does not exist.
https://grpc.github.io/grpc/csharp/api/Grpc.Core.Metadata.html#Grpc_Core_Metadata_Get_System_String_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question was more around the general availability of this header for our use. Is this left to the service implementation? Could we expect OTLP Collector to be setting this header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we expect OTLP Collector to be setting this header?
Yes, per the spec, any OTLP receiver like the one for the collector should support this header for throttling requests. https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md#otlpgrpc-throttling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but yea I think I see the confusion... the spec does not make it obvious that it's a header by the name grpc-status-details-bin
. This is the header the collector responds with. Similarly it's the header that New Relic's OTLP ingest responds with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the header the collector responds with.
I think it would be good to add a code comment about this.
#4587 implemented a retry mechanism using Grpc.Net.Client's built in retry mechanism. However, the requirements of the OTLP specification differ from the standard gRPC retry mechanism.
This PR contains a spec-compliant OTLP retry mechanism. It can be used by both the gRPC and HTTP exporter.
This PR does not yet wire up the exporters to use this retry mechanism. This will be in a follow up.
Example usage
Follow up PRs: