-
Notifications
You must be signed in to change notification settings - Fork 775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Otlp] Fix Http Retry to cover network failure and add tests #5394
[Otlp] Fix Http Retry to cover network failure and add tests #5394
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5394 +/- ##
==========================================
+ Coverage 83.38% 84.76% +1.38%
==========================================
Files 297 282 -15
Lines 12531 12127 -404
==========================================
- Hits 10449 10280 -169
+ Misses 2082 1847 -235
Flags with carried forward coverage won't be shown. Click here to find out more.
|
src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs
Outdated
Show resolved
Hide resolved
src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple nits but LGTM
if (!statusCode.HasValue) | ||
{ | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think returning true from this method when HttpStatusCode is null is a good thing to do. I think this method should only be used when we actually have a status code, so the statusCode
parameter should not be nullable.
This is essentially saying: all requests that failed without a status code response (e.g., timeout) are retryable. I do not think this is what we want.
For example, consider the code where we catch failures:
Lines 63 to 68 in 77ef123
catch (HttpRequestException ex) | |
{ | |
OpenTelemetryProtocolExporterEventSource.Log.FailedToReachCollector(this.Endpoint, ex); | |
return new ExportClientHttpResponse(success: false, deadlineUtc: deadline, response: null, exception: ex); | |
} |
An HttpRequestException
can mean many different things. I think we need to be selective and determine which things we deem retryable.
Take a look at the docs for HttpRequestException
.
I think a combination of the InnerException
and maybe the HttpRequestError
property should be used to clue us into whether a request is retryable. For example, if the inner exception is a TimeoutException
then this should be retryable. However, there are a number of values for HttpRequestError
that should probably not be retried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HttpRequestError
is .NET8.0 specific concept. Before this, there was no way to drill down the HttpRequestException
to the details you see in that enum.
My intention here was to keep the changes to minimum. What I can do is add an additional method in this class to inspect the exception first and fail fast in case of http (assuming there is some way to distinguish a retriable error when status code is not available for other versions). (follow up).
This method is called from within TryGetRetryResult<TStatusCode, TCarrier> and it is possible that in some cases statuscode will be null (for e.g. connection error). So we can return true here then.
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HttpRequestError is .NET8.0 specific concept.
The InnerException
however predates .NET 8, so it can be used for your purposes.
statuscode will be null (for e.g. connection error)
Yes, correct, but what this PR is currently saying is that when status code is null it means the request is retryable. This is not always correct. Null status code can mean connection error, but does not always mean connection error.
HTTP is more complicated than gRPC because you must use two signals to determine retriability: the status code (when present) or the exception (when status code is not present). If my memory serves me, in the case of a connection error the exception will be a TimeoutException
.
I'd have to play around with things myself to recommend how I'd refactor things, but I think you should give this a shot: my first thought is that it may make sense to pass both the status code and exception (or exception type) to TryGetRetryResult
. This way you can properly decouple the check for a retryable status code from a retryable exception type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more context, well engineered HTTP retry strategies take into account what I'm getting at. Take some time and study Polly as a point of reference: https://www.pollydocs.org/strategies/retry#defaults. Note that when constructing your retry policy using Polly, you can be explicit about how certain types of exceptions are retried (or not). https://www.pollydocs.org/strategies/retry#overusing-builder-methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my memory serves me, in the case of a connection error the exception will be a TimeoutException.
No, timeout exception is different than connection error.
The InnerException however predates .NET 8, so it can be used for your purposes.
Yes, but innerException won't give you the level of details that HttpRequestError
does. And there is no defined set other than the one in SocketException
where you can look up the ErrorCode.
Let me see if I can do the refactor to remove the need for null StatusCode. Most likely I will have to do separate TryGetRetryResult for http and grpc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on this
I went back and checked the references. Here are the things I have found:
- There is no clear contract on HttpRequestException's innerException prior to .NET8.0.
Even withHttRequestError
we have to analyze the codes thoroughly to see which ones could not retried. - Transient Errors covers all exceptions within
HttpRequestException
within Polly as well. Looking at the link you sent above and also found something similar here. - I did refer some other implementations as well. I could not find any implementations further
drilling down the HttpRequestException in case of no response.
I removed the null status code checks and did some refactoring for http case.
Current implementation has no way of retrying the network errors which is very common issue users run into. The scope of this PR is to solve that specific issue.
I created 5425 to investigate further drill down of HttpRequestException
in case of no response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I think the discussions around fine-tuning the logic on how to handle different kinds of HttpRequestException
can be done outside of this PR.
Towards #1779
Design discussion issue #
Changes
Fixes the retry to consider network failures. Adds few tests (not a comprehensive list, planning to add more as a follow up).
Merge requirement checklist
CHANGELOG.md
files updated for non-trivial changes