Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Otlp] Fix Http Retry to cover network failure and add tests #5394

Merged

Conversation

vishweshbankwar
Copy link
Member

@vishweshbankwar vishweshbankwar commented Feb 27, 2024

Towards #1779
Design discussion issue #

Changes

Fixes the retry to consider network failures. Adds few tests (not a comprehensive list, planning to add more as a follow up).

Merge requirement checklist

  • CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)
  • Unit tests added/updated
  • Appropriate CHANGELOG.md files updated for non-trivial changes
  • Changes in public API reviewed (if applicable)

Copy link

codecov bot commented Feb 27, 2024

Codecov Report

Attention: Patch coverage is 91.66667% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 84.76%. Comparing base (6250307) to head (4172097).
Report is 122 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5394      +/-   ##
==========================================
+ Coverage   83.38%   84.76%   +1.38%     
==========================================
  Files         297      282      -15     
  Lines       12531    12127     -404     
==========================================
- Hits        10449    10280     -169     
+ Misses       2082     1847     -235     
Flag Coverage Δ
unittests ?
unittests-Solution-Experimental 84.49% <91.66%> (?)
unittests-Solution-Stable 84.72% <91.66%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...yProtocol/Implementation/ExportClient/OtlpRetry.cs 97.01% <91.66%> (+14.25%) ⬆️

... and 58 files with indirect coverage changes

@vishweshbankwar vishweshbankwar marked this pull request as ready for review February 27, 2024 21:22
@vishweshbankwar vishweshbankwar requested a review from a team February 27, 2024 21:22
Copy link
Contributor

@rajkumar-rangaraj rajkumar-rangaraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@CodeBlanch CodeBlanch added the pkg:OpenTelemetry.Exporter.OpenTelemetryProtocol Issues related to OpenTelemetry.Exporter.OpenTelemetryProtocol NuGet package label Feb 29, 2024
Copy link
Member

@CodeBlanch CodeBlanch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple nits but LGTM

Comment on lines 200 to 203
if (!statusCode.HasValue)
{
return true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think returning true from this method when HttpStatusCode is null is a good thing to do. I think this method should only be used when we actually have a status code, so the statusCode parameter should not be nullable.

This is essentially saying: all requests that failed without a status code response (e.g., timeout) are retryable. I do not think this is what we want.

For example, consider the code where we catch failures:

catch (HttpRequestException ex)
{
OpenTelemetryProtocolExporterEventSource.Log.FailedToReachCollector(this.Endpoint, ex);
return new ExportClientHttpResponse(success: false, deadlineUtc: deadline, response: null, exception: ex);
}

An HttpRequestException can mean many different things. I think we need to be selective and determine which things we deem retryable.

Take a look at the docs for HttpRequestException.
I think a combination of the InnerException and maybe the HttpRequestError property should be used to clue us into whether a request is retryable. For example, if the inner exception is a TimeoutException then this should be retryable. However, there are a number of values for HttpRequestError that should probably not be retried.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HttpRequestError is .NET8.0 specific concept. Before this, there was no way to drill down the HttpRequestException to the details you see in that enum.

My intention here was to keep the changes to minimum. What I can do is add an additional method in this class to inspect the exception first and fail fast in case of http (assuming there is some way to distinguish a retriable error when status code is not available for other versions). (follow up).

This method is called from within TryGetRetryResult<TStatusCode, TCarrier> and it is possible that in some cases statuscode will be null (for e.g. connection error). So we can return true here then.

what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HttpRequestError is .NET8.0 specific concept.

The InnerException however predates .NET 8, so it can be used for your purposes.

statuscode will be null (for e.g. connection error)

Yes, correct, but what this PR is currently saying is that when status code is null it means the request is retryable. This is not always correct. Null status code can mean connection error, but does not always mean connection error.

HTTP is more complicated than gRPC because you must use two signals to determine retriability: the status code (when present) or the exception (when status code is not present). If my memory serves me, in the case of a connection error the exception will be a TimeoutException.

I'd have to play around with things myself to recommend how I'd refactor things, but I think you should give this a shot: my first thought is that it may make sense to pass both the status code and exception (or exception type) to TryGetRetryResult. This way you can properly decouple the check for a retryable status code from a retryable exception type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more context, well engineered HTTP retry strategies take into account what I'm getting at. Take some time and study Polly as a point of reference: https://www.pollydocs.org/strategies/retry#defaults. Note that when constructing your retry policy using Polly, you can be explicit about how certain types of exceptions are retried (or not). https://www.pollydocs.org/strategies/retry#overusing-builder-methods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my memory serves me, in the case of a connection error the exception will be a TimeoutException.

No, timeout exception is different than connection error.

The InnerException however predates .NET 8, so it can be used for your purposes.

Yes, but innerException won't give you the level of details that HttpRequestError does. And there is no defined set other than the one in SocketException where you can look up the ErrorCode.

Let me see if I can do the refactor to remove the need for null StatusCode. Most likely I will have to do separate TryGetRetryResult for http and grpc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on this

I went back and checked the references. Here are the things I have found:

  1. There is no clear contract on HttpRequestException's innerException prior to .NET8.0.
    Even with HttRequestError we have to analyze the codes thoroughly to see which ones could not retried.
  2. Transient Errors covers all exceptions within HttpRequestException within Polly as well. Looking at the link you sent above and also found something similar here.
  3. I did refer some other implementations as well. I could not find any implementations further
    drilling down the HttpRequestException in case of no response.

I removed the null status code checks and did some refactoring for http case.

Current implementation has no way of retrying the network errors which is very common issue users run into. The scope of this PR is to solve that specific issue.
I created 5425 to investigate further drill down of HttpRequestException in case of no response.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanwest -Also regarding the specific timeout case. we already handle that by checking IsDeadlineExceeded before confirming if it should be retried. Lets discuss more cases where you think we should not be retrying when HttpRequestException with no response is thrown #5425

Copy link
Contributor

@utpilla utpilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I think the discussions around fine-tuning the logic on how to handle different kinds of HttpRequestException can be done outside of this PR.

@utpilla utpilla merged commit 3cd4c62 into open-telemetry:main Mar 11, 2024
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg:OpenTelemetry.Exporter.OpenTelemetryProtocol Issues related to OpenTelemetry.Exporter.OpenTelemetryProtocol NuGet package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants