Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement OTLP retry mechanism #4616

Merged
merged 6 commits into from
Aug 30, 2023

Conversation

alanwest
Copy link
Member

@alanwest alanwest commented Jun 26, 2023

#4587 implemented a retry mechanism using Grpc.Net.Client's built in retry mechanism. However, the requirements of the OTLP specification differ from the standard gRPC retry mechanism.

This PR contains a spec-compliant OTLP retry mechanism. It can be used by both the gRPC and HTTP exporter.

This PR does not yet wire up the exporters to use this retry mechanism. This will be in a follow up.

Example usage

public ExportResult Export(...)
{
    var nextRetryDelayMilliseconds = OtlpRetry.InitialBackoffMilliseconds;

    while (true)
    {
        try
        {
            // Do the export
        }
        catch (RpcException ex)
        {
            if (OtlpRetry.TryGetGrpcRetryResult(ex.StatusCode, deadline, ex.Trailers, nextRetryDelayMilliseconds, out var retryResult))
            {
                Thread.Sleep(retryResult.RetryDelay);
                nextRetryDelayMilliseconds = retryResult.NextRetryDelayMilliseconds;
            }
            else
            {
                throw;
            }
        }
    }
}

Follow up PRs:

  • Expand tests to cover HTTP
  • Integrate retry with gRPC exporters
  • Integrate retry with HTTP exporters

@alanwest alanwest requested a review from a team June 26, 2023 21:33
Comment on lines +82 to +93
// TODO: Consider introducing a fixed max number of retries (e.g. max 5 retries).
// The spec does not specify a max number of retries, but it may be bad to not cap the number of attempts.
// Without a max number of retry attempts, retries would continue until the deadline.
// Maybe this is ok? However, it may lead to an unexpected behavioral change. For example:
// 1) When using a batch processor, a longer delay due to repeated
// retries up to the deadline may lead to a higher chance that the queue will be exhausted.
// 2) When using the simple processor, a longer delay due to repeated
// retries up to the deadline will lead to a prolonged blocking call.
// if (attemptCount >= MaxAttempts)
// {
// return false
// }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to set a maximum cap of 5 retries, but I'd like to get other people's thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 sounds good.
1), users can adjust the queue length if they frequently see dropped items. for 2) there is no mitigation given the retries are not configurable, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 sounds good.

Yes, I agree. I plan to leave this as a follow up for now. Put it in place when I actually wire this up with the exporter.

@codecov
Copy link

codecov bot commented Jun 26, 2023

Codecov Report

Merging #4616 (e3e0805) into main (bb1253e) will increase coverage by 0.02%.
The diff coverage is 82.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4616      +/-   ##
==========================================
+ Coverage   83.91%   83.94%   +0.02%     
==========================================
  Files         292      293       +1     
  Lines       11908    11966      +58     
==========================================
+ Hits         9993    10045      +52     
- Misses       1915     1921       +6     
Files Changed Coverage
...yProtocol/Implementation/ExportClient/OtlpRetry.cs 82.75%

@vishweshbankwar
Copy link
Member

Having this on by default for Http exporter may impact certain users who have a custom HttpClient set up with their own retry mechanism. I have not seen this but something we might run into this so bringing it up.


private static bool IsGrpcStatusCodeRetryable(StatusCode statusCode, bool hasRetryDelay)
{
switch (statusCode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanwest Is it possible to do this custom retry for Unavailable and ResourceExhausted only? And rest with https://learn.microsoft.com/en-us/aspnet/core/grpc/retries?view=aspnetcore-7.0#configure-a-grpc-retry-policy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, especially if we introduce a max number of retries. There's no way to know how many attempts the client library attempted vs. special handling for unavailable/resource exhausted.

Though, even if it were possible, I'd be hard pressed to say the complexity would be buying us anything. I think it's pretty nice that gRPC and HTTP uses the exact same retry logic.

@alanwest
Copy link
Member Author

alanwest commented Jun 30, 2023

Having this on by default for Http exporter may impact certain users who have a custom HttpClient set up with their own retry mechanism. I have not seen this but something we might run into this so bringing it up.

I'd like to think on this a bit more to gauge how concerned we need to be about this. Worst case, we could disable retry if someone is using the HttpClientFactory option - though I think this would be a bit of a bummer.

This is one reason I think I'd like to hold on introducing support for the HttpClientFactory for the gRPC exporter. That is, I'd like to land retry first and then consider adding HttpClientFactory support proposed in #4625.

@dhhoang
Copy link

dhhoang commented Jun 30, 2023

I'd like to think on this a bit more to gauge how concerned we need to be about this. Worst case, we could disable retry if someone is using the HttpClientFactory option - though I think this would be a bit of a bummer.

This is one reason I think I'd like to hold on introducing support for the HttpClientFactory for the gRPC exporter. That is, I'd like to land retry first and then consider adding HttpClientFactory support proposed in #4625.

Would it be better if retry is not enabled by default but offered as an option?

  • If users upgrade to new versions -> no breaking change.
  • Make explicit documentation to say that this retry mechanism is separate from any custom retry from HttpClientFactory so that users are aware when they opt in.

@reyang
Copy link
Member

reyang commented Jun 30, 2023

Example usage

public ExportResult Export(...)
{
    var nextRetryDelayMilliseconds = OtlpRetry.InitialBackoffMilliseconds;

    while (true)
    {
        try
        {
            // Do the export
        }
        catch (RpcException ex)
        {
            if (OtlpRetry.TryGetGrpcRetryResult(ex.StatusCode, context.Options.Deadline, ex.Trailers, nextRetryDelayMilliseconds, out var retryResult))
            {
                Thread.Sleep(retryResult.RetryDelay);
                nextRetryDelayMilliseconds = retryResult.NextRetryDelayMilliseconds;
            }
            else
            {
                throw;
            }
        }
    }
}

I want to understand the direction here, the above code snippet has two parts that are concerning:

  1. Thread.Sleep based on an external input without policy, this would block the exporting thread indefinitely.
  2. What's the overall strategy - if something failed to export, do we prioritize to exporter the latest data or we'll keep retrying the old data (while the new data might fill up the buffer and get dropped).

@reyang
Copy link
Member

reyang commented Jun 30, 2023

  • If users upgrade to new versions -> no breaking change.

What's breaking?

@vishweshbankwar
Copy link
Member

Having this on by default for Http exporter may impact certain users who have a custom HttpClient set up with their own retry mechanism. I have not seen this but something we might run into this so bringing it up.

I'd like to think on this a bit more to gauge how concerned we need to be about this. Worst case, we could disable retry if someone is using the HttpClientFactory option - though I think this would be a bit of a bummer.

This is one reason I think I'd like to hold on introducing support for the HttpClientFactory for the gRPC exporter. That is, I'd like to land retry first and then consider adding HttpClientFactory support proposed in #4625.

Looked into this bit more. Just sharing some points for consideration:

I think we should consider this as an opt-in feature or on by default with ability to turn off. More for http protobuf as we allow custom HttpClient.

a) Retrying with this approach has some overhead involved for recreating request messages for each retry during failures.
b) There can be other scenarios such as using Circuit breaker which users might want to implement?
c) Could there be other response codes based on the server's implementation?.

@alanwest
Copy link
Member Author

@reyang

Thread.Sleep based on an external input without policy, this would block the exporting thread indefinitely.

The call to Export is a blocking call. Task.Delay could be used, but probably to the same effect. I studied the retry mechanism for Grpc.Net.Client extensively to guide this implementation. It uses Task.Delay as the delay is make from an asynchronous method https://github.com/grpc/grpc-dotnet/blob/16e01b9369a1c2b1440bbe7f77c1520b11dc6db3/src/Grpc.Net.Client/Internal/Retry/RetryCall.cs#L219.

What's the overall strategy - if something failed to export, do we prioritize to exporter the latest data or we'll keep retrying the old data (while the new data might fill up the buffer and get dropped).

The strategy is guided by the spec. I'm not looking to make anything up here, so I think our goal should be that we mutually understand the specification. Here's my interpretation with regards to your question:

If export fails with a retryable status code, then the OTLP exporter should retry the same data. Per the spec, retries attempts will continue up until the configured timeout which defaults to 10 seconds. If using the batch export processor this could mean the buffer fills up and new data is dropped.

That said, I am proposing one additional requirement not in the spec and that is a cap of 5 retry attempts - this is not currently implemented in this PR.

@reyang
Copy link
Member

reyang commented Jun 30, 2023

The strategy is guided by the spec. I'm not looking to make anything up here, so I think our goal should be that we mutually understand the specification. Here's my interpretation with regards to your question:

If export fails with a retryable status code, then the OTLP exporter should retry the same data. Per the spec, retries attempts will continue up until the configured timeout which defaults to 10 seconds. If using the batch export processor this could mean the buffer fills up and new data is dropped.

Where are these coming from? Is this the one we're trying to follow? https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/exporter.md#retry

@reyang
Copy link
Member

reyang commented Jun 30, 2023

https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md#failures

I don't see how this section in the spec is related to the code snippet from #4616 (comment).

I also couldn't find where "per the spec, retries attempts will continue up until the configured timeout which defaults to 10 seconds. If using the batch export processor this could mean the buffer fills up and new data is dropped" is coming from.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 8, 2023

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

@github-actions github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jul 8, 2023
@alanwest alanwest removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jul 12, 2023
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

@github-actions github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jul 20, 2023
@alanwest alanwest removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jul 25, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

@github-actions github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Aug 4, 2023
@alanwest alanwest removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Aug 8, 2023
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

@github-actions github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Aug 16, 2023
Copy link
Member

@vishweshbankwar vishweshbankwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM considering this will be enabled via feature flag allowing us to clarify things with spec and gather feedback.

Feature flag can be added as a separate PR while wiring things up.

@github-actions github-actions bot removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Aug 17, 2023
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

@github-actions github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Aug 24, 2023
@utpilla utpilla removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Aug 24, 2023
// Consider pulling in Random.Shared implementation.
lock (Random)
{
return Random.Next(min, max);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use max + 1? The second argument for Random.Next is the exclusive upper bound of the random number returned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 Not sure it matters a whole lot. I'm happy to go either way. Just for some additional context, the behavior of the retry logic in this PR is very much inspired by grpc-dotnet's implementation of the gRPC retry specification including the calculation for random jitter.

Using Random.Next(min, max) is equivalent to grpc-dotnet's implementation

https://github.com/grpc/grpc-dotnet/blob/01049834beb8342d1cf93bc5902a7fb2d3a310e1/src/Grpc.Net.Client/Internal/Retry/RetryCall.cs#L216

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it matters a whole lot.

Probably not. It's more of a code readability thing. It might seem a bit odd as the variable names suggest that the max value could indeed be a possible delayDuration value.

I'm okay with the code as-is.

return null;
}

var statusDetails = trailers.Get(GrpcStatusDetailsHeader);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is every Grpc service expected to set this trailer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was more around the general availability of this header for our use. Is this left to the service implementation? Could we expect OTLP Collector to be setting this header?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we expect OTLP Collector to be setting this header?

Yes, per the spec, any OTLP receiver like the one for the collector should support this header for throttling requests. https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md#otlpgrpc-throttling

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but yea I think I see the confusion... the spec does not make it obvious that it's a header by the name grpc-status-details-bin. This is the header the collector responds with. Similarly it's the header that New Relic's OTLP ingest responds with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the header the collector responds with.

I think it would be good to add a code comment about this.

@utpilla utpilla merged commit 1da0297 into open-telemetry:main Aug 30, 2023
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants