Implement OTLP retry mechanism #4616

alanwest · 2023-06-26T21:33:20Z

#4587 implemented a retry mechanism using Grpc.Net.Client's built in retry mechanism. However, the requirements of the OTLP specification differ from the standard gRPC retry mechanism.

This PR contains a spec-compliant OTLP retry mechanism. It can be used by both the gRPC and HTTP exporter.

This PR does not yet wire up the exporters to use this retry mechanism. This will be in a follow up.

Example usage

public ExportResult Export(...)
{
    var nextRetryDelayMilliseconds = OtlpRetry.InitialBackoffMilliseconds;

    while (true)
    {
        try
        {
            // Do the export
        }
        catch (RpcException ex)
        {
            if (OtlpRetry.TryGetGrpcRetryResult(ex.StatusCode, deadline, ex.Trailers, nextRetryDelayMilliseconds, out var retryResult))
            {
                Thread.Sleep(retryResult.RetryDelay);
                nextRetryDelayMilliseconds = retryResult.NextRetryDelayMilliseconds;
            }
            else
            {
                throw;
            }
        }
    }
}

Follow up PRs:

Expand tests to cover HTTP
Integrate retry with gRPC exporters
Integrate retry with HTTP exporters

alanwest · 2023-06-26T21:39:54Z

src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs

+        // TODO: Consider introducing a fixed max number of retries (e.g. max 5 retries).
+        // The spec does not specify a max number of retries, but it may be bad to not cap the number of attempts.
+        // Without a max number of retry attempts, retries would continue until the deadline.
+        // Maybe this is ok? However, it may lead to an unexpected behavioral change. For example:
+        //    1) When using a batch processor, a longer delay due to repeated
+        //       retries up to the deadline may lead to a higher chance that the queue will be exhausted.
+        //    2) When using the simple processor, a longer delay due to repeated
+        //       retries up to the deadline will lead to a prolonged blocking call.
+        // if (attemptCount >= MaxAttempts)
+        // {
+        //     return false
+        // }


I'm inclined to set a maximum cap of 5 retries, but I'd like to get other people's thoughts.

5 sounds good.
1), users can adjust the queue length if they frequently see dropped items. for 2) there is no mitigation given the retries are not configurable, correct?

5 sounds good.

Yes, I agree. I plan to leave this as a follow up for now. Put it in place when I actually wire this up with the exporter.

codecov · 2023-06-26T21:50:19Z

Codecov Report

Merging #4616 (e3e0805) into main (bb1253e) will increase coverage by 0.02%.
The diff coverage is 82.75%.

@@            Coverage Diff             @@
##             main    #4616      +/-   ##
==========================================
+ Coverage   83.91%   83.94%   +0.02%     
==========================================
  Files         292      293       +1     
  Lines       11908    11966      +58     
==========================================
+ Hits         9993    10045      +52     
- Misses       1915     1921       +6

Files Changed	Coverage
...yProtocol/Implementation/ExportClient/OtlpRetry.cs	`82.75%`

vishweshbankwar · 2023-06-28T19:21:19Z

Having this on by default for Http exporter may impact certain users who have a custom HttpClient set up with their own retry mechanism. I have not seen this but something we might run into this so bringing it up.

vishweshbankwar · 2023-06-29T16:22:43Z

src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs

+
+    private static bool IsGrpcStatusCodeRetryable(StatusCode statusCode, bool hasRetryDelay)
+    {
+        switch (statusCode)


@alanwest Is it possible to do this custom retry for Unavailable and ResourceExhausted only? And rest with https://learn.microsoft.com/en-us/aspnet/core/grpc/retries?view=aspnetcore-7.0#configure-a-grpc-retry-policy

Probably not, especially if we introduce a max number of retries. There's no way to know how many attempts the client library attempted vs. special handling for unavailable/resource exhausted.

Though, even if it were possible, I'd be hard pressed to say the complexity would be buying us anything. I think it's pretty nice that gRPC and HTTP uses the exact same retry logic.

alanwest · 2023-06-30T00:33:48Z

Having this on by default for Http exporter may impact certain users who have a custom HttpClient set up with their own retry mechanism. I have not seen this but something we might run into this so bringing it up.

I'd like to think on this a bit more to gauge how concerned we need to be about this. Worst case, we could disable retry if someone is using the HttpClientFactory option - though I think this would be a bit of a bummer.

This is one reason I think I'd like to hold on introducing support for the HttpClientFactory for the gRPC exporter. That is, I'd like to land retry first and then consider adding HttpClientFactory support proposed in #4625.

dhhoang · 2023-06-30T03:07:47Z

I'd like to think on this a bit more to gauge how concerned we need to be about this. Worst case, we could disable retry if someone is using the HttpClientFactory option - though I think this would be a bit of a bummer.

This is one reason I think I'd like to hold on introducing support for the HttpClientFactory for the gRPC exporter. That is, I'd like to land retry first and then consider adding HttpClientFactory support proposed in #4625.

Would it be better if retry is not enabled by default but offered as an option?

If users upgrade to new versions -> no breaking change.
Make explicit documentation to say that this retry mechanism is separate from any custom retry from HttpClientFactory so that users are aware when they opt in.

reyang · 2023-06-30T04:03:30Z

Example usage

public ExportResult Export(...)
{
    var nextRetryDelayMilliseconds = OtlpRetry.InitialBackoffMilliseconds;

    while (true)
    {
        try
        {
            // Do the export
        }
        catch (RpcException ex)
        {
            if (OtlpRetry.TryGetGrpcRetryResult(ex.StatusCode, context.Options.Deadline, ex.Trailers, nextRetryDelayMilliseconds, out var retryResult))
            {
                Thread.Sleep(retryResult.RetryDelay);
                nextRetryDelayMilliseconds = retryResult.NextRetryDelayMilliseconds;
            }
            else
            {
                throw;
            }
        }
    }
}

I want to understand the direction here, the above code snippet has two parts that are concerning:

Thread.Sleep based on an external input without policy, this would block the exporting thread indefinitely.
What's the overall strategy - if something failed to export, do we prioritize to exporter the latest data or we'll keep retrying the old data (while the new data might fill up the buffer and get dropped).

reyang · 2023-06-30T04:05:31Z

If users upgrade to new versions -> no breaking change.

What's breaking?

vishweshbankwar · 2023-06-30T21:22:22Z

Having this on by default for Http exporter may impact certain users who have a custom HttpClient set up with their own retry mechanism. I have not seen this but something we might run into this so bringing it up.

I'd like to think on this a bit more to gauge how concerned we need to be about this. Worst case, we could disable retry if someone is using the HttpClientFactory option - though I think this would be a bit of a bummer.

This is one reason I think I'd like to hold on introducing support for the HttpClientFactory for the gRPC exporter. That is, I'd like to land retry first and then consider adding HttpClientFactory support proposed in #4625.

Looked into this bit more. Just sharing some points for consideration:

I think we should consider this as an opt-in feature or on by default with ability to turn off. More for http protobuf as we allow custom HttpClient.

a) Retrying with this approach has some overhead involved for recreating request messages for each retry during failures.
b) There can be other scenarios such as using Circuit breaker which users might want to implement?
c) Could there be other response codes based on the server's implementation?.

alanwest · 2023-06-30T22:32:22Z

@reyang

Thread.Sleep based on an external input without policy, this would block the exporting thread indefinitely.

The call to Export is a blocking call. Task.Delay could be used, but probably to the same effect. I studied the retry mechanism for Grpc.Net.Client extensively to guide this implementation. It uses Task.Delay as the delay is make from an asynchronous method https://github.com/grpc/grpc-dotnet/blob/16e01b9369a1c2b1440bbe7f77c1520b11dc6db3/src/Grpc.Net.Client/Internal/Retry/RetryCall.cs#L219.

What's the overall strategy - if something failed to export, do we prioritize to exporter the latest data or we'll keep retrying the old data (while the new data might fill up the buffer and get dropped).

The strategy is guided by the spec. I'm not looking to make anything up here, so I think our goal should be that we mutually understand the specification. Here's my interpretation with regards to your question:

If export fails with a retryable status code, then the OTLP exporter should retry the same data. Per the spec, retries attempts will continue up until the configured timeout which defaults to 10 seconds. If using the batch export processor this could mean the buffer fills up and new data is dropped.

That said, I am proposing one additional requirement not in the spec and that is a cap of 5 retry attempts - this is not currently implemented in this PR.

reyang · 2023-06-30T22:40:02Z

The strategy is guided by the spec. I'm not looking to make anything up here, so I think our goal should be that we mutually understand the specification. Here's my interpretation with regards to your question:

If export fails with a retryable status code, then the OTLP exporter should retry the same data. Per the spec, retries attempts will continue up until the configured timeout which defaults to 10 seconds. If using the batch export processor this could mean the buffer fills up and new data is dropped.

Where are these coming from? Is this the one we're trying to follow? https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/exporter.md#retry

alanwest · 2023-06-30T22:42:32Z

https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md#failures

reyang · 2023-06-30T22:54:45Z

https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md#failures

I don't see how this section in the spec is related to the code snippet from #4616 (comment).

I also couldn't find where "per the spec, retries attempts will continue up until the configured timeout which defaults to 10 seconds. If using the batch export processor this could mean the buffer fills up and new data is dropped" is coming from.

github-actions · 2023-07-08T03:20:13Z

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions · 2023-07-20T03:16:51Z

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions · 2023-08-04T03:16:27Z

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions · 2023-08-16T03:16:36Z

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

vishweshbankwar

LGTM considering this will be enabled via feature flag allowing us to clarify things with spec and gather feedback.

Feature flag can be added as a separate PR while wiring things up.

github-actions · 2023-08-24T03:16:32Z

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/google/rpc/status.proto

src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/google/rpc/error_details.proto

src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs

utpilla · 2023-08-25T06:18:06Z

src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs

+        // Consider pulling in Random.Shared implementation.
+        lock (Random)
+        {
+            return Random.Next(min, max);


Do we want to use max + 1? The second argument for Random.Next is the exclusive upper bound of the random number returned.

🤷 Not sure it matters a whole lot. I'm happy to go either way. Just for some additional context, the behavior of the retry logic in this PR is very much inspired by grpc-dotnet's implementation of the gRPC retry specification including the calculation for random jitter.

Using Random.Next(min, max) is equivalent to grpc-dotnet's implementation

https://github.com/grpc/grpc-dotnet/blob/01049834beb8342d1cf93bc5902a7fb2d3a310e1/src/Grpc.Net.Client/Internal/Retry/RetryCall.cs#L216

Not sure it matters a whole lot.

Probably not. It's more of a code readability thing. It might seem a bit odd as the variable names suggest that the max value could indeed be a possible delayDuration value.

I'm okay with the code as-is.

src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs

utpilla · 2023-08-25T06:53:14Z

src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs

+            return null;
+        }
+
+        var statusDetails = trailers.Get(GrpcStatusDetailsHeader);


Is every Grpc service expected to set this trailer?

Metadata.Get returns null if the header does not exist.

https://grpc.github.io/grpc/csharp/api/Grpc.Core.Metadata.html#Grpc_Core_Metadata_Get_System_String_

My question was more around the general availability of this header for our use. Is this left to the service implementation? Could we expect OTLP Collector to be setting this header?

Could we expect OTLP Collector to be setting this header?

Yes, per the spec, any OTLP receiver like the one for the collector should support this header for throttling requests. https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md#otlpgrpc-throttling

Ah, but yea I think I see the confusion... the spec does not make it obvious that it's a header by the name grpc-status-details-bin. This is the header the collector responds with. Similarly it's the header that New Relic's OTLP ingest responds with.

This is the header the collector responds with.

I think it would be good to add a code comment about this.

Implement OTLP retry mechanism

282c58a

alanwest requested a review from a team June 26, 2023 21:33

alanwest added 2 commits June 26, 2023 14:33

Merge branch 'main' into alanwest/otlp-grpc-retry-2

60b4b18

Move status header to const

1155ccf

alanwest commented Jun 26, 2023

View reviewed changes

vishweshbankwar reviewed Jun 29, 2023

View reviewed changes

dhhoang mentioned this pull request Jun 30, 2023

Make OtlpExporterOptions.HttpClientFactory work with GRPC protocol #4625

Closed

4 tasks

github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jul 8, 2023

alanwest removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jul 12, 2023

github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jul 20, 2023

alanwest removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jul 25, 2023

github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Aug 4, 2023

alanwest removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Aug 8, 2023

github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Aug 16, 2023

vishweshbankwar approved these changes Aug 16, 2023

View reviewed changes

github-actions bot removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Aug 17, 2023

github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Aug 24, 2023

utpilla removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Aug 24, 2023

utpilla reviewed Aug 24, 2023

View reviewed changes

src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/google/rpc/status.proto Outdated Show resolved Hide resolved

utpilla reviewed Aug 24, 2023

View reviewed changes

src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/google/rpc/error_details.proto Outdated Show resolved Hide resolved

utpilla reviewed Aug 24, 2023

View reviewed changes

src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs Show resolved Hide resolved

utpilla reviewed Aug 24, 2023

View reviewed changes

src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs Outdated Show resolved Hide resolved

utpilla reviewed Aug 24, 2023

View reviewed changes

src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs Show resolved Hide resolved

utpilla reviewed Aug 24, 2023

View reviewed changes

src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs Show resolved Hide resolved

utpilla reviewed Aug 25, 2023

View reviewed changes

src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ExportClient/OtlpRetry.cs Outdated Show resolved Hide resolved

utpilla reviewed Aug 25, 2023

View reviewed changes

alanwest added 2 commits August 25, 2023 10:34

Merge branch 'main' into alanwest/otlp-grpc-retry-2

2d78147

PR feedback

d59f0b6

utpilla approved these changes Aug 25, 2023

View reviewed changes

Merge branch 'main' into alanwest/otlp-grpc-retry-2

e3e0805

utpilla merged commit 1da0297 into open-telemetry:main Aug 30, 2023
27 checks passed

This was referenced Jan 16, 2024

Run dashboard out-of-process dotnet/aspire#1667

Merged

Integrate retry with OTLP exporter #5226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement OTLP retry mechanism #4616

Implement OTLP retry mechanism #4616

alanwest commented Jun 26, 2023 •

edited

Loading

alanwest Jun 26, 2023

vishweshbankwar Jun 28, 2023

alanwest Jun 30, 2023

codecov bot commented Jun 26, 2023 •

edited

Loading

vishweshbankwar commented Jun 28, 2023

vishweshbankwar Jun 29, 2023

alanwest Jun 30, 2023

alanwest commented Jun 30, 2023 •

edited

Loading

dhhoang commented Jun 30, 2023

reyang commented Jun 30, 2023

reyang commented Jun 30, 2023

vishweshbankwar commented Jun 30, 2023

alanwest commented Jun 30, 2023

reyang commented Jun 30, 2023

alanwest commented Jun 30, 2023

reyang commented Jun 30, 2023

github-actions bot commented Jul 8, 2023

github-actions bot commented Jul 20, 2023

github-actions bot commented Aug 4, 2023

github-actions bot commented Aug 16, 2023

vishweshbankwar left a comment

github-actions bot commented Aug 24, 2023

utpilla Aug 25, 2023

alanwest Aug 25, 2023

utpilla Aug 25, 2023

utpilla Aug 25, 2023

alanwest Aug 25, 2023

utpilla Aug 25, 2023

alanwest Aug 25, 2023

alanwest Aug 25, 2023

utpilla Aug 25, 2023

Implement OTLP retry mechanism #4616

Implement OTLP retry mechanism #4616

Conversation

alanwest commented Jun 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 26, 2023 • edited Loading

Codecov Report

vishweshbankwar commented Jun 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanwest commented Jun 30, 2023 • edited Loading

dhhoang commented Jun 30, 2023

reyang commented Jun 30, 2023

reyang commented Jun 30, 2023

vishweshbankwar commented Jun 30, 2023

alanwest commented Jun 30, 2023

reyang commented Jun 30, 2023

alanwest commented Jun 30, 2023

reyang commented Jun 30, 2023

github-actions bot commented Jul 8, 2023

github-actions bot commented Jul 20, 2023

github-actions bot commented Aug 4, 2023

github-actions bot commented Aug 16, 2023

vishweshbankwar left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanwest commented Jun 26, 2023 •

edited

Loading

codecov bot commented Jun 26, 2023 •

edited

Loading

alanwest commented Jun 30, 2023 •

edited

Loading