Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry policy for OTLP/gRPC exporter #4587

Closed

Conversation

alanwest
Copy link
Member

Implements the retry behavior described by the specification for the OTLP/gRPC exporter.

@alanwest alanwest requested a review from a team June 12, 2023 23:09
{
StatusCode.Cancelled,
StatusCode.DeadlineExceeded,
StatusCode.ResourceExhausted,
Copy link
Member Author

@alanwest alanwest Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add a test validating that the Grpc.Net.Client library honors the retry_delay sent from the server when StatusCode.ResourceExhausted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server could be sending a retry_delay even with StatusCode.Unavailable. Client should be respecting that as well?

Copy link
Member Author

@alanwest alanwest Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesNK I'm curious if you have a moment to share guidance here...

This PR implements retry for our gRPC exporter using a RetryPolicy. There are a number of status codes which can be retried, however, we have a requirement that cannot be expressed using a RetryPolicy alone when status is Unavailable or Resource_Exhausted:

The server MUST indicate retryable errors using code Unavailable and MAY supply additional details via status using RetryInfo containing 0 value of RetryDelay.

My hope was to avoid rolling our own retry mechanism, but from what I can tell Grpc.Net.Client does not support handling RetryInfo.

Options I can think of

  1. Roll our own retry mechanism for all status codes conforming to our requirements.
  2. Use a RetryPolicy for retryable codes which do not require handling RetryInfo and only roll our own logic for Unavailable and Resource_Exhaused.

Are there benefits to one option over the other? Are there better options?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grpc.Net.Client follows the retry spec: https://github.com/grpc/proposal/blob/master/A6-client-retries.md. It doesn't have anything related to RetryInfo.

Do other gRPC client libraries support RetryInfo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do other gRPC client libraries support RetryInfo?

No, not from what I've seen. This RetryInfo seems to be something specific to some Google APIs. It's unfortunate that the OTel spec adopted this over the standard gRPC retry spec.

OTLP supports both gRPC and HTTP and since HTTP has a Retry-After response header I suspect the decision to adopt this RetryInfo thing for gRPC was inspired by the desire to keep parity with the retry behavior between both HTTP and gRPC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gRPC retry supports the grpc-retry-pushback-ms response header which seems similar.

Grpc.Net.Client retries doesn't support RetryInfo. I suggest you look at other implementations to see what they do. If they have a clever technique then maybe that is something that could also be done in .NET. If they are all doing retries manually then you'll have to do it manually.

@codecov
Copy link

codecov bot commented Jun 12, 2023

Codecov Report

Merging #4587 (9d9be1d) into main (039556d) will increase coverage by 0.05%.
The diff coverage is 100.00%.

❗ Current head 9d9be1d differs from pull request most recent head 3bd6412. Consider uploading reports for the commit 3bd6412 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4587      +/-   ##
==========================================
+ Coverage   85.13%   85.18%   +0.05%     
==========================================
  Files         320      320              
  Lines       12679    12707      +28     
==========================================
+ Hits        10794    10825      +31     
+ Misses       1885     1882       -3     
Impacted Files Coverage Δ
...TelemetryProtocol/OtlpExporterOptionsExtensions.cs 97.54% <100.00%> (+0.73%) ⬆️

... and 6 files with indirect coverage changes

@@ -10,6 +10,14 @@
are now included in this package.
([#4556](https://github.com/open-telemetry/opentelemetry-dotnet/pull/4556))

* Added support for the OTLP gRPC exporter to retry failed requests due to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't cover PartialSuccess, right? We wouldn't be retrying any of the rejected items from a partially successful response.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the spec does not support retrying rejected items from a partially successful response. However, you bring up a good point, the spec does say

The client MUST NOT retry the request when it receives a partial success response where the partial_success is populated.

I need to fix this.

return GrpcChannel.ForAddress(options.Endpoint);
// The specification does not dictate the parameters of the default retry policy.
// These defaults are borrowed from opentelemetry-java.
var retryPolicy = new RetryPolicy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this impact the default deadline we set?

Is it possible that while retrying the deadline is exceeded and the retry does not complete or not performed(if retryAfter is more than default deadline)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

@github-actions github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Jun 21, 2023
@alanwest
Copy link
Member Author

It seems that using the built in retry mechanism of Grpc.Net.Client may not work for our purposes since the OTel spec has some unique non-standard requirements. I've got another solution in the works that is spec compliant. Closing this PR for now and will open another soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale Issues and pull requests which have been flagged for closing due to inactivity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants