Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specification for otlp retry configuration parameters and defaults #1974

Closed
wants to merge 3 commits into from

Conversation

jack-berg
Copy link
Member

Fixes #1742.

Changes

Expand the OTLP exporter retry language to include language for using the built-in gRPC retry mechanism. This allows SDKs to leverage the mechanism as a standard without precluding the use of other retry mechanisms (such as the one used in the collector).

It also establishes declares the configurable parameters if the built-in gRPC retry mechanism is used, with default values.

Related issues #

  • opentelemetry-java PR #3636 prototypes adding a retry mechanism but is blocked on spec language
  • opentelemetry-php Issue #328 requests an OTLP retry implementation
  • opentelemetry-dotnet Issue #1779 requests an OTLP retry implementation

@arminru arminru added area:configuration Related to configuring the SDK area:sdk Related to the SDK spec:protocol Related to the specification/protocol directory labels Sep 30, 2021
@tigrannajaryan
Copy link
Member

opentelemetry-java PR #3636 prototypes adding a retry mechanism but is blocked on spec language

Can you please point to the part of the spec that is blocking this?

For OTLP/HTTP, the errors `408 (Request Timeout)` and `5xx (Server Errors)` are defined as transient, detailed information about erros can be found in the [HTTP failures section](otlp.md#failures). For the OTLP/gRPC, the full list of the gRPC retryable status codes can be found in the [gRPC response section](otlp.md#otlpgrpc-response).
For OTLP/HTTP, the errors `408 (Request Timeout)` and `5xx (Server Errors)` are defined as transient, detailed information about errors can be found in the [HTTP failures section](otlp.md#failures). For the OTLP/gRPC, the full list of the gRPC retryable status codes can be found in the [gRPC response section](otlp.md#otlpgrpc-response).

SDKs MAY use the built-in [gRPC Retry](https://github.com/grpc/proposal/blob/master/A6-client-retries.md) mechanism to facilitate exponential back-off. If the built-in gRPC mechanism is used, the following values SHOULD be available for configuration:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For retry, I guess we need to consider the following situation:

  1. When transport error happened - the exporter didn't hear back from the receiver. (in this case the retry policy might be different from the situation where the exporter got the server response - where the server might give hint on Retry-After or Server Too Busy / Not Available).
  2. When partial success occurred - e.g. HTTP 206. The response payload needs to include what succeeded and what failed, and the exporter will need to understand and respect that.
  3. How to de-dupe - if there is no partial success, a simple unique sequence id (scoped to the connection or session) might be used. If there is partial success, a more sophisticated mechanism might be needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the 10/5/2021 Spec SIG, we discussed that retry behavior in partial success scenarios could be a separate area of future work.

Upon closer inspection, of the otlp spec, I believe each of these points are already documented as known limitations.

  1. When transport error happened - the exporter didn't hear back from the receiver. (in this case the retry policy might be different from the situation where the exporter got the server response - where the server might give hint on Retry-After or Server Too Busy / Not Available).

This falls under the Duplicate Data limitation.

  1. When partial success occurred - e.g. HTTP 206. The response payload needs to include what succeeded and what failed, and the exporter will need to understand and respect that.

This falls under the Partial Success limitation.

  1. How to de-dupe - if there is no partial success, a simple unique sequence id (scoped to the connection or session) might be used. If there is partial success, a more sophisticated mechanism might be needed.

This falls under the Request Acknowledgements / Partial Success limitations.

I could add language clarifying that these limitations also apply to retry attempts, but that's already implied. I think the OTLP spec's language is already sufficiently clear on what happens in retry edge cases: data duplication can occur if things like network interruptions happen, and there's no accommodation for partial success (for retry attempts or in general).

@jack-berg
Copy link
Member Author

Can you please point to the part of the spec that is blocking this?

@tigrannajaryan talked about it in the 10/05/2021 Spec SIG but reiterating here, there is no spec language that blocks usage of the gRPC retry mechanism. However, the language SIGs would be more comfortable implementing if the language was more explicit.

- `maxBackoff`: Must be a duration greater than 0.
- `backoffMultiplier` Must be a number greater than 0.

SDKs have unspecified default values for these properties. They are used to compute the backoff as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't in the SIG where I think it was discussed, but why do we leave the defaults unspecified? Actually having agreed upon defaults to use in the implementation was one of the main points I was hoping for from the spec language.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDKs have unspecified default values for these properties.

This sentence is unnecessary. It is implicitly assumed. If the spec does not define something then it is unspecified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually having agreed upon defaults to use in the implementation was one of the main points I was hoping for from the spec language.

Why do we need it in the spec? gRPC has built-in defaults. What's wrong with those defaults?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gRPC has built-in defaults.

I can't find any evidence that gRPC has built in defaults.

This sentence is unnecessary. It is implicitly assumed. If the spec does not define something then it is unspecified.

I think there's an important distinction. If something isn't in the spec, then it could be an accidental omission that could be added in the future. Having explicit language in the spec that says "SDKs have unspecified defaults" makes it clear that its not an omission but a conscious choice to let the SDKs choose the default values. Here's another example of this language.

Copy link
Contributor

@anuraaga anuraaga Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If gRPC had defaults to copy from, using those could probably work well but indeed I can't find them.

I think having defaults in the spec is important because I expect all languages to have the same retry behavior as a user - there doesn't seem to be anyone language specific about it. We've seen some env variables diverge, probably due to ordering of SDK implementation and spec, but I think the intent is for it to define cross-language concerns where possible to avoid divergence. Making it easier to implement an SDK by not needing to wonder as much is a great side-benefit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came to the same conclusion as @anuraaga here. We need this language to ensure a consistent user experience.

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favour of adding this proposed sentence:

SDKs MAY use the built-in gRPC Retry mechanism to facilitate exponential back-off.

Other changes proposed in this PR I think are unnecessary.

- `maxBackoff`: Must be a duration greater than 0.
- `backoffMultiplier` Must be a number greater than 0.

SDKs have unspecified default values for these properties. They are used to compute the backoff as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDKs have unspecified default values for these properties.

This sentence is unnecessary. It is implicitly assumed. If the spec does not define something then it is unspecified.

Comment on lines 76 to 77
- The initial retry attempt will occur after `random(0, initialBackoff)`
- The `n`-th retry attempt will occur after `random(0, min(initialBackoff*backoffMultiplier**(n-1), maxBackoff))`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we need this in the spec. To make such a recommendation we need to back it off by a research and demonstrate that a particular algorithm for some reason is more suitable for OTLP than others. I do not think it is necessary though. It can remain an implementation detail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reply to this comment is bundled here.

Comment on lines 69 to 72
- `maxAttempts`: The maximum number of attempts, including the original request. Must be an integer greater than 1 and less than 6.
- `initialBackoff`: Must be a duration greater than 0.
- `maxBackoff`: Must be a duration greater than 0.
- `backoffMultiplier` Must be a number greater than 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are unnecessarily exposing very fine grained parameters. I would prefer not to overload the space of configuration parameters unless there is a clear evidence it is necessary (is there?).
gRPC has defaults that SDK authors can use. If they feel strong about using different default values they can do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we need specification around this.

Let's say we don't add language like this, and just add simpler language that allows use of the built-in gRPC retry mechanism. SDKs may end up:

  • Choosing different variations of exponential backoff retry mechanisms, with different configuration parameters, different default values, and different ways of computing the delay. This is a bad user story because users have to be aware of the particulars of the SDKs they rely on to understand the retry behavior.
  • Its reasonable that users would want to reconfigure the default retry parameters - an app writing to a collector in its same data center is going to want different parameters than one writing to a vendor in a different region / continent. Since the configuration parameters are free to be different across SDKs, there's no uniform way to specify env var retry configuration. SDKs could use language specific environment variables, but this is a bad user story because ops has to be aware of language particulars.
  • Because there's no language requiring consistency between otlp grpc, http/protobuf, http/json, an SDK might choose to use the built in mechanism for grpc, and different mechanisms for http/protobuf and http/json. This is a bad user story because it allows for behavioral changes in retry based on otlp protocol. Also, it allows for an inconsistent configuration experience between different otlp protocols.

I actually think the language should go further and: 1. Specify that SDKs "MUST" compute backoff using a strategy like the built-in gRPC retry mechanism. 2. Reiterate the configuration parameters and the algorithm to calculate the delay. 3. Specify reasonable defaults.

This would make implementations unambiguous to implement for SDK authors. Behavior would be consistent across languages and protocols for users. And we could eventually add env (or other configuration) that would allow users to reconfigure the defaults in a consistent way.

In terms of research, the gRPC retry mechanism has been incubating for years, and it will be difficult for any research we do to supplant it. It uses a "full jitter" calculation, where the delay is a random value between 0 and the target delay for the attempt. This amazon post concludes that full jitter is a good implementation.

- `maxBackoff`: Must be a duration greater than 0.
- `backoffMultiplier` Must be a number greater than 0.

SDKs have unspecified default values for these properties. They are used to compute the backoff as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually having agreed upon defaults to use in the implementation was one of the main points I was hoping for from the spec language.

Why do we need it in the spec? gRPC has built-in defaults. What's wrong with those defaults?

@jack-berg
Copy link
Member Author

Ping @tigrannajaryan. Curious how we can make progress on this. I'd really like to find a path forward for having the language SDKs (especially java) have retry logic for transient errors, as is already required by the spec.

@tigrannajaryan
Copy link
Member

I'd really like to find a path forward for having the language SDKs (especially java) have retry logic for transient errors, as is already required by the spec.

I think part of the proposed changes are good, the rest I am not convinced about. To make progress it may be better to split the PR into 2 parts, merge the part on which we agree quickly and continue discussing the debatable parts on a new PR.

@jack-berg
Copy link
Member Author

How about this for a reduced scope version:

SDKs MAY use the built-in gRPC Retry mechanism to facilitate exponential back-off.
SDKs SHOULD have retry configuration and mechanics that are consistent across OTLP protocols. For example, if the built-in gRPC Retry mechanism is used for the grpc protocol, the http/protobuf and http/json protocols should expose the same configuration options and compute the backoff duration in the same manner.

It explicitly approves the gRPC mechanism, and at least provides consistency within a language SDK for how the retry mechanism works across protocols.

I still think it would be better to have consistency across all SDKs, in terms of configuration and defaults, but maybe this language would be enough to get some movement.

@anuraaga what do you think? With this language I imagine opentelemetry-java could add setRetryPolicy(RetryPolicy retryPolicy) methods on all the otlp exporter builders, initially without any default values. We could add experimental properties that the autoconfigure module could key off of to configure the properties like:

otel.experimental.exporter.otlp.retry.maxAttempts
otel.experimental.exporter.otlp.retry.initialBackoff
otel.experimental.exporter.otlp.retry.maxBackoff
otel.experimental.exporter.otlp.retry.backoffMultiplier

@anuraaga
Copy link
Contributor

Scoping down to just those settings, and no specified defaults, seems fine for me.

Note that if we don't decide on any settings here, then to add retry in Java we would probably

  1. Flip on gRPC retry when using that library
  2. Implement a similar mechanism for when not using the gRPC library in our code
  3. Decide on default parameters by finding some reference out there
  4. Provide no configuration knobs to the user - without a spec, we can't expose this anymore than as an internal implementation detail subject to change and removal at any time.

I'd say the crux of the issue is retry is underspec'd - retry is already required with a MUST, but the spec provides no guidelines on how an SDK should actually implement this. I don't think the intent of the spec is to provide such general points without providing help to the SDKs to result in convergence, and easier to implement SDKs.

@tigrannajaryan
Copy link
Member

@jack-berg to be clear: I am all for consistency between SDKs and transports, especially from the end-user's configuration perspective.

I am against prescribing a specific backoff algorithm and parameters for the algorithm because we don't know where the optimum is (we can't know unless we do an extensive research using a wide range of representative network configurations, data flows and failure modes).

Without that knowledge there is a danger of over-specifying. Then people may start to depend on the exact behavior and we will be stuck with it even if we later discover it is not the best approach for our use cases.

How about this for a reduced scope version:

SDKs MAY use the built-in gRPC Retry mechanism to facilitate exponential back-off.
SDKs SHOULD have retry configuration and mechanics that are consistent across OTLP protocols. For example, if the built-in gRPC Retry mechanism is used for the grpc protocol, the http/protobuf and http/json protocols should expose the same configuration options and compute the backoff duration in the same manner.

This sounds reasonable to me.

@jack-berg
Copy link
Member Author

I'd say the crux of the issue is retry is underspec'd - retry is already required with a MUST, but the spec provides no guidelines on how an SDK should actually implement this.

Yes. Also agree with @tigrannajaryan that its not good to prescribe a parameters when "we don't know where the optimum is". The way I see it, we're in a catch 22.

I hope that we can get the data needed to prescribe parameters by scoping down the language such that SDKs can offer opt-in retry such that users can begin to experiment with different configurations and eventually allow us to arrive at a consensus.

  1. Decide on default parameters by finding some reference out there
  2. Provide no configuration knobs to the user - without a spec, we can't expose this anymore than as an internal implementation detail subject to change and removal at any time.

@anuraaga I've searched and I can't find a reference. I think that after this language is in and we have an explicit greenlight for the built in gRPC mechanism, its reasonable to offer a method on the OTLP exporters to configure a policy with parameters defined in the gRPC retry spec. The only thing missing from the spec at that point would be a set of retry environment variables and their default values, but we can still open up retry to users that programmatically configure the SDK. And perhaps we could also justify some experimental environment variables to make it easier to collect enough data needed to make a decision.

@jack-berg
Copy link
Member Author

@tigrannajaryan / @anuraaga thoughts on moving forward with the reduced scope?


SDKs MAY use the built-in [gRPC Retry](https://github.com/grpc/proposal/blob/master/A6-client-retries.md) mechanism to facilitate exponential back-off.

SDKs SHOULD have retry configuration and mechanics that are consistent across OTLP protocols. For example, if the built-in gRPC Retry mechanism is used for the `grpc` protocol, the `http/protobuf` and `http/json` protocols should expose the same configuration options and compute the backoff duration in the same manner.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I may not have articulated my point correctly - I don't think we should have this language without specifying at least what knobs are present. Different SDKs having different knobs is not the intent of the OTel spec I think. If we can't come up with any standard for at least the retry configuration, than I think the spec can only suggest "no knobs". SDKs could still add knobs in a way that is obvious they can be removed at any time, perhaps with an experimental prefix to the env variables or in an experimental package, but any official knob would be spec-incompliant, which is good to ensure that languages don't diverge in their knobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan Is there a mechanism to add this paragraph in an experimental fashion to the spec itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have this language as well as corresponding knobs in the spec. Removing the knobs was a compromise to try to get some movement.

I accepted the compromise of removing the knobs because by explicitly allowing the built in gRPC Retry mechanism, and saying that SDKS SHOULD have retry configuration ..., the knobs defined in gRPC are the implied means of configuration.

@tigrannajaryan
Copy link
Member

We have 3 different opinions about how we go with this change. I believe it means that this change is not ready to be accepted in any form. We need significantly better level of consensus to make changes to the spec. I think we should keep this unspecified until there is new evidence and better understanding on what's the right approach.

Is there a mechanism to add this paragraph in an experimental fashion to the spec itself?

I do not think it is necessary. SDKs are free to experiment with the behavior as long as it does not contradict with the spec. The outcome of the experiments can feed back to the spec. As long as you clearly mark the SDK features as experimental in the SDK itself you are free to iterate, change, remove and if you find the "best" way to handle retries you can then legalize it in the spec and force all other SDKs to follow it.

@github-actions
Copy link

github-actions bot commented Nov 3, 2021

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 3, 2021
@github-actions
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:configuration Related to configuring the SDK area:sdk Related to the SDK spec:protocol Related to the specification/protocol directory Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Additional specification around OTLP exporter retry requirements
5 participants