Add specification for otlp retry configuration parameters and defaults #1974

jack-berg · 2021-09-27T23:24:02Z

Changes

Expand the OTLP exporter retry language to include language for using the built-in gRPC retry mechanism. This allows SDKs to leverage the mechanism as a standard without precluding the use of other retry mechanisms (such as the one used in the collector).

It also establishes declares the configurable parameters if the built-in gRPC retry mechanism is used, with default values.

Related issues #

opentelemetry-java PR #3636 prototypes adding a retry mechanism but is blocked on spec language
opentelemetry-php Issue #328 requests an OTLP retry implementation
opentelemetry-dotnet Issue #1779 requests an OTLP retry implementation

specification/protocol/exporter.md

tigrannajaryan · 2021-10-05T14:41:24Z

opentelemetry-java PR #3636 prototypes adding a retry mechanism but is blocked on spec language

Can you please point to the part of the spec that is blocking this?

reyang · 2021-10-05T15:22:42Z

specification/protocol/exporter.md

-For OTLP/HTTP, the errors `408 (Request Timeout)` and `5xx (Server Errors)` are defined as transient, detailed information about erros can be found in the [HTTP failures section](otlp.md#failures). For the OTLP/gRPC, the full list of the gRPC retryable status codes can be found in the [gRPC response section](otlp.md#otlpgrpc-response).
+For OTLP/HTTP, the errors `408 (Request Timeout)` and `5xx (Server Errors)` are defined as transient, detailed information about errors can be found in the [HTTP failures section](otlp.md#failures). For the OTLP/gRPC, the full list of the gRPC retryable status codes can be found in the [gRPC response section](otlp.md#otlpgrpc-response).
+
+SDKs MAY use the built-in [gRPC Retry](https://github.com/grpc/proposal/blob/master/A6-client-retries.md) mechanism to facilitate exponential back-off. If the built-in gRPC mechanism is used, the following values SHOULD be available for configuration:


For retry, I guess we need to consider the following situation:

When transport error happened - the exporter didn't hear back from the receiver. (in this case the retry policy might be different from the situation where the exporter got the server response - where the server might give hint on Retry-After or Server Too Busy / Not Available).

When partial success occurred - e.g. HTTP 206. The response payload needs to include what succeeded and what failed, and the exporter will need to understand and respect that.

How to de-dupe - if there is no partial success, a simple unique sequence id (scoped to the connection or session) might be used. If there is partial success, a more sophisticated mechanism might be needed.

In the 10/5/2021 Spec SIG, we discussed that retry behavior in partial success scenarios could be a separate area of future work.

Upon closer inspection, of the otlp spec, I believe each of these points are already documented as known limitations.

When transport error happened - the exporter didn't hear back from the receiver. (in this case the retry policy might be different from the situation where the exporter got the server response - where the server might give hint on Retry-After or Server Too Busy / Not Available).

This falls under the Duplicate Data limitation.

When partial success occurred - e.g. HTTP 206. The response payload needs to include what succeeded and what failed, and the exporter will need to understand and respect that.

This falls under the Partial Success limitation.

How to de-dupe - if there is no partial success, a simple unique sequence id (scoped to the connection or session) might be used. If there is partial success, a more sophisticated mechanism might be needed.

This falls under the Request Acknowledgements / Partial Success limitations.

I could add language clarifying that these limitations also apply to retry attempts, but that's already implied. I think the OTLP spec's language is already sufficiently clear on what happens in retry edge cases: data duplication can occur if things like network interruptions happen, and there's no accommodation for partial success (for retry attempts or in general).

specification/protocol/exporter.md

jack-berg · 2021-10-05T20:51:07Z

Can you please point to the part of the spec that is blocking this?

@tigrannajaryan talked about it in the 10/05/2021 Spec SIG but reiterating here, there is no spec language that blocks usage of the gRPC retry mechanism. However, the language SIGs would be more comfortable implementing if the language was more explicit.

anuraaga · 2021-10-06T08:46:55Z

specification/protocol/exporter.md

+- `maxBackoff`: Must be a duration greater than 0.
+- `backoffMultiplier` Must be a number greater than 0.
+
+SDKs have unspecified default values for these properties. They are used to compute the backoff as follows:


Wasn't in the SIG where I think it was discussed, but why do we leave the defaults unspecified? Actually having agreed upon defaults to use in the implementation was one of the main points I was hoping for from the spec language.

SDKs have unspecified default values for these properties.

This sentence is unnecessary. It is implicitly assumed. If the spec does not define something then it is unspecified.

Actually having agreed upon defaults to use in the implementation was one of the main points I was hoping for from the spec language.

Why do we need it in the spec? gRPC has built-in defaults. What's wrong with those defaults?

gRPC has built-in defaults.

I can't find any evidence that gRPC has built in defaults.

This sentence is unnecessary. It is implicitly assumed. If the spec does not define something then it is unspecified.

I think there's an important distinction. If something isn't in the spec, then it could be an accidental omission that could be added in the future. Having explicit language in the spec that says "SDKs have unspecified defaults" makes it clear that its not an omission but a conscious choice to let the SDKs choose the default values. Here's another example of this language.

If gRPC had defaults to copy from, using those could probably work well but indeed I can't find them.

I think having defaults in the spec is important because I expect all languages to have the same retry behavior as a user - there doesn't seem to be anyone language specific about it. We've seen some env variables diverge, probably due to ordering of SDK implementation and spec, but I think the intent is for it to define cross-language concerns where possible to avoid divergence. Making it easier to implement an SDK by not needing to wonder as much is a great side-benefit.

I came to the same conclusion as @anuraaga here. We need this language to ensure a consistent user experience.

tigrannajaryan

I am in favour of adding this proposed sentence:

SDKs MAY use the built-in gRPC Retry mechanism to facilitate exponential back-off.

Other changes proposed in this PR I think are unnecessary.

tigrannajaryan · 2021-10-06T11:43:55Z

specification/protocol/exporter.md

+- `maxBackoff`: Must be a duration greater than 0.
+- `backoffMultiplier` Must be a number greater than 0.
+
+SDKs have unspecified default values for these properties. They are used to compute the backoff as follows:


SDKs have unspecified default values for these properties.

This sentence is unnecessary. It is implicitly assumed. If the spec does not define something then it is unspecified.

tigrannajaryan · 2021-10-06T11:48:12Z

specification/protocol/exporter.md

+- The initial retry attempt will occur after `random(0, initialBackoff)`
+- The `n`-th retry attempt will occur after `random(0, min(initialBackoff*backoffMultiplier**(n-1), maxBackoff))`


I do not think we need this in the spec. To make such a recommendation we need to back it off by a research and demonstrate that a particular algorithm for some reason is more suitable for OTLP than others. I do not think it is necessary though. It can remain an implementation detail.

My reply to this comment is bundled here.

tigrannajaryan · 2021-10-06T11:56:12Z

specification/protocol/exporter.md

+- `maxAttempts`: The maximum number of attempts, including the original request. Must be an integer greater than 1 and less than 6.
+- `initialBackoff`: Must be a duration greater than 0.
+- `maxBackoff`: Must be a duration greater than 0.
+- `backoffMultiplier` Must be a number greater than 0.


I think we are unnecessarily exposing very fine grained parameters. I would prefer not to overload the space of configuration parameters unless there is a clear evidence it is necessary (is there?).
gRPC has defaults that SDK authors can use. If they feel strong about using different default values they can do it.

I do think we need specification around this.

Let's say we don't add language like this, and just add simpler language that allows use of the built-in gRPC retry mechanism. SDKs may end up:

Choosing different variations of exponential backoff retry mechanisms, with different configuration parameters, different default values, and different ways of computing the delay. This is a bad user story because users have to be aware of the particulars of the SDKs they rely on to understand the retry behavior.

Its reasonable that users would want to reconfigure the default retry parameters - an app writing to a collector in its same data center is going to want different parameters than one writing to a vendor in a different region / continent. Since the configuration parameters are free to be different across SDKs, there's no uniform way to specify env var retry configuration. SDKs could use language specific environment variables, but this is a bad user story because ops has to be aware of language particulars.

Because there's no language requiring consistency between otlp grpc, http/protobuf, http/json, an SDK might choose to use the built in mechanism for grpc, and different mechanisms for http/protobuf and http/json. This is a bad user story because it allows for behavioral changes in retry based on otlp protocol. Also, it allows for an inconsistent configuration experience between different otlp protocols.

I actually think the language should go further and: 1. Specify that SDKs "MUST" compute backoff using a strategy like the built-in gRPC retry mechanism. 2. Reiterate the configuration parameters and the algorithm to calculate the delay. 3. Specify reasonable defaults.

This would make implementations unambiguous to implement for SDK authors. Behavior would be consistent across languages and protocols for users. And we could eventually add env (or other configuration) that would allow users to reconfigure the defaults in a consistent way.

In terms of research, the gRPC retry mechanism has been incubating for years, and it will be difficult for any research we do to supplant it. It uses a "full jitter" calculation, where the delay is a random value between 0 and the target delay for the attempt. This amazon post concludes that full jitter is a good implementation.

tigrannajaryan · 2021-10-06T11:57:12Z

specification/protocol/exporter.md

+- `maxBackoff`: Must be a duration greater than 0.
+- `backoffMultiplier` Must be a number greater than 0.
+
+SDKs have unspecified default values for these properties. They are used to compute the backoff as follows:


Actually having agreed upon defaults to use in the implementation was one of the main points I was hoping for from the spec language.

Why do we need it in the spec? gRPC has built-in defaults. What's wrong with those defaults?

jack-berg · 2021-10-12T16:08:40Z

Ping @tigrannajaryan. Curious how we can make progress on this. I'd really like to find a path forward for having the language SDKs (especially java) have retry logic for transient errors, as is already required by the spec.

tigrannajaryan · 2021-10-12T17:36:15Z

I'd really like to find a path forward for having the language SDKs (especially java) have retry logic for transient errors, as is already required by the spec.

I think part of the proposed changes are good, the rest I am not convinced about. To make progress it may be better to split the PR into 2 parts, merge the part on which we agree quickly and continue discussing the debatable parts on a new PR.

jack-berg · 2021-10-12T18:37:23Z

How about this for a reduced scope version:

SDKs MAY use the built-in gRPC Retry mechanism to facilitate exponential back-off.
SDKs SHOULD have retry configuration and mechanics that are consistent across OTLP protocols. For example, if the built-in gRPC Retry mechanism is used for the grpc protocol, the http/protobuf and http/json protocols should expose the same configuration options and compute the backoff duration in the same manner.

It explicitly approves the gRPC mechanism, and at least provides consistency within a language SDK for how the retry mechanism works across protocols.

I still think it would be better to have consistency across all SDKs, in terms of configuration and defaults, but maybe this language would be enough to get some movement.

@anuraaga what do you think? With this language I imagine opentelemetry-java could add setRetryPolicy(RetryPolicy retryPolicy) methods on all the otlp exporter builders, initially without any default values. We could add experimental properties that the autoconfigure module could key off of to configure the properties like:

otel.experimental.exporter.otlp.retry.maxAttempts
otel.experimental.exporter.otlp.retry.initialBackoff
otel.experimental.exporter.otlp.retry.maxBackoff
otel.experimental.exporter.otlp.retry.backoffMultiplier

anuraaga · 2021-10-13T05:41:32Z

Scoping down to just those settings, and no specified defaults, seems fine for me.

Note that if we don't decide on any settings here, then to add retry in Java we would probably

Flip on gRPC retry when using that library
Implement a similar mechanism for when not using the gRPC library in our code
Decide on default parameters by finding some reference out there
Provide no configuration knobs to the user - without a spec, we can't expose this anymore than as an internal implementation detail subject to change and removal at any time.

I'd say the crux of the issue is retry is underspec'd - retry is already required with a MUST, but the spec provides no guidelines on how an SDK should actually implement this. I don't think the intent of the spec is to provide such general points without providing help to the SDKs to result in convergence, and easier to implement SDKs.

tigrannajaryan · 2021-10-13T14:57:26Z

@jack-berg to be clear: I am all for consistency between SDKs and transports, especially from the end-user's configuration perspective.

I am against prescribing a specific backoff algorithm and parameters for the algorithm because we don't know where the optimum is (we can't know unless we do an extensive research using a wide range of representative network configurations, data flows and failure modes).

Without that knowledge there is a danger of over-specifying. Then people may start to depend on the exact behavior and we will be stuck with it even if we later discover it is not the best approach for our use cases.

How about this for a reduced scope version:

SDKs MAY use the built-in gRPC Retry mechanism to facilitate exponential back-off.
SDKs SHOULD have retry configuration and mechanics that are consistent across OTLP protocols. For example, if the built-in gRPC Retry mechanism is used for the grpc protocol, the http/protobuf and http/json protocols should expose the same configuration options and compute the backoff duration in the same manner.

This sounds reasonable to me.

jack-berg · 2021-10-14T15:19:32Z

I'd say the crux of the issue is retry is underspec'd - retry is already required with a MUST, but the spec provides no guidelines on how an SDK should actually implement this.

Yes. Also agree with @tigrannajaryan that its not good to prescribe a parameters when "we don't know where the optimum is". The way I see it, we're in a catch 22.

I hope that we can get the data needed to prescribe parameters by scoping down the language such that SDKs can offer opt-in retry such that users can begin to experiment with different configurations and eventually allow us to arrive at a consensus.

Decide on default parameters by finding some reference out there

Provide no configuration knobs to the user - without a spec, we can't expose this anymore than as an internal implementation detail subject to change and removal at any time.

@anuraaga I've searched and I can't find a reference. I think that after this language is in and we have an explicit greenlight for the built in gRPC mechanism, its reasonable to offer a method on the OTLP exporters to configure a policy with parameters defined in the gRPC retry spec. The only thing missing from the spec at that point would be a set of retry environment variables and their default values, but we can still open up retry to users that programmatically configure the SDK. And perhaps we could also justify some experimental environment variables to make it easier to collect enough data needed to make a decision.

jack-berg · 2021-10-19T18:21:13Z

@tigrannajaryan / @anuraaga thoughts on moving forward with the reduced scope?

anuraaga · 2021-10-20T00:55:58Z

specification/protocol/exporter.md

+
+SDKs MAY use the built-in [gRPC Retry](https://github.com/grpc/proposal/blob/master/A6-client-retries.md) mechanism to facilitate exponential back-off.
+
+SDKs SHOULD have retry configuration and mechanics that are consistent across OTLP protocols. For example, if the built-in gRPC Retry mechanism is used for the `grpc` protocol, the `http/protobuf` and `http/json` protocols should expose the same configuration options and compute the backoff duration in the same manner.


I think I may not have articulated my point correctly - I don't think we should have this language without specifying at least what knobs are present. Different SDKs having different knobs is not the intent of the OTel spec I think. If we can't come up with any standard for at least the retry configuration, than I think the spec can only suggest "no knobs". SDKs could still add knobs in a way that is obvious they can be removed at any time, perhaps with an experimental prefix to the env variables or in an experimental package, but any official knob would be spec-incompliant, which is good to ensure that languages don't diverge in their knobs.

@tigrannajaryan Is there a mechanism to add this paragraph in an experimental fashion to the spec itself?

I think we should have this language as well as corresponding knobs in the spec. Removing the knobs was a compromise to try to get some movement.

I accepted the compromise of removing the knobs because by explicitly allowing the built in gRPC Retry mechanism, and saying that SDKS SHOULD have retry configuration ..., the knobs defined in gRPC are the implied means of configuration.

tigrannajaryan · 2021-10-20T21:45:21Z

We have 3 different opinions about how we go with this change. I believe it means that this change is not ready to be accepted in any form. We need significantly better level of consensus to make changes to the spec. I think we should keep this unspecified until there is new evidence and better understanding on what's the right approach.

Is there a mechanism to add this paragraph in an experimental fashion to the spec itself?

I do not think it is necessary. SDKs are free to experiment with the behavior as long as it does not contradict with the spec. The outcome of the experiments can feed back to the spec. As long as you clearly mark the SDK features as experimental in the SDK itself you are free to iterate, change, remove and if you find the "best" way to handle retries you can then legalize it in the spec and force all other SDKs to follow it.

github-actions · 2021-11-03T03:17:19Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions · 2021-11-10T03:17:24Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Add specification for otlp retry configuration parameters and defaults

2042bc2

jack-berg requested review from a team September 27, 2021 23:24

github-actions bot assigned tigrannajaryan Sep 27, 2021

jack-berg commented Sep 27, 2021

View reviewed changes

specification/protocol/exporter.md Outdated Show resolved Hide resolved

arminru added area:configuration Related to configuring the SDK area:sdk Related to the SDK spec:protocol Related to the specification/protocol directory labels Sep 30, 2021

reyang reviewed Oct 5, 2021

View reviewed changes

specification/protocol/exporter.md Outdated Show resolved Hide resolved

Remove default values for retry properties.

c1eb940

anuraaga reviewed Oct 6, 2021

View reviewed changes

tigrannajaryan reviewed Oct 6, 2021

View reviewed changes

Reduce scope

ec4a4e9

anuraaga reviewed Oct 20, 2021

View reviewed changes

jack-berg mentioned this pull request Oct 26, 2021

Autoconfigure experimental OTLP retry open-telemetry/opentelemetry-java#3791

Merged

github-actions bot added the Stale label Nov 3, 2021

github-actions bot closed this Nov 10, 2021

alanwest mentioned this pull request Jan 20, 2022

Collector exporter retries and backoff open-telemetry/opentelemetry-js#1233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add specification for otlp retry configuration parameters and defaults #1974

Add specification for otlp retry configuration parameters and defaults #1974

jack-berg commented Sep 27, 2021

tigrannajaryan commented Oct 5, 2021

reyang Oct 5, 2021

jack-berg Oct 5, 2021

jack-berg commented Oct 5, 2021

anuraaga Oct 6, 2021

tigrannajaryan Oct 6, 2021

tigrannajaryan Oct 6, 2021

jack-berg Oct 6, 2021

anuraaga Oct 6, 2021 •

edited

Loading

jack-berg Oct 6, 2021

tigrannajaryan left a comment

tigrannajaryan Oct 6, 2021

tigrannajaryan Oct 6, 2021

jack-berg Oct 6, 2021

tigrannajaryan Oct 6, 2021

jack-berg Oct 6, 2021

tigrannajaryan Oct 6, 2021

jack-berg commented Oct 12, 2021

tigrannajaryan commented Oct 12, 2021

jack-berg commented Oct 12, 2021

anuraaga commented Oct 13, 2021

tigrannajaryan commented Oct 13, 2021

jack-berg commented Oct 14, 2021

jack-berg commented Oct 19, 2021

anuraaga Oct 20, 2021

anuraaga Oct 20, 2021

jack-berg Oct 26, 2021

tigrannajaryan commented Oct 20, 2021

github-actions bot commented Nov 3, 2021

github-actions bot commented Nov 10, 2021

		- The initial retry attempt will occur after `random(0, initialBackoff)`
		- The `n`-th retry attempt will occur after `random(0, min(initialBackoffbackoffMultiplier*(n-1), maxBackoff))`


		SDKs MAY use the built-in [gRPC Retry](https://github.com/grpc/proposal/blob/master/A6-client-retries.md) mechanism to facilitate exponential back-off.

		SDKs SHOULD have retry configuration and mechanics that are consistent across OTLP protocols. For example, if the built-in gRPC Retry mechanism is used for the `grpc` protocol, the `http/protobuf` and `http/json` protocols should expose the same configuration options and compute the backoff duration in the same manner.

Add specification for otlp retry configuration parameters and defaults #1974

Add specification for otlp retry configuration parameters and defaults #1974

Conversation

jack-berg commented Sep 27, 2021

Changes

tigrannajaryan commented Oct 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jack-berg commented Oct 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anuraaga Oct 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tigrannajaryan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jack-berg commented Oct 12, 2021

tigrannajaryan commented Oct 12, 2021

jack-berg commented Oct 12, 2021

anuraaga commented Oct 13, 2021

tigrannajaryan commented Oct 13, 2021

jack-berg commented Oct 14, 2021

jack-berg commented Oct 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tigrannajaryan commented Oct 20, 2021

github-actions bot commented Nov 3, 2021

github-actions bot commented Nov 10, 2021

anuraaga Oct 6, 2021 •

edited

Loading