-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add specification for otlp retry configuration parameters and defaults #1974
Conversation
Can you please point to the part of the spec that is blocking this? |
specification/protocol/exporter.md
Outdated
For OTLP/HTTP, the errors `408 (Request Timeout)` and `5xx (Server Errors)` are defined as transient, detailed information about erros can be found in the [HTTP failures section](otlp.md#failures). For the OTLP/gRPC, the full list of the gRPC retryable status codes can be found in the [gRPC response section](otlp.md#otlpgrpc-response). | ||
For OTLP/HTTP, the errors `408 (Request Timeout)` and `5xx (Server Errors)` are defined as transient, detailed information about errors can be found in the [HTTP failures section](otlp.md#failures). For the OTLP/gRPC, the full list of the gRPC retryable status codes can be found in the [gRPC response section](otlp.md#otlpgrpc-response). | ||
|
||
SDKs MAY use the built-in [gRPC Retry](https://github.com/grpc/proposal/blob/master/A6-client-retries.md) mechanism to facilitate exponential back-off. If the built-in gRPC mechanism is used, the following values SHOULD be available for configuration: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For retry, I guess we need to consider the following situation:
- When transport error happened - the exporter didn't hear back from the receiver. (in this case the retry policy might be different from the situation where the exporter got the server response - where the server might give hint on Retry-After or Server Too Busy / Not Available).
- When partial success occurred - e.g. HTTP 206. The response payload needs to include what succeeded and what failed, and the exporter will need to understand and respect that.
- How to de-dupe - if there is no partial success, a simple unique sequence id (scoped to the connection or session) might be used. If there is partial success, a more sophisticated mechanism might be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the 10/5/2021 Spec SIG, we discussed that retry behavior in partial success scenarios could be a separate area of future work.
Upon closer inspection, of the otlp spec, I believe each of these points are already documented as known limitations.
- When transport error happened - the exporter didn't hear back from the receiver. (in this case the retry policy might be different from the situation where the exporter got the server response - where the server might give hint on Retry-After or Server Too Busy / Not Available).
This falls under the Duplicate Data limitation.
- When partial success occurred - e.g. HTTP 206. The response payload needs to include what succeeded and what failed, and the exporter will need to understand and respect that.
This falls under the Partial Success limitation.
- How to de-dupe - if there is no partial success, a simple unique sequence id (scoped to the connection or session) might be used. If there is partial success, a more sophisticated mechanism might be needed.
This falls under the Request Acknowledgements / Partial Success limitations.
I could add language clarifying that these limitations also apply to retry attempts, but that's already implied. I think the OTLP spec's language is already sufficiently clear on what happens in retry edge cases: data duplication can occur if things like network interruptions happen, and there's no accommodation for partial success (for retry attempts or in general).
@tigrannajaryan talked about it in the 10/05/2021 Spec SIG but reiterating here, there is no spec language that blocks usage of the gRPC retry mechanism. However, the language SIGs would be more comfortable implementing if the language was more explicit. |
specification/protocol/exporter.md
Outdated
- `maxBackoff`: Must be a duration greater than 0. | ||
- `backoffMultiplier` Must be a number greater than 0. | ||
|
||
SDKs have unspecified default values for these properties. They are used to compute the backoff as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't in the SIG where I think it was discussed, but why do we leave the defaults unspecified? Actually having agreed upon defaults to use in the implementation was one of the main points I was hoping for from the spec language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SDKs have unspecified default values for these properties.
This sentence is unnecessary. It is implicitly assumed. If the spec does not define something then it is unspecified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually having agreed upon defaults to use in the implementation was one of the main points I was hoping for from the spec language.
Why do we need it in the spec? gRPC has built-in defaults. What's wrong with those defaults?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gRPC has built-in defaults.
I can't find any evidence that gRPC has built in defaults.
This sentence is unnecessary. It is implicitly assumed. If the spec does not define something then it is unspecified.
I think there's an important distinction. If something isn't in the spec, then it could be an accidental omission that could be added in the future. Having explicit language in the spec that says "SDKs have unspecified defaults" makes it clear that its not an omission but a conscious choice to let the SDKs choose the default values. Here's another example of this language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If gRPC had defaults to copy from, using those could probably work well but indeed I can't find them.
I think having defaults in the spec is important because I expect all languages to have the same retry behavior as a user - there doesn't seem to be anyone language specific about it. We've seen some env variables diverge, probably due to ordering of SDK implementation and spec, but I think the intent is for it to define cross-language concerns where possible to avoid divergence. Making it easier to implement an SDK by not needing to wonder as much is a great side-benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favour of adding this proposed sentence:
SDKs MAY use the built-in gRPC Retry mechanism to facilitate exponential back-off.
Other changes proposed in this PR I think are unnecessary.
specification/protocol/exporter.md
Outdated
- `maxBackoff`: Must be a duration greater than 0. | ||
- `backoffMultiplier` Must be a number greater than 0. | ||
|
||
SDKs have unspecified default values for these properties. They are used to compute the backoff as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SDKs have unspecified default values for these properties.
This sentence is unnecessary. It is implicitly assumed. If the spec does not define something then it is unspecified.
specification/protocol/exporter.md
Outdated
- The initial retry attempt will occur after `random(0, initialBackoff)` | ||
- The `n`-th retry attempt will occur after `random(0, min(initialBackoff*backoffMultiplier**(n-1), maxBackoff))` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we need this in the spec. To make such a recommendation we need to back it off by a research and demonstrate that a particular algorithm for some reason is more suitable for OTLP than others. I do not think it is necessary though. It can remain an implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reply to this comment is bundled here.
specification/protocol/exporter.md
Outdated
- `maxAttempts`: The maximum number of attempts, including the original request. Must be an integer greater than 1 and less than 6. | ||
- `initialBackoff`: Must be a duration greater than 0. | ||
- `maxBackoff`: Must be a duration greater than 0. | ||
- `backoffMultiplier` Must be a number greater than 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are unnecessarily exposing very fine grained parameters. I would prefer not to overload the space of configuration parameters unless there is a clear evidence it is necessary (is there?).
gRPC has defaults that SDK authors can use. If they feel strong about using different default values they can do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think we need specification around this.
Let's say we don't add language like this, and just add simpler language that allows use of the built-in gRPC retry mechanism. SDKs may end up:
- Choosing different variations of exponential backoff retry mechanisms, with different configuration parameters, different default values, and different ways of computing the delay. This is a bad user story because users have to be aware of the particulars of the SDKs they rely on to understand the retry behavior.
- Its reasonable that users would want to reconfigure the default retry parameters - an app writing to a collector in its same data center is going to want different parameters than one writing to a vendor in a different region / continent. Since the configuration parameters are free to be different across SDKs, there's no uniform way to specify env var retry configuration. SDKs could use language specific environment variables, but this is a bad user story because ops has to be aware of language particulars.
- Because there's no language requiring consistency between otlp
grpc
,http/protobuf
,http/json
, an SDK might choose to use the built in mechanism forgrpc
, and different mechanisms forhttp/protobuf
andhttp/json
. This is a bad user story because it allows for behavioral changes in retry based on otlp protocol. Also, it allows for an inconsistent configuration experience between different otlp protocols.
I actually think the language should go further and: 1. Specify that SDKs "MUST" compute backoff using a strategy like the built-in gRPC retry mechanism. 2. Reiterate the configuration parameters and the algorithm to calculate the delay. 3. Specify reasonable defaults.
This would make implementations unambiguous to implement for SDK authors. Behavior would be consistent across languages and protocols for users. And we could eventually add env (or other configuration) that would allow users to reconfigure the defaults in a consistent way.
In terms of research, the gRPC retry mechanism has been incubating for years, and it will be difficult for any research we do to supplant it. It uses a "full jitter" calculation, where the delay is a random value between 0 and the target delay for the attempt. This amazon post concludes that full jitter is a good implementation.
specification/protocol/exporter.md
Outdated
- `maxBackoff`: Must be a duration greater than 0. | ||
- `backoffMultiplier` Must be a number greater than 0. | ||
|
||
SDKs have unspecified default values for these properties. They are used to compute the backoff as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually having agreed upon defaults to use in the implementation was one of the main points I was hoping for from the spec language.
Why do we need it in the spec? gRPC has built-in defaults. What's wrong with those defaults?
Ping @tigrannajaryan. Curious how we can make progress on this. I'd really like to find a path forward for having the language SDKs (especially java) have retry logic for transient errors, as is already required by the spec. |
I think part of the proposed changes are good, the rest I am not convinced about. To make progress it may be better to split the PR into 2 parts, merge the part on which we agree quickly and continue discussing the debatable parts on a new PR. |
How about this for a reduced scope version:
It explicitly approves the gRPC mechanism, and at least provides consistency within a language SDK for how the retry mechanism works across protocols. I still think it would be better to have consistency across all SDKs, in terms of configuration and defaults, but maybe this language would be enough to get some movement. @anuraaga what do you think? With this language I imagine
|
Scoping down to just those settings, and no specified defaults, seems fine for me. Note that if we don't decide on any settings here, then to add retry in Java we would probably
I'd say the crux of the issue is retry is underspec'd - retry is already required with a |
@jack-berg to be clear: I am all for consistency between SDKs and transports, especially from the end-user's configuration perspective. I am against prescribing a specific backoff algorithm and parameters for the algorithm because we don't know where the optimum is (we can't know unless we do an extensive research using a wide range of representative network configurations, data flows and failure modes). Without that knowledge there is a danger of over-specifying. Then people may start to depend on the exact behavior and we will be stuck with it even if we later discover it is not the best approach for our use cases.
This sounds reasonable to me. |
Yes. Also agree with @tigrannajaryan that its not good to prescribe a parameters when "we don't know where the optimum is". The way I see it, we're in a catch 22. I hope that we can get the data needed to prescribe parameters by scoping down the language such that SDKs can offer opt-in retry such that users can begin to experiment with different configurations and eventually allow us to arrive at a consensus.
@anuraaga I've searched and I can't find a reference. I think that after this language is in and we have an explicit greenlight for the built in gRPC mechanism, its reasonable to offer a method on the OTLP exporters to configure a policy with parameters defined in the gRPC retry spec. The only thing missing from the spec at that point would be a set of retry environment variables and their default values, but we can still open up retry to users that programmatically configure the SDK. And perhaps we could also justify some experimental environment variables to make it easier to collect enough data needed to make a decision. |
@tigrannajaryan / @anuraaga thoughts on moving forward with the reduced scope? |
|
||
SDKs MAY use the built-in [gRPC Retry](https://github.com/grpc/proposal/blob/master/A6-client-retries.md) mechanism to facilitate exponential back-off. | ||
|
||
SDKs SHOULD have retry configuration and mechanics that are consistent across OTLP protocols. For example, if the built-in gRPC Retry mechanism is used for the `grpc` protocol, the `http/protobuf` and `http/json` protocols should expose the same configuration options and compute the backoff duration in the same manner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I may not have articulated my point correctly - I don't think we should have this language without specifying at least what knobs are present. Different SDKs having different knobs is not the intent of the OTel spec I think. If we can't come up with any standard for at least the retry configuration, than I think the spec can only suggest "no knobs". SDKs could still add knobs in a way that is obvious they can be removed at any time, perhaps with an experimental prefix to the env variables or in an experimental package, but any official knob would be spec-incompliant, which is good to ensure that languages don't diverge in their knobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tigrannajaryan Is there a mechanism to add this paragraph in an experimental fashion to the spec itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have this language as well as corresponding knobs in the spec. Removing the knobs was a compromise to try to get some movement.
I accepted the compromise of removing the knobs because by explicitly allowing the built in gRPC Retry mechanism, and saying that SDKS SHOULD have retry configuration ...
, the knobs defined in gRPC are the implied means of configuration.
We have 3 different opinions about how we go with this change. I believe it means that this change is not ready to be accepted in any form. We need significantly better level of consensus to make changes to the spec. I think we should keep this unspecified until there is new evidence and better understanding on what's the right approach.
I do not think it is necessary. SDKs are free to experiment with the behavior as long as it does not contradict with the spec. The outcome of the experiments can feed back to the spec. As long as you clearly mark the SDK features as experimental in the SDK itself you are free to iterate, change, remove and if you find the "best" way to handle retries you can then legalize it in the spec and force all other SDKs to follow it. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Fixes #1742.
Changes
Expand the OTLP exporter retry language to include language for using the built-in gRPC retry mechanism. This allows SDKs to leverage the mechanism as a standard without precluding the use of other retry mechanisms (such as the one used in the collector).
It also establishes declares the configurable parameters if the built-in gRPC retry mechanism is used, with default values.
Related issues #