Exporterhelper does not respect context deadlines #11183

jmacd · 2024-09-16T22:34:25Z

Describe the bug

In a gateway configuration of the collector, it is likely for an OTLP receiver and exporter to be configured. Any gRPC exporter/receiver pair that both propagate timeout information are likely to fall into this problem scenario. (For example, the OTel-Arrow components will behave the same after open-telemetry/opentelemetry-collector-contrib#34742 merges.)

In a synchronous pipeline, it means the caller's context is likely to have a deadline before it reaches an exporter, typically added by the pipeline receiver or propagated from an earlier pipeline segment's exporter.

Note that the timeout sender does recognize the incoming deadline, but not intentionally. The Golang context package will never increase a deadline, but it will shorten a deadline:

WithDeadline returns a copy of the parent context with the deadline adjusted to be no later than d. If the parent's deadline is already earlier than d, WithDeadline(parent, d) is semantically equivalent to parent.

The result is that exporterhelper will pass through a timeout shorter than its configured timeout. This may be working as intended for some, and a bug for others. When I set a 30s timeout on the exporter and my receiver observes a 5s timeout instead, it's likely to cause confusion.

The behavior of the retry sender is even more problematic. The retry sender is likely to create the problem described above by ignoring the arriving context timeout. When the retry sender's max_elapsed setting is greater than the arriving context timeout, we are virutally assured of issuing one short-timeout request at the end of a series of retries.

Say you've configured your exporterhelper like this:

retry_sender:
  enabled: true
timeout: 30s

If there is an incoming context deadline that is shorter than the configured max_elapsed setting, retry sending will continue past the original deadline. Since the arriving context deadline is preserved through all of this, the final export request is likely to have a timeout shorter than 30s. This leads to confusion.

Steps to reproduce
To reproduce this I created a client application that would send as fast as it can through a collector pipeline w/ both timeout and retry sender configured.

What did you expect to see?
I was expecting to see what I saw, because I've seen this in production and wanted to reproduce it.

What did you wish to see instead?
I have several wishes here.

First, the retry sender should consider the context deadline. When backoffDelay is computed it should be compared against the context deadline (if set). If there is a deadline that is less than the backoff delay, we should not fall into the selct statement and wait for cancelation, we should fail fast at that point w/ an gRPC Aborted code, message "insufficient context deadline for retry".

Second, the exporter timeout sender would ideally have a way to impose a minimum timeout. When the timeout sender is configured with a 5s timeout and the arriving context has a 1s deadline, I think I'd like two optional behaviors:

Ignore the arriving timeout, let the configured timeout override. With this option, I expect the caller's export will end in as deadline-exceeded while the pipeline continues to export the data w/ an independent deadline.
Fail fast. When the arriving context timeout is less than the configured timeout, fail immediately w/ an Aborted code.

Third, I would like for the OTLP receiver to support the same feature, a configuration that allows ignoring Receiver to ignore arriving timeouts when they are unreasonably short. This will have to be done on a per-receiver basis, but possibly a receiverhelper library could help receivers have uniform behavior.

What version did you use?
v0.108.x

What config did you use?
as shown above.

Environment

Additional context

The text was updated successfully, but these errors were encountered:

… retry (#11331) #### Description The retry sender will delay until the context is canceled, where instead it could fail fast with a transient error and a clear message that no more retries are possible given the configuration. #### Link to tracking issue Part of #11183 #### Testing One new test.

… retry (open-telemetry#11331) #### Description The retry sender will delay until the context is canceled, where instead it could fail fast with a transient error and a clear message that no more retries are possible given the configuration. #### Link to tracking issue Part of open-telemetry#11183 #### Testing One new test.

jmacd added the bug Something isn't working label Sep 16, 2024

mx-psi added area:exporter priority:p2 Medium labels Sep 17, 2024

This was referenced Sep 17, 2024

Exporterhelper: address timeout misalignment w/ context deadline #11198

Closed

[consumer] Add new otlp-centric error type #11085

Open

jmacd mentioned this issue Oct 1, 2024

(exporterhelper) Retry sender: fail request if context timeout < next retry #11331

Merged

This was referenced Oct 7, 2024

(exporterhelper): Add exporterhelper timeout policy support #11385

Closed

(exporterhelper) Configuration package for timeout sender #11387

Closed

jmacd mentioned this issue Nov 1, 2024

Collector Reliability Review #11499

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exporterhelper does not respect context deadlines #11183

Exporterhelper does not respect context deadlines #11183

jmacd commented Sep 16, 2024 •

edited

Loading

Exporterhelper does not respect context deadlines #11183

Exporterhelper does not respect context deadlines #11183

Comments

jmacd commented Sep 16, 2024 • edited Loading

jmacd commented Sep 16, 2024 •

edited

Loading