Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporterhelper does not respect context deadlines #11183

Open
jmacd opened this issue Sep 16, 2024 · 0 comments
Open

Exporterhelper does not respect context deadlines #11183

jmacd opened this issue Sep 16, 2024 · 0 comments
Labels
area:exporter bug Something isn't working priority:p2 Medium

Comments

@jmacd
Copy link
Contributor

jmacd commented Sep 16, 2024

Describe the bug

In a gateway configuration of the collector, it is likely for an OTLP receiver and exporter to be configured. Any gRPC exporter/receiver pair that both propagate timeout information are likely to fall into this problem scenario. (For example, the OTel-Arrow components will behave the same after open-telemetry/opentelemetry-collector-contrib#34742 merges.)

In a synchronous pipeline, it means the caller's context is likely to have a deadline before it reaches an exporter, typically added by the pipeline receiver or propagated from an earlier pipeline segment's exporter.

Note that the timeout sender does recognize the incoming deadline, but not intentionally. The Golang context package will never increase a deadline, but it will shorten a deadline:

WithDeadline returns a copy of the parent context with the deadline adjusted to be no later than d. If the parent's deadline is already earlier than d, WithDeadline(parent, d) is semantically equivalent to parent.

The result is that exporterhelper will pass through a timeout shorter than its configured timeout. This may be working as intended for some, and a bug for others. When I set a 30s timeout on the exporter and my receiver observes a 5s timeout instead, it's likely to cause confusion.

The behavior of the retry sender is even more problematic. The retry sender is likely to create the problem described above by ignoring the arriving context timeout. When the retry sender's max_elapsed setting is greater than the arriving context timeout, we are virutally assured of issuing one short-timeout request at the end of a series of retries.

Say you've configured your exporterhelper like this:

retry_sender:
  enabled: true
timeout: 30s

If there is an incoming context deadline that is shorter than the configured max_elapsed setting, retry sending will continue past the original deadline. Since the arriving context deadline is preserved through all of this, the final export request is likely to have a timeout shorter than 30s. This leads to confusion.

Steps to reproduce
To reproduce this I created a client application that would send as fast as it can through a collector pipeline w/ both timeout and retry sender configured.

What did you expect to see?
I was expecting to see what I saw, because I've seen this in production and wanted to reproduce it.

What did you wish to see instead?
I have several wishes here.

First, the retry sender should consider the context deadline. When backoffDelay is computed it should be compared against the context deadline (if set). If there is a deadline that is less than the backoff delay, we should not fall into the selct statement and wait for cancelation, we should fail fast at that point w/ an gRPC Aborted code, message "insufficient context deadline for retry".

Second, the exporter timeout sender would ideally have a way to impose a minimum timeout. When the timeout sender is configured with a 5s timeout and the arriving context has a 1s deadline, I think I'd like two optional behaviors:

  • Ignore the arriving timeout, let the configured timeout override. With this option, I expect the caller's export will end in as deadline-exceeded while the pipeline continues to export the data w/ an independent deadline.
  • Fail fast. When the arriving context timeout is less than the configured timeout, fail immediately w/ an Aborted code.

Third, I would like for the OTLP receiver to support the same feature, a configuration that allows ignoring Receiver to ignore arriving timeouts when they are unreasonably short. This will have to be done on a per-receiver basis, but possibly a receiverhelper library could help receivers have uniform behavior.

What version did you use?
v0.108.x

What config did you use?
as shown above.

Environment

Additional context

@jmacd jmacd added the bug Something isn't working label Sep 16, 2024
bogdandrutu pushed a commit that referenced this issue Oct 5, 2024
… retry (#11331)

#### Description
The retry sender will delay until the context is canceled, where instead
it could fail fast with a transient error and a clear message that no
more retries are possible given the configuration.

#### Link to tracking issue
Part of #11183 

#### Testing
One new test.
jackgopack4 pushed a commit to jackgopack4/opentelemetry-collector that referenced this issue Oct 8, 2024
… retry (open-telemetry#11331)

#### Description
The retry sender will delay until the context is canceled, where instead
it could fail fast with a transient error and a clear message that no
more retries are possible given the configuration.

#### Link to tracking issue
Part of open-telemetry#11183 

#### Testing
One new test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:exporter bug Something isn't working priority:p2 Medium
Projects
None yet
2 participants