Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry::Exporter::OTLP::Exporter should re-establish an HTTP connection on error #1658

Open
SophieDeBenedetto opened this issue Jul 30, 2024 · 6 comments
Labels

Comments

@SophieDeBenedetto
Copy link

The Problem

The OpenTelemetry::Exporter::OTLP::Exporter#send_bytes method establishes a persistent HTTP connection and re-uses that connection when receiving certain error statuses back from the server and retrying the export request.

At GitHub, we observed that this can cause a pile-on effect impacting certain backend nodes (in our case, we have an OTel collector backend). Nodes that received "bad" requests or that are returning errors to the client for other reasons then continue to receive all the retries from a given request since the client is re-using the same persistent HTTP connection. The collector node would then be under increased pressure, and where the collector node was already under memory or CPU pressure, this would exacerbate the situation.

So, we introduced a monkey patch to the OTLP::Exporter to force it to create a new HTTP connection in the event of an error response. As a result, we saw a marked decrease in client exporter failure rates and OTel collector span refusal and drop rates, and we saw improvements in the distribution of memory usage across our fleet of OTel collector pods.

The Proposal

The OTLP::Exporter should close the current HTTP connection and open a new one when #send_bytes gets an error response back from the backend.

Implementation Suggestion

Our monkey patch looks like this:

def backoff?(retry_count:, reason:, retry_after: nil)
  @http.finish if @http.started?
  super
end

The #backoff? method is called before any call to #redo to retry the request in #send_bytes. So, the #backoff? method would be an appropriate place to close the HTTP connection. Then, the code already present in #send_bytes will start a fresh connection when #redo is called.

@xuan-cao-swi
Copy link
Contributor

I think it's a good idea to close http connection for every retry (even without multi-node environment).

Although I am little bit concerned about tcp connection overhead (maybe it's negligible), but I will bring it up to next SIG (everyone is welcome to join)

@SophieDeBenedetto
Copy link
Author

I think it's a good idea to close http connection for every retry (even without multi-node environment).

Although I am little bit concerned about tcp connection overhead (maybe it's negligible), but I will bring it up to next SIG (everyone is welcome to join)

Thanks @xuan-cao-swi! I might not be able to make the next Sig but I'll try to set up one of my teammates to attend if I can't make it since we're interested in helping move this forward.

@kaylareopelle
Copy link
Contributor

Hi @SophieDeBenedetto! Thanks for your patience here! We discussed this during the SIG today.

We're still trying to figure out the best way to respond to this problem, but wanted to give you an update.

We’re open to adding an option to enable this functionality, but haven’t found an example of another OTLP exporter following this pattern. When it seems like we're adding a feature for the first time across OTel, the topic generally warrants a larger conversation across the SIGs.

It seems like open-telemetry/opentelemetry-specification#4148 relates to this problem and may provide a spec'd path to implement a configuration option to toggle this functionality. There may also be nuances about what types of errors should relate to a retry and which should force a reconnection.

We're going to try to start/resume this cross-SIG conversation and will get back to you when we know more/have a place for the discussion to point you to.

Copy link
Contributor

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

@kaylareopelle
Copy link
Contributor

Oh no, this one fell off our radar! I'm sorry about the delay. I'll try to get an answer in the next SIG.

@xuan-cao-swi
Copy link
Contributor

Sorry for the delayed response.

I think we could add a configurable option to close connections for all retries. Although this wouldn't be spec-compliant, it would be best to ask about this behavior change in the spec GitHub for confirmation.

Regarding spec compliance:
After reviewing the spec, it seems that closing existing connections does not comply with the current specifications. The spec acknowledges the potential to overwhelm the server with failed requests and indicates that the retry interval should use an exponential backoff strategy.

I checked the Python OTLP exporter as a reference, and it also does not close connections between retries (I believe the JavaScript exporter behaves similarly, but please correct me if I'm wrong).

Would a queuing system (as suggested here) mitigate this issue if one destination becomes overwhelmed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants