OpenTelemetry::Exporter::OTLP::Exporter should re-establish an HTTP connection on error #1658

SophieDeBenedetto · 2024-07-30T15:47:39Z

The Problem

The OpenTelemetry::Exporter::OTLP::Exporter#send_bytes method establishes a persistent HTTP connection and re-uses that connection when receiving certain error statuses back from the server and retrying the export request.

At GitHub, we observed that this can cause a pile-on effect impacting certain backend nodes (in our case, we have an OTel collector backend). Nodes that received "bad" requests or that are returning errors to the client for other reasons then continue to receive all the retries from a given request since the client is re-using the same persistent HTTP connection. The collector node would then be under increased pressure, and where the collector node was already under memory or CPU pressure, this would exacerbate the situation.

So, we introduced a monkey patch to the OTLP::Exporter to force it to create a new HTTP connection in the event of an error response. As a result, we saw a marked decrease in client exporter failure rates and OTel collector span refusal and drop rates, and we saw improvements in the distribution of memory usage across our fleet of OTel collector pods.

The Proposal

The OTLP::Exporter should close the current HTTP connection and open a new one when #send_bytes gets an error response back from the backend.

Implementation Suggestion

Our monkey patch looks like this:

def backoff?(retry_count:, reason:, retry_after: nil)
  @http.finish if @http.started?
  super
end

The #backoff? method is called before any call to #redo to retry the request in #send_bytes. So, the #backoff? method would be an appropriate place to close the HTTP connection. Then, the code already present in #send_bytes will start a fresh connection when #redo is called.

The text was updated successfully, but these errors were encountered:

xuan-cao-swi · 2024-08-09T15:56:40Z

I think it's a good idea to close http connection for every retry (even without multi-node environment).

Although I am little bit concerned about tcp connection overhead (maybe it's negligible), but I will bring it up to next SIG (everyone is welcome to join)

SophieDeBenedetto · 2024-08-09T16:22:09Z

I think it's a good idea to close http connection for every retry (even without multi-node environment).

Although I am little bit concerned about tcp connection overhead (maybe it's negligible), but I will bring it up to next SIG (everyone is welcome to join)

Thanks @xuan-cao-swi! I might not be able to make the next Sig but I'll try to set up one of my teammates to attend if I can't make it since we're interested in helping move this forward.

kaylareopelle · 2024-08-20T16:56:08Z

Hi @SophieDeBenedetto! Thanks for your patience here! We discussed this during the SIG today.

We're still trying to figure out the best way to respond to this problem, but wanted to give you an update.

We’re open to adding an option to enable this functionality, but haven’t found an example of another OTLP exporter following this pattern. When it seems like we're adding a feature for the first time across OTel, the topic generally warrants a larger conversation across the SIGs.

It seems like open-telemetry/opentelemetry-specification#4148 relates to this problem and may provide a spec'd path to implement a configuration option to toggle this functionality. There may also be nuances about what types of errors should relate to a retry and which should force a reconnection.

We're going to try to start/resume this cross-SIG conversation and will get back to you when we know more/have a place for the discussion to point you to.

github-actions · 2024-09-20T01:58:40Z

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

kaylareopelle · 2024-09-20T16:53:25Z

Oh no, this one fell off our radar! I'm sorry about the delay. I'll try to get an answer in the next SIG.

xuan-cao-swi · 2024-09-26T22:26:41Z

Sorry for the delayed response.

I think we could add a configurable option to close connections for all retries. Although this wouldn't be spec-compliant, it would be best to ask about this behavior change in the spec GitHub for confirmation.

Regarding spec compliance:
After reviewing the spec, it seems that closing existing connections does not comply with the current specifications. The spec acknowledges the potential to overwhelm the server with failed requests and indicates that the retry interval should use an exponential backoff strategy.

I checked the Python OTLP exporter as a reference, and it also does not close connections between retries (I believe the JavaScript exporter behaves similarly, but please correct me if I'm wrong).

Would a queuing system (as suggested here) mitigate this issue if one destination becomes overwhelmed?

github-actions bot added the stale label Sep 20, 2024

kaylareopelle added keep and removed stale labels Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenTelemetry::Exporter::OTLP::Exporter should re-establish an HTTP connection on error #1658

OpenTelemetry::Exporter::OTLP::Exporter should re-establish an HTTP connection on error #1658

SophieDeBenedetto commented Jul 30, 2024

xuan-cao-swi commented Aug 9, 2024

SophieDeBenedetto commented Aug 9, 2024

kaylareopelle commented Aug 20, 2024

github-actions bot commented Sep 20, 2024

kaylareopelle commented Sep 20, 2024

xuan-cao-swi commented Sep 26, 2024

OpenTelemetry::Exporter::OTLP::Exporter should re-establish an HTTP connection on error #1658

OpenTelemetry::Exporter::OTLP::Exporter should re-establish an HTTP connection on error #1658

Comments

SophieDeBenedetto commented Jul 30, 2024

The Problem

The Proposal

Implementation Suggestion

xuan-cao-swi commented Aug 9, 2024

SophieDeBenedetto commented Aug 9, 2024

kaylareopelle commented Aug 20, 2024

github-actions bot commented Sep 20, 2024

kaylareopelle commented Sep 20, 2024

xuan-cao-swi commented Sep 26, 2024