-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenTelemetry::Exporter::OTLP::Exporter should re-establish an HTTP connection on error #1658
Comments
I think it's a good idea to close http connection for every retry (even without multi-node environment). Although I am little bit concerned about tcp connection overhead (maybe it's negligible), but I will bring it up to next SIG (everyone is welcome to join) |
Thanks @xuan-cao-swi! I might not be able to make the next Sig but I'll try to set up one of my teammates to attend if I can't make it since we're interested in helping move this forward. |
Hi @SophieDeBenedetto! Thanks for your patience here! We discussed this during the SIG today. We're still trying to figure out the best way to respond to this problem, but wanted to give you an update. We’re open to adding an option to enable this functionality, but haven’t found an example of another OTLP exporter following this pattern. When it seems like we're adding a feature for the first time across OTel, the topic generally warrants a larger conversation across the SIGs. It seems like open-telemetry/opentelemetry-specification#4148 relates to this problem and may provide a spec'd path to implement a configuration option to toggle this functionality. There may also be nuances about what types of errors should relate to a retry and which should force a reconnection. We're going to try to start/resume this cross-SIG conversation and will get back to you when we know more/have a place for the discussion to point you to. |
👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the |
Oh no, this one fell off our radar! I'm sorry about the delay. I'll try to get an answer in the next SIG. |
Sorry for the delayed response. I think we could add a configurable option to close connections for all retries. Although this wouldn't be spec-compliant, it would be best to ask about this behavior change in the spec GitHub for confirmation. Regarding spec compliance: I checked the Python OTLP exporter as a reference, and it also does not close connections between retries (I believe the JavaScript exporter behaves similarly, but please correct me if I'm wrong). Would a queuing system (as suggested here) mitigate this issue if one destination becomes overwhelmed? |
The Problem
The
OpenTelemetry::Exporter::OTLP::Exporter#send_bytes
method establishes a persistent HTTP connection and re-uses that connection when receiving certain error statuses back from the server and retrying the export request.At GitHub, we observed that this can cause a pile-on effect impacting certain backend nodes (in our case, we have an OTel collector backend). Nodes that received "bad" requests or that are returning errors to the client for other reasons then continue to receive all the retries from a given request since the client is re-using the same persistent HTTP connection. The collector node would then be under increased pressure, and where the collector node was already under memory or CPU pressure, this would exacerbate the situation.
So, we introduced a monkey patch to the
OTLP::Exporter
to force it to create a new HTTP connection in the event of an error response. As a result, we saw a marked decrease in client exporter failure rates and OTel collector span refusal and drop rates, and we saw improvements in the distribution of memory usage across our fleet of OTel collector pods.The Proposal
The
OTLP::Exporter
should close the current HTTP connection and open a new one when#send_bytes
gets an error response back from the backend.Implementation Suggestion
Our monkey patch looks like this:
The
#backoff?
method is called before any call to#redo
to retry the request in#send_bytes
. So, the#backoff?
method would be an appropriate place to close the HTTP connection. Then, the code already present in#send_bytes
will start a fresh connection when#redo
is called.The text was updated successfully, but these errors were encountered: