Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTLP exporters don't respect timeout #2663

Open
aabmass opened this issue May 5, 2022 · 4 comments
Open

OTLP exporters don't respect timeout #2663

aabmass opened this issue May 5, 2022 · 4 comments
Labels
bug Something isn't working metrics release:after-ga To be resolved after GA release

Comments

@aabmass
Copy link
Member

aabmass commented May 5, 2022

Discussed further in #2402 (comment)

OTLP exporter as of today

  • try to export, with a 30 10 second timeout, fail and wait 2 seconds
  • try to export, with a 30 10 second timeout, fail and wait 4 seconds
  • try to export, with a 30 10 second timeout, fail and wait 8 seconds
  • try to export, with a 30 10 second timeout, fail and wait 16 seconds
  • try to export, with a 30 10 second timeout, fail and wait 32 seconds
  • try to export, with a 30 10 second timeout, fail and wait 64 seconds and finally give up

In an ideal scenario the BSP has to cancel the export call after 30s irrespective of exporter handling mechanism. But it doesn't do that with 30s export mills timeout passed to via env/arg.

@aabmass aabmass added bug Something isn't working metrics 1.10.0rc1 release candidate 1 for metrics GA labels May 5, 2022
@aabmass
Copy link
Member Author

aabmass commented May 6, 2022

Moving some discussion from the PR here:

    def _export(
        self,
        data: TypingSequence[SDKDataT],
        timeout_millis: Optional[float] = None,
    ) -> ExportResultT:
        if timeout_millis is not None:
            timeout_seconds = timeout_millis / 10**3
        else:
            timeout_seconds = self._timeout

Help me understand this. It is generally self._timeout except when we want to finish work as early as possible force_flush or shutdown in which case we pass the timeout_millis from processor/periodicexportingreader?
From our discussion last week, IIRC you said the timeout configured directly on the exporter (via constructor or environment vars) should apply to individual requests, potentially within a series of retries.

cc @srikanthccv

I checked the Go impl (where this is a common pattern) and it actually does something different.

  • If the exporter's config timeout (self._timeout for us) is present, it uses that.
    • Note, if that is longer than the context's timeout is, it will actually use the shorter one
  • If not present, it uses the just the parent context
  • It doesn't have a different context for individual requests vs retries (retry logic here)

This is really confusing to me because DEADLINE_EXCEEDED is retryable but it will never happen? I'm not really sure what the intention of the spec is. Maybe a few examples can help tease out the details:

BSP has as longer timeout

  • user sets OTEL_EXPORTER_OTLP_TIMEOUT=10000
  • they set OTEL_BSP_EXPORT_TIMEOUT=30000

does the export abort after 10s? Or should it retry until 30 seconds? What if the user calls force flush with 15s timeout?

BSP has a shorter timeout

  • user sets OTEL_EXPORTER_OTLP_TIMEOUT=30000
  • they set OTEL_BSP_EXPORT_TIMEOUT=10000

presumably the BSP should win out here

User passes a longer timeout than either

  • user sets OTEL_EXPORTER_OTLP_TIMEOUT=10000
  • they set OTEL_BSP_EXPORT_TIMEOUT=10000

What if the user calls force flush with a 30s timeout? Should we wait 30s or respect the shorter configured options of BSP/OTLP? Maybe there is an argument for making the timeout params Optional to indicate no deadline from the parent call

@gsakkis
Copy link

gsakkis commented Feb 9, 2024

Any update on this? Not sure if it's the same bug but I set the timeout to 1 sec and it keeps trying to reach the (inexistent) endpoint:

WARNING - 2024-02-10 00:16:51,218 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 1s.
WARNING - 2024-02-10 00:16:52,227 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 2s.
WARNING - 2024-02-10 00:16:54,233 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 4s.
WARNING - 2024-02-10 00:16:58,238 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 8s.
WARNING - 2024-02-10 00:17:06,245 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 16s.
WARNING - 2024-02-10 00:17:22,255 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 32s.

@havrylenkok
Copy link

This behavior is EXTREMELY problematic when collector has worse performance than the application. Which for many is acceptable, after all this is telemetry data.

Especially when we are talking about Lambda, with sub-second invocations and short timeouts.

Current workaround for autoinstrumentation I see is to patch the retries out completely during the cold start:

from opentelemetry.metrics._internal import get_meter_provider
meter_provider = get_meter_provider()  # should return the global object
meter_provider._sdk_config.metric_readers[0]._exporter._retryable = lambda _: False

@jonapich
Copy link

I wanted to keep some form of retry so I made this patch:

from opentelemetry.exporter.otlp.proto.common._internal import _create_exp_backoff_generator  # this is a copy


def _patched_otel_create_exp_backoff_generator(*args: Any, **kwargs: Any) -> Any:
    original_max_value = kwargs.get("max_value", 0)
    new_max_value = 2
    kwargs["max_value"] = new_max_value

    for backoff in _create_exp_backoff_generator(*args, **kwargs):
        yield backoff
        if backoff == new_max_value:
            if original_max_value:
                yield original_max_value  # the original function will stop if we return its max value
            break

Here's how you can apply this patch:

import logging

def _patch_otel_backoff_strategy() -> None:
    try:
        # import without the "from" to get the reference, not a copy
        import opentelemetry.exporter.otlp.proto.common._internal
        import opentelemetry.exporter.otlp.proto.grpc.exporter

        # patch the source as well as its copy in the exporter module
        opentelemetry.exporter.otlp.proto.common._internal._create_exp_backoff_generator = (
            opentelemetry.exporter.otlp.proto.grpc.exporter._create_exp_backoff_generator
        ) = _patched_otel_create_exp_backoff_generator
    except Exception:
        logging.warning("Failed to patch the OpenTelemetry exporter backoff strategy.")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working metrics release:after-ga To be resolved after GA release
Projects
None yet
Development

No branches or pull requests

6 participants