Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement timeout mechanism for several metrics components #2402

Closed
ocelotl opened this issue Jan 24, 2022 · 21 comments · Fixed by #2653
Closed

Implement timeout mechanism for several metrics components #2402

ocelotl opened this issue Jan 24, 2022 · 21 comments · Fixed by #2653
Assignees
Labels
1.10.0rc1 release candidate 1 for metrics GA metrics sdk Affects the SDK package.

Comments

@ocelotl
Copy link
Contributor

ocelotl commented Jan 24, 2022

collect doesn't return anything, so returning True until we implement timeout/error handling mechanism seemed reasonable to me. What do you think?

Originally posted by @lonewolf3739 in #2401 (comment)

Also, the timeout mechanism is required in several parts of the metrics spec:

  1. MeterProvider.shutdown
  2. MeterProvider.forceflush
  3. Asynchronous callbacks
  4. MetricReader.collect
  5. MetricReader.shutdown
  6. PushMetricExporter.forceflush
  7. PushMetricExporter.shutdown
@ocelotl
Copy link
Contributor Author

ocelotl commented Mar 24, 2022

Just for the record: https://pypi.org/project/timeout-decorator/

@ocelotl
Copy link
Contributor Author

ocelotl commented Mar 24, 2022

Also #385

@ocelotl ocelotl added the sdk Affects the SDK package. label Mar 24, 2022
@ocelotl ocelotl changed the title Implement timeout mechanism for collect Implement timeout mechanism for several metrics components Mar 24, 2022
@ocelotl ocelotl added the 1.10.0rc1 release candidate 1 for metrics GA label Apr 12, 2022
@aabmass
Copy link
Member

aabmass commented Apr 26, 2022

Based on the linked issues, the previous decision was to not use the signal based approach for implementing generic timeout mechanism because it won't work outside of the main thread, and it can interfere with applications using the same signal handler already.

For some background,

  • The Java methods return a future like object which lets user join() with a certain timeout
  • JS doesn't appear to do anything special, just returns a Promise or accepts a callback. There is no way to cancel
  • Go simply passes the ctx and tells the implementor to respect the cancellation/timeout
  • .NET accepts timeout for shutdown and force flush which the implementor should respect

Based on that, I think it's reasonable to just pass down a timeout duration where possible and document that implementors should respect the timeout. This doesn't actually implement "SHOULD complete or abort within some timeout" but is not terribly opinionated.

It would be great if the mechanism we add for metrics can also be used in the tracing SDK, where timeouts were never implemented previously. The problem is that adding a timeout optional parameter to the SpanExporter, SpanProcessor etc. interface signatures will break existing implementations of those interfaces when we go to call them. We could potentially defensively call with the timeout param and fallback to not passing timeout e.g. here

try:
  self._active_span_processor.shutdown(timeout=timeout)
except TypeError:
  self._active_span_processor.shutdown()

@aabmass
Copy link
Member

aabmass commented Apr 26, 2022

I think potentially we should try to add *args, **kwargs or config objects to any user implementable interfaces going forward to avoid this kind of issue.

@aabmass
Copy link
Member

aabmass commented Apr 28, 2022

Looks like I lied about JS. They accept some timeout parameters as config options and apply them: https://github.com/open-telemetry/opentelemetry-js/blob/cfda625e83a164c9e19745392879c32aedcfe76f/packages/opentelemetry-sdk-trace-base/src/BasicTracerProvider.ts#L153

The obvious downside with this approach is you can't change the timeout at different call sites. I think it still satisfies the spec though. And we could add it without the try/except stuff.

@aabmass
Copy link
Member

aabmass commented Apr 29, 2022

Ok.. new issue is the OTLP exporter accepts a timeout parameter on it's own. The really weird thing is, this is considered a timeout for just RPCs to the OTLP endpoint and we use an exponential backoff that will delay things way past the default of 10s all the way to 63 seconds plus whatever delay we had from the RPCs.

And on top of that.. the timeout for PeriodicExportingMetricReader and BatchSpanProcessor have a export_timeout_millis of 30s (way shorter than the max backoff), but neither of those classes pass the timeout to the exporter

I'm not even sure the correct behavior at this point

@lzchen
Copy link
Contributor

lzchen commented May 2, 2022

So my understanding is that there are several timeouts that we have implemented per component that all could contribute to or be the actual timeout that users experience in the metric pipeline and we are not sure which one(s), if any, to use?

@aabmass
Copy link
Member

aabmass commented May 2, 2022

I put up a draft PR to add some of the metrics timeouts: #2653

So my understanding is that there are several timeouts that we have implemented per component

For tracing the only implemented timeout is force flush, which is indeed implemented at several levels:

  • SynchronousMultiSpanProcessor passes a timeout to each sub-processor and checks the deadline for the whole thing:

    deadline_ns = _time_ns() + timeout_millis * 1000000
    for sp in self._span_processors:
    current_time_ns = _time_ns()
    if current_time_ns >= deadline_ns:
    return False
    if not sp.force_flush((deadline_ns - current_time_ns) // 1000000):
    return False

  • ConcurrentMultiSpanProcessor does similar, but runs the sub-processors in a thread pool executor:

    for sp in self._span_processors: # type: SpanProcessor
    future = self._executor.submit(sp.force_flush, timeout_millis)
    futures.append(future)
    timeout_sec = timeout_millis / 1e3
    done_futures, not_done_futures = concurrent.futures.wait(
    futures, timeout_sec
    )

    Note this can be redundant when used with SpanProcessors that respect the timeout already like BatchSpanProcessor

  • BatchSpanProcessor sets the timeout while waiting for the background thread, which should respect the timeout:

    ret = flush_request.event.wait(timeout_millis / 1e3)


The BatchSpanProcessor is completely ignoring the export_timeout_millis constructor param except that it uses it as a default for force flushing timeout.

@lzchen
Copy link
Contributor

lzchen commented May 2, 2022

So a couple of things to here then:

  1. Implement metrics similarly to tracing in terms of supporting export_timeout for export in exporters and for flush in metric readers (do we even support flush for this).
  2. Implement shutdown_timeout for shutdown in metricsreader and exporters.
  3. Implement shutdown_timeout for shutdown in tracing exporters.
  4. Implement export_timeout for export in exporters for trace exporters.

@srikanthccv
Copy link
Member

Ok.. new issue is the OTLP exporter accepts a timeout parameter on it's own. The really weird thing is, this is considered a timeout for just RPCs to the OTLP endpoint and we use an exponential backoff that will delay things way past the default of 10s all the way to 63 seconds plus whatever delay we had from the RPCs.

And on top of that.. the timeout for PeriodicExportingMetricReader and BatchSpanProcessor have a export_timeout_millis of 30s (way shorter than the max backoff), but neither of those classes pass the timeout to the exporter

I'm not even sure the correct behavior at this point

Not just OTLP but other exporters jaeger, zipkin have their own timeout that can be configured. The 63 seconds delay in OTLP is an outlier because we only do that when we encounter transient network issues. The number used to be much higher as 15 min but we bumped it down to 63 seconds after some deliberation (spec doesn't really say anything as of today). My understanding is that export_timeout_millis is for the export call from processor but the exporter itself could do different things. For example an exporter X devides the batch into 4 different chunks and exports them. So here the export timeout on exporter is for each such call but the processor has the maximum timeout of 30s for export call regardless what exporter does internally.

@aabmass
Copy link
Member

aabmass commented May 3, 2022

For example an exporter X devides the batch into 4 different chunks and exports them. So here the export timeout on exporter is for each such call but the processor has the maximum timeout of 30s for export call regardless what exporter does internally.

I think that's true if the exporter performs that stuff asynchronously after returning from the export() call. Correct me if I'm wrong but I think our OTLP exporter blocks in this case which also blocks the BSP worker thread. The spec (and our implementations) will call export() once at at time. So the worst case result is that OTLP exporter will

  • try to export, with a 30 second timeout, fail and wait 2 seconds
  • try to export, with a 30 second timeout, fail and wait 4 seconds
  • try to export, with a 30 second timeout, fail and wait 8 seconds
  • try to export, with a 30 second timeout, fail and wait 16 seconds
  • try to export, with a 30 second timeout, fail and wait 32 seconds

for a total of ~213 seconds. All the while blocking the BSP's worker thread, meaning it also won't respond to force flush events (on shutdown). I suppose this is a separate issue from this one, but the situation is currently very broken.

@aabmass
Copy link
Member

aabmass commented May 3, 2022

@lzchen that's right thanks for outlining it. I guess my proposals are

1. pass timeout duration down through the call stack

e.g. MeterProvider.shutdown(timeout_millis) -> PeriodicExportingMetricReader.collect(timeout_millis) -> PeriodicExportingMetricReader -> Exporter.export(timeout_millis). This allows just implementing the timeout in the lowest level.

Pros:

  • explicit
  • we make no opinion on how to implement the timeout
    • most RPC/HTTP frameworks that exporters use already accept and handle timeouts.
  • more compatible with different asynchronous frameworks going forward

Cons:

  • have to pass timeout arg through the call chain
  • can be ignored by misbehaving exporters

2. handle timeout at the highest level

e.g. MeterProvider.shutdown() would start a background thread (or submit to a thread pool) and wait for the thread to finish for timeout_millis.

Pros:

  • simple
  • one size fits all
  • timeout is configurable

Cons:

  • paying for an extra thread/thread pool which could probably be avoided.
    • the SDK is already creating quite a few threads so this cost is growing.
  • not very compatible with other asynchronous frameworks like asyncio

My PR is implementing the first approach

@lzchen
Copy link
Contributor

lzchen commented May 3, 2022

+1 for first approach.

@srikanthccv
Copy link
Member

For example an exporter X devides the batch into 4 different chunks and exports them. So here the export timeout on exporter is for each such call but the processor has the maximum timeout of 30s for export call regardless what exporter does internally.

I think that's true if the exporter performs that stuff asynchronously after returning from the export() call. Correct me if I'm wrong but I think our OTLP exporter blocks in this case which also blocks the BSP worker thread. The spec (and our implementations) will call export() once at at time. So the worst case result is that OTLP exporter will

  • try to export, with a 30 second timeout, fail and wait 2 seconds
  • try to export, with a 30 second timeout, fail and wait 4 seconds
  • try to export, with a 30 second timeout, fail and wait 8 seconds
  • try to export, with a 30 second timeout, fail and wait 16 seconds
  • try to export, with a 30 second timeout, fail and wait 32 seconds

for a total of ~213 seconds. All the while blocking the BSP's worker thread, meaning it also won't respond to force flush events (on shutdown). I suppose this is a separate issue from this one, but the situation is currently very broken.

OTLP exporter as of today

  • try to export, with a 30 10 second timeout, fail and wait 2 seconds
  • try to export, with a 30 10 second timeout, fail and wait 4 seconds
  • try to export, with a 30 10 second timeout, fail and wait 8 seconds
  • try to export, with a 30 10 second timeout, fail and wait 16 seconds
  • try to export, with a 30 10 second timeout, fail and wait 32 seconds
  • try to export, with a 30 10 second timeout, fail and wait 64 seconds and finally give up

In an ideal scenario the BSP has to cancel the export call after 30s irrespective of exporter handling mechanism. But it doesn't do that with 30s export mills timeout passed to via env/arg.

Question for the first approach - I maybe misunderstanding something but there are two different timeouts which are to be considered. 1. OTEL_METRIC_EXPORT_TIMEOUT Which I think is for the entire export call (30s). 2. OTEL_EXPORTER_OTLP_TIMEOUT (10s) for the rpc call (an exporter may have multiple rpc calls). So passing timeout_millis down the call stack till exporter is not the really the expected behaviour?

@aabmass
Copy link
Member

aabmass commented May 4, 2022

So passing timeout_millis down the call stack till exporter is not the really the expected behaviour?

I guess what I'm proposing is all of these blocking calls respect the timeout they receive as the maximum for the entire blocking call e.g. Exporter.export(..., timeout) should never block longer than the timeout passed in. If the exporter is doing queueing or backoffs, it can implement that as it pleases, respecting the overall deadline for the export. That would work for the scenario where OTEL_METRIC_EXPORT_TIMEOUT < OTEL_EXPORTER_OTLP_TIMEOUT, or maybe you're sending cumulatives every 5 seconds and you'd rather drop a slow request and send the next collection interval's data instead of doing exponential backoff.

This is basically how Go contexts work out of the box. When you create a context with a timeout, it calculates the deadline as now + timeout. If you make a child contexts with a different timeout (e.g. a single request in a set of backoffs), it will respect the sooner deadline between parent and child contexts. That is the most intuitive behavior to me.

Our SynchronousMultiSpanProcessor.force_flush() already behaves this way for example.

@aabmass
Copy link
Member

aabmass commented May 4, 2022

The other problem with the approach 2 is if the thread running the blocking call doesn't complete within the timeout, it will just be leaked and there is no real way to cancel it. We could use multiprocessing instead and actually terminate the background process, but then the exporters/processors and call arguments must be pickelable. With approach 1, an exporter ignoring the timeout will block indefinitely, but a good behaving exporter/processor will actually be able to degrade more gracefully.

The exception to this is if we ever support asyncio where awaitables can easily be cancelled/timed out from the parent call and the child may respond to cancellation gracefully. We could do both option 1 and 2 for asyncio.

@srikanthccv
Copy link
Member

Makes sense, +1 for the first approach.

@ocelotl
Copy link
Contributor Author

ocelotl commented May 4, 2022

All right +1 for the first approach as well 👍

@aabmass
Copy link
Member

aabmass commented May 5, 2022

Cool, PR is ready for review for everything except async callbacks. I will do that in a separate PR I think. Async callbacks also need to be written with a forward compatible signature. Maybe a config dataclass would make more sense for async callbacks?

def callback(config: CallbackConfig) -> Iterable[Observation]:
  ...

I feel like users are more likely to forget to add **kwargs for callbacks, and this way is easier for them to grock. We really don't want to break async instrument callbacks for users since this is a common use case.

@lzchen
Copy link
Contributor

lzchen commented May 6, 2022

I feel like users are more likely to forget to add **kwargs for callbacks, and this way is easier for them to grock. We really don't want to break async instrument callbacks for users since this is a common use case.

  1. What does grock mean?
  2. Do you mean that if we recommended users to use **kwargs they would misuse it? Are we trying to make sure mistakes won't be made by forcing them to pass in a dataclass?

@aabmass
Copy link
Member

aabmass commented May 6, 2022

2. Do you mean that if we recommended users to use **kwargs they would misuse it? Are we trying to make sure mistakes won't be made by forcing them to pass in a dataclass?

Yes trying to prevent them from making mistakes and messing up the signature. Im implementing this in #2664

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.10.0rc1 release candidate 1 for metrics GA metrics sdk Affects the SDK package.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants