Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTLP Exporter Retry and Persist Feature #4115

Open
mic-max opened this issue Jan 28, 2023 · 13 comments
Open

OTLP Exporter Retry and Persist Feature #4115

mic-max opened this issue Jan 28, 2023 · 13 comments
Labels
enhancement New feature or request pkg:OpenTelemetry.Exporter.OpenTelemetryProtocol Issues related to OpenTelemetry.Exporter.OpenTelemetryProtocol NuGet package

Comments

@mic-max
Copy link
Contributor

mic-max commented Jan 28, 2023

Feature Request

Is your feature request related to a problem?

When exporting data and a transient server issue prevents the request from being processed correctly the data will be lost. Or when a program is shutdown any data not yet exported before the process is terminated will be lost.

Describe the solution you'd like:

The data should be attempted to be exported again when the error is considered repeatable. On program shutdown data yet to be exported should attempt to do so after first saving to disk in case the transmission fails or does not have enough time to complete. Upon the next program execution the saved to disk telemetry will attempt to export. This will reduce the amount lost telemetry.

Additional Context

Add the ability to OTLP exporters to retry exports that fail in a defined way. This includes between program shutdowns by persisting the data to disk upon failure. This will help improve the reliability of OTel from the client's end.

Original GitHub Issue: #1278

The first set of PRs will focus on a single to be decided section in the following matrix and follow-up PRs will be enabling the others, reusing as much code as reasonable.

Metrics Logs Traces
HTTP x x x
gRPC x x x

src/OpenTelemetry.Exporter.OpenTelemetryProtocol

PR Roadmap

  1. OTLP Exporter Options: Opt-in.
    1. FileBlobProvider:
      1. Path the folder location defined in Storage folder section of Add persistent storage to exporter #1278
      2. max size (is new or old data dropped?)
      3. maintenance period (the 2 minute default likely too long to release a lease)
      4. retention period
      5. write timeout
    2. RetryThreadPeriodMilliseconds (default of 1000? or combine into 1 field together with maintenance period?)
  2. Write
    1. Each ExportClient will be modified
    2. If a retryable error is returned.
      1. Log the error (info level).
      2. Create a blob. persistentBlobProvider.TryCreateBlob(data, RetryAfter.Milliseconds, out var blob);
    3. Non retryable error.
      1. Drop the data.
      2. Log the error (warning level).
      3. Increment a dropped item counter.
  3. Retry
    1. Implemented as a new thread that will wake up every XX milliseconds and perform a retry on all blobs it can get a lease for.
    2. foreach (var blobItem in persistentBlobProvider.GetBlobs()) { ... }
    3. blob.TryLease(1000); blob.TryRead(out var data);
    4. If successfully exported or response says to no longer retry this blob.
      1. Delete blob blob.TryDelete();
      2. Log the error (warning level).
      3. Increment a dropped item counter.
    5. If not successful, but another retryable error is returned
      1. Log the error (info level).
  4. Cleanup
    1. This part may already be taken care of by the FileBlobProvider and the above Retry scenario.
  5. Shutdown
    1. Write data to disk, then attempt to send that data.
    2. Exit grace period for Console-like apps. Point copy of OpenCensus at 0474607a16282252697f989113d68bdf71959070 #3 in the original issue.
  6. Add Guards to FileBlobProvider interface. Ref
  7. Optimisations

Retryable Errors

  • gRPC
    • UNAVAILABLE
    • CANCELLED
    • DEADLINE_EXCEEDED
    • ABORTED
    • OUT_OF_RANGE
    • DATA_LOSS
    • RESOURCE_EXHAUSTED - Nuanced, see the spec.
  • HTTP
    • 408: Request Timeout
    • 503: Server Too Busy, do not retry too frequently, exponential backoff and respect 'Retry-After' header
    • 5xx: Server Errors

Testing Strategy

Make use of the test/OpenTelemetry.Exporter.OpenTelemetryProtocol.Tests/MockCollectorIntegrationTests.cs class. Some example can be seen in this closed PR. Which the changes to that file made in that PR should be reusable.

References

@mic-max mic-max added the enhancement New feature or request label Jan 28, 2023
@cijothomas cijothomas added the pkg:OpenTelemetry.Exporter.OpenTelemetryProtocol Issues related to OpenTelemetry.Exporter.OpenTelemetryProtocol NuGet package label Jan 28, 2023
@reyang
Copy link
Member

reyang commented Jan 30, 2023

For retry, we need to consider the following cases:

  1. HTTP 503 Server Too Busy - we don't want to retry too frequently, probably exponential backoff, plus respecting 'Retry-After' header if there is one.
  2. When we retry, what's ordering? E.g. for metrics it might make more sense to always prioritize the newest data. For logs, it might make sense to prioritize the older one?
  3. We might need to introduce some QoS metrics (e.g. SDK internal queue size, error count, retry count), these metrics might need to be treated as high priority, so if the exporter is too busy to handle all the data, at least we have these QoS data prioritized.

@pjanotti
Copy link
Contributor

Confirming the end result here:

  1. OTLP Exporter Options: Opt-in.

All of the work planned for this issue stays as opt-in, correct?

@vishweshbankwar
Copy link
Member

Confirming the end result here:

  1. OTLP Exporter Options: Opt-in.

All of the work planned for this issue stays as opt-in, correct?

That is correct. This will be an opt-in feature until we have spec.

@alanwest
Copy link
Member

This will be an opt-in feature until we have spec.

Is there related spec work for these options in-progress already?

If a retryable error is returned.
Log the error (info level).
Create a blob.

A retry policy is useful independently from persistent storage. That is, the gRPC client (or Polly if using HTTP) could be configured with a retry policy which can handle transient network errors. This handling would be opaque to the OTLP exporter.

Do you plan to implement retry w/o also requiring the use of persistent storage?

@cijothomas
Copy link
Member

cijothomas commented Jan 31, 2023

Do you plan to implement retry w/o also requiring the use of persistent storage?

We should do this and is required by spec https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/exporter.md#retry

Persistent storage is an optional, opt-in feature.

@CodeBlanch
Copy link
Member

Just wanted to share a thought. @alanwest kind of scratched at this on the SIG yesterday, I think. What does "persistent storage" mean? In the scope of this work, it seems to me we persist the data for retries. I'm working on an exporter for some client teams some (maybe all) of which seem to want "persistent storage" but of the always & up-front variety. Meaning exporter just writes to disk and then some other thread tries to ship off the disk data on its own cadence.

Only sharing so we can be clear what kind of "persistent storage" we aim to support in OTLP and make sure the documentation is also clear 😄

@Kadajski
Copy link

Are there any plans still in motion regarding some kind of persistence support? Once OpenTelemetry.PersistentStorage is released is the intention to integrate it into the otlp exporter or create some persisted otlp exporter?

https://opentelemetry.io/docs/specs/otel/library-guidelines/#protocol-exporters

if an application’s telemetry data must be delivered to a remote backend that has no guaranteed availability the end user may choose to use a persistent local queue and an Exporter to retry sending on failures.

This is more or less the scenario I would like to cater for, similar to what @CodeBlanch mentioned. Similar to how the collector has this supported https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md#persistent-queue .

Right now the issue that I experience is if create my own BatchExportProcessor using a persisted local queue and try re-use the OtlpTraceExporter.Export function, I then need to serialize/deserialize System.Diagnostics.Activity to store it on some persistent queue somewhere, serializing the Activity object does not really seem like a viable option. The only other option I am left with, unless I am missing something, is to implement the entire OTLP exporter myself and do the mapping to the proto types and store the serialization of that format on my local file queue.

@smoms
Copy link

smoms commented Feb 19, 2024

Am I correct that there is no retry by default? We would need to inject a retry policy by ourselves in the HttpClientFactory (if using Http) with e.g. Polly? If so we would need to catch the error codes described in the spec? tks

@cijothomas
Copy link
Member

Am I correct that there is no retry by default? We would need to inject a retry policy by ourselves in the HttpClientFactory (if using Http) with e.g. Polly? If so we would need to catch the error codes described in the spec? tks

True. See #1779 . Some PRs are in-flight now to make this happen automatically, so you don't have to manually deal with it.

@smoms
Copy link

smoms commented Feb 20, 2024

@cijothomas thanks.
However I understand that the point here is to implement the Http retryable codes as defined by OTel Protocol Spec.
If this is the case, I may still need by myself to cover with a retry mechanism other failure situations. For example as specified here. Am I correct?

@cijothomas
Copy link
Member

@cijothomas thanks. However I understand that the point here is to implement the Http retryable codes as defined by OTel Protocol Spec. If this is the case, I may still need by myself to cover with a retry mechanism other server error codes. For example as specified here. Am I correct?

Yes.

Copy link
Contributor

This issue was marked stale due to lack of activity and will be closed in 7 days. Commenting will instruct the bot to automatically remove the label. This bot runs once per day.

@github-actions github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Dec 17, 2024
@nijave
Copy link

nijave commented Dec 18, 2024

not stale

@github-actions github-actions bot removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pkg:OpenTelemetry.Exporter.OpenTelemetryProtocol Issues related to OpenTelemetry.Exporter.OpenTelemetryProtocol NuGet package
Projects
None yet
Development

No branches or pull requests

10 participants