Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processed/exported SDK metrics #83

Open
carlosalberto opened this issue Jun 5, 2023 · 6 comments
Open

Processed/exported SDK metrics #83

carlosalberto opened this issue Jun 5, 2023 · 6 comments
Assignees

Comments

@carlosalberto
Copy link
Contributor

Opening this issue to mainly get the ball rolling, as I have had users asking for metrics around processed/dropped/exported data (starting with traces, but following up with metrics/logs). I'd like to initially add the following metrics (some inspiration take by the current metrics in the Java SDK):

  • otel.exporter.exported, counter, with attributes:
    • success = true|false
    • type = span|metric|log
    • exporterType = <exporter type, e.g. GrpcSpanExporter>
  • otel.processor.processed, counter, with attributes:
    • dropped = true|false (buffer overflow)
    • type = span|metric|log
    • processorType = <processor typ, e.g. BatchSpanProcessor>

Albeit this is mostly targeted at SDKs, the Collector could use this as well - in which case we may want to add a component or pipeline.component attribute (or similar), to signal whether this is a SDK or a Collector.

@arminru
Copy link
Member

arminru commented Jun 6, 2023

Do you intend to just introduce a semantic convention for this, or would this be added to the SDK specification (in https://github.com/open-telemetry/opentelemetry-specification) as well to ensure a consistent implementation?
The relevant SDK spec parts are already stable, but this could be introduced as an optional feature.

@jsuereth
Copy link
Contributor

jsuereth commented Jun 6, 2023

+1 on semconv, also this walks into the "namespaced attributes" debate.

@Oberon00
Copy link
Member

Oberon00 commented Jun 6, 2023

The relevant SDK spec parts are already stable, but this could be introduced as an optional feature.

I don't think that "stable" is that restrictive, but I think this would be best made optional anyway

@fbogsany
Copy link
Contributor

fbogsany commented Jun 6, 2023

This is exceptionally useful. We added hooks to enable metrics capture in the Ruby SDK a couple of years ago: open-telemetry/opentelemetry-ruby#510. The metrics we defined include:

  • otel.otlp_exporter.request_duration
  • otel.otlp_exporter.failure ("soft" failure - request will be retried)
  • otel.bsp.buffer_utilization (a snapshot of "fullness" of the BSP buffer)
  • otel.bsp.export.success
  • otel.bsp.export.failure (hard failure - request will not be retried)
  • otel.bsp.exported_spans
  • otel.bsp.dropped_spans

At Shopify, we find these metrics very useful for monitoring the health of our trace collection pipeline. We have added these metrics in various hacky ways to other language SDKs (e.g. Go). It would be great to standardize them across SDK implementations.

@robertlaurin
Copy link

The Ruby SDK also reports compressed and uncompressed sizes of the batch before exporting. We have found this to be a better indicator of load on our collection infrastructure than span volume alone. We often feel the pain of this missing from other SDK implementations where we have not hacked it in.

@tiithansen
Copy link

tiithansen commented Sep 5, 2024

It would be nice if BSP would export following metrics:

otel.bsp.queue.capacity - Maximum size of queue (Gauge)
otel.bsp.queue.size - Number of items in queue (Gauge)
otel.bsp.queue.max_batch_size - Maximum size of a batch (Gauge)
otel.bsp.queue.timeout - Timeout when batch is exported regardless of size (Gauge)
otel.bsp.queue.exports - With labels reason=size|timeout (Counter)

Then its possible to build dashboards and alerts to detect problematic applications easily because its possible to compare size and capacity also its possible to see what triggers exports most timeouts or size hits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants