Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exemplars disappears during OpenTelemetry metrics to Prometheus metrics conversion #320

Closed
akselleirv opened this issue Oct 27, 2023 · 7 comments
Assignees
Labels
bug Something isn't working frozen-due-to-age

Comments

@akselleirv
Copy link

akselleirv commented Oct 27, 2023

What's wrong?

OpenTelemetry exemplars are not converted in the otelcol.exporter.prometheus component.

I'm using the OpenTelemetry Java instrumentation with the following command java -javaagent:opentelemetry-javaagent.jar -jar target/rest-service-complete-0.0.1.jar which instruments the SpringBoot application. It will then send traces and metrics to the Grafana Agent which then converts it to Prometheus metrics and sends it to Mimir. I then expect exemplars to be available when querying in Grafana, which is not the case.

I have replaced Grafana Agent with using the OpenTelemetry collector instead, and expose the converted metrics on the collector. Then let Grafana Agent scrape the collector and send them to Mimir. Then the exemplars are available in Grafana.

So it seems that the issue is within the Grafana Agent metrics conversion.

Steps to reproduce

  1. Start a SpringBoot HTTP server with OpenTelemetry Java agent: java -javaagent:opentelemetry-javaagent.jar -jar target/rest-service-complete-0.0.1.jar
  2. Start Grafana Agent
  3. Send request to the HTTP server which send metrics to the Grafana Agent within 60s.
  4. Query for http_server_duration_milliseconds_bucket in Grafana with exemplars enabled which will not display any exemplars in the graph window.

System information

darwin/arm64

Software version

Grafana Agent v0.37.3

Configuration

logging {
  level  = "info"
  format = "logfmt"
}

otelcol.receiver.otlp "default" {
  http { }

  grpc { }

  output {
    traces  = [otelcol.processor.batch.default.input]
    metrics = [otelcol.processor.batch.default.input]
  }
}

otelcol.processor.batch "default" {
  output {
    traces  = [otelcol.exporter.otlp.default.input]
    metrics = [otelcol.processor.transform.default.input]
  }
}

otelcol.exporter.otlp "default" {
  client {
    endpoint = "tempo-distributor.tempo.svc:4317"

    tls {
        insecure             = true
        insecure_skip_verify = true
    }
  }
}

// Adds the namespace, pod and container labels to the metrics
otelcol.processor.transform "default" {
  metric_statements {
    context    = "datapoint"
    statements = [
      "set(attributes[\"namespace\"], resource.attributes[\"k8s.namespace.name\"])",
      "set(attributes[\"container\"], resource.attributes[\"k8s.container.name\"])",
      "set(attributes[\"pod\"], resource.attributes[\"k8s.pod.name\"])",
    ]
  }

  output {
    metrics = [otelcol.exporter.prometheus.default.input]
  }
}

otelcol.exporter.prometheus "default" {
  forward_to = [prometheus.relabel.default.receiver]
}

prometheus.relabel "default" {
  rule {
    action        = "drop"
    source_labels = ["http_route"]
    regex         = "/metrics|/actuator.*"
  }

  rule {
    action = "labeldrop"
    regex  = "container_id|k8s_pod_name|host_name|k8s_node_name|k8s_replicaset_name|process_command_args|os_type|process_executable_path|process_pid|process_runtime_description|os_description"
  }

  forward_to = [prometheus.remote_write.remote.receiver]
}

prometheus.remote_write "remote" {
  external_labels = {
    telemetry_auto_instrumentation = "true",
  }

  endpoint {
    url               =  "http://mimir-nginx.mimir.svc/api/v1/push"
  }
}

Logs

No response

@akselleirv akselleirv added the bug Something isn't working label Oct 27, 2023
@akselleirv
Copy link
Author

akselleirv commented Oct 30, 2023

It seems it's skipping writing exemplars as they are considered out-of-order by this if-statement: https://github.com/grafana/agent/blob/main/component/otelcol/exporter/prometheus/internal/convert/convert.go#L335

ts is time.Time(2023-10-30T07:23:23Z) and the series is time.Time(2023-10-30T07:24:14Z). Not sure why that is the case.

EDIT: Might be related to this: open-telemetry/opentelemetry-java#4193.

It seems that the OTEL converter does not check for out-of-order timestamps, and the reason it's able to convert the exemplars: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/prometheusexporter/collector.go#L50

Here is my OTEL collector config for reference:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: localhost:4317

processors:
  batch: {}
exporters:

  prometheus:
    endpoint: "0.0.0.0:1234"
    metric_expiration: 180m
    enable_open_metrics: true
    add_metric_suffixes: false
    resource_to_telemetry_conversion:
      enabled: true

service:

  pipelines:

    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

@akselleirv
Copy link
Author

Would it be possible to have an option to allow out of order exemplars?

Copy link
Contributor

github-actions bot commented Dec 9, 2023

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

@justinbwood
Copy link

justinbwood commented Mar 28, 2024

Looks like the code for the converter moved since @akselleirv's last comment:
https://github.com/grafana/agent/blob/main/internal/component/otelcol/exporter/prometheus/internal/convert/convert.go#L335

Would it make sense to ignore the timestamp entirely, since Mimir (currently) rejects out-of-order exemplars anyway?

For example, I get this in Grafana Agent's logs when it attempts to push some exemplars to Mimir:

ts=2024-03-28T21:22:49.191638465Z level=error msg="non-recoverable error" component=prometheus.remote_write.mimir subcomponent=rw remote_name=546f14 url=https://mimir.example.com/api/v1/push count=1943 exemplarCount=57 err="server returned HTTP status 400 Bad Request: failed pushing to ingester: user=anonymous: err: out of order exemplar. timestamp=2024-03-28T21:22:09.783Z, 
  series={__name__=\"duration_milliseconds_bucket\", env=\"dev\", http_method=\"GET\", http_status_code=\"200\", job=\"job-xxx\", le=\"2\", namespace=\"ns-xxx\", service=\"svc-xxx\", span_kind=\"SPAN_KIND_SERVER\", span_name=\"GET /xxx/api/v1/**\", status_code=\"STATUS_CODE_UNSET\"}, exemplar={trace_id=\"5f23d20e507fb4c177566f9782097345\", span_id=\"d3ac98b035724486\"}"

I also opened grafana/mimir#7748 to hopefully resolve this from the Mimir side.

@justinbwood
Copy link

Upstream issue for out-of-order exemplar support in Prometheus: prometheus/prometheus#13577

@rfratto
Copy link
Member

rfratto commented Apr 11, 2024

Hi there 👋

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

@rfratto rfratto transferred this issue from grafana/agent Apr 11, 2024
@akselleirv
Copy link
Author

This working for me so I'll close the issue.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working frozen-due-to-age
Projects
No open projects
Development

No branches or pull requests

5 participants