Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[connector/spanmetricsconnector] Generated counter drops then disappears #33421

Closed
duc12597 opened this issue Jun 7, 2024 · 13 comments
Closed
Labels
bug Something isn't working connector/spanmetrics needs triage New item requiring triage

Comments

@duc12597
Copy link

duc12597 commented Jun 7, 2024

Component(s)

connector/spanmetrics

What happened?

Description

Our collector receives OTLP traces from Kafka, convert them into metrics and export to a TSDB. After a certain period of collector uptime (24-48 hours), the generated calls_total counter suffers a significant drop in value. Eventually no more metrics are exported.
spanmetrics

Steps to Reproduce

Follow the below collector configuration.

Expected Result

The calls_total counter is ever increasing.

Actual Result

The calls_total counter drops then disappears.

Collector version

v0.101.0

Environment information

Environment

AWS EKS 1.24

OpenTelemetry Collector configuration

extensions:
  sigv4auth:
    region: ap-southeast-1
    service: "aps"
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 03-sink-metric-prometheus
          scrape_interval: 10s
          static_configs:
            - targets: ['127.0.0.1:8888']
  kafka/traces:
    protocol_version: 3.3.1
    brokers:
      - b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
      - b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
    auth:
      tls:
        insecure: true
    topic: otlp_spans
    group_id: 03-sink-metric-prometheus
  kafka/metrics:
    protocol_version: 3.3.1
    brokers:
      - b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
      - b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
    auth:
      tls:
        insecure: true
    topic: otlp_metrics
    group_id: 03-sink-metric-prometheus
processors:
  filter:
    error_mode: ignore
    metrics:
      datapoint:
        - 'IsMatch(attributes["http.target"], ".*.(css|js)")'
  transform:
    error_mode: ignore
    metric_statements:
      - context: datapoint
        statements:
          # reduce the cardinality of metrics with params
          - replace_pattern(attributes["http.target"], "/users/[0-9]{13}", "/users/{userId}")
connectors:
  spanmetrics:
    dimensions:
      - name: http.method
      - name: http.target
      - name: http.status_code
      - name: host.name
      - name: myCustomLabel
    exclude_dimensions:
      - span.kind
      - span.name
      - status.code
    exemplars:
      enabled: true
    metrics_flush_interval: 15s
exporters:
  debug:
  prometheusremotewrite:
    endpoint: https://aps-workspaces.ap-southeast-1.amazonaws.com/workspaces/<prometheus-workspace>/api/v1/remote_write
    auth:
      authenticator: sigv4auth
    external_labels:
      cluster_name: my-cluster
      collector: 03-sink-metric-prometheus
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 10s
      max_elapsed_time: 30s
    send_metadata: true
    max_batch_size_bytes: 3000000
service:
  telemetry:
    metrics:
      address: 127.0.0.1:8888
      level: detailed
  extensions:
    - sigv4auth
  pipelines:
    traces:
      receivers:
        - kafka/traces
      processors: []
      exporters:
        - spanmetrics
    metrics:
      receivers:
        - kafka/metrics
        - prometheus
        - spanmetrics
      processors:
        - filter
        - transform
      exporters:
        - debug
        - prometheusremotewrite

Log output

2024-06-07T01:38:51.776Z    error    exporterhelper/queue_sender.go:101    Exporting failed. Dropping data.    {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", 
"error": "Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded", "errorCauses": [{"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}], "dropped_items": 58510}   
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
    go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:101
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
    go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
    go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43

Additional context

  • Our application uses the HyperTrace Java agent to send telemetry data to Kafka in OTLP format
  • The problem persists across different TSDBs (AWS Prometheus, self-hosted Prometheus, Mimir) and different number of collector replicas (1, 3)
@duc12597 duc12597 added bug Something isn't working needs triage New item requiring triage labels Jun 7, 2024
Copy link
Contributor

github-actions bot commented Jun 7, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@ankitpatel96
Copy link
Contributor

I have a few questions that might help us track down this issue: Is there any chance your collector is restarting at these points? Are you running just one collector or many in a gateway mode?

@duc12597
Copy link
Author

I'm running the collector as a deployment, and have tried both 1 and 3 replicas. The collector did not restart, I had to terminate the pods to keep exporting the metrics

@ankitpatel96
Copy link
Contributor

I see... honestly at this point I don't quite know what would cause it to eventually stop emitting metrics at all - that's the symptom that is really throwing me for a loop.

Are you still having these problems? Can you try increasing resource_metrics_cache_size? The thought is that this might prevent evictions which might prevent the resets.

Other things that might help us track down this problem - what is the count of the unique series within count_total over time? Are there resets happening for a series that the TSDB has already gotten or are there entirely new series?

@duc12597
Copy link
Author

image
This is the count(calls_total) at approximately the time the counter decreases

@duc12597
Copy link
Author

Further observation shows that out of 3 metrics receivers in my collector configuration, kafka/metrics & prometheus worked fine:
image

Only metrics from spanmetrics failed:
image

@ankitpatel96
Copy link
Contributor

thanks for your update. Did you try changing the cache size? I'm honestly a little stumped - any ideas @portertech @Frapschen ?

@swar8080
Copy link
Contributor

swar8080 commented Jul 3, 2024

With the current config the connector will permanently cache every series it sees and send them all during each flush, even the ones where nothing's changed

So eventually the payload flushed to prometheusremotewrite gets so large that the remote write request times out (i.e. context deadline exceeded is a timeout) and likely the request gets rejected by the remote write target because of the size

Permanent error: context deadline exceeded"}], "dropped_items": 58510}   

Possible things that could help are:

  • Setting metrics_expiration on the connector so that infrequently updated span metrics are removed. Then you have to deal with prometheus counter resets
  • Breaking up the remote write requests into smaller batches, possibly using batch processor and/or prometheusremote's built-in config
  • Switching to scraping the span metrics using prometheus since it's optimized for a large number of series

@duc12597
Copy link
Author

I set metrics_expiration: 30m, the metrics still disappeared altogther. It returned after ~6 hours, but somehow the collectors did not restart.
image

@Frapschen
Copy link
Contributor

@duc12597 Have your try to switch push model to pull?. replace your prometheusremotewrite to prometheusexporter.

@duc12597
Copy link
Author

@duc12597 Have your try to switch push model to pull?. replace your prometheusremotewrite to prometheusexporter.

We will consider this option. As of now the collector has been running for 2 weeks without any errors, although there are still counter fluctuations. I'm not sure if it's thanks to any changes on our side. I will close this issue for now and will re-open in the future if this problem resurface.

This is my complete collector manifest:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: 03-sink-metric-prometheus
spec:
  image: mirror.gcr.io/otel/opentelemetry-collector-contrib:0.102.0
  replicas: 5
  nodeSelector:
    mycompany.com/service: observability
    kubernetes.io/arch: amd64
  tolerations:
    - effect: NoSchedule
      key: mycompany.com/service
      value: observability
      operator: Equal
  config: |
    receivers:
      prometheus:
        config:
          scrape_configs:
            - job_name: 03-sink-metric-prometheus
              scrape_interval: 10s
              static_configs:
                - targets: ['127.0.0.1:8888']
      kafka/traces:
        protocol_version: 3.3.1
        brokers:
          - b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
          - b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
        auth:
          tls:
            insecure: true
        topic: otlp_spans
        group_id: 03-sink-metric-prometheus
      kafka/metrics:
        protocol_version: 3.3.1
        brokers:
          - b-1.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
          - b-2.<msk-endpoint>.ap-southeast-1.amazonaws.com:9094
        auth:
          tls:
            insecure: true
        topic: otlp_metrics
        group_id: 03-sink-metric-prometheus
    processors:
      filter:
        error_mode: ignore
        metrics:
          datapoint:
            - 'IsMatch(attributes["http.target"], ".*.(css|js)")'
      transform:
        error_mode: ignore
        metric_statements:
          - context: datapoint
            statements:
              # reduce the cardinality of metrics with params
              - replace_pattern(attributes["http.target"], "/users/[0-9]{13}", "/users/{userId}")
    connectors:
      spanmetrics:
        dimensions:
          - name: http.method
          - name: http.target
          - name: http.status_code
          - name: host.name
          - name: myCustomLabel
        exclude_dimensions:
          - span.kind
          - span.name
          - status.code
        exemplars:
          enabled: true
        metrics_flush_interval: 15s
        metrics_expiration: 1h
        resource_metrics_key_attributes:
          - service.name
          - telemetry.sdk.language
          - telemetry.sdk.name
        resource_metrics_cache_size: 10000
    exporters:
      debug:
      prometheusremotewrite:
        endpoint: http://mimir-nginx/api/v1/push
        send_metadata: true
    service:
      telemetry:
        metrics:
          address: 127.0.0.1:8888
          level: detailed
      extensions:
        - sigv4auth
      pipelines:
        traces:
          receivers:
            - kafka/traces
          processors: []
          exporters:
            - spanmetrics
        metrics:
          receivers:
            - kafka/metrics
            - prometheus
            - spanmetrics
          processors:
            - filter
            - transform
          exporters:
            - debug
            - prometheusremotewrite
  env:
    - name: GOMEMLIMIT
      value: 1640MiB # 80% of resources.limits.memory
  resources:
    requests:
      cpu: 200m
      memory: 512Mi
    limits:
      cpu: 500m
      memory: 2Gi

@Frapschen
Copy link
Contributor

@duc12597 sorry for pinging you, there is a related issue for counter fluctuation, please see #34126 (comment) to fix it.

@duc12597
Copy link
Author

duc12597 commented Aug 7, 2024

@duc12597 sorry for pinging you, there is a related issue for counter fluctuation, please see #34126 (comment) to fix it.

If I understand correctly, this will add a UUID as a label for every metric generated by each collector pod. Will this explode the cardinality? Why does a UUID solve the fluctuation? Can you give an example config?

Thanks a ton.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/spanmetrics needs triage New item requiring triage
Projects
None yet
Development

No branches or pull requests

4 participants