Google Cloud exporter reports unexpected error #36602

AlexisBRENON · 2024-11-29T14:46:39Z

Component(s)

exporter/googlecloud
processor/interval

What happened?

Description

I send HTTP server requests duration metrics to an OTel collector which send them to Google Cloud Monitoring.
The Google Cloud Exporter may report errors while the Debug exporter outputs the metrics as expected.

Steps to Reproduce

Use the Python WSGI instrumentation to report HTTP requests duration.

from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

@app.post("/")
async def post():
    return {"message": "Hello POST"}

@app.put("/")
async def put():
    return {"message": "Hello PUT"}

Expected Result

I expect to export the metrics to Google Cloud Monitoring without any error in the logs.

Actual Result

There is some cases (not always) where Google cloud exporter sends error messages about out of order data points.

Collector version

v0.114.0 custom distribution to add the interval processor

Environment information

Environment

Kubernetes
Collector deployed with the Kubernetes Operator.

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      http:
processors:
  resourcedetection:
    detectors: [gcp]
    timeout: 10s
  batch:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 65
    spike_limit_percentage: 20
  resource/merge_instances:
    attributes:
      - key: service.instance.id
        action: delete
  interval:
    interval: 15s
  filter/in_duration_only:
    error_mode: ignore
    metrics:
      metric:
        - name != "http.server.duration"
  transform/interval_set_timestamp:
    metric_statements:
      - context: datapoint
        statements:
          - set(time, Now())
exporters:
  debug:
    verbosity: detailed # basic # detailed for debug
    sampling_initial: 5
  googlecloud/new:
    project: myproject
service:
  pipelines:
    metrics/new:
      receivers: [otlp]
      processors:
      - memory_limiter
      - batch
      - filter/in_duration_only
      # Aggregate metrics from different instances
      - resource/merge_instances
      - interval
      # Try to fix the out of order error
      - transform/interval_set_timestamp
      # Add GCP attributes
      - resourcedetection
      exporters:
      - googlecloud/new
      - debug

Log output

[...]
2024-11-29T14:18:01.905Z	info	Metrics	{"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 0, "metrics": 0, "data points": 0}
2024-11-29T14:18:01.906Z	info		{"kind": "exporter", "data_type": "metrics", "name": "debug"}
2024-11-29T14:18:01.937Z	info	Metrics	{"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 2}
2024-11-29T14:18:01.938Z	info	ResourceMetrics #0
Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1
Resource attributes:
     -> telemetry.sdk.language: Str(python)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.27.0)
     -> service.name: Str(my-service-name)
     -> service.namespace: Str(greenbids)
     -> service.version: Str(0.2.5)
     -> cloud.provider: Str(gcp)
     -> cloud.account.id: Str(my-gcp-project)
     -> cloud.platform: Str(gcp_kubernetes_engine)
     -> cloud.region: Str(europe-west1)
     -> k8s.cluster.name: Str(my-gke-cluster)
     -> host.id: Str(7813299496046632564)
     -> host.name: Str(gke-my-gke-cluster-system-blue-8be1a7a8-x775)
ScopeMetrics #0
ScopeMetrics SchemaURL:
InstrumentationScope opentelemetry.instrumentation.fastapi 0.48b0
Metric #0
Descriptor:
     -> Name: http.server.duration
     -> Description: Duration of HTTP server requests.
     -> Unit: ms
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative
HistogramDataPoints #0
Data point attributes:
     -> http.scheme: Str(http)
     -> http.host: Str(127.0.0.1:8000)
     -> net.host.port: Int(8000)
     -> http.flavor: Str(1.1)
     -> http.method: Str(PUT)
     -> http.server_name: Str(localhost:8000)
     -> http.status_code: Int(200)
     -> http.target: Str(/)
StartTimestamp: 2024-11-29 13:58:12.570889662 +0000 UTC
Timestamp: 2024-11-29 14:18:01.937084789 +0000 UTC
Count: 6690
Sum: 3500.000000
Min: 0.000000
Max: 2.000000
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 25.000000
ExplicitBounds #4: 50.000000
ExplicitBounds #5: 75.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 750.000000
ExplicitBounds #10: 1000.000000
ExplicitBounds #11: 2500.000000
ExplicitBounds #12: 5000.000000
ExplicitBounds #13: 7500.000000
ExplicitBounds #14: 10000.000000
Buckets #0, Count: 3200
Buckets #1, Count: 3490
Buckets #2, Count: 0
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 0
Buckets #9, Count: 0
Buckets #10, Count: 0
Buckets #11, Count: 0
Buckets #12, Count: 0
Buckets #13, Count: 0
Buckets #14, Count: 0
Buckets #15, Count: 0
HistogramDataPoints #1
Data point attributes:
     -> http.scheme: Str(http)
     -> http.host: Str(127.0.0.1:8000)
     -> net.host.port: Int(8000)
     -> http.flavor: Str(1.1)
     -> http.method: Str(POST)
     -> http.server_name: Str(localhost:8000)
     -> http.status_code: Int(200)
     -> http.target: Str(/)
StartTimestamp: 2024-11-29 13:58:12.577578762 +0000 UTC
Timestamp: 2024-11-29 14:18:01.93709106 +0000 UTC
Count: 1382
Sum: 1023.000000
Min: 0.000000
Max: 2.000000
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 25.000000
ExplicitBounds #4: 50.000000
ExplicitBounds #5: 75.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 750.000000
ExplicitBounds #10: 1000.000000
ExplicitBounds #11: 2500.000000
ExplicitBounds #12: 5000.000000
ExplicitBounds #13: 7500.000000
ExplicitBounds #14: 10000.000000
Buckets #0, Count: 361
Buckets #1, Count: 1021
Buckets #2, Count: 0
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 0
Buckets #9, Count: 0
Buckets #10, Count: 0
Buckets #11, Count: 0
Buckets #12, Count: 0
Buckets #13, Count: 0
Buckets #14, Count: 0
Buckets #15, Count: 0

2024-11-29T14:18:01.962Z	error	internal/queue_sender.go:92	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "googlecloud/new", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: timeSeries[0,1]: write for resource=k8s_cluster{cluster_name:my-gke-cluster,location:europe-west1} failed with: Points must be written in order. One or more of the points specified had an older start time than the most recent point.\nerror details: name = Unknown  desc = total_point_count:2  errors:{status:{code:3}  point_count:2}", "dropped_items": 2}
go.opentelemetry.io/collector/exporter/exporterhelper/internal.NewQueueSender.func1
	go.opentelemetry.io/collector/[email protected]/exporterhelper/internal/queue_sender.go:92
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
	go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43
2024-11-29T14:18:03.711Z	info	Metrics	{"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 0, "metrics": 0, "data points": 0}
2024-11-29T14:18:03.711Z	info		{"kind": "exporter", "data_type": "metrics", "name": "debug"}
[...]

Additional context

As you may see, I have two timeseries, differentiated with the http.method data point attribute (POST and PUT). So I don't understand why the google cloud exporter complain about out of order points.

There are some cases where the error logs do not shows up. On the few examples that I found, it is when my datapoints have ordered timestamp.

But are the timestamps expected to be ordered across different timeseries ? If so, how can I make sure that the exporter sends the timeseries in the right order ?

This seems linked to the addition of the interval processor (and the resource/merge_instance one to merge multiple instances metrics) as I don't have such kind of errors if I remove them.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-11-29T14:46:55Z

Pinging code owners:

exporter/googlecloud: @aabmass @dashpole @jsuereth @punya @psx95

See Adding Labels via Comments if you do not have permissions to add labels yourself.

AlexisBRENON · 2024-11-29T14:55:07Z

/label processor/interval

github-actions · 2024-11-29T14:55:24Z

Pinging code owners for processor/interval: @RichieSams @sh0rez @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.

dashpole · 2024-12-02T15:43:39Z

The out-of-order error can happen within a batch of metrics, or between two batches. Given the current batch doesn't conflict, it likely means that there was a previous batch sent with a later start time.

Note that this does not "merge" the instance label--it just deletes it:

  resource/merge_instances:
    attributes:
      - key: service.instance.id
        action: delete

That is probably part of the problem. If the instance id is required to differentiate between two series and you drop it, you will get errors like the one you got. You will need to use a different processor, like the metrics transform processor, to aggregate away the attribute, rather than removing it: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstransformprocessor#aggregate-labels

I also see: resource=k8s_cluster{cluster_name:my-gke-cluster,location:europe-west1}. I don't see a namespace, or pod name, which will cause collisions if you have multiple pods writing the same metrics. If you use the k8sattributes processor, you can get these for all pods in the cluster.

AlexisBRENON · 2024-12-02T16:32:20Z

@dashpole Thanks for the feedback.
Yes, the resource/merge_instances actually just delete the attributes, however, the series are then merged through the interval processor.
The metricstransformer processor seems to only affect the datapoints labels and not the ones of the resource (or maybe I didn't understand how to use it properly), that's why I tried to use the interval processor (let's see my original StackOverflow post)
What I finally want is to have timeseries aggregated across service.instance.id (to reduce the cardinality, because I don't actually need to plot charts splitted by instance).

Nice catch about the missing namespace or pod name. This should not have any impact for the moment, as I have only one pod, but I will fix it to have a clean setup.

dashpole · 2024-12-02T16:50:46Z

I'm not sure if the interval processor aggregates over resource attributes. Maybe @RichieSams @sh0rez @djaglowski know

RichieSams · 2024-12-02T16:58:32Z

The interval processor will take the "latest" datapoint over the aggregation period. And drops all others. It doesn't sum, average, or anything the values.

It uses the unique combination of scope info, resource info, metric info, and datapoint info to identify a given datapoint. If any of the above are different, then interval processor will not aggregate them. It will keep both.

RichieSams · 2024-12-02T17:00:15Z

You could temporarily add a text exporter, in addition to the google cloud exporter. Which could help with debugging what values are being exported.

AlexisBRENON · 2024-12-02T19:59:04Z

The interval processor will take the "latest" datapoint over the aggregation period. And drops all others. It doesn't sum, average, or anything the values.

Ah... So I completely misuse it 😅

You could temporarily add a text exporter, in addition to the google cloud exporter. Which could help with debugging what values are being exported.

This is what I did. You may see in the first post of the issue that there is the debug exporter in detailed mode.

AlexisBRENON · 2024-12-02T20:04:40Z

To give a little bit more context about my use case. I have a service on which I don't have a lot of control (I cannot change the sent telemetry, neither the PeriodicExporter interval). Moreover, the service may scale horizontally, sometimes having a single instance, sometimes having +100. I'm looking for a way to control cardinality and time granularity on my end (through the collector) to reduce cost.
What may be the best processors to use ? Maybe the GitHub issue is not the right place to discuss this ?

dashpole · 2024-12-02T21:24:41Z

I think you can move a resource attribute to a metric attribute using the transform processor, and then use the metrics transform processor to aggregate. The transform processor would use the datapoint context, and set attributes["service.instance.id"] to resource.attributes["service.instance.id"]

crobert-1 · 2025-01-07T22:15:01Z

@AlexisBRENON Has this issue been resolved, or is there something else necessary here?

Removing needs triage as it looks like a valid solution has been proposed here.

AlexisBRENON · 2025-01-08T13:46:33Z

I currently have a setup that seems to work fine. the pipeline is as follow:

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 65
    spike_limit_percentage: 20
  batch:
  interval:
    interval: 10s
  resourcedetection:
    detectors: [gcp]
    timeout: 10s
  cumulativetodelta:
    max_staleness: 24h
  transform/resource:
    error_mode: ignore
    metric_statements:
    - context: "resource"
      statements:
      - set(attributes["service.instance.id"], attributes["service.namespace"])  # Override instance ID to allow aggregation
      - set(attributes["k8s.namespace.name"], "my-namespace")  # Set namespace for Managed Prometheus export
  groupbyattrs:  # Group all metrics from the same client
    keys:
      - service.name
      - service.version
      - service.namespace
  transform/aggregate:  # Aggregate metrics from the same client
    error_mode: ignore
    metric_statements:
    - context: datapoint
      statements:
      - set(time, TruncateTime(Now(), Duration("10s")))  # Align timestamps to allow aggregation
      - set(start_time, TruncateTime(start_time, Duration("10s")))  # Align timestamps to allow aggregation
      - delete_key(attributes, "http.host")
      - delete_key(attributes, "net.host.port")
      - delete_key(attributes, "http.server_name")
      - delete_key(attributes, "server.address")
      - delete_key(attributes, "server.port")
    - context: metric
      statements:
      - aggregate_on_attributes("sum") where type != METRIC_DATA_TYPE_GAUGE
      - aggregate_on_attributes("mean") where type == METRIC_DATA_TYPE_GAUGE
  deltatocumulative:
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors:
      - memory_limiter
      - batch
      - interval
      - resourcedetection
      - cumulativetodelta
      - transform/resource
      - groupbyattrs
      - transform/aggregate
      - deltatocumulative
      exporters:
      - googlemanagedprometheus

I think we can close this issue 👍

AlexisBRENON added bug Something isn't working needs triage New item requiring triage labels Nov 29, 2024

github-actions bot added the exporter/googlecloud label Nov 29, 2024

github-actions bot added the processor/interval label Nov 29, 2024

github-actions bot mentioned this issue Dec 3, 2024

Weekly Report: 2024-11-26 - 2024-12-03 #36628

Closed

github-actions bot mentioned this issue Dec 10, 2024

Weekly Report: 2024-12-03 - 2024-12-10 #36739

Closed

This was referenced Dec 17, 2024

Weekly Report: 2024-12-10 - 2024-12-17 #36867

Open

Weekly Report: 2024-12-17 - 2024-12-24 #36929

Open

github-actions bot mentioned this issue Dec 31, 2024

Weekly Report: 2024-12-24 - 2024-12-31 #36987

Open

github-actions bot mentioned this issue Jan 7, 2025

Weekly Report: 2024-12-31 - 2025-01-07 #37048

Open

crobert-1 removed the needs triage New item requiring triage label Jan 7, 2025

AlexisBRENON closed this as completed Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Cloud exporter reports unexpected error #36602

Google Cloud exporter reports unexpected error #36602

AlexisBRENON commented Nov 29, 2024 •

edited

Loading

github-actions bot commented Nov 29, 2024

AlexisBRENON commented Nov 29, 2024

github-actions bot commented Nov 29, 2024

dashpole commented Dec 2, 2024

AlexisBRENON commented Dec 2, 2024

dashpole commented Dec 2, 2024

RichieSams commented Dec 2, 2024

RichieSams commented Dec 2, 2024

AlexisBRENON commented Dec 2, 2024

AlexisBRENON commented Dec 2, 2024

dashpole commented Dec 2, 2024

crobert-1 commented Jan 7, 2025

AlexisBRENON commented Jan 8, 2025

Google Cloud exporter reports unexpected error #36602

Google Cloud exporter reports unexpected error #36602

Comments

AlexisBRENON commented Nov 29, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Nov 29, 2024

AlexisBRENON commented Nov 29, 2024

github-actions bot commented Nov 29, 2024

dashpole commented Dec 2, 2024

AlexisBRENON commented Dec 2, 2024

dashpole commented Dec 2, 2024

RichieSams commented Dec 2, 2024

RichieSams commented Dec 2, 2024

AlexisBRENON commented Dec 2, 2024

AlexisBRENON commented Dec 2, 2024

dashpole commented Dec 2, 2024

crobert-1 commented Jan 7, 2025

AlexisBRENON commented Jan 8, 2025

AlexisBRENON commented Nov 29, 2024 •

edited

Loading