500 error when scraping metrics from otel-collector pod when loadbalancing exporter is used #30477

juissi-t · 2024-01-12T14:02:42Z

Component(s)

exporter/loadbalancing

What happened?

Description

I enabled loadbalancing exporter on our collector pods. After a while (~1 hour), Prometheus fails to scrape metrics from the pods which have the exporter configured. Below is an error message from one pod.

* Connected to localhost (::1) port 8888
> GET /metrics HTTP/1.1
> Host: localhost:8888
> User-Agent: curl/8.2.1
> Accept: */*
> 
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 12 Jan 2024 13:41:47 GMT
< Content-Length: 401
< 
An error has occurred while serving metrics:

collected metric "otelcol_exporter_queue_size" { label:{name:"exporter"  value:"loadbalancing"}  label:{name:"service_instance_id"  value:"f3215806-0275-446f-acdc-32c306d25337"}  label:{name:"service_name"  value:"otelcol-contrib"}  label:{name:"service_version"  value:"0.92.0"}  gauge:{value:0}} was collected before with the same name and label values

Steps to Reproduce

Enable loadbalancing exporter.

Expected Result

Prometheus metrics work correctly

Actual Result

Prometheus metrics fail after a while

Collector version

0.92.0

Environment information

Environment

OS: EKS 1.26 Bottlerocket

OpenTelemetry Collector configuration

exporters:
  debug: {}
  loadbalancing:
    protocol:
      otlp:
        sending_queue:
          queue_size: 10000
        tls:
          insecure: true
    resolver:
      k8s:
        service: opentelemetry-collector-sts.monitoring
  logging: {}
extensions:
  health_check: {}
processors:
  batch: {}
  k8sattributes:
    extract:
      labels:
      - from: pod
        key_regex: (.*)
        tag_name: $$1
      metadata:
      - k8s.namespace.name
      - k8s.deployment.name
      - k8s.statefulset.name
      - k8s.daemonset.name
      - k8s.cronjob.name
      - k8s.job.name
      - k8s.node.name
      - k8s.pod.name
      - k8s.pod.uid
      - k8s.pod.start_time
    filter:
      node_from_env_var: K8S_NODE_NAME
    passthrough: false
    pod_association:
    - sources:
      - from: resource_attribute
        name: k8s.pod.ip
    - sources:
      - from: resource_attribute
        name: k8s.pod.uid
    - sources:
      - from: connection
  memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 25
  resource:
    attributes:
    - action: upsert
      key: cluster
      value: dev
receivers:
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_compact:
        endpoint: 0.0.0.0:6831
      thrift_http:
        endpoint: 0.0.0.0:14268
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
      - job_name: opentelemetry-collector
        scrape_interval: 10s
        static_configs:
        - targets:
          - ${env:MY_POD_IP}:8888
  zipkin:
    endpoint: 0.0.0.0:9411
service:
  extensions:
  - health_check
  pipelines:
    metrics:
      exporters:
      - loadbalancing
      processors:
      - k8sattributes
      - memory_limiter
      - resource
      - batch
      receivers:
      - otlp
    traces:
      exporters:
      - loadbalancing
      processors:
      - k8sattributes
      - memory_limiter
      - resource
      - batch
      receivers:
      - otlp
      - jaeger
  telemetry:
    metrics:
      address: 0.0.0.0:8888

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-01-12T14:02:58Z

Pinging code owners:

exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

juissi-t · 2024-01-15T11:24:32Z

Some things I tried:

I enabled debug logging for the weekend --> No problems to scrape metrics in 48 hours.
I disabled debug logging and changed metrics pipeline to use prometheusremotewrite exporter, so only traces go through loadbalancing exporter. --> Issue reproduced after about 2 hours.

jpkrohling · 2024-01-29T13:46:18Z

I think this is more likely to be something with the Prometheus receiver/exporter than with the load balancing, given that this seems to be about the component's own metrics, rather than the load balanced telemetry.

github-actions · 2024-01-29T13:46:55Z

Pinging code owners for exporter/prometheusremotewrite: @Aneurysm9 @rapphil. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-01-29T13:46:56Z

Pinging code owners for receiver/prometheus: @Aneurysm9 @dashpole. See Adding Labels via Comments if you do not have permissions to add labels yourself.

swar8080 · 2024-01-29T20:22:33Z

Getting the same error with this loadbalancer configuration

    exporters:
      loadbalancing:
        protocol:
          otlp:
            sending_queue:
              queue_size: 10000
            tls:
              insecure: true
        resolver:
          k8s:
            service: abc-collector-headless.observability

    service:
      pipelines:
        metrics:
          receivers: [prometheus]
          exporters: [otlp/forwarder, debug]
        traces:
          receivers: [otlp]
          exporters: [loadbalancing, debug]

juissi-t · 2024-01-31T07:15:41Z

Linking a couple of similar sounding issues here:

[loadbalancingexporter] using the loadbalancingexporter k8s resolved breaks the internal metrics #30697
telemetry receiver prometheus down after some minutes #30835

juissi-t · 2024-01-31T13:56:43Z

Based on comments in #30697 I could reproduce this:

Set up the loadbalancing exporter with K8s resolver with two target pods.
Restart one of the target pods.
Metrics start to fail.

I attached pod logs showing what happens, if they are of any help:

agent-logs.txt

Edit: I tried with 0.91.0, and couldn't reproduce. Pod logs from that version:

agent-0.91-logs.txt

Edit 2: Using 0.92.0 but disabling sending_queue works, as per #30697 (comment).

Edit 3: Managed to reproduce the issue with the DNS resolver as well.

dashpole · 2024-01-31T19:20:27Z

From the collector config, it doesn't look like you are actually using the prometheus receiver or prometheus exporter in a pipeline?

juissi-t · 2024-02-01T13:26:44Z

From the collector config, it doesn't look like you are actually using the prometheus receiver or prometheus exporter in a pipeline?

No, I'm not. This can be reproduced easily without those, just by using the loadbalancing exporter.

github-actions · 2024-02-01T13:43:22Z

Pinging code owners for exporter/loadbalancing: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

juissi-t · 2024-02-01T14:35:25Z

Trying to gather my thoughts a bit here, so please forgive me if you find this messy.

Loadbalancing exporter starts and registers a metric otelcol_exporter_queue_size{exporter="loadbalancing",service_instance_id="8b8c3359-8b34-488b-9f22-8cfa8081db97",service_name="otelcol-contrib",service_version="0.92.0"}.
- What registers this metric? Is it the OTLP exporter?
When a load balanced endpoint changes, the metric seems to be duplicated and we get the error mentioned in the original report.
- Is there now a second OTLP exporter process/thread/coroutine running?
When sending_queue is disabled, the error does not happen.
- Does this imply that the problem is on the OTLP exporter side, and just the code registering this one metric related to the queue is somehow broken?
The problem does not happen with version 0.91.0, but it does with 0.92.0. What changed between those versions that might have caused this?

Juliaj · 2024-02-01T18:49:31Z

@juissi-t , some information to your questions above

This metric is registered as part of retry queue, see code here.
This is a good hypothesis and needs more debugging to confirm.
I noticed this too. In my repro, if we use DNS resolver, this also doesn't repro. So a combo of k8s resolver + sending_queue makes this happen.
This is a good find! I can spend time to look into this.

Juliaj · 2024-02-02T07:51:52Z

With more debugging, the code from Otel-collector repo, query_sender.go recordWithOtel introduced in release 0.92.0 may be problematic. Switching the code to call recordWithOC and issue was no longer reproducible.

@dmitryax, would you be able to shed some light?

func (qs *queueSender) recordWithOtel(meter otelmetric.Meter) error {
	var err, errs error

	attrs := otelmetric.WithAttributeSet(attribute.NewSet(attribute.String(obsmetrics.ExporterKey, qs.fullName)))

	qs.metricSize, err = meter.Int64ObservableGauge(
		obsmetrics.ExporterKey+"/queue_size",
		otelmetric.WithDescription("Current size of the retry queue (in batches)"),
		otelmetric.WithUnit("1"),
		otelmetric.WithInt64Callback(func(_ context.Context, o otelmetric.Int64Observer) error {
			o.Observe(int64(qs.queue.Size()), attrs)
			return nil
		}),
	)

jpkrohling · 2024-02-02T09:35:21Z

Feels like it's a duplicate of #16826 .

Juliaj · 2024-02-02T16:44:04Z

@jpkrohling , could you provide more insight on the connection of this to #16826 ?

Juliaj · 2024-02-02T17:31:36Z

This issue also happens with 0.93.0 release. For folks who need a workaround running 0.92.0, a feature gate flag mentioned in release note can be used to disable useOtelForInternalMetrics:

- `service`: Enable `telemetry.useOtelForInternalMetrics` ...  Users can disable the behaviour
  by setting `--feature-gates -telemetry.useOtelForInternalMetrics` at
  collector start.

Note, you can't turn this off with 0.93.0 because the feature is marked as stable.

alolita · 2024-02-02T22:02:58Z

@jpkrohling @open-telemetry/collector-contrib-maintainer can you assign this issue to @Juliaj to investigate and file a PR. Thanks!

Juliaj · 2024-02-07T23:03:30Z

@juissi-t, would you be able to test the repro with the current commits from this repository in your environment ? I synced the recent commits from Otel collector Git repository and this repository to build an image. I am not able to repro with the steps above. Just wondering whether you could help verify.

juissi-t · 2024-02-08T06:43:06Z

@juissi-t, would you be able to test the repro with the current commits from this repository in your environment ? I synced the recent commits from Otel collector Git repository and this repository to build an image. I am not able to repro with the steps above. Just wondering whether you could help verify.

Yes, I can deploy the image to our development environment to check. Please let me know where I can get the image from.

Edit: Managed to build the image myself. Can't reproduce the issue anymore.

codeboten · 2024-02-08T19:44:05Z

@juissi-t can you confirm that this issue is not reproducible w/ v0.94.0 that was just released yesterday?

Juliaj · 2024-02-13T03:43:17Z

@codeboten , @juissi-t , I verified that w/v0.94.0, this issue was not reproducible in our setup.

jpkrohling · 2024-02-14T09:56:25Z

I'm closing, feel free to reopen if this is still an issue.

juissi-t added bug Something isn't working needs triage New item requiring triage labels Jan 12, 2024

github-actions bot added the exporter/loadbalancing label Jan 12, 2024

github-actions bot mentioned this issue Jan 16, 2024

Weekly Report: 2024-01-09 - 2024-01-16 #30565

Closed

github-actions bot mentioned this issue Jan 23, 2024

Weekly Report: 2024-01-16 - 2024-01-23 #30711

Closed

jpkrohling added exporter/prometheusremotewrite receiver/prometheus Prometheus receiver and removed exporter/loadbalancing needs triage New item requiring triage labels Jan 29, 2024

dashpole added the waiting for author label Jan 31, 2024

dashpole added exporter/loadbalancing and removed exporter/prometheusremotewrite receiver/prometheus Prometheus receiver labels Feb 1, 2024

Juliaj mentioned this issue Feb 2, 2024

[loadbalancingexporter] using the loadbalancingexporter k8s resolved breaks the internal metrics #30697

Closed

mx-psi assigned Juliaj Feb 5, 2024

jpkrohling closed this as completed Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

500 error when scraping metrics from otel-collector pod when loadbalancing exporter is used #30477

500 error when scraping metrics from otel-collector pod when loadbalancing exporter is used #30477

juissi-t commented Jan 12, 2024

github-actions bot commented Jan 12, 2024

juissi-t commented Jan 15, 2024 •

edited

Loading

jpkrohling commented Jan 29, 2024

github-actions bot commented Jan 29, 2024

github-actions bot commented Jan 29, 2024

swar8080 commented Jan 29, 2024

juissi-t commented Jan 31, 2024

juissi-t commented Jan 31, 2024 •

edited

Loading

dashpole commented Jan 31, 2024

juissi-t commented Feb 1, 2024

github-actions bot commented Feb 1, 2024

juissi-t commented Feb 1, 2024

Juliaj commented Feb 1, 2024

Juliaj commented Feb 2, 2024 •

edited

Loading

jpkrohling commented Feb 2, 2024 •

edited

Loading

Juliaj commented Feb 2, 2024

Juliaj commented Feb 2, 2024

alolita commented Feb 2, 2024 •

edited

Loading

Juliaj commented Feb 7, 2024

juissi-t commented Feb 8, 2024 •

edited

Loading

codeboten commented Feb 8, 2024

Juliaj commented Feb 13, 2024

jpkrohling commented Feb 14, 2024

500 error when scraping metrics from otel-collector pod when loadbalancing exporter is used #30477

500 error when scraping metrics from otel-collector pod when loadbalancing exporter is used #30477

Comments

juissi-t commented Jan 12, 2024

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Jan 12, 2024

juissi-t commented Jan 15, 2024 • edited Loading

jpkrohling commented Jan 29, 2024

github-actions bot commented Jan 29, 2024

github-actions bot commented Jan 29, 2024

swar8080 commented Jan 29, 2024

juissi-t commented Jan 31, 2024

juissi-t commented Jan 31, 2024 • edited Loading

dashpole commented Jan 31, 2024

juissi-t commented Feb 1, 2024

github-actions bot commented Feb 1, 2024

juissi-t commented Feb 1, 2024

Juliaj commented Feb 1, 2024

Juliaj commented Feb 2, 2024 • edited Loading

jpkrohling commented Feb 2, 2024 • edited Loading

Juliaj commented Feb 2, 2024

Juliaj commented Feb 2, 2024

alolita commented Feb 2, 2024 • edited Loading

Juliaj commented Feb 7, 2024

juissi-t commented Feb 8, 2024 • edited Loading

codeboten commented Feb 8, 2024

Juliaj commented Feb 13, 2024

jpkrohling commented Feb 14, 2024

juissi-t commented Jan 15, 2024 •

edited

Loading

juissi-t commented Jan 31, 2024 •

edited

Loading

Juliaj commented Feb 2, 2024 •

edited

Loading

jpkrohling commented Feb 2, 2024 •

edited

Loading

alolita commented Feb 2, 2024 •

edited

Loading

juissi-t commented Feb 8, 2024 •

edited

Loading