Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

500 error when scraping metrics from otel-collector pod when loadbalancing exporter is used #30477

Closed
juissi-t opened this issue Jan 12, 2024 · 23 comments
Assignees

Comments

@juissi-t
Copy link

Component(s)

exporter/loadbalancing

What happened?

Description

I enabled loadbalancing exporter on our collector pods. After a while (~1 hour), Prometheus fails to scrape metrics from the pods which have the exporter configured. Below is an error message from one pod.

* Connected to localhost (::1) port 8888
> GET /metrics HTTP/1.1
> Host: localhost:8888
> User-Agent: curl/8.2.1
> Accept: */*
> 
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Fri, 12 Jan 2024 13:41:47 GMT
< Content-Length: 401
< 
An error has occurred while serving metrics:

collected metric "otelcol_exporter_queue_size" { label:{name:"exporter"  value:"loadbalancing"}  label:{name:"service_instance_id"  value:"f3215806-0275-446f-acdc-32c306d25337"}  label:{name:"service_name"  value:"otelcol-contrib"}  label:{name:"service_version"  value:"0.92.0"}  gauge:{value:0}} was collected before with the same name and label values

Steps to Reproduce

  • Enable loadbalancing exporter.

Expected Result

  • Prometheus metrics work correctly

Actual Result

  • Prometheus metrics fail after a while

Collector version

0.92.0

Environment information

Environment

OS: EKS 1.26 Bottlerocket

OpenTelemetry Collector configuration

exporters:
  debug: {}
  loadbalancing:
    protocol:
      otlp:
        sending_queue:
          queue_size: 10000
        tls:
          insecure: true
    resolver:
      k8s:
        service: opentelemetry-collector-sts.monitoring
  logging: {}
extensions:
  health_check: {}
processors:
  batch: {}
  k8sattributes:
    extract:
      labels:
      - from: pod
        key_regex: (.*)
        tag_name: $$1
      metadata:
      - k8s.namespace.name
      - k8s.deployment.name
      - k8s.statefulset.name
      - k8s.daemonset.name
      - k8s.cronjob.name
      - k8s.job.name
      - k8s.node.name
      - k8s.pod.name
      - k8s.pod.uid
      - k8s.pod.start_time
    filter:
      node_from_env_var: K8S_NODE_NAME
    passthrough: false
    pod_association:
    - sources:
      - from: resource_attribute
        name: k8s.pod.ip
    - sources:
      - from: resource_attribute
        name: k8s.pod.uid
    - sources:
      - from: connection
  memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 25
  resource:
    attributes:
    - action: upsert
      key: cluster
      value: dev
receivers:
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_compact:
        endpoint: 0.0.0.0:6831
      thrift_http:
        endpoint: 0.0.0.0:14268
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
      - job_name: opentelemetry-collector
        scrape_interval: 10s
        static_configs:
        - targets:
          - ${env:MY_POD_IP}:8888
  zipkin:
    endpoint: 0.0.0.0:9411
service:
  extensions:
  - health_check
  pipelines:
    metrics:
      exporters:
      - loadbalancing
      processors:
      - k8sattributes
      - memory_limiter
      - resource
      - batch
      receivers:
      - otlp
    traces:
      exporters:
      - loadbalancing
      processors:
      - k8sattributes
      - memory_limiter
      - resource
      - batch
      receivers:
      - otlp
      - jaeger
  telemetry:
    metrics:
      address: 0.0.0.0:8888

Log output

No response

Additional context

No response

@juissi-t juissi-t added bug Something isn't working needs triage New item requiring triage labels Jan 12, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@juissi-t
Copy link
Author

juissi-t commented Jan 15, 2024

Some things I tried:

  • I enabled debug logging for the weekend --> No problems to scrape metrics in 48 hours.
  • I disabled debug logging and changed metrics pipeline to use prometheusremotewrite exporter, so only traces go through loadbalancing exporter. --> Issue reproduced after about 2 hours.

@jpkrohling
Copy link
Member

I think this is more likely to be something with the Prometheus receiver/exporter than with the load balancing, given that this seems to be about the component's own metrics, rather than the load balanced telemetry.

Copy link
Contributor

Pinging code owners for exporter/prometheusremotewrite: @Aneurysm9 @rapphil. See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

Pinging code owners for receiver/prometheus: @Aneurysm9 @dashpole. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@swar8080
Copy link
Contributor

Getting the same error with this loadbalancer configuration

    exporters:
      loadbalancing:
        protocol:
          otlp:
            sending_queue:
              queue_size: 10000
            tls:
              insecure: true
        resolver:
          k8s:
            service: abc-collector-headless.observability

    service:
      pipelines:
        metrics:
          receivers: [prometheus]
          exporters: [otlp/forwarder, debug]
        traces:
          receivers: [otlp]
          exporters: [loadbalancing, debug]

@juissi-t
Copy link
Author

@juissi-t
Copy link
Author

juissi-t commented Jan 31, 2024

Based on comments in #30697 I could reproduce this:

  1. Set up the loadbalancing exporter with K8s resolver with two target pods.
  2. Restart one of the target pods.
  3. Metrics start to fail.

I attached pod logs showing what happens, if they are of any help:

agent-logs.txt

Edit: I tried with 0.91.0, and couldn't reproduce. Pod logs from that version:

agent-0.91-logs.txt

Edit 2: Using 0.92.0 but disabling sending_queue works, as per #30697 (comment).

Edit 3: Managed to reproduce the issue with the DNS resolver as well.

@dashpole
Copy link
Contributor

From the collector config, it doesn't look like you are actually using the prometheus receiver or prometheus exporter in a pipeline?

@juissi-t
Copy link
Author

juissi-t commented Feb 1, 2024

From the collector config, it doesn't look like you are actually using the prometheus receiver or prometheus exporter in a pipeline?

No, I'm not. This can be reproduced easily without those, just by using the loadbalancing exporter.

Copy link
Contributor

github-actions bot commented Feb 1, 2024

Pinging code owners for exporter/loadbalancing: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@juissi-t
Copy link
Author

juissi-t commented Feb 1, 2024

Trying to gather my thoughts a bit here, so please forgive me if you find this messy.

  1. Loadbalancing exporter starts and registers a metric otelcol_exporter_queue_size{exporter="loadbalancing",service_instance_id="8b8c3359-8b34-488b-9f22-8cfa8081db97",service_name="otelcol-contrib",service_version="0.92.0"}.
    • What registers this metric? Is it the OTLP exporter?
  2. When a load balanced endpoint changes, the metric seems to be duplicated and we get the error mentioned in the original report.
    • Is there now a second OTLP exporter process/thread/coroutine running?
  3. When sending_queue is disabled, the error does not happen.
    • Does this imply that the problem is on the OTLP exporter side, and just the code registering this one metric related to the queue is somehow broken?
  4. The problem does not happen with version 0.91.0, but it does with 0.92.0. What changed between those versions that might have caused this?

@Juliaj
Copy link
Contributor

Juliaj commented Feb 1, 2024

@juissi-t , some information to your questions above

  1. This metric is registered as part of retry queue, see code here.
  2. This is a good hypothesis and needs more debugging to confirm.
  3. I noticed this too. In my repro, if we use DNS resolver, this also doesn't repro. So a combo of k8s resolver + sending_queue makes this happen.
  4. This is a good find! I can spend time to look into this.

@Juliaj
Copy link
Contributor

Juliaj commented Feb 2, 2024

With more debugging, the code from Otel-collector repo, query_sender.go recordWithOtel introduced in release 0.92.0 may be problematic. Switching the code to call recordWithOC and issue was no longer reproducible.

@dmitryax, would you be able to shed some light?

func (qs *queueSender) recordWithOtel(meter otelmetric.Meter) error {
	var err, errs error

	attrs := otelmetric.WithAttributeSet(attribute.NewSet(attribute.String(obsmetrics.ExporterKey, qs.fullName)))

	qs.metricSize, err = meter.Int64ObservableGauge(
		obsmetrics.ExporterKey+"/queue_size",
		otelmetric.WithDescription("Current size of the retry queue (in batches)"),
		otelmetric.WithUnit("1"),
		otelmetric.WithInt64Callback(func(_ context.Context, o otelmetric.Int64Observer) error {
			o.Observe(int64(qs.queue.Size()), attrs)
			return nil
		}),
	)

@jpkrohling
Copy link
Member

jpkrohling commented Feb 2, 2024

Feels like it's a duplicate of #16826 .

@Juliaj
Copy link
Contributor

Juliaj commented Feb 2, 2024

@jpkrohling , could you provide more insight on the connection of this to #16826 ?

@Juliaj
Copy link
Contributor

Juliaj commented Feb 2, 2024

This issue also happens with 0.93.0 release. For folks who need a workaround running 0.92.0, a feature gate flag mentioned in release note can be used to disable useOtelForInternalMetrics:

- `service`: Enable `telemetry.useOtelForInternalMetrics` ...  Users can disable the behaviour
  by setting `--feature-gates -telemetry.useOtelForInternalMetrics` at
  collector start.

Note, you can't turn this off with 0.93.0 because the feature is marked as stable.

@alolita
Copy link
Member

alolita commented Feb 2, 2024

@jpkrohling @open-telemetry/collector-contrib-maintainer can you assign this issue to @Juliaj to investigate and file a PR. Thanks!

@Juliaj
Copy link
Contributor

Juliaj commented Feb 7, 2024

@juissi-t, would you be able to test the repro with the current commits from this repository in your environment ? I synced the recent commits from Otel collector Git repository and this repository to build an image. I am not able to repro with the steps above. Just wondering whether you could help verify.

@juissi-t
Copy link
Author

juissi-t commented Feb 8, 2024

@juissi-t, would you be able to test the repro with the current commits from this repository in your environment ? I synced the recent commits from Otel collector Git repository and this repository to build an image. I am not able to repro with the steps above. Just wondering whether you could help verify.

Yes, I can deploy the image to our development environment to check. Please let me know where I can get the image from.

Edit: Managed to build the image myself. Can't reproduce the issue anymore.

@codeboten
Copy link
Contributor

@juissi-t can you confirm that this issue is not reproducible w/ v0.94.0 that was just released yesterday?

@Juliaj
Copy link
Contributor

Juliaj commented Feb 13, 2024

@codeboten , @juissi-t , I verified that w/v0.94.0, this issue was not reproducible in our setup.

@jpkrohling
Copy link
Member

I'm closing, feel free to reopen if this is still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants