scrape/scrape.go:1313 Scrape commit failed "error": "data refused due to high memory usage" #8217

pilot513 · 2023-08-10T12:54:00Z

Describe the bug
ota collector can't scrape pod metrics

Steps to reproduce
Configure prometheus exporter with prometheus endpoint

What did you expect to see?
Scrape metrics from pods and receive it to 'prometheusremotewrite'

What did you see instead?
error scrape/scrape.go:1313 Scrape commit failed {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "otel_kubernetes_podscraper", "target": "http://ip:port/metrics", "error": "data refused due to high memory usage"}

What version did you use?
Version: 0.82

What config did you use?
Config:

...
prometheus:
endpoint: 0.0.0.0:port
metric_expiration: 120m
resource_to_telemetry_conversion:
enabled: true
send_timestamps: true
prometheusremotewrite:
endpoint: http://hostname/prometheus/api/v1/write
extensions:
health_check: {}
memory_ballast: {}
processors:
batch: {}
memory_limiter:
check_interval: 3s
limit_mib: 6553
spike_limit_mib: 2048
...

Environment
k8s Pod (from helm chart) with Limit 8G 2cores

CarlosLanderas · 2023-08-10T13:47:40Z

I'm having the same problem with prometheus receiver performing scrapes since the container is started and have tried different collector versions and settings:

resources:
  requests:
    cpu: 500m
    memory: 4096Mi
  limits:
    memory: 4096Mi
 processors:
      memory_limiter:
        check_interval: 1s
        limit_mib: 4000
        spike_limit_mib: 800
      batch:
        send_batch_size: 1000
        timeout: 1s
        send_batch_max_size: 1500
 extensions:          
     memory_ballast:
          size_mib: 2000

pilot513 · 2023-08-10T15:17:57Z

I have settings:

resources:
  requests:
    cpu: 2
    memory: 4096Mi
  limits:
    memory: 8192Mi
 processors:
      memory_limiter:
        check_interval: 3s
        limit_mib: 6553
        spike_limit_mib: 2048

I should try memory_ballast as well ...

pilot513 · 2023-08-10T15:23:16Z

Does anyone have normal scrape results from pods under load?

pilot513 · 2023-08-10T17:52:27Z

Added definitions for batch

      batch:
        send_batch_size: 1000
        timeout: 1s
        send_batch_max_size: 1500

Let's see how it will be

pilot513 · 2023-08-11T08:01:31Z

At first it worked fine. And then the problems started again:

pilot513 · 2023-08-20T12:41:27Z

After a few days of work (5), history repeats itself.

pilot513 · 2023-08-20T13:14:06Z

bhupeshpadiyar · 2024-01-19T02:55:34Z

Hey @pilot513 , did you managed to fix this issue some how?
I am also facing the same issue with histogram metrics export. Please help to suggest the solution if you managed to find the one.

Thanks

pilot513 · 2024-01-19T10:27:51Z

In my case, I noticed that the number of metrics was constantly growing. I began to study this issue, and discovered that one application was generating a constant increase in unique metrics. It shouldn't be this way. I pointed this out to the developers, and they fixed it because their code for expose metrics was incorrect. As soon as I reinstalled the application, the problem went away.

martinohansen · 2024-04-10T06:44:58Z

I'm seeing the same thing, memory usage keeps going up until the receiver starts falling. At that point I begin to see export failures and the export queue goes up as well. We're sending around 35K/s data points across 350 scrape targets.

Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector error scrape/scrape.go:1351	Scrape commit failed	{"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "k8s", "target": "http://10.30.28.208:6666/metrics", "error": "data refused due to high memory usage"}
Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1
	github.com/prometheus/[email protected]/scrape/scrape.go:1351
Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
	github.com/prometheus/[email protected]/scrape/scrape.go:1429
Apr 10 06:34:35 otel-collector-7855596df8-lz6cv otel-collector github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
	github.com/prometheus/[email protected]/scrape/scrape.go:1306

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: k8s
          tls_config:
            insecure_skip_verify: true
          scrape_interval: 10s
          ...
processors:
  batch:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 20
  k8sattributes:
    extract:
      metadata:
        - k8s.container.name
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.deployment.name
        - k8s.replicaset.name
        - k8s.node.name
        - k8s.daemonset.name
        - k8s.cronjob.name
        - k8s.job.name
        - k8s.statefulset.name
      labels:
        - tag_name: k8s.pod.label.app
          key: app
          from: pod
        - tag_name: k8s.pod.label.component
          key: component
          from: pod
        - tag_name: k8s.pod.label.zone
          key: zone
          from: pod
    pod_association:
      - sources:
        - from: resource_attribute
          name: k8s.pod.ip
      - sources:
        - from: resource_attribute
          name: k8s.pod.uid
      - sources:
        - from: connection
  transform/add-workload-label:
    metric_statements:
      - context: datapoint
        statements:
        - set(attributes["kube_workload_name"], resource.attributes["k8s.deployment.name"])
        - set(attributes["kube_workload_name"], resource.attributes["k8s.statefulset.name"])
        - set(attributes["kube_workload_type"], "deployment") where resource.attributes["k8s.deployment.name"] != nil
        - set(attributes["kube_workload_type"], "statefulset") where resource.attributes["k8s.statefulset.name"] != nil
exporters:
  prometheusremotewrite:
    endpoint: ${env:PROMETHEUSREMOTEWRITE_ENDPOINT}
    headers:
      Authorization: ${env:PROMETHEUSREMOTEWRITE_TOKEN}
    resource_to_telemetry_conversion:
      enabled: true
    max_batch_size_bytes: 2000000
service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [memory_limiter, batch, k8sattributes, transform/add-workload-label]
      exporters: [prometheusremotewrite]

containers:
- command:
  - /otelcol-contrib
  - --config=/conf/otel-collector-config.yaml
  image: otel/opentelemetry-collector-contrib:0.96.0
  imagePullPolicy: IfNotPresent
  name: otel-collector
  ports:
  - containerPort: 55679
    protocol: TCP
  - containerPort: 4317
    protocol: TCP
  - containerPort: 4318
    protocol: TCP
  - containerPort: 14250
    protocol: TCP
  - containerPort: 14268
    protocol: TCP
  - containerPort: 9411
    protocol: TCP
  - containerPort: 8888
    protocol: TCP
  resources:
    limits:
      cpu: "4"
      memory: 16Gi
    requests:
      cpu: "4"
      memory: 16Gi
  volumeMounts:
  - mountPath: /conf
    name: otel-collector-config-vol
  env:
  - name: "GOMEMLIMIT"
    value: "12GiB" # 80% of memory request
  envFrom:
  - secretRef:
      name: otel-collector

martinohansen · 2024-04-10T07:49:57Z

Just tested on v0.97, same failure pattern. I did noticed this error message as well:

Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector error exporterhelper/queue_sender.go:101	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded", "errorCauses": [{"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}], "dropped_items": 108155}
Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
	go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:101
Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
	go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:57
Apr 10 07:45:24 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
	go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector error exporterhelper/queue_sender.go:101	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: context deadline exceeded", "dropped_items": 108341}
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
	go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:101
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
	go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:57
Apr 10 07:45:34 otel-collector-7cbf899d45-cswl5 otel-collector go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
	go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43

chenlujjj · 2024-05-27T10:01:40Z

Met similar error on v0.97

garg031 · 2024-07-22T05:02:27Z

I am facing similar issue..

Scrape continouly fails with the below error -

github.com/prometheus/[email protected]/scrape/scrape.go:1306
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
github.com/prometheus/[email protected]/scrape/scrape.go:1429
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
github.com/prometheus/[email protected]/scrape/scrape.go:1351
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1
error scrape/scrape.go:1351 Scrape commit failed {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_pool": "itomperftesting-otel-collector-job", "target": "http://0.0.0.0:8888/metrics", "error": "data refused due to high memory usage"}

This also leads high memory usage at otel-collector..

Do we have any work-around for this ?

benzch · 2024-10-29T11:04:25Z

Same issue version 0.100. any work around?

pilot513 added the bug Something isn't working label Aug 10, 2023

martinohansen mentioned this issue Apr 10, 2024

prometheusremotewrite context deadline exceeded open-telemetry/opentelemetry-collector-contrib#31910

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scrape/scrape.go:1313 Scrape commit failed "error": "data refused due to high memory usage" #8217

scrape/scrape.go:1313 Scrape commit failed "error": "data refused due to high memory usage" #8217

pilot513 commented Aug 10, 2023 •

edited

Loading

CarlosLanderas commented Aug 10, 2023 •

edited

Loading

pilot513 commented Aug 10, 2023 •

edited

Loading

pilot513 commented Aug 10, 2023

pilot513 commented Aug 10, 2023

pilot513 commented Aug 11, 2023

pilot513 commented Aug 20, 2023 •

edited

Loading

pilot513 commented Aug 20, 2023

bhupeshpadiyar commented Jan 19, 2024

pilot513 commented Jan 19, 2024

martinohansen commented Apr 10, 2024

martinohansen commented Apr 10, 2024 •

edited

Loading

chenlujjj commented May 27, 2024

garg031 commented Jul 22, 2024 •

edited

Loading

benzch commented Oct 29, 2024

scrape/scrape.go:1313 Scrape commit failed "error": "data refused due to high memory usage" #8217

scrape/scrape.go:1313 Scrape commit failed "error": "data refused due to high memory usage" #8217

Comments

pilot513 commented Aug 10, 2023 • edited Loading

CarlosLanderas commented Aug 10, 2023 • edited Loading

pilot513 commented Aug 10, 2023 • edited Loading

pilot513 commented Aug 10, 2023

pilot513 commented Aug 10, 2023

pilot513 commented Aug 11, 2023

pilot513 commented Aug 20, 2023 • edited Loading

pilot513 commented Aug 20, 2023

bhupeshpadiyar commented Jan 19, 2024

pilot513 commented Jan 19, 2024

martinohansen commented Apr 10, 2024

martinohansen commented Apr 10, 2024 • edited Loading

chenlujjj commented May 27, 2024

garg031 commented Jul 22, 2024 • edited Loading

benzch commented Oct 29, 2024

pilot513 commented Aug 10, 2023 •

edited

Loading

CarlosLanderas commented Aug 10, 2023 •

edited

Loading

pilot513 commented Aug 10, 2023 •

edited

Loading

pilot513 commented Aug 20, 2023 •

edited

Loading

martinohansen commented Apr 10, 2024 •

edited

Loading

garg031 commented Jul 22, 2024 •

edited

Loading